SPARQL2Flink: Evaluation of SPARQL Queries on Apache Flink

: Existing SPARQL query engines and triple stores are continuously improved to handle more massive datasets. Several approaches have been developed in this context proposing the storage and querying of RDF data in a distributed fashion, mainly using the MapReduce Programming Model and Hadoop-based ecosystems. New trends in Big Data technologies have also emerged (e.g., Apache Spark, Apache Flink); they use distributed in-memory processing and promise to deliver higher data processing performance. In this paper, we present a formal interpretation of some PACT transformations implemented in the Apache Flink DataSet API. We use this formalization to provide a mapping to translate a SPARQL query to a Flink program. The mapping was implemented in a prototype used to determine the correctness and performance of the solution. The source code of the project is available in Github under the MIT license.


Introduction
The amount and size of datasets represented in the Resource Description Framework (RDF) [1] language are increasing; this leads to challenging the limits of existing triple stores and SPARQL query evaluation technologies, requiring more efficient query evaluation techniques. Several proposals have been documented in state of the art use of Big Data technologies for storing and querying RDF data [2][3][4][5][6]. Some of these proposals have focused on executing SPARQL queries on the MapReduce Programming Model [7] and its implementation, Hadoop [8]. However, more recent Big Data technologies have emerged (e.g., Apache Spark [9], Apache Flink [10], Google DataFlow [11]). They use distributed in-memory processing and promise to deliver higher data processing performance than traditional MapReduce platforms [12]. These technologies are widely used in research projects and all kinds of companies (e.g., Google, Twitter, and Netflix, or even by small start-ups).
To analyze whether or not we can use these technologies to provide query evaluation over large RDF datasets, we will work with Apache Flink, an open-source platform for distributed stream and batch data processing. One of the essential components of the Flink framework is the Flink optimizer called Nephele [13]. Nephele is based on the Parallelization Contracts (PACTs) Programming Model [14] which is in turn a generalization of the well-known MapReduce Programming Model. The output of the Flink optimizer is a compiled and optimized PACT program which is a Directed Acyclic Graphs (DAG)-based dataflow program. At a high level, Flink programs are regular programs written in Java, Scala, or Python. Flink programs are mapped to dataflow programs, which implement multiple transformations (e.g., filter, map, join, group) on distributed collections, which are initially created from other sources (e.g., by reading from files). Results are returned via sinks, which may, for example, write the data to (distributed) files, or the standard output (e.g., to the command line terminal).
In [14], the set of initial PACTs operations (i.e., map, reduce, cross, cogroup, match) is formally described from the point of view of distributed data processing. Hence, the main challenge that we need to address is how to transform SPARQL queries into Flink programs that use the DataSet API's Transformations? This paper presents an approach for SPARQL query evaluation over massive static RDF datasets through the Apache Flink framework. To summarize, the main contributions of this paper are the following:

1.
A formal definition of the Apache Flink's subset transformations.

2.
A formal mapping to translate a SPARQL query to Flink program based on the DataSet API transformation. 3.
An open-source implementation, called SPARQL2Flink, available on Github under the MIT license, which transforms a SPARQL query into a Flink program. We assume that to deal with an RDF dataset encoding a SPARQL query is more accessible than writing a program using the Apache Flink DataSet API.
This research is a preliminary work towards making scalable queries processable in a framework like Apache Flink. We chose Apache Flink among several other Big Data tools based on comparative studies such as [12,[15][16][17][18][19][20][21][22]. Flink provides a streaming data processing that incorporates (i) a distributed dataflow runtime that exploits pipelined streaming execution for batch and stream workloads, (ii) exactly-once state consistency through lightweight checkpointing, (iii) native iterative processing, and (iv) sophisticated window semantics, supporting out-of-order processing. The results reported in this paper focus on the processing of SPARQL queries over static RDF data through Apache Flink DataSet API. However, it is essential to note that this work is part of a general project that aims to process hybrid queries over massive static RDF data and append-only RDF streams. For example, the applications derived from the Internet of Things (IoT) that need to store, process, and analyze data in real or near real-time. In the Semantic Web context, so far, there have been some technologies trying to provide this capability [23][24][25][26]. Further work is needed to optimize the resulting Flink programs to ensure that queries can be run over large RDF datasets as described in our motivation.
The remainder of the paper is organized as follows: In Section 2, we present a brief overview of RDF, SPARQL, PACT Programming Model, and Apache Flink. In Section 3, we describe a formal interpretation of PACT transformations implemented in the Apache Flink DataSet API and the semantic correspondence between SPARQL Algebra operators and Apache Flink's subset transformations. In Section 4, we present an implementation of the transformations described in Section 3, as a Java library. In Section 5, we present the evaluation of the performance of SPARQL2Flink using an adaptation of the Berlin SPARQL Benchmark [27]. In Section 6, we present related work on the SPARQL query processing of massive static RDF data which use MapReduce-based technologies. Finally, Section 7 presents conclusions and interesting issues for future work.

Resource Description Framework
Resource Description Framework (RDF) [1] is a W3C recommendation for the representation of data on the Semantic Web. There are different serialization formats for RDF documents (e.g., RDF/XML, N-Triples, N3, Turtle). In the following, some essential elements of the RDF terminology are defined in an analogous way as Perez et al. do in [1,28]. Terms and Triples). Assume there are pairwise disjoint infinite sets I, B, and L (IRIs, blank nodes, and literals). A tuple (s, p, o) ∈ (I ∪ B) × I × (I ∪ B ∪ L) is called an RDF triple. In this tuple, s is called the subject, p the predicate, and o the object. We denote T (the set of RDF terms) as the union of IRIs, blank nodes, and literals, i.e., T = I ∪ B ∪ L. IRI (Internationalized Resource Identifier) is a generalization of URI (Uniform Resource Identifier). URIs represent common global identifiers for resources across the Web. Definition 2 (RDF Graph). An RDF graph is a set of RDF triples. If G is an RDF graph, term(G) is the set of elements of T appearing in the triples of G, and blank(G) is the set of blank nodes appearing in G, i.e., blank(G) = term(G) ∩ B.

Definition 3 (RDF Dataset
). An RDF dataset DS is a set DS = {g 0 , (µ 1 , g 1 ), (µ 2 , g 2 ) . . . , (µ n , g n )} where g 0 and g i are RDF graphs, and each corresponding µ i is a distinct IRI. g 0 is called the default graph, while each of the others is called named graph.

SPARQL Protocol and RDF Query Language
SPARQL [29] is the W3C recommendation to query RDF. There are four query types: SELECT, ASK, DESCRIBE, and CONSTRUCT. In this paper, we focus on SELECT queries. The basic SELECT query consists of three parts separated by the keywords PREFIX, SELECT, and WHERE. The PREFIX part enables to declare prefixes to be used in IRIs to make them shorter; the SELECT part identifies the variables to appear in the query result; the WHERE part provides the Basic Graph Pattern (BGP) to match against the input data graph. A definition of terminology comprising the concepts of Triple and Basic Graph Pattern, Mappings, Basic Graph Patterns and Mappings, Subgraph Matching, Value Constraint, Built-in condition, Graph pattern expression, Graph Pattern Evaluation, and SELECT Result Form is done by Perez et al. in [28,30]. We encourage the reader to refer to these papers before going ahead.

PACT Programming Model
PACT Programming Model [14] is considered as a generalization of MapReduce [7]. The PACT Programming Model operates on a key/value data model and is based on so-called Parallelization Contracts (PACTs). A PACT consists of a system-provided secondorder function (called Input Contract) and a user-defined first-order function (UDF) which processes custom data types. The PACT Programming Model provides an initial set of five Input Contracts that include two Single-Input Contracts: map and reduce as known from MapReduce which apply to user-defined functions with a single input, and three additional Multi-Input Contracts: cross, cogroup, and match which apply to user-defined functions for multiple inputs. As in the previous subsection, we encourage the reader to refer to the work of Battre at al. [14] for a complete review of the definitions concerning to the concepts of Simple-Input Contract, mapping function, map, reduce, Multi-Input Contract, cross, cogroup, and match.

Apache Flink
The Stratosphere [31] research project aims at building a big data analysis platform, which will make it possible to analyze massive amounts of data in a manageable and declarative way. In 2014 Stratosphere was open-sourced by the name Flink as an Apache Incubator project. It graduated to Apache Top Level project in the same year. Apache Flink [10] is an open-source framework for distributed stream and batch data processing. The main components of Apache Flink architecture are the core, the APIs (e.g., DataSet, DataStream, Table & SQL), and the libraries (e.g., Gelly). The core is a streaming dataflow engine that provides data distribution, communication, and fault tolerance, for distributed computations over data streams. The DataSet API allows processing finite datasets (batch processing), the DataStream API processes potentially unbounded data streams (stream processing), and the Table & SQL API allows the composition of queries from relational operators. The SQL support is based on Apache Calcite [32], which implements the SQL standard. The libraries are built upon those APIs. Apache Flink also provides an optimizer, called Nephele [13]. Nephele optimizer is based on the PACT Programming Model [14] and transforms a PACT program into a Job Graph [33]. Additionally, Apache Flink provides several PACT transformations for data transformation (e.g., filter, map, join, group).

Mapping SPARQL Queries to an Apache Flink Program
In this section, we present our first two contributions: the formal description of the Apache Flink's subset transformations and the semantic correspondence between SPARQL Algebra operators and Apache Flink's set transformations.

PACT Data Model
We describe the PACT Data Model in a way similar to the one in [34], which is centered around the concepts of datasets and records. We assume a possibly unbounded universal multi-set of records T . In this way, a dataset T = {r 1 , . . . , r n } is a bounded collection of records. Consequently, each dataset T is a subset of the possibly unbounded universal multi-set T , i.e., T ⊆ T . A record r = [k 1 : v 1 , . . . , k n : v n ] is an unordered list of key-value pairs. The semantics of the keys and values, including their type is left to the user-defined functions that manipulate them [34]. We employ the record keys to define some PACT transformations. It is possible to use numbers as the record keys; in the special case where the keys of a record r is the set {1, 2, . . . , n} for some n ∈ N, we say r is a tuple. For the sake of simplicity, we write r = [v 1 , . . . , v n ] instead of tuple r = [1 : v 1 , . . . , n : v n ]. Two records r 1 = [k 1,1 : v 1,1 , . . . , k 1,n : v 1,n ] and r 2 = [k 2,1 : v 2,1 , . . . , k 2,m : v 2,m ] are equal (r 1 ≡ r 2 ) iff n = m and ∀ i∈{1,...,n} , ∃ j∈{1,...,m} .

Formalization of Apache Flink Transformations
In this section, we propose a formal interpretation of the PACT transformations implemented in the Apache Flink DataSet API. That interpretation will be used to establish a correspondence with the SPARQL Algebra operators. This correspondence is necessary before establishing an encoding to translate SPARQL queries to Flink programs in order to exploit the capabilities of Apache Flink for RDF data processing.
In order to define the PACT transformations, we need to define some auxiliary notions. First, we define the record projection, that builds a new record, which is made up of the keyvalue pairs associated with some specific keys. Second, we define the record value projection that allows obtaining the values associated with some specific keys. Next, we define the single dataset partition that creates groups of records where the values associated with some keys are the same. Finally, we define the multiple dataset partition as a generalization of the single dataset partition. The single dataset and the multiple dataset partitions are crucial to establish the definition of the reduce and the cogroup transformations due to the fact that they apply some specific user functions over groups of records.
Record projection is defined as follows: Definition 4 (Record Projection). Let r = [k 1 : v 1 , . . . , k n : v n ] ∈ T be a record and {i 1 , . . . , i m } ⊆ keys(r) be a set of keys, we define the projection of r over {i 1 , . . . , i m } (denoted as r(i 1 , . . . , i m )) as follows: In this way, by means of a record projection, a new record is obtained only with the key-value pairs associated to some key in the set I = {i 1 , . . . , i m }.
Record value projection is defined as follows: It is worth specifying that the record value projection takes a record and produces a tuple of values. In this way, in this operation, the key order in the tuple [i 1 , . . . , i m ] is considered for the result construction. Likewise, the result of the record value projection could contain repeated elements. Let r 1 and r 2 be tuples of values, we say that r 1 and r 2 are equivalent (r 1 ≡ r 2 ) if both r 1 and r 2 contain exactly the same elements.
The notion of single dataset partition as follows: Definition 6 (Single Dataset Partition). Let T ⊆ T be a dataset and given a non-emtpy set of keys K, we define a single dataset partition of T over keys K as: . . , T m } is a set partition of T such that: Intuitively, the single dataset partition creates groups of records where the values associated to some keys (set K) are the same.
Analogous to the single dataset partition, we define the multiple dataset partition below. It is possible to realize that the multiple dataset partition is a generalization of the single dataset partition.

Definition 7 (Multiple Dataset Partition).
Let T , T ⊆ T be two datasets and given two non-emtpy set of keys K 1 and K 2 , we define a multiple dataset partition of T and T over keys . . , T m } is a set partition of T ∪ T such that: After defining the auxiliary notions, we will define the map, reduce, filter, project, match, outer match, cogroup, and union PACT transformations.

Definition 8 (Map Transformation).
Let T ⊆ T be a dataset and given a function f ranging over T , i.e., f : T → T , we define a map transformation as follows: Correspondingly, the map transformation takes each record r = [k 1 : v 1 , . . . , k n : v n ] of a dataset T and produces a new record r = [k 1 : v 1 , . . . , k m : v m ] by means of a user function f . Records produced by function f can differ with respect to the original records. First, the number of key-value pairs can be different, i.e., n = m. Second, the keys k 1 , . . . , k m do not have to match with the keys k 1 , . . . , k n . Last, the datatype associated to each value can differ.
Accordingly, we define the reduce transformation as follows: Definition 9 (Reduce Transformation). Let T ⊆ T be a dataset and given a non-emtpy set of keys K and a function f ranging over the power set of T , i.e., f : P(T ) → T , we define a reduce transformation as follows: In this way, the reduce transformation takes a dataset T and groups records by means of the single dataset partition. In each group, the records have the same values for the keys in set K. Then, it applies user function f over each group and produces a new record.
Definition 10 (Filter Transformation). Let T ⊆ T be a dataset and given a function f ranging over T to boolean values, i.e., f : T → {true, false}, we define a filter transformation as follows: The filter transformation evaluates predicate f with every record of a dataset T and it selects only those records with which f returns true.
Definition 11 (Project Transformation). Let T ⊆ T be a dataset and given a set of keys K = {k 1 , . . . , k m }, we define a project transformation as follows: While filter transformation allows selecting some specific records according to some criteria, which are expressed in the semantics of a function f , the project transformation enables us to obtain some specific fields from the records of a dataset T . For this purpose, we apply a record projection operation-to each record in T with respect to a set of keys K. It is worth highlighting that the result of a project transformation is a multi-set due to several records having the same values in the keys of set K.
Previous PACT transformations take as a parameter a dataset T and produce as a result a new dataset according to specific semantics. Nevertheless, a lot of data sources are available, and it is necessary to process and combine multiple datasets eventually. In consequence, some PACT transformations are taking two or more datasets as parameters [14]. Following, we present a formal interpretation of the essential multi-datasets transformations, including matching, grouping, and union. Definition 12 (Match Transformation). Let T 1 , T 2 ⊆ T be datasets, given a function f ranging over T 1 and T 2 , i.e., f : T 1 × T 2 → T and given sets of keys K 1 and K 2 , we define a match transformation as follows: Thus, match transformation takes each pair of records (r 1 , r 2 ) built from datasets T 1 and T 2 , and applies user function f with those pairs for which the values in r 1 with respect to keys in K 1 coincide with the values in r 2 with respect to keys in K 2 . For this purpose, it checks this correspondence through a record value projection. Intuitively, the match transformation enables us to group and process pairs of records related to some specific criterion.
In some cases, it is necessary to match and process a record in a dataset even if a corresponding record does not exist in the other dataset. The outer match transformation extends the match transformation to enable such a matching. Outer match transformation is defined as follows: Definition 13 (Outer Match Transformation). Let T 1 , T 2 ⊆ T be datasets, given a function f ranging over T 1 and T 2 , i.e., f : T 1 × T 2 → T and given sets of keys K 1 and K 2 , we define a outer match transformation as follows: In this manner, the outer match transformation is similar to the match transformation, but it allows us to apply the user function f with a record r 1 , although it does not exist a record r 2 that matches with record r 1 with respect to keys K 1 and K 2 .
In addition to the match and outer match transformations, the cogroup transformation enables us to group records in two datasets. Those records must coincide with respect to a set of keys. The cogroup transformation is defined as follows: given a function f ranging over P(T ) → T and given sets of keys K 1 and K 2 , we define a cogroup transformation as follows: Intuitively, cogroup transformation processes groups with the records on datasets T 1 and T 2 for which the values of the keys in K 1 and K 2 are equal. Then, it applies a user function f over each one of those groups.
Finally, the union transformation creates a new dataset with every record in two datasets T 1 and T 2 . It is defined as follows: Definition 15 (Union Transformation). Let T 1 , T 2 ⊆ T be datasets, we define a union transformation as follows: It is essential to highlight that records in dataset T 1 and T 2 can differ in the number of pairs key-value and the type of values.

Correspondence between SPARQL Algebra Operators and Apache Flink Transformations
In this section, we propose a semantic correspondence between SPARQL algebra operators and the PACT transformations implemented in the Apache Flink DataSet API. We use the formalization of PACT transformations presented in the previous section to provide an intuitive and correct mapping of the semantics elements of SPARQL queries. It is important to remember that in this formalization a record is an unordered list of n key-value pairs. However, as described in Section 2.1, an RDF dataset is a set of triples which is composed of three elements < s, p, o >. Hence, for this particular case, a record will be understood as an unordered list of three key-value pairs. Besides, we assume that each field of a record r can be accessed using indexes 0, 1, and 2. Likewise, we assume that RDF triple pattern are triples [s, p, o] where s, p, o can be variables or values. Finally, the result of the application of each PACT transformation is intended to be a solution mapping, i.e., sets of key-value pairs with RDF variables as keys that will be represented as records with n key-value pairs.
Following, we present the definition of our encoding of SPARQL queries as PACT transformations. First, we define the encoding of the graph pattern evaluation as follows: Definition 16 (Graph Pattern PACT Encoding). Let P be a graph pattern and D be an RDF dataset, the PACT encoding of the evaluation of P over D, denoted by ||P|| D , is defined recursively as follows: 1. If P is a triple pattern [s, p, o] then: where function f 1 is defined as follows: and function f 2 is defined as follows: 2. If P is (P 1 AND P 2 ) then: where K = vars(P 1 ) ∩ vars(P 2 ) and function f is defined as follows: 3. If P is (P 1 OPT P 2 ) then: where K = vars(P 1 ) ∩ vars(P 2 ) and function f is defined as follows: where function f is defined as follows: where R is a boolean expression and In this way, the graph pattern PACT evaluation is encoded according to the recursive definition of a graph pattern P. More precisely, we have that: • If P is a triple pattern, then records of dataset D are filtered (by means of function f 1 ) to obtain only the records that are compatible with respect to the variables and values in [s, p, o]. Then, the filtered records are mapped (by means function f 2 ) to obtain solution mappings that relate each variable to each possible value. • If P is a join (left join) (it uses the SPARQL operators AND (OPT)), then a match (outermatch) transformation is performed between the recursive evaluation of subgraphs P 1 and P 2 with respect to a set K conformed by the variables in P 1 and P 2 . • If P is a union graph pattern, then there is a union transformation between the recursive evaluation of subgraphs P 1 and P 2 . • Finally, if P is a filter graph pattern, then a f ilter transformation is performed over the recursive evaluation of subgraph P where the user function f is built according to the structure of the filter expression R.
Additionally to the graph pattern evaluation, we present an encoding of the evaluation of SELECT and DISTINCT SELECT queries as well as the ORDER-BY and LIMIT modifiers. The selection encoding is defined as follows: Definition 17 (Selection PACT Encoding). Let D be an RDF dataset, P be a graph pattern, K be a finite set of variables, and Q = P, K be a selection query over D, the PACT Encoding of the evaluation of Q over D is defined as follows: Correspondingly, the selection query is encoded as a project transformation over the evaluation of the graph pattern P associated with the query with respect to a set of keys K conformed by the variables in the SELECT part of the query. We make a subtle variation in defining the distinct selection as follows: Definition 18 (Distinct Selection PACT Encoding). Let D be an RDF dataset, P be a graph pattern, K be a finite set of variables, and Q * = P, K be a distinct selection query over D, the PACT Encoding of the evaluation of Q * over D is defined as follows: where function f is defined as follows: The definition of the distinct selection PACT encoding is similar to the general selection query encoding. The main difference corresponds to a reduction step (reduce transformation) in which, the duplicate records, i.e., records with the same value in the keys of set K (the distinct keys) are reduced to only one occurrence by means of the function f that takes as a parameter a set of records for which the value in the keys in K is the same and returns the first of them (actually, it could return any of them).
The encoding of the evaluation of a order-by query is defined as follows: Definition 19 (Order By PACT Encoding). Let D be an RDF dataset, P be a graph pattern, k be a variable, and Q * = P, k, f lag be an order by query over D, the PACT Encoding of the evaluation of Q * over D is defined as follows: where function order is defined as follows: Thereby, the graph pattern associated with the query is first evaluated according to the encoding of its precise semantics. Then, the resulting solution mapping is ordered by means of a function order. Currently, we only consider ordering with respect to one key, which is a simplification of the ORDER BY operator in SPARQL. Finally, the encoding of the evaluation of a limit query is defined as follows: Definition 20 (Limit PACT Encoding). Let D be an RDF dataset, P be a graph pattern, m be an integer such that m ≥ 0, and Q * = P, m be a limit query over D, the PACT Encoding of the evaluation of Q * over D is defined as follows: where function limit is defined as follows: In this way, once the graph pattern associated with the query is evaluated, the result is shortened to consider only the m records according to the query. According to the SPARQL semantics, if m > |M|, the result is equal to M.

Implementation
This section presents our last contribution. We implemented the transformations described in Section 3 as a Java library [35]. According to Apache Flink [10], a Flink program usually consists on four basic stages: (i) loading/creating the initial data, (ii) specifying the transformations of the data, (iii) specifying where to put the results of the computations, and (iv) triggering the program execution. The SPARQL2Flink [35] library-available on Github under the MIT license, is focused on the first three stages of a Flink program, and it is composed of two modules, called: Mapper and Runner, as shown in Figure 1  Translate Query To a Logical Query Plan: this submodule uses the Jena ARQ library to translate the SPARQL query into a Logical Query Plan (LQP) expressed with SPARQL Algebra operators. The LQP is represented with an RDF-centric syntax provided by Jena, which is called SPARQL Syntax Expression (SSE) [36]. Listing 2 shows an LQP of the SPARQL query example. Convert Logical Query Plan into Flink program: this submodule converts each SPARQL Algebra operator in the query to a transformation from the DataSet API of Apache Flink, according to the correspondence described in Section 3. For instance, each triple pattern within a Basic Graph Pattern (BGP) is encoded as a combination of filter and map transformations, the leftjoin operator is encoded as a leftOuterJoin transformation, whereas the project operator is expressed as a map transformation. Listing 3 shows the Java Flink program corresponding to the SPARQL query example. Listing 3. Java Flink program. 1 4 5 / * * * Environment and Source ( s t a t i c RDF d a t a s e t ) * * * / 6 f i n a l ExecutionEnvironment env = ExecutionEnvironment . getExecutionEnvironment ( ) ; 7 DataSet < T r i p l e > d a t a s e t = L o a d T r i p l e s . fromDataset ( env , params . g e t ( " d a t a s e t " ) ) ; 8 9 / * * * Applying T r a n s f o r m a t i o n s * * * / 10 DataSet <SolutionMapping > sm1 = d a t a s e t 11 . f i l t e r (new T r i p l e 2 T r i p l e ( null , " h t t p :// xmlns . com/ f o a f /0.1/name " , n u l l ) ) 12 . map(new Triple2SM ( " ? person " , null , " ?name " ) ) ; 13 14 DataSet <SolutionMapping > sm2 = d a t a s e t 15 . f i l t e r (new T r i p l e 2 T r i p l e ( null , " h t t p :// xmlns . com/ f o a f /0.1/mbox" , n u l l ) ) 16 . map(new Triple2SM ( " ? person " , null , " ?mbox" ) ) ; 17 18 DataSet <SolutionMapping > sm3 = sm1 . l e f t O u t e r J o i n ( sm2 ) 19 . where (new J o i n K e y S e l e c t o r (new S t r i n g [ ] { " ? person " } ) ) 20 . equalTo (new J o i n K e y S e l e c t o r (new S t r i n g [ ] { " ? person " } ) ) 21 . with (new L e f t J o i n (new S t r i n g [ ] { " ? person " } ) ) ; 22 23 DataSet <SolutionMapping > sm4 = sm3 24 . map(new P r o j e c t (new S t r i n g [ ] { " ? person " , " ?name " , " ?mbox" } ) ) ; 25 26 // * * * Sink * * * 27 sm4 . writeAsText ( param . g e t ( " output " )+ " R e s u l t " , F i l e S y s t e m . WriteMode . OVERWRITE) 28 . s e t P a r a l l e l i s m ( 1 ) ; 29 30 env . e x e c u t e ( "SPARQL Query t o F l i n k Program " ) ; 31 } 32 } The Runner module allows executing a Flink program (as a jar file) on an Apache Flink stand-alone or local cluster mode. This module is composed of two submodules: Load RDF Dataset, which loads an RDF dataset in N-Triples format, and Functions, which contain several Java classes that allow us to solve the transformations within the Flink program.

Evaluation and Result
In this section, we present the evaluation of the performance of the SPARQL2Flink library by reusing a subset of the queries defined by the Berlin SPARQL Benchmark (BSBM) [27]. On the one hand, experiments were performed to empirically prove the correctness of the results of a SPARQL query transformed into a Flink program. On the other hand, experiments were carried out to show that our approach processes data that can scale as much as permitted by the underlying technology, in this case, Apache Flink. All experiments carried out in this section are available in [37].
The SPARQL2Flink library does not implement the SPARQL protocol and cannot be used as a SPARQL endpoint. It is important to note that this does not impose a strong limitation on our approach. This is an engineering task that will be supported in the future. For this reason, we do not use the test drive proposed in BSBM. Hence, we followed the following steps:

Generate Datasets from BSBM
BSBM is built around an e-commerce use case in which a set of products is offered by different vendors while consumers post reviews about the products [27]. Different datasets were generated using the BSBM data generator by setting up the number of products, number of producers, number of vendors, number of offers, and number of triples, as shown in Table 1. For each dataset, one file was generated in N-Triples format. The name of each dataset is associated with the size of the dataset in gigabytes. The ds20mb dataset was used to perform the correctness tests. ds300mb, ds600mb, ds1gb, ds2gb, and ds18gb datasets were used to perform the scalability tests in the local cluster.

Verify Which SPARQL Query Templates Are Supported
The BSBM offers 12 different SPARQL query templates to emulate the search and navigation pattern of a consumer looking for a product [27]. We modified the query template omitting SPARQL operators and expressions that are not yet implemented in the library. The SPARQL query templates Q1, Q2, Q3, Q4, Q5, Q7, Q8, Q10, and Q11 were instantiated. Table 2 summarizes the list of queries that are Supported (S), Partially Supported (PS), and Not Supported (NS) by SPARQL2Flink library. In the case where the query is Partially Supported, it is detailed how the query was modified to be able to transform it into a Flink program. The SPARQL queries instantiated can be seen in [37].

Transform SPARQL Query into a Flink Program through SPARQL2Flink
First, SPARQL2Flink converts each SPARQL query (i.e., Q1, Q2, Q3, Q4, Q5, Q7, Q8, Q10, and Q11) into a Logical Query Plan expressed in terms of SPARQL Syntax Expression (SSE). Then, each Logical Query Plan was transformed into a Flink program (packaged in a .jar file) through the SPARQL2Flink [35]. The Logical Query Plans and the Flink Programs can be seen in [37].

Perform Results Correctness Tests on a Standalone Environment
The formal correctness tests are beyond the scope of this paper. However, empirical correctness tests were performed on the results of the nine queries that SPARQ2Flink supports and partially supports. The nine queries were executed independently in Apache Jena 3.6.0 and Apache Flink 1.10.0 without Hadoop using the ds20mb dataset. Both applications were set up on a laptop with Intel Core Duo i5 2.8 GHz, 8 GB RAM, 1 TB solid-state disk, and Mac Sierra operating system. In this test, we compared the results of running each SPARQL query on Apache Jena and each corresponding Flink program on Apache Flink. The results of each query were compared manually, checking if they were the same. In all cases, they were. All results can be seen in [37].

Carry out Scalability Tests on a Local Cluster Environment
The Apache Flink local cluster needs at least one Job Manager and one or more Task Managers. The Job Manager is the master that coordinates and manages the execution of the program; the Task Managers are the workers or slaves that execute parts of the parallel programs. The parallelism of task execution was determined by using the Task Slots available on each Task Manager. In order to carry out scalability tests, we setup an Apache Flink local cluster with one master node and fifteen slave nodes; each slave node with one Task Slot. Table 3 shows an specifications of each node. The flink-conf.yaml file is part of the Job Manager and the Task Manager and contains all cluster configurations as a flat collection of YAML(YAML Ain't Markup Languagehttps://yaml.org accessed on 23 March 2020) key-value pairs. Most of the configurations are the same for the Job Manager node and the Task Manager nodes. The parameters listed in Listing 4 were set up in order to carry out scalability tests. The parameters not listed were maintained with default values. A detailed description of each parameter can be see in [38].
Two scalability tests were conducted. In both tests, the primary measure evaluated is the query execution time of nine queries on different datasets and number of nodes in the cluster. Based on the number of available nodes, we configured five different clusters. C1 cluster with one master node and four slave nodes; C2 cluster with one master node and seven slave nodes; C3 cluster with one master node and eight slave nodes; C4 cluster with one master node and eleven slave nodes; C5 cluster with one master node and fifteen slave nodes.
The first scalability test aims to evaluate the performance of SPARQL2Flink with ds300mb, ds600mb, ds1gb, and ds2gb datasets with a different number of triples. The test was performed on C1, C3, and C5 clusters. Each dataset was replicated on each node. Each node was configured with Apache Flink 1.10.0. A query executed on a dataset and a cluster configuration is considered an individual test. For instance, the query Q1 was executed on the ds300mb dataset on cluster C1. After that, the same query was executed on the same dataset on clusters C3 and C5. The remaining eight queries were executed on the other datasets and clusters in the same way. Consequently, a total of 108 tests were conducted. Figure 2 depicts the query execution times of nine queries after running the first scalability test. The second scalability test was performed on the C2, C4, and C5 clusters. For this test, Apache Flink 1.10.0 and Hadoop 2.8.3 were configured to use HDFS in order to store the ds18gb dataset. As in the first test, the nine queries were executed by changing the cluster configuration, for a total of 27 additional tests. Figure 3 depicts the query execution times of nine queries after running the second scalability test. Each SPARQL query transformed into a Flink program generates a plan with several tasks. A task in Apache Flink is the basic unit of execution. It is the place where each parallel instance of an operator is executed. In terms of tasks, Q1 query generates 14 tasks, Q2 generates 31 tasks, Q3 generates 17 tasks, Q4 generates 28 tasks, Q5 generates 18 tasks, Q7 generates 29 tasks, Q8 generates 23 tasks, Q10 generates 18 tasks, and Q11 generates 6 tasks. All Flink program plans are available at [37]. In particular, the first task of the Flink program plan is associated with the dataset loading, the last task with the file creation which contains the query results, and the intermediate tasks are related to query execution. Table 4 describes the times of the dataset loading time, query execution time, and the sum of both. Table 4. dlt+qet refers to the sum of dlt and qet times. dlt represents the dataset loading time which refers to the time spent to move each triple from a file into Apache Flink local cluster. qet represents the query execution time. The file creation time was ignored. In the worst case, it was less than or equal to 373 milliseconds. The times for query processing are in seconds.

Related Work
Several proposals have been documented in the use of Big Data technologies for storing and querying RDF data [2][3][4][5][6]. The most common way so far to query massive static RDF data has been rewriting SPARQL queries over the MapReduce Programming Model [7] and executing them on Hadoop [8] Ecosystems. A detailed comparison of existing approaches can be found in the survey presented in [3]. This work provides a comprehensive description of RDF data management in large-scale distributed platforms, where storage and query processing are performed in a distributed fashion, but under centralized control. The survey classifies the systems according to how they implement three fundamental functionalities: data storage, query processing, and reasoning; this determines how the triples are accessed and the number of MapReduce jobs. Additionally, it details the solutions adopted to implement those functionalities.
Another survey is [39], which presents a high-level overview of RDF data management, focusing on several approaches that have been adopted. The discussion focused on centralized RDF data management, distributed RDF systems, and querying over the Linked Open Data. In particular, in the distributed RDF systems, it identifies and discusses four classes of approaches: cloud-based solutions, partitioning-based approaches, federated SPARQL evaluation systems, and partial evaluation-based approach.
Given the success of NoSQL [40] (for "not only SQL") systems, a number of authors have developed RDF data management systems based on these technologies. The survey in [41] provides a comprehensive study of the state of the art in data storage techniques, indexing strategies, and query execution mechanisms in the context of the RDF data processing. Part of this study summarizes the approaches that exploit NoSQL database systems for building scalable RDF management systems. In particular, [42] is recent example under NoSQL umbrella for efficiently evaluating SPARQL queries using MongoDB and Apache Spark. This work proposes an effective data model for storing RDF data in a document database, called node-oriented partition, using maximum replication factor of 2 (i.e., in the worst-case scenario, the data graph will be doubled in storage size). Each query is decomposed into a set of generalized star queries, which ensures that no joining operations over multiple datasets are required. The authors propose an efficient and simple distributed algorithm for partitioning large RDF data graphs based on the fact that each SPARQL query Q can be decomposed into a set of generalized star queries which can be evaluated independently of each other and can be used to compute the answers of the initial query Q [43].
In recent years, new trends in Big Data Technologies such as Apache Spark [9], Apache Flink [10], and Google DataFlow [11] have been proposed. They use distributed in-memory processing and promise to deliver higher performance data processing than traditional MapReduce platforms [12]. In particular, Apache Spark implements a programming model similar to MapReduce but extends it with two abstractions: Resilient Distributed Datasets (RDDs) [44] and Data Frames (DF) [45]. RDDs are a distributed, immutable, and fault-tolerant memory abstraction and DF is a compressed and schema-enabled data abstraction.
The survey [46] summarizes the approaches that use Apache Spark for querying large RDF data. For example, S2RDF [47] proposes a novel relational schema and relies on a translation of SPARQL queries into SQL for being executed using Spark SQL. The new relational partitioning schema for RDF data is called Extended Vertical Partitioning (ExtVP) [48]. In this schema, the RDF triples are distributed in pairs of columns, each one corresponding to an RDF term (the subject and the object). The relations are computed at the data load time using semi-joins, akin to the concept of Join Indices [49] in relational databases, to limit the number of comparisons when joining triple patterns. Each triple pattern of a query is translated into a single SQL query, and the query performance is optimized using the set of statistics and additional data structures computed during the data pre-processing step. The authors in [50] propose and compare five different query processing approaches based on different join execution models (i.e., partitioned join and broadcast join) on Spark components like RDD, DF, and SQL API. Morever, they propose a formalization for evaluating the cost of SPARQL query processing in a distributed setting. The main conclusion is that Spark SQL does not (yet) fully exploit the variety of distributed join algorithms and plans that could be executed using the Spark platform and propose some guidelines for more efficient implementations.
Google DataFlow [11] is a Programming Model and Cloud Service for batch and stream data processing with a unified API. It is built upon Google technologies, such as MapReduce for batch processing, FlumeJava [51] for programming model definition, and MillWheel [52] for stream processing. Google released the Dataflow Software Development Kit (SDK) as an open-source Apache project, named Apache Beam [53]. There are no works reported to the best of our knowledge that use Google Data-Flow to process massive static RDF datasets and RDF streams.
In a similar line to our approach, the authors in [54] propose FLINKer, which is a proposal to manage large RDF datasets and resolve SPARQL queries on top of Flink/Gelly. In practice, FLINKer makes use of Gelly to provide the vertex-centric view on graph processing, and the some DatasSet API operator to support each of the transformations required to resolve SPARQL queries. FLINKer uses the Jena ARQ SPARQL processor to parse a given SPARQL query and generate a parsing tree of operations, which are then resolved through the existing operators in the DataSet API of Flink, i.e., map, flatmap, filter, project, flatjoin, reducegroup, and iteration operators. The computation is performed through a sequence of iterations steps called supersteps. The main advantage of using Gelly as a backend is that Flink has native iterative support, i.e., the iterations do not require new job scheduling overheads to be performed.
The main difference between FLINKer and our proposal is the formalization of a set of PACT transformations implemented in the Apache Flink DataSet API and a formal mapping to translate a SPARQL query to a program based on the DataSet Flink program. We also provide an open-source implementation of our proposal as a Java library, available on Github under the MIT license. Besides, unlike FLINKer, we present an evaluation of the correctness and the performance of our tool by reusing a subset of the queries defined by the Berlin SPARQL Benchmark (BSBM). We did not find a FLINKer implementation available to perform a comparison against our proposal. Table 5 summarizes a comparison of the main features of the approaches that use Apache Flink, like FLINKer and SPARQL2Flink, and Apache Spark in combinations with MongoDB.

Conclusions and Future Work
We have presented an approach for transforming SPARQL queries into Apache Flink programs for querying massive static RDF data. The main contributions of this paper are the formal definitions of the Apache Flink's subset transformations, the definition of the semantic correspondence between Apache Flink's subset transformations and the SPARQL Algebra operators, and the implementation of our approach as a library.
For the sake of simplicity, we limit to SELECT queries with SPARQL Algebra operators such as Basic Graph Pattern, AND This work is the first step towards building up a Hybrid (batch and streaming) SPARQL query system on a Big Data scalable ecosystem. In this respect, the preliminary scalability tests with SPARQL2Flink show promising results in the processing of SPARQL queries over static RDF data. We can see that in all cases (i.e., Figures 2 and 3), the query execution time decreases as the number of nodes in the cluster increases. However, improvements are still needed to optimize the processing of queries. It is important to note that our proposed approach and its implementation did not apply any optimization techniques. The generated Flink program process raw triples datasets.
The static RDF dataset used in our approach is serialized in a traditional plain format called N-Triples, which is the simplest form of the textual representation of RDF data but is also the most difficult to use because it does not allow URI abbreviation. The triples are given in subject, predicate, and object order as three complete URIs separated by spaces and encompassed by angle brackets (< and >). Each statement is given on a single line ended by a period (.). This approach results in a painful task requiring a great effort in terms of time and computational resources. Performance and scalability arise as significant issues in this scenario, and their resolution is closely related to the efficient storage and retrieval of the semantic data. Loading the raw triples to RAM significantly affects the dataset loading time (dlt), as can be seen in Table 4, more specifically, in the column labeled dlt. This value increases as the size of the dataset increases as well. For example, for the dataset with 69,494,080 triples, the loading time value is 938 s, using 4 nodes of the local cluster, when processing the Q1 query. This is because the SPARQL2Flink library does not yet implement optimization techniques. In future works, we will focus on two aspects: optimization techniques and RDF stream processing. We will study how to adapt some optimization techniques inherent in the processing of SPARQL queries like HDT compression [55][56][57] and multi-way join operators [58,59].
For RDF stream processing, we will extend the PACT Data Model to describe a formal interpretation of the data stream notion, the different window types, and the windowing operation necessary to establish an encoding to translate CQELS-QL [60] queries to DataStream PACT transformations. In practice, the DataStream API of the Apache Flink comes with predefined window assigners for the most common use cases, namely tumbling windows, sliding windows, session windows, and global windows. The window assigner defines how elements are assigned to windows. In particular, we focus on tumbling windows and sliding windows assigners in combination with time-based windows and count-based windows.