Semantic Data Federated Query Optimization Based on Decomposition of Block-Level Subqueries

Yao, Yuan; Zhang, Yang

doi:10.3390/fi17110531

Open AccessArticle

Semantic Data Federated Query Optimization Based on Decomposition of Block-Level Subqueries

by

Yuan Yao

^1,* and

Yang Zhang

^2,*

¹

School of Engineering, University of Glasgow, Glasgow G12 8QQ, UK

²

School of Computer Science (National Pilot Software Engineering School), Beijing University of Posts and Telecommunications, Beijing 100876, China

^*

Authors to whom correspondence should be addressed.

Future Internet 2025, 17(11), 531; https://doi.org/10.3390/fi17110531

Submission received: 13 October 2025 / Revised: 12 November 2025 / Accepted: 17 November 2025 / Published: 20 November 2025

(This article belongs to the Special Issue Internet of Things Technology and Service Computing)

Download

Browse Figures

Versions Notes

Abstract

The digital age and the rise of Internet of Things technology have led to an explosion of data, including vast amounts of semantic data. In the context of large-scale semantic data graphs, centralized storage struggles to meet the efficiency requirements of the queries. This has led to a shift towards distributed semantic data systems. In federated semantic data systems, ensuring both query efficiency and comprehensive results is challenging because of data independence and privacy constraints. To address this, we propose a query processing framework featuring a block-level star decomposition method for generating efficient query plans, augmented by auxiliary indexes to guarantee the completeness of the results. A specialized FEDERATEDAND BY keyword is introduced for federated environments, and a partition-based parallel assembly method accelerates the result integration. Our approach demonstrably improves query efficiency and is analyzed for its potential application in energy systems.

Keywords:

federate semantic data; distributed SPARQL query; query results assembly

1. Introduction

As we enter the era of big data [1], the rapid development of the Internet of Things (IoT) has led to explosive growth in data, including vast amounts of knowledge data. To fully utilize this data and facilitate the rapid development of various industries, many enterprises and institutions have constructed semantic data knowledge networks [2]. Semantic data are mainly established through the Resource Description Framework (RDF) [3] graph technology, and queries are implemented using the SPARQL language (SPARQL Protocol and RDF Query Language) [4,5]. In traditional distributed semantic data scenarios, the semantic query process involves all data content. However, owing to the potential privacy concerns associated with semantic data, certain datasets may not be fully accessible owing to data security considerations. This is particularly true for data related to sensitive areas, such as government work. Consequently, the concept of federated distributed environment semantic queries has emerged. For instance, within the energy sector, the ongoing development of energy systems, such as smart grids, integrated energy systems, wind/solar farm clusters, and distributed energy resource management, has led to increasing structural complexity. Data sources are diverse and managed by various entities (including power generation companies, grid operators, and consumers), and such data are typically subject to stringent privacy and security requirements. Therefore, federated distributed environment semantic queries can be effectively applied to energy systems.

In this context, a federated environment semantic query refers to a scenario in which data are managed and maintained by the publishing organization itself, and the data storage and organization structure cannot be modified. Typically, only the query port matching process is provided, and various semantic query functions must be addressed. Therefore, further research on federated query technology is required.

Figure 1 illustrates a data snippet from the federated semantic dataset. Semantic data are primarily represented using triples that follow the subject–predicate–object structure. Entities are represented by nodes, and the predicate relationships are represented by edges. Although the data sources shown are independent, with data from different domains managed by different organizations, the expression of semantics may have correlations, and the corresponding semantic graphs may be interconnected, forming a complete data graph. The query process is a graph-matching process. As shown in Figure 2, a simple query is matched in the dataset, and the results of other fields are obtained using entity and predicate information in the query. Therefore, for the complex query shown in Figure 3, the matching results indicated by the red lines can be obtained in the federated environment depicted in Figure 1. Our research focuses on the accurate and efficient acquisition of such cross-source query results.

Current federated distributed system implementations [6], such as FedX [7] and SPLENDID [8], have limitations in query strategy generation and intermediate result assembly. In the query strategy generation phase, these approaches decompose the input query by each triple, which makes the subquery granularity too small and generates excessive intermediate results after subquery matching. These subqueries independently search for data sources during the query strategy generation. This leads to an excessive number of intermediate results for each of the triples. These methods for assembling intermediate results rely on a simple iterative process involving multiple many-to-many matching operations, which incurs significant overhead. Alternatively, the VALUES clause can be used to implement federated queries with bindings by combining the assembly and federation matching processes. However, this approach significantly hinders the parallel execution capabilities of the system.

Considering the sample query execution depicted in Figure 3, the FedX method splits the query based on individual triples. They then merge data sources with only one destination site into the same subquery based on the different data sources referenced by the triples. Consequently, the final decomposition and execution plan, as shown in Figure 4, consists of numerous subqueries, each containing only one or two triples. This approach leads to a significant number of intermediate results after the matching stage, which negatively affects the subsequent assembly phase.

Therefore, existing solutions can be improved in terms of the number of intermediate results, query execution speed, semantic expression of federated queries, and comprehensive consideration of federated node structures. This study addresses these shortcomings and proposes efficiency enhancements. To address the prevalent issues of excessive intermediate results and sluggish query performance in federated query systems, this study proposes a novel approach that transforms input queries into larger block-level subqueries. This technique enhances the subquery granularity, effectively reducing the number of intermediate results. Subsequently, the system performs a secondary reorganization on block subqueries that may not yield accurate results. This ensured the overall accuracy of the retrieved data. In essence, this approach optimizes both the semantic query strategy and intermediate result assembly process within a federated distributed environment. Our main contributions are as follows.

(1): Our approach adopts a large-block decomposition strategy rather than traditional triple-splitting methods. This strategy generates larger subqueries after decomposition, effectively reducing the number of intermediate results and remote executions while maintaining the accuracy of the query results. Consequently, it improves the overall query efficiency.
(2): We incrementally convert a complex SPARQL query into graph patterns by introducing a new “FEDRATEDAND BY” concept and means. In other words, we introduce the novel keyword “FEDRATEDAND BY” specifically for federated environments. This keyword enhances the comprehensiveness of semantic expression within the data federation.
(3): Our solution optimizes the partition and connection order by reducing the intermediate reconnection computation. We also incorporated the Bulk Synchronous Parallel (BSP) model to implement parallel assembly within a parallel architecture. This combined approach improves the query efficiency by leveraging the parallel capabilities of the system.

The remainder of this paper is organized as follows: Section 2 introduces the related work and the current research status in this field. Section 3 introduces the proposed semantic query strategy generation scheme. The intermediate results of the assembly program are presented in Section 4. Section 5 presents the experimental results and analyses. Finally, in Section 6, we draw conclusions.

2. Related Work

Building on the challenges and objectives outlined in the Introduction, this section reviews the existing approaches to federated semantic query processing. We analyze both traditional distributed systems and specialized federated frameworks, highlighting their limitations in handling complex query structures and intermediate result assembly, which are key issues that our method aims to address.

In the traditional distributed environment, mainly through different data division strategy research targeted query optimization [9,10,11,12], or the application of cloud computing platforms and components [13,14,15,16], adjust the organization of data storage [17,18] and enhance the parallel ability of query computation to achieve the optimization of the query process, these ways are involved in the overall data content, like Peng P’s works [19], not fully applicable to the federated data environment. However, the processing in the federated environment also adopts the query strategy generation and intermediate result assembly in two stages of execution.

2.1. Semantic Query Strategy Generation

In federated environments, the optimization of the query policy generation phase in existing systems [20] can be divided into two main approaches: metadata indexing and SPARQL syntax processing. The DARQ federated distributed semantic data system, introduced by Quilitz B [21] is the most classical indexing implementation, in which the query policy generation is optimized by the declarative indexing of the “service description” and the declarative indexing of the Q-Tree. In a subsequent study, PrasserF [22] discovered the defects of the Q-Tree in DARQ, improved the information loss problem in the transformation of semantic triples to 3D vectors, and enhanced the accuracy of indexing.

On the basis of DARQ’s “service description” and the overall declarative index, Görlitz O proposed a new federated semantic data system SPLENDID [8], which is mainly based on the VOID metadata model [23], including the quantitative information of dataset triples, entities, predicates, subjects, and objects. The SemaGrow system [24] also uses the source selection method of SPLENDID. It performs cost-based query planning based on dataset VOID statistics. Hibiscus [25], a federated distributed RDF system proposed by Saleem et al., relies heavily on the ability to compute metadata. For each source, Hibisus defines a set of functions that map the attributes to their subjects and objects. MULDER [26] describes data sources in terms of RDF molecular templates and uses these templates for source selection, query decomposition, and optimization.

FedX [7] proposed by Schwarte A and Lusail proposed by Abdelaziz are two examples of federated query systems that do not use indexes and generate policies based on the SPARQL syntax. FedX traverses the triple schema of a given query and issues a SPARQL ASK query for each triple schema using that triple schema. Each federation member that returns TRUE in response to such a query is identified as a federation member that contains a triple that matches the triple pattern. It also implements a heuristic-based query planner that pushes the computations to local endpoints. It implements an optional query execution plan generation. Lusail [27] also incorporates the language features of SPARQL to check for results by combining commands with statements such as FILTER NOT EXISTS, LIMIT 1, etc., on input query triples, and thus obtaining the results selected by the data source.

2.2. Query Plan Intermediate Result Assembly

In the intermediate result assembly phase, the processing of classical federated query systems [28,29] is more centralized, which can usually be divided into natural, binding, and hash connections. The classical federated query system DARQ mainly designs two types of intermediate result join. The first is the regular natural join, in which, during the continuous iteration of the join process, for each binding in the external relation, the internal relation is scanned, and the binding that matches the join condition is added to the result set. The second type is a bound connection, where, depending on the decomposition order of the subqueries, the previous subquery returns the result after execution at the data site, sends the result along with the next subquery to the corresponding site, binds the two using the VALUES clause, and implements filtering during the matching process. This serial matching process severely wastes efficiency in a distributed environment. In contrast to DARQ, FedX does not use triples as connection individuals and uses bound connections in a block-nested loop with a greatly reduced number of endpoint requests.

SPLENDID also implements both natural and bound connections. For the former, dynamic programming, a flexible optimization strategy commonly used in traditional relational databases, is used to optimize the join order of the SPARQL basic graph schema. Bound joins, on the other hand, can significantly reduce the network overhead due to the division of SPLENDID exclusion groups. ADERIS is an adaptive query engine with index-based nested round-robin joins. Fan et al. designed overlapping lists and n-hop lists for federated environments with high transmission costs and geographically distant geographies, considering the network topology, to efficiently reduce the cost of inter-member communication.

3. Semantic Query Strategy Generation

3.1. Approach Overview

Our query strategy generation method consists of two main phases: query decomposition and data-source localization. This subsection provides a high-level overview of these phases before delving into the technical details in the subsequent subsections.

In this phase, the system processes basic triples (subject–predicate–object) in a query. We broke down the query into chunks, especially the star structure. The system employs a cost model to identify the central node in the query graph. This selection process occurs sequentially. Finally, following a greedy decomposition strategy, the system transforms the original graph schema into a series of star queries, each likely to be centered around the chosen node.
Data source localization phase: The current phase processes the decomposed subqueries sequentially. First, it constructs metadata mapping based on the metadata information of data sources in the federated environment and then constructs multi-source predicate-assisted indexes, based on which the subqueries are further decomposed, and the ternary containing the separate data sources is disassembled.

3.2. General Triple Pattern Query Processing

For the metadata information obtained from the federated environment, it is first necessary to perform space-for-time unified management of predicate information and transform the set into the mapping Map = {<P,S>} by algorithm in the offline phase.

When performing star decomposition on the input query graph, the selection of center nodes and the order of decomposition need to be considered, which we obtain by defining the node cost model, and the more appropriate star center nodes and decomposition order are obtained by model cost comparison.

In the RDF star graph, the cost model of node v is specifically defined as follows:

V a l (v) = \frac{|O u t D (v)| * \min (N (p_{1}) + N (p_{2}) + \dots + N (p_{n})) * 100}{\frac{F r e (p_{1}) + F r e (p_{2}) + \dots + F r e (p_{n})}{n}}

(1)

where Val(v) is the node generation value computed by the node, OutD(v) refers to the out-degree value of node v in the input query, that is, the number of triples it is associated with, N(p_i) is used to obtain the number of data sources where node v’s associated triple predicate exists, and Fre(p_i) is used to obtain the frequency of occurrence of node v’s associated triple predicate in the federated environment, with the higher the frequency, the larger the value.

Through the node cost model, calculate and sort the cost of each node in the input query graph Q to obtain the node queue List(v); obtain the first node t of List(v) as the first center node S1.Center of the star subquery Consider star S1 with S1.The center as the center node finds the associated ternary in the input query, adds it to S1, and adds the star S1 to the star subquery queue StarList after the search is completed; the execution is repeated to complete the process of star decomposition.

For the input complex query, as shown in Figure 3, the query decomposition result, as shown in Figure 5, can be obtained after the first stage of center selection and star decomposition.

After decomposing the original query into a series of smaller subqueries using a star-shaped approach, we refined the processing in two ways. First, we filtered out data sources (destination data sites) that were unlikely to contain relevant results, thereby avoiding unnecessary data queries, in Algorithm 1. Second, we leverage the constructed auxiliary index to optimize the order of execution of these subqueries, potentially improving the overall efficiency, in Algorithm 2. This filtering is particularly effective for subqueries with a simpler structure, as we can confidently avoid sending them to data sources that do not have any results.

Algorithm 1 Auxiliary Index Construction Algorithm

Input: Data source predicate metadata mapping:

M a p = {< P, S >}

Output: auxiliary index:

I n d e x

1 for

i = 1

to

| M a p |

do

2 if

| S_{i} | = = 1

then

3

M a p . r e m o v e (P_{i}, S_{i});

4 Initialize an empty query result map Res_MAP;

5 for

i = 1

to

| M a p | - 1

do

6

{R e s}_{i} = g e t L o c a t i o n (P_{i}, R e s_M A P);

// Getting search results

7 for

j = i + 1

to

| M a p |

do

8 if

| S_{i} | = = | S_{j} |

then

9

{R e s}_{j} = g e t L o c a t i o n (P j, R e s_M A P);

// Getting search results

10

{R e s}_{i j} = f e d e r a t e Q u e r y (P_{i} \cup P_{j});

// Execute join queries on datasets

11 if

{R e s}_{i} ⋈ {R e s}_{j} = {R e s}_{i j}

then

12

I n d e x . p u t (P_{i} \cup P_{j}, t r u e);

13 else

14

I n d e x . p u t (P_{i} \cup P_{j}, t r u e);

15 return Index;

16 function

g e t L o c a t i o n (P, M a p) :

17 if

M a p . g e t K e y S e t () . c o n t a i n s (P)

then

18

L = M a p . g e t (P);

19 else

20

L = f e d e r a t e Q u e r y (P)

// Execute queries on the dataset

21

M a p . p u t (P, L)

22 return

L

;

The algorithm discards triples (ternaries) in which the destination data source is considered a single unit. This is because the number of multi-source predicates (involving data from multiple sources) is inherently low owing to dataset characteristics, and even fewer predicate pairs share the same data source (typically around a hundred). For predicate pairs, such as P_i and P_j, the getResult function first matches their corresponding data sources (lines 6–9). The predicates are then combined for further matching, and the results are analyzed. If the results are identical, this suggests a strong correlation between the predicates, indicating that they are likely to always appear together. In such cases, they can be merged into a single subquery for efficiency.During processing within the getLocation function, we leverage a “space-for-time” approach to avoid redundant predicate checks across multiple executions.

Because star decomposition does not guarantee perfect matching for all triples, we employ a secondary reorganization algorithm based on maximal subset computation. This step ensures effective reorganization using the constructed auxiliary indices.

Algorithm 2 Query Reorganization Algorithm

Input: Predicate data source mapping for subqueries:

S t a r M a p = {< b a s i c T r i p l e, S >}

Output: Subquery reorganization collection:

f i n a l S t a r Q u e r y

1 Initialize an empty Dictionary

D i c t;

2 while

! S t a r M a p . i s E m p t y ()

then

3 for each

b a s i c T r i p l e, S

in

S t a r M a p

do

4 for each

s t a t

in

S

do

5 if

D i c t . c o n t a i n s (s t a t)

then // Constructing a Dict reverse map

6

D i c t . g e t (s t a t) . a d d (b a s i c T r i p l e);

7 else

8

D i c t . a d d (s t a t, b a s i c T r i p l e);

9 end if

10 end for

11 end for

12

l a r g e S u b s e t = g e t S u b s e t (D i c t);

// Define the method to get the set with the highest number of values.

13

finalStarQuery . a d d (l a r g e S u b s e t);

14 for each

P, S

in

S t a r M a p

do

15 if

S . c o n t a i n s (l a r g e S u b s e t . k e y S e t ())

then

16

S . r e m o v e (l a r g e S u b s e t . k e y S e t ());

17 end if

18 end for

19 return

f i n a l S t a r Q u e r y;

For the star query group after the initial decomposition, as shown in Figure 5, after query reorganization, the decomposition result, as shown in Figure 6, can be obtained and sent to the corresponding destination data site.

3.3. Handling of Complex Structures

In the previous subsection, we introduced the processing idea and specific flow of an input query constituted by a basic graph schema in a federated schema. In SPARQL, keywords such as UNION and OPTIONAL are also provided, and depending on the structure of the complex type, the main structure can be categorized into keywords targeting the ternary and expression and keywords targeting the entity.

For regular Basic Graph Patter (BGP) composition the inputs are a SPARQL query, we can define if q1 and q2 are SPARQL queries, we can get

(q 1 U N I O N q 2)

,

(q 1 O P T I O N A L q 2), (q 1 A N D q 2), (q 1 F I L T E R E x p r)

are all SPARQL queries, from which the complex query example in Figure 7 can be transformed into:

((q 1 A N D (q 2 U N I O N q 3)) O P T I O N A L q 4) F I L T E R E x p r

(2)

Based on this definition, we disassemble the complex input query into a merger of multiple BGP queries with keyword functions, and discuss the impact of keyword effects on BGP results. For an input query q and a set G of federated environment RDF graphs, the full exact matching result of q on G is denoted as ‖q‖. The following definitions can be given for different keyword functions:

If the input query q = q1 AND q2, then ‖q‖ = ‖q1‖⋈‖q2‖
If the input query q = q1 UNION q2, then ‖q‖ = ‖q1‖∪‖q2‖
If the input query q = q1 OPTIONAL q2, then
‖q‖ = (‖q1‖⋈‖q2‖)∪(‖q1‖\‖q2‖)
If the input query q = q1 FILTER Expr, then ‖q‖ = θ_Expr ‖q1‖

where, in definition (3), “\” represents the relative difference of the set, ‖q1‖\‖q2‖, that is, the set of elements belonging to ‖q1‖but not belonging to ‖q2‖that is, to retain the set of elements belonging to‖q1‖but not belonging to ‖q2‖that is, to realize the principle of optionally selective. In definition (4), “θ_Expr” represents the second filtering of the ‖q1‖ matching result according to the content of the regular expression. For such a processing flow, in the query strategy generation phase, the processing of such keywords in the input query is abstracted into the following process for iterative execution, and the processing of PatternGroup can be iterated to ensure the accurate execution of the keywords in the case of nesting.

Entity keywords are used to restrict the content of entity variables. The processing idea is similar; after processing the basic graph pattern, according to the variable fields involved in the clause, it will be inserted into the corresponding subquery to facilitate the rapid application of the assembly process.

ORDER BY: single-variable sorting or expression-variable sorting, combined with the ascending ASC keyword and descending DESC keywords.
FEDERATEDAND BY: In a federated environment, each organization has its own datasets and returns its own query results, and then different results are composed by the keyword in “FEDERATED BY”. A solution is proposed to deal with the new demand function when querying for semantically federated structure data.

Traditional federated database querying relies on the merging of sets. It performs matching across all data sources in the query graph and merges the results to obtain a complete answer. However, this approach can be inefficient. In contrast, the proposed FEDERATEDAND BY keyword allows filtering based on a specific field. Results are returned only if this field has the same value in all relevant data sources. However, for the FEDERATEDAND BY keyword to be effective, the specified field must be present in at least two data sources; otherwise, the query becomes meaningless.

For example, when querying a FEDERATED BY environment containing multiple wind farms, using the FEDERATED BY keyword allows users to specify the desired fields to retrieve equipment present across all wind farms. Filters can then be applied during query processing, significantly reducing redundant steps and improving the efficiency.

4. Intermediate Results Assembly

After generating the query execution plan through star decomposition and auxiliary indexing, the next critical phase is the assembly of the intermediate results. This phase directly impacts the overall query performance, particularly in federated environments, where data sources are distributed and network latency varies. In this section, we first introduce a centralized assembly method with partition-based optimization, followed by a distributed assembly approach that leverages the BSP model to enhance parallelism and scalability. This section details the process of assembling the intermediate results. It begins with a fundamental algorithm for centralized assembly. Then, it introduces a partitioning approach based on subquery characteristics to enable partition-based centralized assembly, which improves efficiency. Next, a cost computation model is defined to identify the optimal strategy for result assembly. This is followed by the introduction of distributed result-assembly processing. Here, a Bulk Synchronous Parallel (BSP) model is adopted to design a synchronization algorithm specifically for distributed assembly. By integrating this model into the result splicing process, the algorithm maximizes the parallel assembly efficiency.

4.1. Centralized Assembly Based on Partitioning

Because the most tedious step in the result assembly process is the need to frequently compare whether two subqueries are connectable, we first summarize the possible subquery relationships, as shown in Figure 8. After judging the subgraph relationships based on the contents of the figure, we claim that the two subqueries are connectable when, and only when, the subgraphs have the two relationships (1) and (3).

Based on this, our approach proposes a partition-based optimization technique. First, we explore the join order for the joinable subqueries.

Taking the two joinable subqueries of subgraph relation (1) in Figure 6 as an example, subqueries q1 and q2 have a common node ?B. Considering only the cost of this one join computation, i.e., the number of matches of the join process, and defining N(q) as the number of results returned by the federated matching of the query q, the cost of the one-shot join is

C o s t (q 1 ⋈ q 2) = N (q 1) * N (q 2)

(3)

If we take the mean value of n for the matching results of a single triple, then N(q) = n, then a single connection costs time complexity O(n2), and for the iterative process of basic centralized assembly, if a single input query is split into t subqueries, then full assembly of all of them achieves time complexity to the power of O(nt), which imposes a huge connection consumption. Because we cannot predict the result of a single connection, we cannot perform a uniform cost calculation for the overall query process to select the optimal connection scheme; therefore, we must select the current optimal connection for each step based on the idea of a greedy strategy. Although it is easy to find that the cost of a single connection is directly related to the number of results of the subqueries, when there are fewer results of the subqueries, the cost of the connection is less, and the intermediate results after the connection are less.

Therefore, for the connection process of subqueries q1, q2, and q3, assuming that two by two can be connected and N(q1) > N(q2) > N(q3), then when connecting q1 and q2 first in the worst-case scenario (when there are the most results after the connection), the

N (q 1 ⋈ q 2) = N (q 1)

(4)

In the next step of the connection,

\begin{matrix} N ((q 1 ⋈ q 2) ⋈ q 3) = N (q 3) \end{matrix}

(5)

\begin{matrix} C o s t ((q 1 ⋈ q 2) ⋈ q 3) = N (q 1) * N (q 2) + N (q 1) * N (q 3) \end{matrix}

(6)

It can be observed that the cost in the second join is calculated using the larger N(q1); therefore, so in order to minimize the result of each join step, we consider prioritizing the subquery with the smallest N(q), which can be obtained when this example is executed in the order of joining q2, q3, and then q1:

\begin{matrix} N ((q 2 ⋈ q 3) ⋈ q 1) = N (q 3) \end{matrix}

(7)

\begin{matrix} C o s t ((q 2 ⋈ q 3) ⋈ q 1) = N (q 2) * N (q 3) + N (q 1) * N (q 3) \end{matrix}

(8)

So it can be obtained that Cost((q2⋈q3)⋈q1)< Cost((q1⋈q2)⋈q3), so at each step we need to select the query with the smallest value of N(q) to start the assembly.

This section builds upon the join order selection and outlines the rules for partitioning subqueries. Our approach aims to avoid the repeated checking of whether pairs of subqueries can be joined. We achieve this by iteratively dividing all the subqueries into multiple partitions, in Algorithm 3. Each partition guarantees that its subqueries can be joined, whereas subqueries that cannot be joined are placed in separate partitions. Consequently, during processing, we only need to consider joining subqueries from the same partition, which significantly reduces the number of joinability checks.

Algorithm 3 Subquery Partitioning Algorithm

Input: Subquery Collection:

Q = {q_{1}, q_{2} \dots q_{n}}

Output: Partition results:

p a r t Q = {p_{1}, p_{2} \dots}

1 Initialize an empty Set

p a r t Q;

2

c o u n t = 0;

//counter

3 while

c o u n t < n

do

4

Q = Q - h o s t Q; h o s t Q = g e t S m a l l (Q);

// Go line 14 for definition of method

5 Initialize an empty Set

p a r t P l a n;

6

p a r t P l a n . a d d (h o s t Q); c o u n t + +;

7 for each

q q

in

Q

do

8 if

h o s t Q

and

q q

are joinable then

9

p a r t P l a n . a d d (q q); c o u n t + +; Q = Q - q q;

10 end if

11 end for

12

p a r t Q . a d d (p a r t P l a n);

13 return

p a r t Q;

14 function

g e t S m a l l (Q)

// The subquery that gets the fewest results

15

m i n = I n t e g e r . M A X_V A L U E;

16 for each

q

in

Q

do

17 if

q . s i z e () < m i n

then

18

r e s = q;

19 return

r e s;

The results of one round of partitioning and two rounds of partitioning for a series of subqueries entered after regular processing (left side of Figure 9) are shown in Figure 9 and Figure 10, respectively.

This algorithm iteratively assembles the results until the final outcome is obtained. However, serial execution significantly affects performance. To address this, we propose a parallelized architecture with three key components managed by a scheduler: Partition Controller, Implements the partitioning strategy to divide the intermediate results into independent units; FIFO Task Queue, Stores and manages the partitioned results in a first-in-first-out order; Worker Threads, Independently claim tasks from the queue, extract the partitioned content, and perform the assembly within each partition.

To minimize the synchronization overhead, idle worker threads are suspended and awakened only when new tasks become available. Additionally, they sleep only when the queue is empty to avoid unnecessary context-switching.

4.2. Distributed Assembly Based on BSP

Centralized assembly places more computational pressure on the control node, and even after optimizing the splicing order and processing, there is still a large optimization space; one of the points is the parallelism of the processing. Therefore, this subsection attempts to assemble the query intermediate results in a distributed manner. We introduced a Bulk Synchronous Parallel (BSP) model to design a synchronization algorithm for distributed assembly, illustrated in Figure 11.

The local computation session of BSP allows different processors to perform local computations in parallel and independently, which is adapted to our partitioned computation to complete the intermediate results connecting session. The process of barrier synchronization represents the end of one round of computation, and the next round of computation proceeds to the next superstep. The communication process communicates the intermediate results and prepares for the next superstep.

Localized computation link:

In our content, it is mainly responsible for connecting the intermediate results after the partitioning is completed, which is processed according to Algorithm 4:

Algorithm 4 Algorithm for local computation on node

S_{j}

Input: all intermediate results at the ith overstep on node

S_{j}

:

{θ^{i} (S}_{j})

the intermediate result after the (i−1)^th overstep computation received on node

S_{j}

:

θ_{i n}^{i - 1} {(S}_{j})

,

Output: This overstep produces a fully assembled result:

A N

Intermediate results to be sent:

θ_{o u t}^{i} (S_{j})

,

1 Initialize an empty Set

θ, Q S

;

2

θ = θ_{i n}^{i - 1} (S_{j}) ⋃ {θ^{i} (S}_{j})

// A collection of all possible matches

3

Q S =

θ_{i n}^{i - 1} (S_{j})

4 while

! Q S . i s E m p t y ()

do

5 Initialize an empty Set

Q S_n e x t;

6 for each

q

in

Q S

do

7 for each

p

in

θ

do

8 if

p

and

q

are joinable then

9

p q = p ⋈ q;

10 if

p q

is a completed result then

11

A N . a d d (p q);

12 else

13

Q S_n e x t . a d d (p q);

14 end if

15 end if

16

θ_{o u t}^{i} (S_{j}) . a d d (Q S_n e x t);

17

Q S = Q S_n e x t;

18

{θ^{i + 1} (S}_{j}) = {θ^{i} (S}_{j}) ⋃ θ_{o u t}^{i} (S_{j})

19 return

A N, θ_{o u t}^{i} (S_{j})

;

Consider the ith super-step. For each data site

S_{j}

in

θ_{i n}^{i - 1} (S_{j})

denotes all the received intermediate results in the ith super-step, and

θ^{i} (S_{j})

denotes all the intermediate results at the ith super-step on node

S_{j}

.

In the ith super-step, we evaluate the relationship between the received computational results

θ_{i n}^{i - 1} (S_{j})

and all the results owned by

S_{j}

by an algorithm. For each intermediate result q, we check whether it can be connected to an intermediate result? The result of the connection is judged, and if it is a complete query match, it is returned to the control node for collection. If the connected result is still an intermediate result, we reintroduce it as a received result and check whether it can be further joined to connect to other results, and also inserted into

θ_{o u t}^{i} (S_{j})

to make sure that the possibility of matching all the connections is sent to the other segments in the communication step discussed below.

2.: Communication Segment;

This session is used to manage the exchange of data between nodes. Consider the ith super step: The simple communication strategy is as follows: If the intermediate result pq in

θ_{o u t}^{i} (S_{j})

has the same defined variables as the full result

θ (S_{k})

of node

S_{k}

then pq will be sent from

S_{j}

to site

S_{k}

.

However, this communication strategy may produce duplicate results. For example, when the intermediate result pq in

θ_{o u t}^{i} (S_{j})

has the same defined variables as the full result

θ^{i} (S_{k})

of node

S_{k}

pq will be sent from

S_{j}

to site

S_{k}

similarly, implying that the intermediate result pq’ in

θ_{o u t}^{i} (S_{k})

has the same defined variables as the full result

θ^{i} (S_{j})

of node

S_{j}

. pq’ will be sent from

S_{k}

to site

S_{j}

. In other words, we get the connection result pq ⋈ pq’ will be in both

S_{j}

and

S_{k}

locations. This wastes resources and increases the total evaluation time.

To avoid duplicate result computations, we introduced a “one-way communication” approach. When this occurs, only connections passing in one direction are considered. We define that when the number of intermediate results

| {θ^{i} (S}_{j}) | \leq | {θ^{i} (S}_{k}) |

is denoted as

S_{j} < S_{k}

, and if a complete matching result RS consists of intermediate results on

S_{j}, S_{j + 1}, \dots, S_{k}

and

| {θ^{i} (S}_{j}) | \leq | {θ^{i} (S}_{k}) | \leq \dots \leq | {θ^{i} (S}_{k}) |

, then RS will be generated only from the site

S_{k}

, not at every process node. Therefore, in each communication process, we only send from the smaller process node to the larger process node according to the number of intermediate results, for the aforementioned example,

| {θ^{i} (S}_{j}) | \leq | {θ^{i} (S}_{k}) |

, then only pq’ will be sent from

S_{k}

to site

S_{j}

in the distributed environment, to avoid that the same case occurs multiple times of assembly in the environment.

3.: Barrier Synchronization

All communications for the mth superstep should be completed before entering the (m + 1)th superstep. Some provisions also need to be made for the initial state of the system (i.e., the 0th superstep) and system termination conditions.

In the 0th superstep, only local matching and query results are available at each site. Because it is not necessary to assemble the results before they are summarized and does not fit our processing logic, no local computation is required in the 0th superstep. This directly leads to the communication phase. Each site S_i sends the results to the other segments according to the communication strategy described above.

A key issue in the BSP algorithm is the number of supersteps required to terminate the system. For a single query of the input query graph Q, since the query strategy and decomposition in the system have been determined, it is assumed that the graph Q is decomposed into

N_{s u b q u e r y} (Q)

subqueries, and thus in the worst case, where only one set of inter-subqueries is computed for each splice, at most

N_{s u b q u e r y} (Q) - 1

steps are needed to carry out the process, and thus the maximum number of oversteps in a distributed system is

N_{s u b q u e r y} (Q) - 1

.

5. Experiments

To validate the effectiveness of our proposed methods for both query strategy generation (Section 3) and intermediate result assembly (Section 4), we conducted comprehensive experiments using standard benchmarks and comparison systems.

5.1. Setting

We evaluated our approach using two benchmark datasets: LargeRDFBench [28] and WatDiv [29].

LargeRDFBench: This is a large-scale suite designed to provide real-world datasets and queries for evaluating semantic-data management systems. It covers various classifications and domains, including life science and gene semantics. Its website lies at https://github.com/dice-group/LargeRDFBench (accessed on 23 September 2024).

WatDiv: This is an artificial semantic dataset generator developed by the University of Waterloo. This allows users to create datasets of different sizes based on specified parameters. We generated multiple random datasets of varying sizes and used a traditional METIS division strategy to split the centralized data into slices across a federated distributed system. Its website lies at http://dsg.uwaterloo.ca/watdiv/ (accessed on 23 September 2024).

For testing, we focused on four common query graph structures: chain, star, snowflake, and complex structures. We selected three query groups (L1–L3, S1–S3, F1–F3, and C1–C3) for each structure to evaluate the effectiveness of our query strategy generation and avoid relying solely on a single structural model. Computation environment is shown in Table 1.

5.2. Feasibility Test

To evaluate the effectiveness of our query strategy generation approach, we conducted an ablation experiment comparing it with the fedQuery scheme, which utilizes traditional ternary decomposition and localization methods. Both approaches were integrated into the basic iterative assembly scheme, and multiple runs were executed to compare the average number of query matches and the query strategy generation time. The results are shown in Figure 12.

Based on our optimization decomposition scheme, we connected the two assembly strategies while comparing the basic iterative assembly scheme basicAssm and recorded the query test group runs, as shown in Figure 13.

The test results demonstrate significant improvements in subquery execution compared with the fedQuery approach. For queries other than chain structures, our method typically reduces the number of subquery executions by 30–40%. This improvement was even more pronounced when the data source distribution for the ternary groups was more uniform. Additionally, the use of indexes provides some optimization of the decomposition time.

While both decomposition strategies achieve similar overall query execution efficiency trends, our assembly scheme offers optimizations compared with the basic centralized iterative approach. The distributed assembly leads to better optimization when dealing with smaller intermediate results. This is because the overhead of setting up a distributed environment is lower for smaller datasets, allowing distributed processing to leverage its parallel-processing speed for greater efficiency gains. In conclusion, our proposed optimizations are particularly effective for handling complex queries.

5.3. Efficiency Test

To evaluate the effectiveness of our solutions in this chapter, we compare them with two classic federated semantic query systems: FedX and SPLENDID. These systems are well-established references for SPARQL-querying efficiency. We focused on natural semantic data using the LargeRDFBench dataset. The experimental results are shown in Figure 14.

The experiment revealed key differences between the approaches. FedX incurs significantly more remote executions than the others. This is because it uses ASK statements to probe the data sources for each subquery during selection. SPLENDID, owing to its decomposition method, has slightly more remote executions than our approach (fedQuery).

In terms of overall efficiency, SPLENDID’s federated query processing is less efficient than the others, often exceeding FedX’s runtime. In contrast, our proposed distributed parallel assembly process (BSPAssm) demonstrates better efficiency after accessing the same subqueries as FedX. Although the performance improvement for small queries (L1–L3) is modest, BSPAssm achieves a significant 20% runtime improvement for complex queries compared to FedX. This highlights the effectiveness of our optimization in the intermediate result assembly phase of the federated queries.

6. Conclusions

This study proposes a novel approach for processing semantic queries in federated databases. It departs from traditional methods that decompose queries based on individual triples. Instead, it uses a cost model for the star decomposition of the input query, minimizing redundant intermediate results. To ensure comprehensive results, this study introduces a metadata-based auxiliary index for subquery location and secondary reorganization. Additionally, it handles complex SPARQL queries with keywords by transforming them into basic graph patterns, during decomposition. For intermediate result assembly, this study proposes a generic algorithm that analyzes subgraph relationships. It further introduces a partition-based optimization method to reduce the number of matches and a parallel structure to improve efficiency. Recognizing the potential for parallel processing in distributed environments, this study leverages the BSP synchronous parallel model for parallel assembly.

In our future work, we will improve the comprehensiveness of the processing process for the new complex queries added in SPARQL1.1, such as attribute path query, aggregation query, keyword query, and so on. We also consider data scalability and adaptability issues, network quality issues, etc., in the processing of the query process. Through more in-depth research, the potential applications of this technology across various sectors, such as energy systems, should be explored.

Author Contributions

Conceptualization, Y.Z. and Y.Y.; methodology, Y.Z. and Y.Y.; software, Y.Z.; validation, Y.Z. and Y.Y.; formal analysis, Y.Y.; investigation, Y.Z. and Y.Y.; resources, Y.Z.; data curation, Y.Y.; writing—original draft preparation, Y.Z. and Y.Y.; writing—review and editing, Y.Y.; visualization, Y.Y.; supervision, Y.Z.; project administration, Y.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding. The APC was funded by Yuan Yao.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding authors.

Conflicts of Interest

The authors declare no conflict of interest.

References

Arena, F.; Pau, G. An overview of big data analysis. Bull. Electr. Eng. Inform. 2020, 9, 1646–1653. [Google Scholar] [CrossRef]
Hitzler, P. A review of the semantic web field. Commun. ACM 2021, 64, 76–83. [Google Scholar] [CrossRef]
Tomaszuk, D.; Hyland-Wood, D. RDF 1.1: Knowledge representation and data integration language for the Web. Symmetry 2020, 12, 84. [Google Scholar] [CrossRef]
DuCharme, B. Learning SPARQL: Querying and Updating with SPARQL 1.1; O’Reilly Media, Inc.: Sebastopol, CA, USA, 2013. [Google Scholar]
Tasar, C.O.; Komesli, M.; Unalir, M.O. A comparative review for question answering frameworks on the linked data. Recent Adv. Comput. Sci. Commun. 2021, 14, 1695–1705. [Google Scholar] [CrossRef]
Papadaki, M.E.; Tzitzikas, Y.; Mountantonakis, M. A brief survey of methods for analytics over RDF knowledge graphs. Analytics 2023, 2, 55–74. [Google Scholar] [CrossRef]
Schwarte, A.; Haase, P.; Hose, K.; Schenkel, R.; Schmidt, M. Fedx: Optimization techniques for federated query processing on linked data. In Proceedings of the International Semantic Web Conference, Bonn, Germany, 23–27 October 2011; Springer: Berlin/Heidelberg, Germany, 2011; pp. 601–616. [Google Scholar]
Görlitz, O.; Staab, S. SPLENDID: SPARQL Endpoint Federation Exploiting VOID Descriptions. COLD 2011, 782, 13–24. [Google Scholar]
Huang, J.; Abadi, D.J.; Ren, K. Scalable SPARQL querying of large RDF graphs. Proc. VLDB Endow. 2011, 4, 1123–1134. [Google Scholar] [CrossRef]
Lee, K.; Liu, L.; Tang, Y.; Zhang, Q.; Zhou, Y. Efficient and customizable data partitioning framework for distributed big RDF data processing in the cloud. In Proceedings of the 2013 IEEE Sixth International Conference on Cloud Computing, Santa Clara, CA, USA, 28 June–3 July 2013; IEEE: New York, NY, USA, 2013; pp. 327–334. [Google Scholar]
Gurajada, S.; Seufert, S.; Miliaraki, I.; Theobald, M. TriAD: A Distributed Shared-nothing RDF Engine Based on Asynchronous Message Passing. In Proceedings of the 14th International Conference on ACM Special Interest Group on Management of Data, Snowbird, UT, USA, 22–27 June 2014; pp. 289–300. [Google Scholar]
Peng, P.; Zou, L.; Chen, L.; Zhao, D. Query Workload-based RDF graph fragmentation and allocation. In Proceedings of the 19th International Conference on Extending Database Technology, Bordeaux, France, 15–18 March 2016; pp. 377–388. [Google Scholar]
Merceedi, K.J.; Sabry, N.A. A comprehensive survey for hadoop distributed file system. Asian J. Res. Comput. Sci. 2021, 11, 46–57. [Google Scholar] [CrossRef]
Gupta, S.K.; Yadav, S.K.; Soni, S.K. Exploring the Power of Big Data for IoT: A Comprehensive Review. In Proceedings of the 2023 International Conference on IoT, Communication and Automation Technology (ICICAT), Gorakhpur, India, 23–24 June 2023; IEEE: New York, NY, USA, 2023; pp. 1–6. [Google Scholar]
Thakur, S.; Jha, S.K. Cloud Computing and its Emerging Trends on Big Data Analytics. In Proceedings of the 2023 4th International Conference on Electronics and Sustainable Communication Systems (ICESC), Coimbatore, India, 6–8 July 2023; IEEE: New York, NY, USA, 2023; pp. 1159–1164. [Google Scholar]
Schätzle, A.; Przyjaciel-Zablocki, M.; Lausen, G. PigSPARQL: Mapping SPARQL to pig latin. In Proceedings of the International Workshop on Semantic Web Information Management, Athens, Greece, 12–16 June 2011; pp. 1–8. [Google Scholar]
Papailiou, N.; Tsoumakos, D.; Konstantinou, I.; Koziris, N. H₂RDF+: An efficient data management system for big RDF graphs. In Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data, Snowbird, UT, USA, 22–27 June 2014; pp. 909–912. [Google Scholar]
Shao, B.; Wang, H.; Li, Y. Trinity: A distributed graph engine on a memory cloud. In Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, New York, NY, USA, 22–27 June 2013; pp. 505–516. [Google Scholar]
Peng, P.; Zou, L.; Özsu, M.T.; Chen, L.; Zhao, D. Processing SPARQL queries over distributed RDF graphs. VLDB J. 2016, 25, 243–268. [Google Scholar] [CrossRef]
Cheng, S.; Hartig, O. Source Selection for SPARQL Endpoints: Fit for Heterogeneous Federations of RDF Data Sources? In Proceedings of the QuWeDa@ ISWC, Hangzhou, China, 23–27 October 2022; pp. 5–16. [Google Scholar]
Quilitz, B.; Leser, U. Querying distributed RDF data sources with SPARQL. In Proceedings of the Semantic Web: Research and Applications: 5th European Semantic Web Conference, ESWC 2008, Tenerife, Canary Islands, Spain, 1–5 June 2008; Proceedings 5. Springer: Berlin/Heidelberg, Germany, 2008; pp. 524–538. [Google Scholar]
Prasser, F.; Kemper, A.; Kuhn, K.A. Efficient distributed query processing for autonomous RDF databases. In Proceedings of the 15th International Conference on Extending Database Technology, Berlin, Germany, 27–30 March 2012; pp. 372–383. [Google Scholar]
Cimiano, P.; Chiarcos, C.; McCrae, J.P.; Gracia, J. Modelling metadata of language resources. In Linguistic Linked Data: Representation, Generation and Applications; Springer: Berlin/Heidelberg, Germany, 2020; pp. 123–135. [Google Scholar]
Charalambidis, A.; Troumpoukis, A.; Konstantopoulos, S. SemaGrow: Optimizing federated SPARQL queries. In Proceedings of the 11th International Conference on Semantic Systems, Vienna, Austria, 15–17 September 2015; pp. 121–128. [Google Scholar]
Saleem, M.; Ngomo, A.C.N. HiBISCuS: Hypergraph-based source selection for SPARQL endpoint federation. In Proceedings of the European Semantic Web Conference, Crete, Greece, 25–29 May 2014; Springer: Cham, Switzerland, 2014; pp. 176–191. [Google Scholar]
Endris, K.M.; Galkin, M.; Lytra, I.; Mami, M.N.; Vidal, M.-E.; Auer, S. MULDER: Querying the linked data web by bridging RDF molecule templates. In Proceedings of the Database and Expert Systems Applications: 28th International Conference, DEXA 2017, Lyon, France, 28–31 August 2017; Proceedings, Part I 28. Springer International Publishing: Berlin/Heidelberg, Germany, 2017; pp. 3–18. [Google Scholar]
Abdelaziz, I.; Mansour, E.; Ouzzani, M.; Aboulnaga, A.; Kalnis, P. Lusail: A system for querying linked data at scale. Proc. VLDB Endow. 2017, 11, 485–498. [Google Scholar] [CrossRef]
Azevedo, L.G.; de Souza Soares, E.F.; Souza, R.; Moreno, M. Modern Federated Database Systems: An Overview. In Proceedings of the 22nd International Conference on Enterprise Information Systems: ICEIS 2020, Virtual, 5–7 May 2020; Volume 1, pp. 276–283. [Google Scholar]
Gu, Z.; Corcoglioniti, F.; Lanti, D.; Mosca, A.; Xiao, G.; Xiong, J.; Calvanese, D. A systematic overview of data federation systems. Semant. Web 2024, 15, 107–165. [Google Scholar] [CrossRef]

Figure 1. Federated Distributed Semantic Data Example.

Figure 2. Graph Matching Process Example.

Figure 3. Complex SPARQL query examples.

Figure 4. FedX method query decomposition results.

Figure 5. Preliminary results of the query decomposition.

Figure 6. Query decomposition results.

Figure 7. Example of a complex input query with keywords.

Figure 8. Summary of the Subgraph Relationships.

Figure 9. Centralized assembly results of the first round of partitioning.

Figure 10. Centralized assembly results of the second round of partitioning.

Figure 11. BSP Structure Schematic.

Figure 12. Results of the ablation experiments.

Figure 13. Comparison of run times for different assemblies.

Figure 14. Efficiency test run results.

Table 1. Hardware Setting.

	Processor	Memory	Harddisk	Network
controlling node	2.30 GHz with 8 cores	512 GB	8 TB	100 MBPS
Plain node	3.06 GHz	16 GB	200 GB	100 MBPS

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Yao, Y.; Zhang, Y. Semantic Data Federated Query Optimization Based on Decomposition of Block-Level Subqueries. Future Internet 2025, 17, 531. https://doi.org/10.3390/fi17110531

AMA Style

Yao Y, Zhang Y. Semantic Data Federated Query Optimization Based on Decomposition of Block-Level Subqueries. Future Internet. 2025; 17(11):531. https://doi.org/10.3390/fi17110531

Chicago/Turabian Style

Yao, Yuan, and Yang Zhang. 2025. "Semantic Data Federated Query Optimization Based on Decomposition of Block-Level Subqueries" Future Internet 17, no. 11: 531. https://doi.org/10.3390/fi17110531

APA Style

Yao, Y., & Zhang, Y. (2025). Semantic Data Federated Query Optimization Based on Decomposition of Block-Level Subqueries. Future Internet, 17(11), 531. https://doi.org/10.3390/fi17110531

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Semantic Data Federated Query Optimization Based on Decomposition of Block-Level Subqueries

Abstract

1. Introduction

2. Related Work

2.1. Semantic Query Strategy Generation

2.2. Query Plan Intermediate Result Assembly

3. Semantic Query Strategy Generation

3.1. Approach Overview

3.2. General Triple Pattern Query Processing

3.3. Handling of Complex Structures

4. Intermediate Results Assembly

4.1. Centralized Assembly Based on Partitioning

4.2. Distributed Assembly Based on BSP

5. Experiments

5.1. Setting

5.2. Feasibility Test

5.3. Efficiency Test

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI