Semantic Data Federated Query Optimization Based on Decomposition of Block-Level Subqueries
Abstract
1. Introduction
- (1)
- Our approach adopts a large-block decomposition strategy rather than traditional triple-splitting methods. This strategy generates larger subqueries after decomposition, effectively reducing the number of intermediate results and remote executions while maintaining the accuracy of the query results. Consequently, it improves the overall query efficiency.
- (2)
- We incrementally convert a complex SPARQL query into graph patterns by introducing a new “FEDRATEDAND BY” concept and means. In other words, we introduce the novel keyword “FEDRATEDAND BY” specifically for federated environments. This keyword enhances the comprehensiveness of semantic expression within the data federation.
- (3)
- Our solution optimizes the partition and connection order by reducing the intermediate reconnection computation. We also incorporated the Bulk Synchronous Parallel (BSP) model to implement parallel assembly within a parallel architecture. This combined approach improves the query efficiency by leveraging the parallel capabilities of the system.
2. Related Work
2.1. Semantic Query Strategy Generation
2.2. Query Plan Intermediate Result Assembly
3. Semantic Query Strategy Generation
3.1. Approach Overview
- In this phase, the system processes basic triples (subject–predicate–object) in a query. We broke down the query into chunks, especially the star structure. The system employs a cost model to identify the central node in the query graph. This selection process occurs sequentially. Finally, following a greedy decomposition strategy, the system transforms the original graph schema into a series of star queries, each likely to be centered around the chosen node.
- Data source localization phase: The current phase processes the decomposed subqueries sequentially. First, it constructs metadata mapping based on the metadata information of data sources in the federated environment and then constructs multi-source predicate-assisted indexes, based on which the subqueries are further decomposed, and the ternary containing the separate data sources is disassembled.
3.2. General Triple Pattern Query Processing
| Algorithm 1 Auxiliary Index Construction Algorithm |
| Input: Data source predicate metadata mapping: |
| Output: auxiliary index: |
| 1 for to do |
| 2 if then |
| 3 |
| 4 Initialize an empty query result map Res_MAP; |
| 5 for to do |
| 6 // Getting search results |
| 7 for to do |
| 8 if then |
| 9 // Getting search results |
| 10 // Execute join queries on datasets |
| 11 if then |
| 12 |
| 13 else |
| 14 |
| 15 return Index; |
| 16 function |
| 17 if then |
| 18 |
| 19 else |
| 20 // Execute queries on the dataset |
| 21 |
| 22 return ; |
| Algorithm 2 Query Reorganization Algorithm |
| Input: Predicate data source mapping for subqueries: |
| Output: Subquery reorganization collection: |
| 1 Initialize an empty Dictionary |
| 2 while then |
| 3 for each in do |
| 4 for each in do |
| 5 if then // Constructing a Dict reverse map |
| 6 |
| 7 else |
| 8 |
| 9 end if |
| 10 end for |
| 11 end for |
| 12 |
| // Define the method to get the set with the highest number of values. |
| 13 |
| 14 for each in do |
| 15 if then |
| 16 |
| 17 end if |
| 18 end for |
| 19 return |
3.3. Handling of Complex Structures
- If the input query q = q1 AND q2, then ‖q‖ = ‖q1‖⋈‖q2‖
- If the input query q = q1 UNION q2, then ‖q‖ = ‖q1‖∪‖q2‖
- If the input query q = q1 OPTIONAL q2, then‖q‖ = (‖q1‖⋈‖q2‖)∪(‖q1‖\‖q2‖)
- If the input query q = q1 FILTER Expr, then ‖q‖ = θ_Expr ‖q1‖
- where, in definition (3), “\” represents the relative difference of the set, ‖q1‖\‖q2‖, that is, the set of elements belonging to ‖q1‖but not belonging to ‖q2‖that is, to retain the set of elements belonging to‖q1‖but not belonging to ‖q2‖that is, to realize the principle of optionally selective. In definition (4), “θ_Expr” represents the second filtering of the ‖q1‖ matching result according to the content of the regular expression. For such a processing flow, in the query strategy generation phase, the processing of such keywords in the input query is abstracted into the following process for iterative execution, and the processing of PatternGroup can be iterated to ensure the accurate execution of the keywords in the case of nesting.
- ORDER BY: single-variable sorting or expression-variable sorting, combined with the ascending ASC keyword and descending DESC keywords.
- FEDERATEDAND BY: In a federated environment, each organization has its own datasets and returns its own query results, and then different results are composed by the keyword in “FEDERATED BY”. A solution is proposed to deal with the new demand function when querying for semantically federated structure data.
4. Intermediate Results Assembly
4.1. Centralized Assembly Based on Partitioning
| Algorithm 3 Subquery Partitioning Algorithm |
| Input: Subquery Collection: |
| Output: Partition results: |
| 1 Initialize an empty Set |
| 2 //counter |
| 3 while do |
| 4 |
| // Go line 14 for definition of method |
| 5 Initialize an empty Set |
| 6 |
| 7 for each in do |
| 8 if and are joinable then |
| 9 |
| 10 end if |
| 11 end for |
| 12 |
| 13 return |
| 14 function // The subquery that gets the fewest results |
| 15 |
| 16 for each in do |
| 17 if then |
| 18 |
| 19 return |
4.2. Distributed Assembly Based on BSP
- Localized computation link:
| Algorithm 4 Algorithm for local computation on node |
| Input: all intermediate results at the ith overstep on node : |
| the intermediate result after the (i−1)th overstep computation received on node : , |
| Output: This overstep produces a fully assembled result: |
| Intermediate results to be sent: , |
| 1 Initialize an empty Set ; |
| 2 // A collection of all possible matches |
| 3 |
| 4 while do |
| 5 Initialize an empty Set |
| 6 for each in do |
| 7 for each in do |
| 8 if and are joinable then |
| 9 |
| 10 if is a completed result then |
| 11 |
| 12 else |
| 13 |
| 14 end if |
| 15 end if |
| 16 |
| 17 |
| 18 |
| 19 return ; |
- 2.
- Communication Segment;
- 3.
- Barrier Synchronization
5. Experiments
5.1. Setting
5.2. Feasibility Test
5.3. Efficiency Test
6. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Arena, F.; Pau, G. An overview of big data analysis. Bull. Electr. Eng. Inform. 2020, 9, 1646–1653. [Google Scholar] [CrossRef]
- Hitzler, P. A review of the semantic web field. Commun. ACM 2021, 64, 76–83. [Google Scholar] [CrossRef]
- Tomaszuk, D.; Hyland-Wood, D. RDF 1.1: Knowledge representation and data integration language for the Web. Symmetry 2020, 12, 84. [Google Scholar] [CrossRef]
- DuCharme, B. Learning SPARQL: Querying and Updating with SPARQL 1.1; O’Reilly Media, Inc.: Sebastopol, CA, USA, 2013. [Google Scholar]
- Tasar, C.O.; Komesli, M.; Unalir, M.O. A comparative review for question answering frameworks on the linked data. Recent Adv. Comput. Sci. Commun. 2021, 14, 1695–1705. [Google Scholar] [CrossRef]
- Papadaki, M.E.; Tzitzikas, Y.; Mountantonakis, M. A brief survey of methods for analytics over RDF knowledge graphs. Analytics 2023, 2, 55–74. [Google Scholar] [CrossRef]
- Schwarte, A.; Haase, P.; Hose, K.; Schenkel, R.; Schmidt, M. Fedx: Optimization techniques for federated query processing on linked data. In Proceedings of the International Semantic Web Conference, Bonn, Germany, 23–27 October 2011; Springer: Berlin/Heidelberg, Germany, 2011; pp. 601–616. [Google Scholar]
- Görlitz, O.; Staab, S. SPLENDID: SPARQL Endpoint Federation Exploiting VOID Descriptions. COLD 2011, 782, 13–24. [Google Scholar]
- Huang, J.; Abadi, D.J.; Ren, K. Scalable SPARQL querying of large RDF graphs. Proc. VLDB Endow. 2011, 4, 1123–1134. [Google Scholar] [CrossRef]
- Lee, K.; Liu, L.; Tang, Y.; Zhang, Q.; Zhou, Y. Efficient and customizable data partitioning framework for distributed big RDF data processing in the cloud. In Proceedings of the 2013 IEEE Sixth International Conference on Cloud Computing, Santa Clara, CA, USA, 28 June–3 July 2013; IEEE: New York, NY, USA, 2013; pp. 327–334. [Google Scholar]
- Gurajada, S.; Seufert, S.; Miliaraki, I.; Theobald, M. TriAD: A Distributed Shared-nothing RDF Engine Based on Asynchronous Message Passing. In Proceedings of the 14th International Conference on ACM Special Interest Group on Management of Data, Snowbird, UT, USA, 22–27 June 2014; pp. 289–300. [Google Scholar]
- Peng, P.; Zou, L.; Chen, L.; Zhao, D. Query Workload-based RDF graph fragmentation and allocation. In Proceedings of the 19th International Conference on Extending Database Technology, Bordeaux, France, 15–18 March 2016; pp. 377–388. [Google Scholar]
- Merceedi, K.J.; Sabry, N.A. A comprehensive survey for hadoop distributed file system. Asian J. Res. Comput. Sci. 2021, 11, 46–57. [Google Scholar] [CrossRef]
- Gupta, S.K.; Yadav, S.K.; Soni, S.K. Exploring the Power of Big Data for IoT: A Comprehensive Review. In Proceedings of the 2023 International Conference on IoT, Communication and Automation Technology (ICICAT), Gorakhpur, India, 23–24 June 2023; IEEE: New York, NY, USA, 2023; pp. 1–6. [Google Scholar]
- Thakur, S.; Jha, S.K. Cloud Computing and its Emerging Trends on Big Data Analytics. In Proceedings of the 2023 4th International Conference on Electronics and Sustainable Communication Systems (ICESC), Coimbatore, India, 6–8 July 2023; IEEE: New York, NY, USA, 2023; pp. 1159–1164. [Google Scholar]
- Schätzle, A.; Przyjaciel-Zablocki, M.; Lausen, G. PigSPARQL: Mapping SPARQL to pig latin. In Proceedings of the International Workshop on Semantic Web Information Management, Athens, Greece, 12–16 June 2011; pp. 1–8. [Google Scholar]
- Papailiou, N.; Tsoumakos, D.; Konstantinou, I.; Koziris, N. H2RDF+: An efficient data management system for big RDF graphs. In Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data, Snowbird, UT, USA, 22–27 June 2014; pp. 909–912. [Google Scholar]
- Shao, B.; Wang, H.; Li, Y. Trinity: A distributed graph engine on a memory cloud. In Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, New York, NY, USA, 22–27 June 2013; pp. 505–516. [Google Scholar]
- Peng, P.; Zou, L.; Özsu, M.T.; Chen, L.; Zhao, D. Processing SPARQL queries over distributed RDF graphs. VLDB J. 2016, 25, 243–268. [Google Scholar] [CrossRef]
- Cheng, S.; Hartig, O. Source Selection for SPARQL Endpoints: Fit for Heterogeneous Federations of RDF Data Sources? In Proceedings of the QuWeDa@ ISWC, Hangzhou, China, 23–27 October 2022; pp. 5–16. [Google Scholar]
- Quilitz, B.; Leser, U. Querying distributed RDF data sources with SPARQL. In Proceedings of the Semantic Web: Research and Applications: 5th European Semantic Web Conference, ESWC 2008, Tenerife, Canary Islands, Spain, 1–5 June 2008; Proceedings 5. Springer: Berlin/Heidelberg, Germany, 2008; pp. 524–538. [Google Scholar]
- Prasser, F.; Kemper, A.; Kuhn, K.A. Efficient distributed query processing for autonomous RDF databases. In Proceedings of the 15th International Conference on Extending Database Technology, Berlin, Germany, 27–30 March 2012; pp. 372–383. [Google Scholar]
- Cimiano, P.; Chiarcos, C.; McCrae, J.P.; Gracia, J. Modelling metadata of language resources. In Linguistic Linked Data: Representation, Generation and Applications; Springer: Berlin/Heidelberg, Germany, 2020; pp. 123–135. [Google Scholar]
- Charalambidis, A.; Troumpoukis, A.; Konstantopoulos, S. SemaGrow: Optimizing federated SPARQL queries. In Proceedings of the 11th International Conference on Semantic Systems, Vienna, Austria, 15–17 September 2015; pp. 121–128. [Google Scholar]
- Saleem, M.; Ngomo, A.C.N. HiBISCuS: Hypergraph-based source selection for SPARQL endpoint federation. In Proceedings of the European Semantic Web Conference, Crete, Greece, 25–29 May 2014; Springer: Cham, Switzerland, 2014; pp. 176–191. [Google Scholar]
- Endris, K.M.; Galkin, M.; Lytra, I.; Mami, M.N.; Vidal, M.-E.; Auer, S. MULDER: Querying the linked data web by bridging RDF molecule templates. In Proceedings of the Database and Expert Systems Applications: 28th International Conference, DEXA 2017, Lyon, France, 28–31 August 2017; Proceedings, Part I 28. Springer International Publishing: Berlin/Heidelberg, Germany, 2017; pp. 3–18. [Google Scholar]
- Abdelaziz, I.; Mansour, E.; Ouzzani, M.; Aboulnaga, A.; Kalnis, P. Lusail: A system for querying linked data at scale. Proc. VLDB Endow. 2017, 11, 485–498. [Google Scholar] [CrossRef]
- Azevedo, L.G.; de Souza Soares, E.F.; Souza, R.; Moreno, M. Modern Federated Database Systems: An Overview. In Proceedings of the 22nd International Conference on Enterprise Information Systems: ICEIS 2020, Virtual, 5–7 May 2020; Volume 1, pp. 276–283. [Google Scholar]
- Gu, Z.; Corcoglioniti, F.; Lanti, D.; Mosca, A.; Xiao, G.; Xiong, J.; Calvanese, D. A systematic overview of data federation systems. Semant. Web 2024, 15, 107–165. [Google Scholar] [CrossRef]














| Processor | Memory | Harddisk | Network | |
|---|---|---|---|---|
| controlling node | 2.30 GHz with 8 cores | 512 GB | 8 TB | 100 MBPS |
| Plain node | 3.06 GHz | 16 GB | 200 GB | 100 MBPS |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Yao, Y.; Zhang, Y. Semantic Data Federated Query Optimization Based on Decomposition of Block-Level Subqueries. Future Internet 2025, 17, 531. https://doi.org/10.3390/fi17110531
Yao Y, Zhang Y. Semantic Data Federated Query Optimization Based on Decomposition of Block-Level Subqueries. Future Internet. 2025; 17(11):531. https://doi.org/10.3390/fi17110531
Chicago/Turabian StyleYao, Yuan, and Yang Zhang. 2025. "Semantic Data Federated Query Optimization Based on Decomposition of Block-Level Subqueries" Future Internet 17, no. 11: 531. https://doi.org/10.3390/fi17110531
APA StyleYao, Y., & Zhang, Y. (2025). Semantic Data Federated Query Optimization Based on Decomposition of Block-Level Subqueries. Future Internet, 17(11), 531. https://doi.org/10.3390/fi17110531
