An Efficient Distributed SPARQL Query Processing Scheme Considering Communication Costs in Spark Environments
Abstract
:1. Introduction
2. Related Work
3. The Proposed Distributed SPARQL Query Processing Scheme
3.1. Overall Architecture
3.2. Query Decomposition
3.3. Query Processing Procedure
4. Performance Evaluation
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Antoniou, G.; Harmelen, F.V. A Semantic Web Primer; MIT Press: Cambridge, MA, USA, 2004. [Google Scholar]
- Shadbolt, N.; Berners-Lee, T.; Hall, W. The Semantic Web Revisited. IEEE Intell. Syst. 2006, 21, 96–101. [Google Scholar] [CrossRef] [Green Version]
- Carroll, J.; Dickinson, I.; Dollin, C.; Reynolds, D.; Seaborne, A.; Wilkinson, K. Jena: Implementing the Semantic Web Recommendations. In Proceedings of the International Conference on World Wide Web—Alternate Track Papers & Posters, New York, NY, USA, 19–21 May 2004; pp. 74–83. [Google Scholar]
- Hassanzadeh, O.; Kementsietsidis, A.; Velegrakis, Y. Data Management Issues on the Semantic Web. In Proceedings of the International Conference on Data Engineering, Arlington, VA, USA, 1–5 April 2012; pp. 1204–1206. [Google Scholar]
- RDF 1.1 Concepts and Abstract Syntax. Available online: https://www.w3.org/TR/2014/REC-rdf11-concepts-20140225/ (accessed on 7 December 2021).
- Decker, S.; Melnik, S.; Harmelen, F.V.; Fensel, D.; Klein, M.C.A.; Broekstra, J.; Erdmann, M.; Horrocks, I. The Semantic Web: The Roles of XML and RDF. IEEE Internet Comput. 2000, 4, 63–74. [Google Scholar] [CrossRef]
- Broekstra, J.; Kampman, A.; Harmelen, F.V. Sesame: A Generic Architecture for Storing and Querying RDF and RDF Schema. In Proceedings of the International Semantic Web Conference, Sardinia, Italy, 9–12 June 2002; pp. 54–68. [Google Scholar]
- Picalausa, F.; Luo, Y.; Fletcher, G.H.L.; Hidders, J.; Vansummeren, S. A Structural Approach to Indexing Triples. In Proceedings of the Extended Semantic Web Conference, Heraklion, Greece, 27–31 May 2012; pp. 406–421. [Google Scholar]
- Neumann, T.; Weikum, G. The RDF-3X engine for scalable management of RDF data. VLDB J. 2010, 19, 91–113. [Google Scholar] [CrossRef] [Green Version]
- Kang, S.; Shim, J.; Lee, S. Tridex: A lightweight triple index for relational database-based Semantic Web data management. Expert Syst. Appl. 2013, 40, 3421–3431. [Google Scholar] [CrossRef]
- SPARQL 1.1 Overview. Available online: https://www.w3.org/TR/sparql11-overview/ (accessed on 16 December 2021).
- Kim, K.; Moon, B.; Kim, H. R3F: RDF triple filtering method for efficient SPARQL query processing. World Wide Web 2015, 18, 317–357. [Google Scholar] [CrossRef]
- Hassan, M.; Bansal, K.S. RDF Data Storage Techniques for Efficient SPARQL Query Processing Using Distributed Computation Engines. In Proceedings of the International Conference on Information Reuse and Integration, Salt Lake City, UT, USA, 6–9 July 2018; pp. 323–330. [Google Scholar]
- Bonifati, A.; Martens, W.; Timm, T. An analytical study of large SPARQL query logs. VLDB J. 2020, 29, 655–679. [Google Scholar] [CrossRef] [Green Version]
- Kim, K.; Moon, B.; Kim, H. RG-index: An RDF graph index for efficient SPARQL query processing. Expert Syst. Appl. 2014, 41, 4596–4607. [Google Scholar] [CrossRef]
- Huang, J.; Abadi, D.J.; Ren, K. Scalable SPARQL Querying of Large RDF Graphs. VLDB Endow. 2011, 4, 1123–1134. [Google Scholar] [CrossRef]
- Kharrat, M.; Jedidi, A.; Gargouri, F. SPARQL Query Generation Based on RDF Graph. In Proceedings of the International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management, Porto, Portugal, 9–11 November 2016; pp. 450–455. [Google Scholar]
- Wu, B.; Zhou, Y.; Yuan, P. Scalable SPARQL Querying Using Path Partitioning. In Proceedings of the International Conference on Data Engineering, Seoul, Korea, 13–17 April 2015; pp. 795–806. [Google Scholar]
- Hu, C.; Wang, X.; Yang, R.; Wo, T. ScalaRDF: A Distributed, Elastic and Scalable In-Memory RDF Triple Store. In Proceedings of the International Conference on Parallel and Distributed Systems, Wuhan, China, 13–16 December 2016; pp. 593–601. [Google Scholar]
- Wang, X.; Yang, T.; Chen, J.; He, L.; Du, X. RDF partitioning for scalable SPARQL query processing. Front. Comput. Sci. 2015, 9, 919–933. [Google Scholar] [CrossRef]
- Galárraga, L.; Hose, K.; Schenkel, R. Partout: A distributed engine for efficient RDF processing. In Proceedings of the International World Wide Web Conference, Seoul, Korea, 7–11 April 2014; pp. 267–268. [Google Scholar]
- Guo, X.; Gao, H.; Zou, Z. Leon: A Distributed RDF Engine for Multi-query Processing. In Proceedings of the International Conference on Database Systems for Advanced Applications, Chiang Mai, Thailand, 22–25 April 2019; pp. 742–759. [Google Scholar]
- Potter, A.; Motik, B.; Nenov, Y.; Horrocks, I. Dynamic Data Exchange in Distributed RDF Stores. IEEE Trans. Knowl. Data Eng. 2018, 30, 2312–2325. [Google Scholar] [CrossRef]
- Naacke, H.; Curé, O. On distributed SPARQL query processing using triangles of RDF triples. Open J. Semant. Web 2020, 7, 17–32. [Google Scholar]
- Jabeen, H.; Haziiev, E.; Sejdiu, G.; Lehmann, J. Dise: A Distributed in-Memory Sparql Processing Engine over Tensor Data. In Proceedings of the IEEE 14th International Conference on Semantic Computing (ICSC), San Diego, CA, USA, 3–5 February 2020; pp. 400–407. [Google Scholar]
- Hassan, M.; Bansal, S.K. S3QLRDF: Property Table Partitioning Scheme for Distributed SPARQL Querying of Large-Scale RDF data. In Proceedings of the IEEE International Conference on Smart Data Services (SMDS), Online, 18–24 October 2020; pp. 133–140. [Google Scholar]
- Lu, J.; Yang, C.; Wang, B.; Feng, J. FP-ExtVP: Accelerating Distributed SPARQL Queries by Exploiting Load-Adaptive Partitioning. In Proceedings of the IEEE International Conference on Big Data (Big Data), Online, 10–13 December, 2020; pp. 543–550. [Google Scholar]
- Ragab, M.; Eyvazov, S.; Tommasini, R.; Sakr, S. Systematic Performance Analysis of Distributed SPARQL Query Answering Using Spark-SQL; IOP Press: Bristol, UK, 2020; pp. 1–21. [Google Scholar]
- Kang, X.; Zhao, Y.; Yuan, P.; Jin, H. Grace: An Efficient Parallel SPARQL Query System over Large-Scale RDF Data. In Proceedings of the IEEE 24th International Conference on Computer Supported Cooperative Work in Design (CSCWD), Dalian, China, 5–7 May 2021; pp. 769–774. [Google Scholar]
- Leng, Y.; Chen, Z.; Zhong, F.; Li, X.; Hu, Y.; Yang, C. BRGP: A balanced RDF graph partitioning algorithm for cloud storage. Concurr. Comput. Pract. Exp. 2017, 29, e3896. [Google Scholar] [CrossRef]
- Padiya, T.; Bhise, M. DWAHP: Workload Aware Hybrid Partitioning and Distribution of RDF Data. In Proceedings of the International Database Engineering & Applications Symposium, Bristol, UK, 12–14 July 2017; pp. 235–241. [Google Scholar]
- Zeng, K.; Yang, J.; Wang, H.; Shao, B.; Wang, Z. A distributed graph engine for web scale RDF data. VLDB Endow. 2013, 6, 265–276. [Google Scholar] [CrossRef] [Green Version]
- Ravindra, P.; Anyanwu, K. Nesting Strategies for Enabling Nimble MapReduce Dataflows for Large RDF Data. Proc. Int. J. Semant. Web Inf. Syst. 2014, 10, 1–26. [Google Scholar] [CrossRef]
- Elzein, N.M.; Majid, M.A.; Hashem, I.A.T.; Yaqoob, I.; Alaba, F.A.; Imran, M. Managing big RDF data in clouds: Challenges, opportunities, and solutions. Sustain. Cities Soc. 2018, 39, 375–386. [Google Scholar] [CrossRef]
- Quilitz, B.; Leser, U. Querying Distributed RDF Data Source with SPARQL. In Proceedings of the European Semantic Web Conferences, Tenerife, Spain, 1–5 June 2008; pp. 524–538. [Google Scholar]
- Feng, J.; Meng, C.; Song, J.; Zhang, X.; Feng, Z.; Zou, L. SPARQL Query Parallel Processing: A Survey. In Proceedings of the International Congress on Big Data, Honolulu, HI, USA, 25–30 June 2017; pp. 444–451. [Google Scholar]
- Papailiou, N.; Konstantinou, I.; Tsoumakos, D.; Karras, P.; Koziris, N. H2RDF+: High-Performance Distributed Joins over Large-Scale RDF Graphs. In Proceedings of the IEEE International Conference on Big Data, Silicon Valley, CA, USA, 6–9 October 2013; pp. 255–263. [Google Scholar]
- Wylot, M.; Hauswirth, M.; Cudré-Mauroux, P.; Sakr, S. RDF Data Storage and Query Processing Schemes: A Survey. ACM Comput. Surv. 2018, 51, 84. [Google Scholar] [CrossRef]
- Leida, M.; Chu, A. Distributed SPARQL Query Answering over RDF Data Streams. In Proceedings of the International Congress on Big Data, Santa Clara, CA, USA, 27 June–2 July 2013; pp. 369–378. [Google Scholar]
- Abdelaziz, I.; Harbi, R.; Khayyat, Z.; Kalnis, P. A Survey and Experimental Comparison of Distributed SPARQL Engines for Very Large RDF Data. VLDB Endow. 2017, 10, 2049–2060. [Google Scholar] [CrossRef] [Green Version]
- Zhou, J.; Bochmann, G.V.; Shi, Z. Distributed Query Processing in an Ad-Hoc Semantic Web Data Sharing System. In Proceedings of the International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum, Cambridge, MA, USA, 20–24 May 2013; pp. 687–695. [Google Scholar]
- Hammoud, M.; Rabbou, D.A.; Nouri, R.; Beheshti, S.; Sakr, S. DREAM: Distributed RDF Engine with Adaptive Query Planner and Minimal Communication. VLDB Endow. 2015, 8, 654–665. [Google Scholar] [CrossRef]
- Chen, X.; Chen, H.; Zhang, N.; Zhang, S. SparkRDF: Elastic Discreted RDF Graph Processing Engine with Distributed Memory. In Proceedings of the International Conference on Web Intelligence and Intelligent Agent Technology, Singapore, 6–9 December 2015; pp. 292–300. [Google Scholar]
- Zaharia, M.; Xin, R.S.; Wendell, P.; Das, T.; Armbrust, M.; Dave, A.; Meng, X.; Rosen, J.; Venkataraman, S.; Franklin, M.J.; et al. Apache Spark: A unified engine for big data processing. Commun. ACM 2016, 59, 56–65. [Google Scholar] [CrossRef]
- Li, M.; Tan, J.; Wang, Y.; Zhang, L.; Salapura, V. SparkBench: A Comprehensive Benchmarking Suite for in Memory Data Analytic Platform Spark. In Proceedings of the Conference on Computing Frontiers, Ischia, Italy, 18–21 May 2015; pp. 1–8. [Google Scholar]
- Zaharia, M.; Chowdhury, M.; Das, T.; Dave, A.; Ma, J.; McCauly, M.; Franklin, M.J.; Shenker, S.; Stoica, I. Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. In Proceedings of the USENIX Symposium on Networked Systems Design and Implementation, San Jose, CA, USA, 25–27 April 2012; pp. 15–28. [Google Scholar]
- Zhang, M.; Chen, R.; Zhang, X.; Feng, Z.; Rao, G.; Wang, X. Intelligent RDD Management for High Performance In-Memory Computing in Spark. In Proceedings of the International Conference on World Wide Web Companion, Perth, Australia, 3–7 April 2017; pp. 873–874. [Google Scholar]
- Agathangelos, G.; Troullinou, G.; Kondylakis, H.; Stefanidis, K.; Plexousakis, D. RDF Query Answering Using Apache Spark: Review and Assessment. In Proceedings of the International Conference on Data Engineering Workshops, Paris, France, 16–20 April 2018; pp. 54–59. [Google Scholar]
- The LUBM Benchmark. Available online: http://swat.cse.lehigh.edu/projects/lubm/ (accessed on 6 December 2021).
- DBpedia. Available online: http://wiki.dbpedia.org/ (accessed on 6 December 2021).
Parameter | Value |
---|---|
Processor | Intel(R) Core(TM) i5 |
Memory | 64 GB |
No. of nodes | Master 1, Slave 3 |
Engine used | Spark 2.4.5 |
(s) | N1 | N2 | N3 | |||
---|---|---|---|---|---|---|
Existing Scheme | Proposed Scheme | Existing Scheme | Proposed Scheme | Existing Scheme | Proposed Scheme | |
Q1 | 2.578 | 2.557 | 2.532 | 2.542 | 3.421 | 3.411 |
Q2 | 2.571 | 2.671 | 2.515 | 2.424 | 2.997 | 2.987 |
Q3 | 3.784 | 3.532 | 3.564 | 3.487 | X | X |
Q4 | 7.601 | 7.710 | 7.331 | 6.109 | 1.927 | 1.887 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Lim, J.; Kim, B.; Lee, H.; Choi, D.; Bok, K.; Yoo, J. An Efficient Distributed SPARQL Query Processing Scheme Considering Communication Costs in Spark Environments. Appl. Sci. 2022, 12, 122. https://doi.org/10.3390/app12010122
Lim J, Kim B, Lee H, Choi D, Bok K, Yoo J. An Efficient Distributed SPARQL Query Processing Scheme Considering Communication Costs in Spark Environments. Applied Sciences. 2022; 12(1):122. https://doi.org/10.3390/app12010122
Chicago/Turabian StyleLim, Jongtae, Byounghoon Kim, Hyeonbyeong Lee, Dojin Choi, Kyoungsoo Bok, and Jaesoo Yoo. 2022. "An Efficient Distributed SPARQL Query Processing Scheme Considering Communication Costs in Spark Environments" Applied Sciences 12, no. 1: 122. https://doi.org/10.3390/app12010122
APA StyleLim, J., Kim, B., Lee, H., Choi, D., Bok, K., & Yoo, J. (2022). An Efficient Distributed SPARQL Query Processing Scheme Considering Communication Costs in Spark Environments. Applied Sciences, 12(1), 122. https://doi.org/10.3390/app12010122