A Hybrid Approach Combining R*-Tree and k-d Trees to Improve Linked Open Data Query Performance
:1. Introduction
- We present a new hybrid index structure, the two-step index structure, designed for efficient LOD join queries. In particular, we consider flash-based solid-state drives (SSDs) as excellent memory-based k-d trees.
- We propose an efficient join query algorithm based on the two-step index structure for various SPARQL query types and a hot-cold segment identification algorithm that determines regions of high interest.
- We evaluate our index structure through extensive experiments using benchmark LOD datasets. Experimental results show that our hybrid approach exhibits better retrieval performance than existing approaches.
2. Background and Related Work
2.1. Overview of Linked Open Data
PREFIX | foaf:http://xmlns.com/foaf/0.1/ (accessed on 17 January 2021) |
PREFIX | user:http://dbpedia.org/person/ (accessed on 17 January 2021) |
user:Smith foaf:knows ?f . | |
?f foaf:project ?n } |
- Star queries are a set of triples formed using the same subject or object variable (subject = subject or object = object). Usually, we consider only subject-subject joins (i.e., all triples have the same subject).
- Chain and directed cycle queries are triple patterns in which the subject and object variables are the same (subject = object) (i.e., the object of the triple is the subject of the next triple).
- Complex queries are a combination of star and chain queries.
- Tree queries contain subject-subject and subject-object joins and some more complex queries.
2.2. Hybrid Storage Structure
2.3. Related Work
3. Hybrid Index System
3.1. Extended Multidimensional Histogram
3.2. Two-Step Index Structure
3.3. Hot-Cold Segment Identification Algorithm
Algorithm 1. Hot-cold segment identification. | |
1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: | // Collecting training data for the decision tree While (iterate time < predetermined time) If user accesses data Then RECENT = 1 COUNT += 1 End If End While While (there exist data in the decision table) If RECENT == 1 and COUNT >= threshold Then TYPE = 1 // Data are identified as hot data End If End While // Training decision tree dTree = DecisionTreeClassifier(max_depth = 3) // Create decision tree dTree.fit(train_data, train_label) // Train decision tree result = dTree.predict(test_data) // Identify test data not accessed by the user. Identified data are relocated to HDD and SSD. |
3.4. Two-Step SPARQL Query Processing
Algorithm 2. MDH*-based join processing. | |
1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 20: 21: 22: 23: | If Q is in SPARQL triple patterns // Q: query For each pattern i in Q If FilterPhase(i) != null Then FilterPhase(i) = OVERLAP(FilterPhase(i-1), FilterPhase(i)) Else break End For For each pattern i in Q If RefinePhase(i) != null Then RefinePhase(i) = JOIN(RefinePhase(i-1), RefinePhase(i)) Else break End For End If Procedure JOIN(X, Y) // X, Y: two k-d tree input sets For each tuple x in k-d tree X For each tuple y in k-d tree Y If α ≠ 0 (or β ≠ 0) Then If x and y satisfy the join condition, Then x and y tuples are added to the result α (or β) is decremented by 1 Else Else break End For End For |
4. Experimental Evaluation
4.1. Join Query Performance and Storage Amount
4.2. Performance of Hot-Cold Segment Identification Method
5. Conclusions and Future Work
Author Contributions
Institutional Review Board Statement
Informed Consent Statement
Conflicts of Interest
- Okoye, K. Linked Open Data: State-of-the-Art Mechanisms and Conceptual Framework. In Linked Open Data: Applications, Trends and Future Developments; Okoye, K., Ed.; IntechOpen: London, UK, 2020; Volume 3, pp. 158–190. [Google Scholar] [CrossRef]
- Svoboda, M.; Mlynkova, I. Linked Data Indexing Methods: A Survey. In On the Move to Meaningful Internet Systems: OTM 2011 Workshops, Lecture Notes in Computer Science; Springer: Berlin/Heidelberg, Germany, 2011; Volume 7046, pp. 474–483. [Google Scholar] [CrossRef]
- Harth, A.; Hose, K.; Karnstedt, M.; Polleres, A.; Satler, K.; Umbrich, J. Data Summaries for On-demand Queries over Linked Data. In Proceedings of the 19th International Conference on World Wide Web (WWW), Raleigh, NC, USA, 26–30 April 2010; pp. 411–420. [Google Scholar] [CrossRef] [Green Version]
- Hartig, O. Querying a Web of Linked Data: Foundations and Query Execution; IOS Press: Amsterdam, The Netherlands, 2016; Volume 5. [Google Scholar]
- Umbrich, J.; Hose, K.; Karnstedt, M.; Harth, A.; Polleres, A. Comparing Data Summaries for Processing Live Queries over Linked Data. World Wide Web 2011, 14, 495–544. [Google Scholar] [CrossRef]
- Guttman, A. R-trees: A Dynamic Index Structure for Spatial Searching. In Proceedings of the ACM International Conference on Management of Data (SIGMOD), New York, NY, USA, 18–21 June 1984; Volume 14, pp. 47–57. [Google Scholar] [CrossRef]
- Umbrich, J. A Hybrid Framework for Querying Linked Data Dynamically. Ph.D. Thesis, National University of Ireland, Galway, Ireland, 2012. [Google Scholar]
- Lynden, S.; Kojima, I.; Matono, A.; Makanura, A.; Yui, M. A Hybrid Approach to Linked Data Query Processing with Time Constraints. In Proceedings of the WWW Workshop on Linked Data on the Web (LDOW) 2013, Rio de Janeiro, Brazil, 14 May 2013. [Google Scholar]
- Harth, A.; Decker, S. Optimized Index Structures for Querying RDF from the Web. In Proceedings of the 3rd Latin American Web Congress (LA-Web), Washington, DC, USA, 1 October–2 November 2005; pp. 71–81. [Google Scholar] [CrossRef]
- Neumann, T.; Weikum, G. RDF-3X: A RISC-style Engine for RDF. In Proceedings of the 34th International Conference on Very Large Data Bases (VLDB), Auckland, New Zealand, 24–30 August 2008; pp. 647–659. [Google Scholar] [CrossRef] [Green Version]
- Wess, C.; Karras, P.; Bernstein, A. Hexastore: Sextuple Indexing for Semantic Web Data Management. In Proceedings of the 34th International Conference on Very Large Data Bases (VLDB), Auckland, New Zealand, 24–30 August 2008; pp. 1008–1019. [Google Scholar] [CrossRef] [Green Version]
- Atre, M.; Chaoji, V.; Zaki, M.; Hendler, J. Matrix Bit Loaded: A Scalable Lightweight Join Query Processor for RDF Data. In Proceedings of the 19th International Conference on World Wide Web (WWW), Raleigh, NC, USA, 26–30 April 2010; pp. 41–50. [Google Scholar] [CrossRef]
- Yuan, P.; Liu, P.; Wu, B.; Jin, H.; Zhang, W.; Liu, L. TripleBit: A Fast and Compact System for Large Scale RDF Data. In Proceedings of the 39th International Conference on Very Large Data Bases (VLDB), Riva del Garda, Trento, Italy, 30 August 2013; pp. 517–528. [Google Scholar] [CrossRef]
- Quilitz, B.; Leser, U. Querying Distributed RDF Data Sources with SPARQL. In Proceedings of the 5th European Semantic Web Conf. (ESWC), Lecture Notes in Computer Science, Canary Islands, Spain, 27 June 2008; Volume 5021, pp. 524–538. [Google Scholar] [CrossRef] [Green Version]
- Langegger, A.; Wob, W.; Blochl, M. A Semantic Middleware for Virtual Data Integration on the Web. In Proceedings of the 5th European Semantic Web Conference (ESWC), Lecture Notes in Computer Science, Canary Islands, Spain, 27 June 2008; Volume 5021, pp. 493–507. [Google Scholar] [CrossRef] [Green Version]
- Abdelaziz, I.; Mansour, E.; Ouzzani, M.; Aboulnaga, A.; Kalnis, P. Lusail: A System for Querying Linked Data at Scale. In Proceedings of the 44th International Conference on Very Large Data Bases (VLDB), Rio de Janeiro, Brazil, 27–31 August 2018; pp. 485–498. [Google Scholar] [CrossRef]
- Lyden, S.; Yui, M.; Matono, A.; Nakanura, A.; Ogawa, H.; Kojima, I. Optimising Coverage, Freshness and Diversity in Live Exploration-based Linked Data Queries. In Proceedings of the 6th International Conference on Web Intelligence, Mining and Semantics (WIMS), Nimes, France, 13–15 June 2016; Volume 18. [Google Scholar] [CrossRef]
- Tsatsanifos, G.; Sacharidis, D.; Sellis, T. On Enhancing Scalability for Distributed RDF/S Stores. In Proceedings of the 14th International Conference on Extending Database Technology, Uppsala, Sweden, 21–24 March 2011; pp. 141–152. [Google Scholar] [CrossRef]
- Mountantonakis, M.; Tzitzikas, Y. Scalable Methods for Measuring the Connectivity and Quality of Large Numbers of Linked Datasets. ACM J. Data Inf. Qual. 2018, 9, 15. [Google Scholar] [CrossRef]
- Fevgas, A.; Bozanis, P. A Spatial Index for Hybrid Storage. In Proceedings of the 23th International Database Applications & Engineering Symposium (IDEAS), Athens, Greece, 10–12 June 2019; pp. 1–8. [Google Scholar] [CrossRef]
- Sakr, S.; Wylot, M.; Mutharaju, R.; Le Phuoc, D.; Fundulaki, I. Linked Data: Storing, Querying, and Reasoning; Springer: Cam, Switzerland, 2018; Volume 4, pp. 51–83. [Google Scholar] [CrossRef]
- Linking Open Data: W3C SWEO Community Project. Available online: http://www.w3.org/wiki/SweoIG/TaskForces/CommunityProjects/LinkingOpenData (accessed on 12 March 2017).
- Chawla, T.; Singh, G.; Pilli, E.S. JOTR: Join-Optimistic Triple Reordering Approach for SPARQL Query Optimization on Big RDF Data. In Proceedings of the 9th International Conference on Computing, Communication and Networking Technologies (ICCCNT), Bangalore, India, 10–12 July 2018; pp. 1–7. [Google Scholar] [CrossRef]
- Bayer, R.; McCreight, E.M. Organization and Maintenance of Large Ordered Indexes. Acta Inform. 1972, 1, 173–189. [Google Scholar] [CrossRef]
- Litwin, W. Linear Hashing: A New Tool for File and Table Addressing. In Proceedings of the 6th International Conference on Very Large Data Bases (VLDB), Montreal, QC, Canada, 1–3 October 1980; pp. 212–223. [Google Scholar] [CrossRef]
- Fagin, R.; Nievergelt, J.; Pippenger, N.; Strong, H.R. Extendible Hashing: A Fast Access Method for Dynamic Files. ACM Trans. Database Syst. 1979, 4, 315–344. [Google Scholar] [CrossRef]
- Moten, D. 3D R-tree in Java. Available online: https://github.com/davidmoten/rtree-3d (accessed on 13 October 2019).
- Jin, C.; De-lin, S.; Fen-xiang, M. An Improved ID3 Decision Tree Algorithm. In Proceedings of the International Conference on Computer Science & Education (ICCSE), Nanning, China, 25–28 July 2009; pp. 127–130. [Google Scholar] [CrossRef]
- Park, D.C.; Du, D. Hot Data Identification for Flash-based Storage Systems Using Multiple Bloom Filters. In Proceedings of the Mass Storage Systems and Technologies (MSST), Denver, CO, USA, 23–27 May 2011. [Google Scholar] [CrossRef]
- Park, D.C. Hot and Cold Data Identification: Applications to Storage Devices and Systems. Ph.D. Thesis, The University of Minnesota, Minneapolis, MN, USA, 2012. [Google Scholar]
- SWAT Projects-The Lehigh University Benchmark (LUBM). Available online: http://swat.cse.lehigh.edu/projects/lubm (accessed on 10 March 2020).
- Vidal, V.; Casanova, M.; Menendez, E.; Arruda, N.; Pequeno, V.; Paes Leme, L. Using Changesets for Incremental Maintenance of Linkset Views. In Proceedings of the 17th International Conference on Web Information Systems Engineering (WISE), New York, NY, USA, 8–10 November 2016; pp. 196–203. [Google Scholar] [CrossRef]
Local Approach | Live Exploration Approach | Index Approach | Hybrid Approach | |
Feature | Store collected data into a local repository | Query multiple SPARQL endpoints | Use summary and approximation indexes | Combine two storage and searching approaches |
Advantage | Excellent response time | Dynamic with up-to-date data | Efficient query processing | Trade-off between two approaches |
Disadvantage | Cannot reflect recent data | Slow response time | High maintenance cost | Bad join query performance |
Related work | QUAD [9], RDF-3X [10], Hexastore [11], Matrix [12], TripleBit [13] | DARQ [14], SemWIQ [15], LiveExplorer [4], Lusail [16], IRISelection [17] | MDH [5], QTree [3], MIDAS-RDF [18], SameAsPrefixIndex [19] | HybridEngine [7], HybridQuery [8], H-Grid [20], MapReduce+RDF-3X [21] |
Size (MB) | Number of Triples | Number of Subjects | Number of Predicates | Number of Objects | |
DBpedia | 3.94 | 31,050 | 4008 | 23 | 16,644 |
DrugBank | 144 | 766,920 | 19,693 | 119 | 274,864 |
LinkedGeoData | 327 | 2,207,295 | 552,541 | 1320 | 1,308,214 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
Share and Cite
Sun, Y.; Zhao, T.; Yoon, S.; Lee, Y. A Hybrid Approach Combining R*-Tree and k-d Trees to Improve Linked Open Data Query Performance. Appl. Sci. 2021, 11, 2405. https://doi.org/10.3390/app11052405
Sun Y, Zhao T, Yoon S, Lee Y. A Hybrid Approach Combining R*-Tree and k-d Trees to Improve Linked Open Data Query Performance. Applied Sciences. 2021; 11(5):2405. https://doi.org/10.3390/app11052405
Chicago/Turabian StyleSun, Yuxiang, Tianyi Zhao, Seulgi Yoon, and Yongju Lee. 2021. "A Hybrid Approach Combining R*-Tree and k-d Trees to Improve Linked Open Data Query Performance" Applied Sciences 11, no. 5: 2405. https://doi.org/10.3390/app11052405
APA StyleSun, Y., Zhao, T., Yoon, S., & Lee, Y. (2021). A Hybrid Approach Combining R*-Tree and k-d Trees to Improve Linked Open Data Query Performance. Applied Sciences, 11(5), 2405. https://doi.org/10.3390/app11052405