Efficient Complex Aggregate Queries with Accuracy Guarantee Based on Execution Cost Model over Knowledge Graphs
Abstract
:1. Introduction
- Existing solutions. A straightforward method for handling aggregate queries for KGs is to perform additional aggregate operations on the answers returned by factoid queries. The effectiveness of this method depends on the quality of answers provided by the factoid queries. In addition, this method tends to overlook answers with equivalent meanings but differing structures [4,6,9,12,13,14,15], thus resulting in inaccurate results. The concept of an "accurate" answer is often contingent upon the user’s query intent and may even be inherently ambiguous [16,17]. Moreover, this still suffers from the inefficiency issue, due to the extra time required to run additional aggregate operations. Therefore, the method of approximate aggregate queries with semantic-aware sampling (AQS) [18,19] was proposed to tackle the issues encountered above. It utilizes a random walk based on KG embedding to collect high-quality random samples through semantic-aware sampling and calculates approximate aggregate results based on random samples using the bag of little bootstraps (BLB) method. It returns a confidence interval regarding the approximate result and ensures that the relative error of the approximate result is bounded by a predefined error bound. We discuss this in Section 2 in more detail. AQS performs well for simple aggregate queries of KGs, in both effectiveness and efficiency. However, it cannot be directly extended to complex aggregate queries of KGs. Here, we call an aggregate query of a KG G (e.g., How many cars are produced in Germany?, see Figure 1, middle left part), which is simple if formed as a query graph Q with only one specific entity (e.g., Germany) and a predicate (e.g., product), and this aims to find the aggregate results for an attribute (e.g., the aggregate function = COUNT for a wildcard attribute ∗) of target entities (e.g., cars) that are semantically related to the given specific entity following the specific predicate. We call a query complex if it is a combination of a set of simple aggregate queries (i.e., sub-queries). In Section 2, we formally define simple and complex aggregate queries. The reason why AQS cannot adapt to complex queries comes from the fact that even if each sub-query satisfies the predefined error bound (guaranteed by performing AQS for each sub-query), the accuracy of the complex query may still not be guaranteed. We next show this problem of AQS clearly.
- Efficiency vs. effectiveness trade-off problem. In AQS, efficiency refers to the response time of a query, and effectiveness (or accuracy) refers to the relative error of the approximate aggregate result regarding the ground truth. Intuitively, the accuracy of a complex query can be guaranteed if we set quite a small predefined error bound for each sub-query of the complex query. However, this would significantly increase the response time, as the AQS would require more time to collect additional samples to achieve a good approximate aggregate result. This creates an efficiency vs. effectiveness trade-off problem for complex aggregate queries using AQS. More precisely, the essence of this problem is how to configure appropriate predefined error bounds for each sub-query, so that the complex query’s accuracy is guaranteed and its response time is as short as possible. We clarify this in the following two examples.
- Our solution and contributions. As shown in Figure 2, we focus on the effectiveness vs. efficiency trade-off problem of AQS for aggregate queries of KGs. We start with the preliminaries of AQS and formally define this problem in Section 2. To solve it, we first study the relationship between effectiveness and efficiency on aggregate queries for KGs (Problem-1), then we leverage this relationship to achieve a balance between the two (Problem-2). For Problem-1, we first study the relationship between effectiveness and efficiency for simple aggregate queries (Problem-1.1), then we extend this to complex aggregate queries (Problem-1.2). To solve Problem-1.1, in Section 3, we first present a cost model of AQS for simple aggregate queries on the basis of Taylor’s theorem (Section 3.1). Then, we show how to determine the parameters of this cost model using a normal equation (Section 3.2). Next, we solve Problem-1.2 by exploring the relationship between the sub-query error bound and the complex aggregate query error bound, establishing a general form of the execution cost model for complex aggregate queries, with accuracy constraints given the relationship of the error bounds (Section 4.1). Finally, in Section 4.2, we employ a multi-objective optimization genetic algorithm to determine the optimal error bounds for sub-queries, thus minimizing the total response time for a complex aggregate query, while returning a sufficiently accurate approximate aggregate result for the complex query. In the context of simple aggregate queries, our approach using a cost model for AQS exhibited 1.8X efficiency improvement, on average, compared to the original AQS without the cost model. For complex aggregate queries, our method stands as the sole solution capable of meeting predefined user error bounds, while maintaining a commendable level of efficiency.
- We present an execution cost model for AQS based on Taylor’s theorem and obtain appropriate parameters for this cost model using the normal equation.
- We extend the cost model of AQS to complex aggregate queries, associated with a set of accuracy constraints on the complex aggregate queries, as well as the relationship between the sub-query error bound and complex query error bound.
- We leverage a genetic algorithm to determine the optimal error bounds for sub-queries, achieving a balance between efficiency and accuracy.
- We conducted experiments with three widely used real-world KGs, to demonstrate the effectiveness and efficiency of our method.
2. Preliminaries
- Approximate aggregate queries with semantic-aware sampling (AQS) [18] has recently been developed as an efficient and effective solution for answering simple aggregate queries for KGs, and it is briefly described as follows. When a simple aggregate query is posed for a KG G, the process of AQS comprises three main steps: (1) Semantic-aware sampling. A semantic-aware online sampling algorithm based on random walk for G is employed by AQS to gather answers from G, forming a random sample that likely represents entities with semantic similarity to the query graph Q. (2) Correctness validation of a random sample. AQS uses a greedy algorithm to filter the correct answers from the random sample, where a correct answer is defined as one with a semantic similarity greater than a predefined threshold . (3) Estimation and accuracy guarantee. With the random sample in hand, AQS estimates an approximate aggregate result . Subsequently, an accuracy guarantee for is provided by iteratively computing a tight confidence interval CI at a high confidence level , with representing the half-width of the CI (also known as the margin of error). This CI indicates that the ground truth V is covered by the interval with a probability of . We terminate AQS when the condition is satisfied, showing that the relative error of regarding V has an upper bound of e with a probability of (Theorem 2 in [18]).
3. Execution Cost Model of AQS for Simple Aggregate Queries
3.1. Execution Cost of AQS
- Cost of semantic-aware random walk sampling. This step has three phases, where a query graph Q and a knowledge graph (KG) G are given. In the first phase, AQS finds the mapping node for the specific node that satisfies and . Afterward, a BFS is initiated from to extract an n-bounded subgraph with respect to , where each entity u from is within n hops from . In the second phase, AQS utilizes a KG embedding model (e.g., TransE [24]) to obtain the predicate similarity of each edge to the query edge . Consequently, AQS formulates a transition matrix based on the predicate similarities and performs random walk on until reaching convergence. Lastly, AQS employs the continuous sampling [25] technique to collect answers with greater semantic similarity to Q, forming a random sample S. Phases (1)–(2) are dedicated to random walk for , and the last phase is utilized for random sampling. Thus, the execution costs of phases (1)–(2) and phase (3) are denoted as and , respectively.
- Q1: “How many cars are produced in Germany?”
- Q2: “How many Spanish football players are there?”
- Q3: “How many football players are there in Germany?”
- Cost of Correctness validation for the random sample. After obtaining a random sample S through semantic-aware random walk sampling, AQS employs a process of correctness validation to identify the correct answers within S. Subsequently, AQS can directly utilize these correct answers for approximate result estimation. In essence, correctness validation involves enumerating all answers in S, making its execution cost dependent on the sample size . Similarly to the execution cost mentioned earlier, can also be estimated as . In other words, a smaller predefined error bound e necessitates a larger sample size , thereby resulting in a longer time for correctness validation.
- Cost of estimation and accuracy guarantee. After validating a random sample S for correctness, AQS employs two unbiased estimators and one consistent estimator for , and , respectively, to estimate the approximate aggregate result for the query in the form of a confidence interval . Simultaneously, AQS provides a robust accuracy guarantee by iteratively refining until . Intuitively, a smaller predefined error bound e requires more iterations for refinement; thus, the execution cost of this step is dependent on e, and can be expressed as . In summary, we define .
- General cost model based on Taylor’s theorem. Based on the above analysis, we can derive the total execution cost of AQS as . Since both and are related to the sample size determined by the predefined error bound e, we can simplify T by combining and together as :
3.2. Cost Model Training
- Estimation of . To estimate , we perform only the first two phases of the first step in AQS, which involve constructing a n-bounded subgraph and performing the random walk until convergence. We repeat these phases for r times and calculate the average time as the estimation of for the given query , where is the running time of the i-th execution.
- Estimation of . To estimate the parameters for a given n-th order Taylor polynomial, such as Equations (4)–(6), we first collect a set of statistics during the runtime as the training data. Next, we use the normal equation to obtain the optimal solution that minimizes the mean squared error (MSE). The details are presented below.
- Collecting training data. The focus of our data collection is the relationships among the sample size , the predefined error bound e, and the execution cost of each step in AQS. We collect run-time statistics in the form of quadruples , where represents an observation of the independent variable e (with a random value in the range ), denotes the total size of the sample required by AQS for a specific , records the execution time of continuous sampling and correctness validation in AQS for , and represents the execution time for estimation and the accuracy guarantee in AQS for . To compile the training data, we generate m observations of e as in our implementation, resulting in m quadruples. Subsequently, we extract information from these statistics to formulate distinct training data for different n-th order Taylor polynomials and to estimate the parameters .
- (1) Training data for . Based on the analysis in Section 3.1, is directly influenced by the sample size , which is primarily determined by the predefined error bound e. Therefore, we initially collect a set of tuples as training data, to estimate the parameters for (Equation (4)). Subsequently, we gather another set of tuples as training data, to estimate the parameters for (Equation (5)). By substituting Equation (4) with the optimal and Equation (5) with the optimal , we can derive the final execution cost model for .
- (2) Training data for . As mentioned in Section 3.1, is associated with the predefined error bound e. Thus, we collect a set of tuples as training data to estimate the parameters for (Equation (6)).
- Training using the normal equation. For illustrative purposes, we will use as an example to explain the training process, which is identical for both and . Given the training data collected for , the training using the normal equation proceeds as follows:
4. Complex Aggregate Query Method Based on the Cost Model of AQS
4.1. Accuracy Constraints on Complex Aggregate Queries
- Case 1: .In this scenario, if , the true value of is , and the approximate query result is . We want to determine the accuracy requirements between e and that ensure . We consider two cases:When is to the right side of the confidence interval, specifically , we have.When is to the left side of the confidence interval, specifically , we have.In summary, if any sub-query satisfies , then is established under the accuracy condition of and .
- Case 2: .In this scenario, if , the true value of is , and the approximate query result is . Similarly to Case 1, we consider two cases and derive the accuracy requirements between e and . The explanation of this result can be summarized as follows:If any sub-query satisfies , then is established under the accuracy conditions of and .
- Case 3: .In this case, when and represents , the approximate query result is . Two scenarios are analyzed, and accuracy conditions between e and are derived. The explanation of the result can be summarized as follows:If any sub-query satisfies , then is established under the accuracy condition of .
- Case 4: .In the final case, when and represents , the query result is , we once again consider two scenarios and derive the corresponding accuracy conditions. The explanation of this result can be summarized as follows:If any sub-query satisfies , then is established under the condition of and .
- General formFrom the aforementioned inference, we can extend these arithmetic operators to encompass a broader spectrum, eventually abstracting all operations between sub-queries into a single symbol, we can thereby establish the accuracy constrains on the complex aggregate queries as follows: Given a complex aggregate query , the query result of is calculated using the expression formed by connecting the results of its sub-queries with a series of operators. The query results of can be expressed as , and a series of operators are abstracted into ⊕, then the final result of the complex aggregate query can be expressed as . We can utilize operator priorities (e.g., numeric operators, string operators, logical operators, etc.) to execute combined operations for each sub-query. Operators with a higher priority are combined initially, followed by the combination of lower priority operators. Additionally, parentheses can be employed to modify the priority order of operators during the combination process. This enables the determination of error-bound constraints for sub-queries in a bottom-up approach. Supposing that the operator between and holds the highest priority, it becomes imperative to ensure that the error bound adheres to specific constraints, denoted as . In cases where the computation of requires the involvement of operators from , their corresponding error bound constraints can be represented as , and so forth. Ultimately, we derive the general form of the error bound constraint for complex aggregate queries expressed as Equation (9).
4.2. Multi-Objective Optimization Based on Genetic Algorithm
Algorithm 1 Multi-objective Optimization Based on the Genetic Algorithm |
Input: error bound constraint , the sub-query’s cost model |
Output: optimal combination of error bounds |
1 initpop = initPopulation(s,n); |
2 for |
3 do pop = encoding(); |
4 fitness(); |
5 crossover(); |
6 mutation(); |
7 decoding(); |
8 roulettewheel(); |
9 findBest(); |
10 return the best ; |
5. Experiments
5.1. Experimental Setup
- Datasets. We used three real-world datasets, as shown in Table 2. (1) DBpedia [27] is an open-domain knowledge base, which was constructed from Wikipedia. (2) Freebase [28] is a knowledge base collected from many sources, including wiki contributions submitted by individuals and users. (3) YAGO [29] is a knowledge base containing information from Wikipedia, WordNet, and GeoNames. We used the CORE part of YAGO (excluding information from GeoName) as our dataset.
- Query workload. In this section, we conducted experiments on three datasets to evaluate our method. As shown in Table 3 and Table 4, 10 of the 127 simple aggregate query instances were used in the experiment, and 8 of the 138 complex aggregate query instances were used in the experiment. We have placed the aggregate operations used in simple aggregate queries and the corresponding attributes for the questions in column of the Table 3. Additionally, since COUNT is used to calculate the number of entities and does not correspond to a specific attribute, we have represented it with the * symbol. (1) The experiment selected 10 fact-type queries from QALD-4 [30,31] (the benchmark set of fact-type queries on DBpedia) as the basis, and modified them to form simple aggregate queries of COUNT, AVG, and SUM, such as in Table 3, and we generated complex aggregate queries based on these simple aggregate queries, such as in Table 4. (2) We selected 12 fact-type queries from Freebase’s fact-type query benchmark set WebQuestions [32] as the benchmark set, and expanded them into AVG and SUM queries, such as in Table 3, and according to these simple aggregate queries, we generated complex aggregate queries such as in Table 4. (3) Finally, this paper generated some queries for the YAGO dataset, such as in Table 3, and we generated complex aggregate queries in the same way as in DBpedia and Freebase, such as in Table 4.
- Metrics. We used quantitative analysis methods and employed four metrics for assessment: (1) the error rate of costmodel (ERCM). ERCM was used to measure the error rate of a query’s predicted running time returned by our cost model compared to its real running time. Given the predicted running time was and the real running time was t, then ERCM was calculated as . (2) Precision of cost model with ERCM below 5% (PCM-5%): PCM-5% is computed as the ratio of queries exhibiting ERCM ≤ 5% to the total number of queries. These first two metrics were intended to assess the accuracy of our execution cost model in predicting the execution times of AQS. To further evaluate the efficiency and effectiveness of the simple and complex aggregate queries using our execution cost model, we introduced the subsequent two measurement indicators: (3) Relative error of query. Given the ground truth of a query denoted by V and the result returned by a specific method is , the relative error is calculated as . (4) Response time. This metric pertains to the time taken by the method to produce a response. Throughout our experimental investigations, each query was executed a minimum of five times. The reported metric values are averages computed across all queries.
- Comparison methods. To evaluate the efficiency and effectiveness of running complex aggregate queries over KGs using the cost model, we compared our approach with several other methods in the literature for graph queries of KGs: (1) ours, (2) AQS [18], the latest research supporting aggregate queries for KGs, (3) SGQ [9], a semantic-guided query algorithm for KGs, (4) GraB [7], a structure-similarity-based index-free query method, and (5) QGA [33], a keyword-based graph search method. In addition, (6) EAQ [34] is another a solution for aggregating queries on KGs: It collects candidate entities via link prediction and only computes aggregate results for simple queries. Since methods (3)–(6) do not support complex aggregate queries, we first extended them to handle complex aggregate queries by adding additional aggregate operations to the results returned by each sub-query to compute the final result.
5.2. Evaluation of the Execution Cost Model of AQS
5.2.1. Effectiveness Evaluation
5.2.2. Efficiency Evaluation
5.2.3. Effect of the Size of Training Data
5.2.4. Discussion of the Execution Cost Model of AQS
5.3. Evaluation of Simple Aggregate Queries Based on the Cost Model
5.3.1. Effectiveness and Efficiency Evaluation
5.3.2. Discussion of Simple Aggregate Queries Based on the Cost Model
5.4. Evaluation of Complex Aggregate Queries Based on Cost Model
5.4.1. Necessity of the Optimization of Sub-Queries’ Error Bounds
5.4.2. Effect of the Crossover and Mutation Algorithms Used in Genetic Algorithms
5.4.3. Effectiveness and Efficiency Evaluation
5.4.4. Parameter Sensitivity
- User-desired error bound e. Figure 13 presents the relative error rate and query time for the complex aggregate query results across various predefined error bounds, ranging from 1% to 5%. The results illustrated in Figure 13 effectively showcase the adaptability of the proposed method to different accuracy requirements. Additionally, a noteworthy observation gleaned from the figure revealed that as the predefined error bound became more lenient, the query response time experienced a reduction. This phenomenon closely paralleled the behavior observed in the simple aggregate queries, thus reaffirming the conclusion that "complex aggregate queries consist of multiple simple aggregate queries". The primary factor contributing to the enhanced efficiency was that a more relaxed error bound often necessitates fewer samples and sampling rounds, resulting in a reduced search space and a decrease in the number of iterations, ultimately leading to quicker query responses.
- Iterations r. Figure 14 depicts the relative error rate and query time of the complex aggregate query results across a range of iterations for the genetic algorithm, from 10 to 1000. It is evident from Figure 11 that when the number of iterations fell within the range of 10 to 50, there was a notable increase in the relative error rate. However, once the number of iterations surpassed 50, the relative error rate remained relatively stable. This behavior stemmed from the fact that when the number of iterations was insufficient, the genetic algorithm struggled to converge towards the optimal solution that satisfied Equation (10). Consequently, the error bound remained smaller, resulting in more accurate query results.Furthermore, Figure 14 highlights that if the number of iterations was either too low or too high, this led to increased response times. There are two primary reasons for this: (1) A lower number of iterations yields inaccurate results from the genetic algorithm, causing a bias in the error bound of the sub-queries. (2) While this bias may contribute to more accurate sub-queries results, it simultaneously increases the time consumption of the sub-queries, ultimately leading to a longer complex aggregate query time. Conversely, a higher number of iterations consumes a significant amount of time during the iterative process, with marginal improvements in the relative error rate of the final result.
- Confidence level . Figure 15 presents the relative error rate and response time for complex aggregate query results, while varying the confidence level from 86% to 98%. The results depicted in Figure 15 reveal a conspicuous trend: as the confidence level increased, the relative error rate decreased. This phenomenon can be primarily attributed to the fact that higher confidence levels correspond to smaller half-widths of the result interval, resulting in a more tightly constrained confidence interval and more precise estimated results. Furthermore, as illustrated in Figure 15, an escalation in the confidence level was associated with a gradual increase in query response time. This can be explained by the need for more query time to obtain a narrower confidence interval when striving for a higher confidence level.
5.4.5. Discussion of Complex Aggregate Queries Based on the Cost Model
6. Related Work
- Online Aggregation. Online aggregation is one of the earliest approaches for performing an aggregate query. The concept of online aggregation (OLA) was initially introduced in [35]. This technology relies on sampling techniques to provide approximate aggregate outcomes within relational data contexts. Since its inception, a substantial body of subsequent research has emerged, considering various dimensions including (1) OLA implementation concerning joins and group-by operations [36,37,38,39,40,41,42], (2) OLA adaptation for distributed environments [43,44,45,46,47,48], and (3) optimizing multi-query scenarios for OLA [49,50]. However, it is important to note that none of these established approaches can be readily applied to address aggregate queries within knowledge graphs (KGs). The inherent reason behind this incompatibility is the distinctive nature of a KG’s schema-flexible structure, which significantly diverges from the rigid framework of relational data. To bridge this gap, a breakthrough was achieved in [18]. Here, a pioneering semantic-aware sampling methodology, meticulously tailored for KGs, was devised. This groundbreaking innovation acts as a pivotal solution, harmonizing traditional OLA techniques with the complex landscape of aggregate query processing within knowledge graphs.
- Factoid Query for KGs. Factoid query is an important application of knowledge graph queries, which aims to retrieve clear and factual answers from structured knowledge storage systems (such as knowledge graphs or semantic databases) in response to specific questions raised by users [4,7,12,14,51,52]. They retrieve queries for specific facts or factual information from knowledge graphs or semantic databases through keyword matching, entity recognition, and learning [4,6,7,8,9,12,13,14,15,51,52,53,54]. However, this querying method is limited to retrieving direct and factual answers from knowledge graphs or structured databases. It typically necessitates a one-to-one mapping between the query and the answer, and furthermore demands that the user provides a highly precise description of the input question. Otherwise, the query response may yield a substantial error. Simultaneously, factoid queries are unable to infer additional information concealed within the KGs. This limitation served as one of the motivations behind the introduction of aggregate queries.
- Aggregate Query for KGs. At present, the prevailing methods for addressing aggregate queries predominantly involve the utilization of fact queries, typically SPARQL aggregate queries [8,55,56,57,58] and AQS. Nevertheless, this approach frequently imposes supplementary time overheads, due to the incorporation of extra aggregate operations into the factoid query. Furthermore, the effectiveness of this approach is contingent upon the quality of the results yielded by the factoid query. However, it is worth noting that AQS still lacks the capacity to comprehensively support complex aggregate queries employing "sampling estimation" models. In light of these constraints, we introduced a pioneering cost model designed to notably enhance the efficiency and precision of complex aggregate queries.
7. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Guha, R.V.; McCool, R.; Miller, E. Semantic Search. In Proceedings of the WWW 2003, Budapest, Hungary, 20–24 May 2003. [Google Scholar]
- Namaki, M.H.; Song, Q.; Wu, Y.; Yang, S. Answering Why-questions by Exemplars in Attributed Graphs. In Proceedings of the SIGMOD 2019, Amsterdam, The Netherlands, 30 June 30–5 July 2019; pp. 1481–1498. [Google Scholar]
- Agichtein, E.; Cucerzan, S.; Brill, E. Analysis of Factoid Questions for Effective Relation Extraction. In Proceedings of the SIGIR 2005, Salvador, Brazil, 15–19 August 2005. [Google Scholar]
- Khan, A.; Wu, Y.; Aggarwal, C.C.; Yan, X. NeMa: Fast Graph Search with Label Similarity. PVLDB 2013, 6, 181–192. [Google Scholar]
- Zou, L.; Huang, R.; Wang, H.; Yu, J.X.; He, W.; Zhao, D. Natural Language Question Answering over RDF: A Graph Driven Approach. In Proceedings of the SIGMOD 2014, Snowbird, UT, USA, 22–27 June 2014. [Google Scholar]
- Yang, S.; Han, F.; Wu, Y.; Yan, X. Fast Top-k Search in Knowledge Graphs. In Proceedings of the ICDE 2016, Helsinki, Finland, 16–20 May 2016. [Google Scholar]
- Jin, J.; Khemmarat, S.; Gao, L.; Luo, J. Querying Web-Scale Information Networks Through Bounding Matching Scores. In Proceedings of the WWW 2015, Florence, Italy, 18–22 May 2015. [Google Scholar]
- Zou, L.; Özsu, M.T.; Chen, L.; Shen, X.; Huang, R.; Zhao, D. gStore: A Graph-based SPARQL Query Engine. VLDB J. 2014, 23, 565–590. [Google Scholar] [CrossRef]
- Wang, Y.; Khan, A.; Wu, T.; Jin, J.; Yan, H. Semantic Guided and Response Times Bounded Top-k Similarity Search over Knowledge Graphs. In Proceedings of the ICDE 2020, Dallas, TX, USA, 20–24 April 2020. [Google Scholar]
- Cui, W.; Xiao, Y.; Wang, H.; Song, Y.; Hwang, S.W.; Wang, W. KBQA: Learning Question Answering over QA Corpora and Knowledge Bases. PVLDB 2017, 10, 565–576. [Google Scholar] [CrossRef]
- Bonifati, A.; Martens, W.; Timm, T. An Analytical Study of Large SPARQL Query Logs. PVLDB 2017, 11, 149–161. [Google Scholar]
- Khan, A.; Li, N.; Yan, X.; Guan, Z.; Chakraborty, S.; Tao, S. Neighborhood Based Fast Graph Search in Large Networks. In Proceedings of the SIGMOD 2011, Athens, Greece, 12–16 June 2011. [Google Scholar]
- Yang, S.; Wu, Y.; Sun, H.; Yan, X. Schemaless and Structureless Graph Querying. PVLDB 2014, 7, 565–576. [Google Scholar] [CrossRef]
- Fan, W.; Li, J.; Ma, S.; Wang, H.; Wu, Y. Graph Homomorphism Revisited for Graph Matching. PVLDB 2010, 3, 1161–1172. [Google Scholar] [CrossRef]
- Zheng, W.; Zou, L.; Peng, W.; Yan, X.; Song, S.; Zhao, D. Semantic SPARQL Similarity Search over RDF Knowledge Graphs. PVLDB 2016, 9, 840–851. [Google Scholar] [CrossRef]
- Lissandrini, M.; Pedersen, T.B.; Hose, K.; Mottin, D. Knowledge Graph Exploration: Where Are We and Where Are We Going? ACM SIGWEB Newsletter: New York, NY, USA, 2020. [Google Scholar]
- Wu, Y.; Khan, A. Graph Pattern Matching. In Encyclopedia of Big Data Technologies; Springer: Berlin/Heidelberg, Germany, 2019. [Google Scholar]
- Wang, Y.; Khan, A.; Xu, X.; Jin, J.; Hong, Q.; Fu, T. Aggregate Queries on Knowledge Graphs: Fast Approximation with Semantic-aware Sampling. In Proceedings of the ICDE 2022, Kuala Lumpur, Malaysia, 9–12 May 2022. [Google Scholar]
- Wang, Y.; Khan, A.; Xu, X.; Ye, S.; Pan, S.; Zhou, Y. Approximate and Interactive Processing of Aggregate Queries on Knowledge Graphs: A Demonstration. In Proceedings of the 31st ACM International Conference on Information & Knowledge Management, Atlanta, GA, USA, 17–21 October 2022; Hasan, M.A., Xiong, L., Eds.; ACM: New York, NY, USA, 2022; pp. 5034–5038. [Google Scholar] [CrossRef]
- Bao, J.; Duan, N.; Yan, Z.; Zhou, M.; Zhao, T. Constraint-based Question Answering with Knowledge Graph. In Proceedings of the COLING 2016, Osaka, Japan, 11–16 December 2016. [Google Scholar]
- Bordes, A.; Usunier, N.; Chopra, S.; Weston, J. Large-scale Simple Question Answering with Memory Networks. arXiv 2015, arXiv:1506.02075. [Google Scholar]
- Huang, X.; Zhang, J.; Li, D.; Li, P. Knowledge Graph Embedding Based Question Answering. In Proceedings of the WSDM 2019, Melbourne, Australia, 11–15 February 2019. [Google Scholar]
- Jin, J.; Khemmarat, S.; Gao, L.; Luo, J. A Distributed Approach for Top-k Star Queries on Massive Information Networks. In Proceedings of the ICPADS 2014, Hsinchu, Taiwan, 16–19 December 2014. [Google Scholar]
- Bordes, A.; Usunier, N.; García-Durán, A.; Weston, J.; Yakhnenko, O. Translating Embeddings for Modeling Multi-relational Data. In Proceedings of the NIPS 2013, Lake Tahoe, NV, USA, 5–10 December 2013. [Google Scholar]
- Li, Y.; Wu, Z.; Lin, S.; Xie, H.; Lv, M.; Xu, Y.; Lui, J.C. Walking with Perception: Efficient Random Walk Sampling via Common Neighbor Awareness. In Proceedings of the ICDE 2019, Macao, China, 8–11 April 2019. [Google Scholar]
- Mangasarian, O.L. Nonlinear Programming; SIAM: Philadelphia, PA, USA, 1994. [Google Scholar]
- Lehmann, J.; Isele, R.; Jakob, M.; Jentzsch, A.; Kontokostas, D.; Mendes, P.N.; Hellmann, S.; Morsey, M.; van Kleef, P.; Auer, S.; et al. DBpedia—A Large-Scale, Multilingual Knowledge Base Extracted from Wikipedia. Semant. Web 2015, 6, 167–195. [Google Scholar] [CrossRef]
- Bollacker, K.D.; Evans, C.; Paritosh, P.; Sturge, T.; Taylor, J. Freebase: A Collaboratively Created Graph Database for Structuring Human Knowledge. In Proceedings of the SIGMOD 2008, Vancouver, BC, Canada, 9–12 June 2008. [Google Scholar]
- Hoffart, J.; Suchanek, F.M.; Berberich, K.; Weikum, G. YAGO2: A Spatially and Temporally Enhanced Knowledge Base from Wikipedia. Artif. Intell. 2013, 194, 28–61. [Google Scholar] [CrossRef]
- QALD-4. Available online: http://qald.aksw.org/index.php?x=challenge&q=4 (accessed on 9 December 2014).
- Unger, C.; Forascu, C.; Lopez, V.; Ngomo, A.C.N.; Cabrio, E.; Cimiano, P.; Walter, S. Question Answering over Linked Data (QALD-4). In Proceedings of the Working Notes for CLEF 2014 Conference, Sheffield, UK, 15–18 September 2014. [Google Scholar]
- Berant, J.; Chou, A.; Frostig, R.; Liang, P. Semantic Parsing on Freebase from Question-answer Pairs. In Proceedings of the EMNLP 2013, Seattle, WA, USA, 18–21 October 2013. [Google Scholar]
- Han, S.; Zou, L.; Yu, J.X.; Zhao, D. Keyword Search on RDF Graphs—A Query Graph Assembly Approach. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, CIKM 2017, Singapore, 6–10 November 2017; Lim, E., Winslett, M., Sanderson, M., Fu, A.W., Sun, J., Culpepper, J.S., Lo, E., Ho, J.C., Donato, D., Agrawal, R., et al., Eds.; ACM: New York, NY, USA, 2017; pp. 227–236. [Google Scholar] [CrossRef]
- Li, Y.; Ge, T.; Chen, C. Online Indices for Predictive Top-k Entity and Aggregate Queries on Knowledge Graphs. In Proceedings of the 2020 IEEE 36th International Conference on Data Engineering (ICDE), Dallas, TX, USA, 20–24 April 2020. [Google Scholar]
- Hellerstein, J.M.; Haas, P.J.; Wang, H.J. Online Aggregation. In Proceedings of the SIGMOD 1997, Tucson, AZ, USA, 23–15 May 1997. [Google Scholar]
- Haas, P.J.; Hellerstein, J.M. Ripple Joins for Online Aggregation. In Proceedings of the SIGMOD 1999, Philadelphia, PA, USA, 1–3 June 1999. [Google Scholar]
- Jermaine, C.M.; Arumugam, S.; Pol, A.; Dobra, A. Scalable Approximate Query Processing with the DBO Engine. In Proceedings of the SIGMOD 2007, Beijing, China, 12–14 June 2007. [Google Scholar]
- Luo, G.; Ellmann, C.J.; Haas, P.J.; Naughton, J.F. A Scalable Hash Ripple Join Algorithm. In Proceedings of the SIGMOD 2002, Madison, WI, USA, 3–6 June 2002. [Google Scholar]
- Li, F.; Wu, B.; Yi, K.; Zhao, Z. Wander Join: Online Aggregation via Random Walks. In Proceedings of the SIGMOD 2016, San Francisco, CA, USA, 26 June–1 July 2016. [Google Scholar]
- Park, Y.; Mozafari, B.; Sorenson, J.; Wang, J. VerdictDB: Universalizing Approximate Query Processing. In Proceedings of the SIGMOD 2018, Houston, TX, USA, 10–15 June 2018. [Google Scholar]
- Acharya, S.; Gibbons, P.B.; Poosala, V. Congressional Samples for Approximate Answering of Group-By Queries. In Proceedings of the SIGMOD 2000, Dallas, TX, USA, 16–18 May 2000. [Google Scholar]
- Ding, B.; Huang, S.; Chaudhuri, S.; Chakrabarti, K.; Wang, C. Sample + Seek: Approximating Aggregates with Distribution Precision Guarantee. In Proceedings of the SIGMOD 2016, San Francisco, CA, USA, 26 June–1 July 2016. [Google Scholar]
- Shi, Y.; Meng, X.; Wang, F.; Gan, Y. You Can Stop Early with COLA: Online Processing of Aggregate Queries in the Cloud. In Proceedings of the CIKM 2012, Maui, HI, USA, 29 October–2 November 2012. [Google Scholar]
- Wu, S.; Jiang, S.; Ooi, B.C.; Tan, K. Distributed Online Aggregation. PVLDB 2009, 2, 443–454. [Google Scholar] [CrossRef]
- Condie, T.; Conway, N.; Alvaro, P.; Hellerstein, J.M.; Gerth, J.; Talbot, J.; Elmeleegy, K.; Sears, R. Online Aggregation and Continuous Query Support in MapReduce. In Proceedings of the SIGMOD 2010, Indianapolis, IN, USA, 6–10 June 2010. [Google Scholar]
- Condie, T.; Conway, N.; Alvaro, P.; Hellerstein, J.M.; Elmeleegy, K.; Sears, R. MapReduce Online. In Proceedings of the NSDI 2010, San Jose, CA, USA, 28–30 April 2010. [Google Scholar]
- Pansare, N.; Borkar, V.R.; Jermaine, C.; Condie, T. Online Aggregation for Large MapReduce Jobs. PVLDB 2011, 4, 1135–1145. [Google Scholar] [CrossRef]
- Zeng, K.; Agarwal, S.; Dave, A.; Armbrust, M.; Stoica, I. G-OLA: Generalized On-Line Aggregation for Interactive Analysis on Big Data. In Proceedings of the SIGMOD 2015, Melbourne, Australia, 31 May–4 June 2015. [Google Scholar]
- Wu, S.; Ooi, B.C.; Tan, K.L. Continuous Sampling for Online Aggregation over Multiple Queries. In Proceedings of the SIGMOD 2010, Indianapolis, IN, USA, 6–10 June 2010. [Google Scholar]
- Wang, Y.; Luo, J.; Song, A.; Dong, F. OATS: Online Aggregation with Two-Level Sharing Strategy in Cloud. Distrib. Parallel Databases 2014, 32, 467–505. [Google Scholar] [CrossRef]
- Cheng, J.; Yu, J.X.; Ding, B.; Yu, P.S.; Wang, H. Fast Graph Pattern Matching. In Proceedings of the Proceedings of the 24th International Conference on Data Engineering, ICDE 2008, Cancún, México, 7–12 April 2008. [Google Scholar]
- Ma, S.; Cao, Y.; Fan, W.; Huai, J.; Wo, T. Strong simulation: Capturing topology in graph pattern matching. ACM Trans. Database Syst. (TODS) 2014, 39, 1–46. [Google Scholar] [CrossRef]
- Zheng, W.; Zou, L.; Lian, X.; Wang, D.; Zhao, D. Graph similarity search with edit distance constraint in large graph databases. In Proceedings of the 22nd ACM international Conference on Information & Knowledge Management, San Francisco, CA, USA, 27 October–1 November 2013; pp. 1595–1600. [Google Scholar]
- Zheng, W.; Lian, X.; Zou, L.; Hong, L.; Zhao, D. Online Subgraph Skyline Analysis over Knowledge Graphs. IEEE Trans. Knowl. Data Eng. 2016, 28, 1805–1819. [Google Scholar] [CrossRef]
- Hu, X.; Duan, J.; Dang, D. Scalable Aggregate Keyword Query over Knowledge Graph. Future Gener. Comput. Syst. 2020, 107, 588–600. [Google Scholar] [CrossRef]
- Unger, C.; Bühmann, L.; Lehmann, J.; Ngomo, A.N.; Gerber, D.; Cimiano, P. Template-based Question Answering over RDF Data. In Proceedings of the WWW 2012, Lyon, France, 16–20 April 2012. [Google Scholar]
- Höffner, K.; Lehmann, J.; Usbeck, R. CubeQA—Question Answering on RDF Data Cubes. In Proceedings of the ISWC 2016, Kobe, Japan, 17–21 October 2016. [Google Scholar]
- Hu, X.; Dang, D.; Yao, Y.; Ye, L. Natural Language Aggregate Query over RDF Data. Inf. Sci. 2018, 454–455, 363–381. [Google Scholar] [CrossRef]
Notations | Descriptions |
---|---|
G | A knowledge graph |
A simple aggregate query over G with a query graph Q and an aggregate function | |
A complex aggregate query over G consisting of sub-queries and an aggregate function | |
V | The ground truth of ; |
The ground truth of | |
The estimated approximate result of | |
The estimated approximate result of | |
e | A predefined error bound |
An error bound for | |
A set of error bounds to achieve a balance between effectiveness and efficiency for a complex aggregate query | |
A user-input confidence level | |
The confidence interval (CI) at confidence level for complex aggregate query | |
The confidence interval (CI) at confidence level for sub-query | |
The half width of ’s CI, called the Margin of Error (MoE) of | |
The half width of ’s CI, called the Margin of Error (MoE) of |
Datasets | Nodes | Edges | Node-Types | Edge-Predicates |
---|---|---|---|---|
DBpedia | 4,521,912 | 15,045,801 | 359 | 676 |
Freebase | 5,706,539 | 48,724,743 | 11,666 | 5118 |
YAGO2 | 7,308,072 | 36,624,106 | 6543 | 101 |
QID | Queries | |
---|---|---|
How many cars are produced in Germany? | COUNT (∗) | |
How many movies in Denmark? | COUNT (∗) | |
What’s the average price of cars that are produced in Germany? | AVG (price) | |
What is the total salary of Spanish football players? | SUM (salary) | |
How many movies that were directed by Steven Spielberg? | COUNT (∗) | |
What’s the average rating of the movies that were directed by Steven Spielberg? | AVG (rating) | |
What is the total box office of Hans Zimmer’s films? | SUM (salary) | |
How many companies are there in England? | COUNT (∗) | |
What is the average population of Chinese cities? | AVG (population) | |
What is the total GDP of Chinese cities? | SUM (GDP) |
QID | Queries | |
---|---|---|
How many cars that are produced in China and Germany? | (+) | |
What’s the sum salary of soccer player from Spain and Portugal? | () | |
What’s the average price of cars that produced in China and Germany? | () | |
How many times islands are there in Oceania than in the Pacific? | (/)) | |
What’s the sum box office of films are directed by steven spielberg? | (×) | |
How many times films are directed by steven spielberg than hans zimmer? | () | |
How many more companies in England than Spain? | ||
How many times the sum length of rivers in China than Brazil? | () |
Datasets | Effectiveness Results (%) | |||||
---|---|---|---|---|---|---|
COUNT | AVG | SUM | ||||
PCM-5% | ERCM | PCM-5% | ERCM | PCM-5% | ERCM | |
DBpedia | 86.08 | 7.21 | 88.38 | 6.97 | 87.63 | 6.77 |
Freebase | 85.56 | 6.89 | 88.67 | 8.88 | 84.75 | 5.64 |
YAGO2 | 84.04 | 6.24 | 82.75 | 6.98 | 88.00 | 5.62 |
Datasets | Efficiency Results (s) | ||||||||
---|---|---|---|---|---|---|---|---|---|
COUNT | AVG | SUM | |||||||
Collecting | Training | Total | Collecting | Training | Total | Collecting | Training | Total | |
DBpedia | 10.69 | 1.40 | 11.09 | 11.30 | 1.40 | 12.7 | 10.08 | 1.40 | 11.48 |
Freebase | 8.45 | 1.49 | 10.94 | 8.01 | 1.14 | 9.15 | 8.80 | 1.77 | 10.57 |
YAGO2 | 62.57 | 1.50 | 64.07 | 81.42 | 1.50 | 82.92 | 43.72 | 1.51 | 45.23 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Ye, S.; Xu, X.; Wang, Y.; Fu, T. Efficient Complex Aggregate Queries with Accuracy Guarantee Based on Execution Cost Model over Knowledge Graphs. Mathematics 2023, 11, 3908. https://doi.org/10.3390/math11183908
Ye S, Xu X, Wang Y, Fu T. Efficient Complex Aggregate Queries with Accuracy Guarantee Based on Execution Cost Model over Knowledge Graphs. Mathematics. 2023; 11(18):3908. https://doi.org/10.3390/math11183908
Chicago/Turabian StyleYe, Shuzhan, Xiaoliang Xu, Yuxiang Wang, and Tao Fu. 2023. "Efficient Complex Aggregate Queries with Accuracy Guarantee Based on Execution Cost Model over Knowledge Graphs" Mathematics 11, no. 18: 3908. https://doi.org/10.3390/math11183908
APA StyleYe, S., Xu, X., Wang, Y., & Fu, T. (2023). Efficient Complex Aggregate Queries with Accuracy Guarantee Based on Execution Cost Model over Knowledge Graphs. Mathematics, 11(18), 3908. https://doi.org/10.3390/math11183908