Learned Query Optimization by Constraint-Based Query Plan Augmentation
Abstract
:1. Introduction
- (1)
- We recognize that the equivalence information among queries is critical in learning-based query optimization models. We propose Celo, a novel learned optimizer that directly integrates the information of equivalent queries into the learning-based model.
- (2)
- We propose identifying equivalent SQL statements using constant DCs. We also address how to use equivalent queries to augment the alternative execution plans of queries for learning-based query optimization.
- (3)
- We conducted extensive experiments on two real-world datasets. The experimental results demonstrate that Celo achieves significant improvements in cost and latency over traditional and learning-based query optimizers.
2. Related Work
3. Preliminaries of Learned Query Optimization
3.1. Query Plan Generation
3.2. The TCNN Model for Query Plan Cost Prediction
3.3. Problem Definition
4. Constraint-Enhanced Query Optimizer
4.1. The DBMS Architecture
- Parse. The query is passed to the parser, which analyzes the SQL statement based on database syntax rules. The database retrieves the relevant table definitions, validates the query, and converts it into a parse tree.
- Rewrite. The parse tree is then passed to the rewriter. The rewriter modifies and standardizes the query based on the rules and views defined in the database tables, generating a rewritten parse tree. This step supports view expansion, rule application, and other semantic-based rewrites.
- Optimization. The rewritten parse tree is sent to the query optimizer, where CELO plays a key role. At this stage, CELO intervenes by setting query hints in PostgreSQL. The traditional optimizer uses statistics collected from the database tables—such as table size, data distribution, and indexes—to generate an optimized query plan tree. CELO further enhances the plan by extending the query plan set using denial constraints (DCs) and predicting the best query plan through a TCNN model.
- Execution. The query plan generated by the optimizer is passed to the executor. The executor follows the query plan, accessing and manipulating the actual data in the database tables, and ultimately returning the results to the user.
4.2. The Celo Model Overview
- Plan expertise collection. This section encompasses two elements: query optimizer and executed plans. Our model assumes the existence of a sample workload consisting of a query representative of the user’s total workload. For each query in , the query optimizer generates multiple query plans and executes the optimal one using Bao. The query, its execution plan, and its latency are added to the experience module for model training as well as providing equivalent plans for user queries in Q.
- Equivalent query discovery. This section encompasses three elements: constant DCs, equivalent predicates, and equivalent queries. We design a constraint-based discovery method based on a set of denial constraints (DCs). Based on the constraints, we convert them into equivalent predicates, which include a pair of equivalent predicates {, }. By replacing the predicates in a query in Q with equivalent predicates, we obtain equivalent queries.
- Query plan augmentation. This section encompasses three elements: experience, equivalent plans, and augmented plan sets. For each equivalent query statement identified in the previous step, we search the corresponding query plan in the experience, which serves as the equivalent plans. Then, we added them to the set of query plans generated by the database query optimizer, thereby expanding it into the augmented plan set. Thus, the Celo model overcomes the drawback of limited candidate query plans by providing only a set of hints, offering more high-quality query plans than BAO. Then, the TCNN model can predict the optimal query plan among the augmented query plans, which may contain more potentially well-performed plans for execution.
- TCNN model training. This section encompasses two elements: training and the TCNN model. Once the query execution is complete, the combination of the query itself, the selected query plan, and the execution latency is added to Celo’s experience as training data to incrementally improve the TCNN model. Therefore, it is stipulated that after executing a certain number of queries, Celo will retrain the TCNN model. This allows the TCNN model to adapt to the current workload, making the retrained model more reliable in selecting the best set of hints for each query in Q. When retraining the TCNN model, a specific amount of training data is retrieved from the experience and used in the training process.
4.3. Equivalent Query Discovery
4.4. Query Plan Augmentation
4.5. TCNN Model Training
4.6. The Whole Procedure
Algorithm 1 The whole procedure of Celo. |
|
5. PostgreSQL Integration
6. Experiments
6.1. Experimental Setup
- Dataset.
- Training data generation.
- Baselines.
- Implementation details.
6.2. Experimental Results
6.2.1. Execution Time
6.2.2. Performance Relative to Bao
6.2.3. Absolute Performance Improvement
6.2.4. Latency Distribution Analysis
6.2.5. Performance Evaluation over Constraint-Independent Queries
7. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Lan, H.; Bao, Z.; Peng, Y. A Survey on Advancing the DBMS Query Optimizer: Cardinality Estimation, Cost Model, and Plan Enumeration. Data Sci. Eng. 2021, 6, 86–101. [Google Scholar] [CrossRef]
- Poosala, V.; Ioannidis, Y.E. Selectivity Estimation without the Attribute Value Independence Assumption. In Proceedings of the 23rd International Conference on Very Large Data Bases, Athens, Greece, 25–29 August 1997; pp. 486–495. [Google Scholar]
- Wu, Y.; Patel, J.M.; Jagadish, H.V. Structural Join Order Selection for XML Query Optimization. In Proceedings of the 19th International Conference on Data Engineering (Cat. No.03CH37405), Bangalore, India, 5–8 March 2003; pp. 443–454. [Google Scholar]
- Weiss, C.; Karras, P.; Bernstein, A. Hexastore: Sextuple indexing for semantic web data management. Proc. VLDB Endow. 2008, 1, 1008–1019. [Google Scholar] [CrossRef]
- Kipf, A.; Kipf, T.; Radke, B.; Leis, V.; Boncz, P.A.; Kemper, A. Learned Cardinalities: Estimating Correlated Joins with Deep Learning. In Proceedings of the 9th Biennial Conference on Innovative Data Systems Research (CIDR’19), Asilomar, CA, USA, 13–16 January 2019. [Google Scholar]
- Krishnan, S.; Yang, Z.; Goldberg, K.; Hellerstein, J.M.; Stoica, I. Learning to Optimize Join Queries With Deep Reinforcement Learning. CoRR 2018. abs/1808.03196. [Google Scholar]
- Sun, J.; Li, G. An End-to-End Learning-based Cost Estimator. Proc. VLDB Endow. 2019, 13, 307–319. [Google Scholar] [CrossRef]
- Marcus, R.C.; Negi, P.; Mao, H.; Zhang, C.; Alizadeh, M.; Kraska, T.; Papaemmanouil, O.; Tatbul, N. Neo: A Learned Query Optimizer. Proc. VLDB Endow. 2019, 12, 1705–1718. [Google Scholar] [CrossRef]
- Marcus, R.; Negi, P.; Mao, H.; Tatbul, N.; Alizadeh, M.; Kraska, T. Bao: Making Learned Query Optimization Practical. In Proceedings of the 2021 International Conference on Management of Data, Xi’an, China, 20–25 June 2021; pp. 1275–1288. [Google Scholar]
- Zhou, X.; Li, G.; Chai, C.; Feng, J. A learned query rewrite system using monte carlo tree search. Proc. VLDB Endow. 2021, 15, 46–58. [Google Scholar] [CrossRef]
- Chu, X.; Ilyas, I.F.; Papotti, P. Discovering Denial Constraints. Proc. VLDB Endow. 2013, 6, 1498–1509. [Google Scholar] [CrossRef]
- Zhou, X.; Chai, C.; Li, G.; Sun, J. Database Meets Artificial Intelligence: A Survey. IEEE Trans. Knowl. Data Eng. 2022, 34, 1096–1116. [Google Scholar] [CrossRef]
- Dutt, A.; Wang, C.; Nazi, A.; Kandula, S.; Narasayya, V.R.; Chaudhuri, S. Selectivity Estimation for Range Predicates using Lightweight Models. Proc. VLDB Endow. 2019, 12, 1044–1057. [Google Scholar] [CrossRef]
- Hasan, S.; Thirumuruganathan, S.; Augustine, J.; Koudas, N.; Das, G. Multi-Attribute Selectivity Estimation Using Deep Learning. CoRR 2019. abs/1903.09999. [Google Scholar]
- Heimel, M.; Kiefer, M.; Markl, V. Self-Tuning, GPU-Accelerated Kernel Density Models for Multidimensional Selectivity Estimation. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, Melbourne, Australia, 31 May–4 June 2015; pp. 1477–1492. [Google Scholar]
- Heule, S.; Nunkesser, M.; Hall, A. Hyperloglog in practice: Algorithmic engineering of a state of the art cardinality estimation algorithm. In Proceedings of the 16th International Conference on Extending Database Technology, Genoa, Italy, 18–22 March 2013; pp. 683–692. [Google Scholar]
- Ioannidis, Y.E. The History of Histograms (abridged). In Proceedings of the 29th International Conference on Very Large Data Bases, Berlin, Germany, 9–12 September 2003; pp. 19–30. [Google Scholar]
- Leis, V.; Radke, B.; Gubichev, A.; Kemper, A.; Neumann, T. Cardinality Estimation Done Right: Index-Based Join Sampling. In Proceedings of the 8th Biennial Conference on Innovative Data Systems Research (CIDR’17), Chaminade, CA, USA, 8–11 January 2017. [Google Scholar]
- Lu, X.; Guan, J. A new approach to building histogram for selectivity estimation in query processing optimization. Comput. Math. Appl. 2009, 57, 1037–1047. [Google Scholar] [CrossRef]
- Müller, M.; Moerkotte, G.; Kolb, O. Improved Selectivity Estimation by Combining Knowledge from Sampling and Synopses. Proc. VLDB Endow. 2018, 11, 1016–1028. [Google Scholar] [CrossRef]
- Ortiz, J.; Balazinska, M.; Gehrke, J.; Keerthi, S.S. An Empirical Analysis of Deep Learning for Cardinality Estimation. CoRR 2019. abs/1905.06425. [Google Scholar]
- Wu, W.; Naughton, J.F.; Singh, H. Sampling-Based Query Re-Optimization. In Proceedings of the 2016 International Conference on Management of Data, San Francisco, CA, USA, 26 June–1 July 2016; pp. 1721–1736. [Google Scholar]
- Yang, Z.; Kamsetty, A.; Luan, S.; Liang, E.; Duan, Y.; Chen, X.; Stoica, I. NeuroCard: One Cardinality Estimator for All Tables. Proc. VLDB Endow. 2020, 14, 61–73. [Google Scholar] [CrossRef]
- Yang, Z.; Liang, E.; Kamsetty, A.; Wu, C.; Duan, Y.; Chen, X.; Abbeel, P.; Hellerstein, J.M.; Krishnan, S.; Stoica, I. Selectivity Estimation with Deep Likelihood Models. CoRR 2019. abs/1905.04278. [Google Scholar]
- Bennett, K.P.; Ferris, M.C.; Ioannidis, Y.E. A Genetic Algorithm for Database Query Optimization. In Proceedings of the 4th International Conference on Genetic Algorithms, San Diego, CA, USA, 13–16 July1991; pp. 400–407. [Google Scholar]
- Fegaras, L. A New Heuristic for Optimizing Large Queries. In Proceedings of the 9th International Conference, DEXA’98, Vienna, Austria, 24–28 August 1998; Volume 1460, pp. 726–735. [Google Scholar]
- Marcus, R.; Papaemmanouil, O. Deep Reinforcement Learning for Join Order Enumeration. In Proceedings of the First International Workshop on Exploiting Artificial Intelligence Techniques for Data Management, Houston, TX, USA, 10 June 2018; pp. 3:1–3:4. [Google Scholar]
- Stillger, M.; Lohman, G.M.; Markl, V.; Kandil, M. LEO—DB2’s LEarning Optimizer. In Proceedings of the 27th International Conference on Very Large Data Bases, Roma, Italy, 11–14 September 2001; pp. 19–28. [Google Scholar]
- Waas, F.; Pellenkoft, A. Join Order Selection—Good Enough Is Easy. In Proceedings of the 17th British National Conference on Databases, BNCOD 17, Exeter, UK, 3–5 July 2000; Volume 1832, pp. 51–67. [Google Scholar]
- Yu, X.; Li, G.; Chai, C.; Tang, N. Reinforcement Learning with Tree-LSTM for Join Order Selection. In Proceedings of the 2020 IEEE 36th International Conference on Data Engineering (ICDE), Dallas, TX, USA, 20–24 April 2020; pp. 1297–1308. [Google Scholar]
- Zhu, R.; Wu, Z.; Chai, C.; Pfadler, A.; Ding, B.; Li, G.; Zhou, J. Learned Query Optimizer: At the Forefront of AI-Driven Databases. In Proceedings of the 25th International Conference on Extending Database Technology (EDBT), Edinburgh, UK, 29 March–1 April 2022; pp. 1–4. [Google Scholar]
- Ilyas, I.F.; Markl, V.; Haas, P.J.; Brown, P.; Aboulnaga, A. CORDS: Automatic Discovery of Correlations and Soft Functional Dependencies. In Proceedings of the 2004 ACM SIGMOD International Conference on Management of Data, Paris, France, 13–18 June 2004; pp. 647–658. [Google Scholar]
- Pena, E.H.M.; Falk, E.; Meira, J.A.; de Almeida, E.C. Mind Your Dependencies for Semantic Query Optimization. J. Inf. Data Manag. 2018, 9, 3–19. [Google Scholar]
- Kossmann, J.; Naumann, F.; Lindner, D.; Papenbrock, T. Workload-driven, Lazy Discovery of Data Dependencies for Query Optimization. In Proceedings of the 12th Annual Conference on Innovative Data Systems Research (CIDR’22), Chaminade, CA, USA, 9–12 January 2022. [Google Scholar]
- Mou, L.; Li, G.; Zhang, L.; Wang, T.; Jin, Z. Convolutional Neural Networks over Tree Structures for Programming Language Processing. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, Phoenix, AZ, USA, 12–17 February 2016; pp. 1287–1293. [Google Scholar]
- Xu, L.; Choy, C.; Li, Y. Deep sparse rectifier neural networks for speech denoising. In Proceedings of the 2016 IEEE International Workshop on Acoustic Signal Enhancement (IWAENC), Xi’an, China, 13–16 September 2016; pp. 1–5. [Google Scholar]
- Thompson, W.R. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika 1933, 25, 285–294. [Google Scholar] [CrossRef]
- Kossmann, J.; Papenbrock, T.; Naumann, F. Data dependencies for query optimization: A survey. VLDB J. 2022, 31, 1–22. [Google Scholar] [CrossRef]
- Mahdavi, M.; Abedjan, Z. Baran: Effective Error Correction via a Unified Context Representation and Transfer Learning. Proc. VLDB Endow. 2020, 13, 1948–1961. [Google Scholar] [CrossRef]
- Rekatsinas, T.; Chu, X.; Ilyas, I.F.; Ré, C. HoloClean: Holistic Data Repairs with Probabilistic Inference. Proc. VLDB Endow. 2017, 10, 1190–1201. [Google Scholar] [CrossRef]
- Szlichta, J.; Godfrey, P.; Gryz, J. Fundamentals of Order Dependencies. Proc. VLDB Endow. 2012, 5, 1220–1231. [Google Scholar] [CrossRef]
- Chen, X.; Wang, Z.; Liu, S.; Li, Y.; Zeng, K.; Ding, B.; Zhou, J.; Su, H.; Zheng, K. Base: Bridging the gap between cost and latency for query optimization. Proc. VLDB Endow. 2023, 16, 1958–1966. [Google Scholar] [CrossRef]
- Chen, T.; Gao, J.; Chen, H.; Tu, Y. LOGER: A learned optimizer towards generating efficient and robust query execution plans. Proc. VLDB Endow. 2023, 16, 1777–1789. [Google Scholar] [CrossRef]
- Chen, X.; Chen, H.; Liang, Z.; Liu, S.; Wang, J.; Zeng, K.; Su, H.; Zheng, K. Leon: A new framework for ml-aided query optimization. Proc. VLDB Endow. 2023, 16, 2261–2273. [Google Scholar] [CrossRef]
- Xu, X.; Zhao, Z.; Zhang, T.; Kang, R.; Sun, L.; Chen, J. COOOL: A Learning-To-Rank Approach for SQL Hint Recommendations. arXiv 2023, arXiv:2304.04407. [Google Scholar]
- Woltmann, L.; Thiessat, J.; Hartmann, C.; Habich, D.; Lehner, W. Fastgres: Making learned query optimizer hinting effective. Proc. VLDB Endow. 2023, 16, 3310–3322. [Google Scholar] [CrossRef]
Layer | Description |
---|---|
Input Layer | . |
First TreeConv Layer | , where is the weight matrix of the first convolution layer. |
Second TreeConv Layer | , where is the weight matrix of the second convolution layer. |
Third TreeConv Layer | , where is the weight matrix of the third convolution layer. |
Dynamic Pooling | . |
First Linear Layer | , where is the weight matrix, and is the bias term. |
ReLU Layer | , where is the ReLU activation function. |
Second Linear Layer (Output Layer) | , where is the weight matrix, is the bias term. |
Query | Bao | Celo |
---|---|---|
: SELECT MAX(cast_info.note),MIN(title.title) FROM movie_link,title,kind_type,cast_info,movie_info WHERE movie_link.movie_id=title.id AND title.kind_id=kind_type.id AND cast_info.movie_id=title.id AND movie_info.movie_id=title.id AND title.title NOT LIKE ‘%(#1.33)’ AND cast_info.note NOT LIKE ‘%nan%’ AND movie_info.id > 5,486,631 and movie_info.id < 7,020,540 and movie_info.info_type_id = 3; | 0.39 | 0.37 |
: SELECT MAX(cast_info.note),MIN(title.title) FROM movie_link,title,kind_type,cast_info,movie_info WHERE movie_link.movie_id=title.id AND title.kind_id=kind_type.id AND cast_info.movie_id=title.id AND movie_info.movie_id=title.id AND title.title NOT LIKE ‘%(#1.33)’ AND cast_info.note NOT LIKE ‘%nan%’ AND movie_info.id > 5,486,631 and movie_info.id < 7,020,540; | 84.81 | 0.43 |
: SELECT MAX(movie_info.info),MAX(name.name_pcode_nf) FROM cast_info,name,movie_info,info_type,person_info WHERE cast_info.person_id=name.id AND movie_info.info_type_id=info_type.id AND person_info.person_id=name.id AND person_info.info_type_id=info_type.id AND movie_info.info NOT LIKE ‘th H%’ AND name.name_pcode_nf NOT LIKE ‘%P1453%’ AND name.surname_pcode NOT LIKE ‘%S35’ AND name.name_pcode_cf LIKE ‘A6265%’ AND name.id > 1,739,579 and name.id < 2,701,135 and name.gender = ‘f’; | 51.71 | 50.38 |
: SELECT MAX(movie_info.info),MAX(name.name_pcode_nf) FROM cast_info,name,movie_info,info_type,person_info WHERE cast_info.person_id=name.id AND movie_info.info_type_id=info_type.id AND person_info.person_id=name.id AND person_info.info_type_id=info_type.id AND movie_info.info NOT LIKE ‘th H%’ AND name.name_pcode_nf NOT LIKE ‘%P1453%’ AND name.surname_pcode NOT LIKE ‘%S35’ AND name.name_pcode_cf LIKE ‘A6265%’ AND name.id > 1,739,579 and name.id < 2,701,135; | 141.45 | 50.48 |
Table | DCs | ||
---|---|---|---|
n | ¬( < 1,739,580 ∧ ≠ ‘m’) | n.id < 1,739,580 AND n.gender = ‘m’ | n.id < 1,739,580 |
n | ¬( > 1,739,579 ∧ < 2,701,135 ∧ ≠ ‘f’) | n.id > 1,739,579 AND n.id < 2,701,135 AND n.gender = ‘f’ | n.id > 1,739,579 AND n.id < 2,701,135 |
¬( > 11,658,936 ∧ < 14,695,656 ∧ ≠ 16) | .id > 11,658,936 AND .id < 14,695,656 AND .info_type_id = 16 | .id > 11,658,936 AND .id < 14,695,656 | |
¬( > 5,486,631 ∧ < 7,020,540 ∧ ≠ 3) | .id > 5,486,631 AND .id < 7,020,540 AND .info_type_id = 3 | .id > 5,486,631 AND .id < 7,020,540 | |
¬( < 12,670,689 ∧ ≠ 1) | .id < 12,670,689 AND .role_id = 1 | .id < 12,670,689 | |
¬( > 12,670,688 ∧ < 20,122,662 ∧ ≠ 2) | .id > 12,670,688 AND .id < 20,122,662 AND .role_id = 2 | .id > 12,670,688 AND .id < 20,122,662 | |
¬( < 1,274,247 ∧ ≠ 1) | .id < 1,274,247 AND .company_type_id = 1 | .id < 1,274,247 | |
¬( > 1,274,246 ∧ < 2,609,130 ∧ ≠ 2) | .id > 1,274,246 AND .id < 2,609,130 AND .company_type_id = 2 | .id > 1,274,246 AND .id < 2,609,130 |
Table | DCs | ||
---|---|---|---|
hospital | ¬( = ∧ ≠ ) | GROUP BY hospital.city, hospital.zip | GROUP BY hospital.city |
hospital | ¬( = ∧ ≠ ) | GROUP BY hospital.city, hospital.country | GROUP BY hospital.city |
hospital | ¬( = ∧ ≠ ) | GROUP BY hospital.city, hospital.zip | GROUP BY hospital.zip |
hospital | ¬( = ∧ ≠ ) | GROUP BY hospital.zip, hospital.state | GROUP BY hospital.zip |
hospital | ¬( = ∧ ≠ ) | GROUP BY hospital.zip, hospital.country | GROUP BY hospital.zip |
hospital | ¬( = ∧ ≠ ) | GROUP BY hospital.country, hospital.state | GROUP BY hospital.country |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Ye, C.; Duan, H.; Zhang, H.; Wu, Y.; Dai, G. Learned Query Optimization by Constraint-Based Query Plan Augmentation. Mathematics 2024, 12, 3102. https://doi.org/10.3390/math12193102
Ye C, Duan H, Zhang H, Wu Y, Dai G. Learned Query Optimization by Constraint-Based Query Plan Augmentation. Mathematics. 2024; 12(19):3102. https://doi.org/10.3390/math12193102
Chicago/Turabian StyleYe, Chen, Haoyang Duan, Hua Zhang, Yifan Wu, and Guojun Dai. 2024. "Learned Query Optimization by Constraint-Based Query Plan Augmentation" Mathematics 12, no. 19: 3102. https://doi.org/10.3390/math12193102
APA StyleYe, C., Duan, H., Zhang, H., Wu, Y., & Dai, G. (2024). Learned Query Optimization by Constraint-Based Query Plan Augmentation. Mathematics, 12(19), 3102. https://doi.org/10.3390/math12193102