5.1. Experimental Settings
Subject databases, versions, and benchmarks. We carried out an investigation into the subject systems of researches on configuration tuning for database systems [
2,
6,
7,
8,
9,
10,
14,
16,
28,
29,
31,
39,
40,
41,
42,
43,
44], and chose five widely used database systems to evaluate PCK approach: MySQL (
https://www.mysql.com/ (accessed on 25 September 2020)), PostgreSQL (
https://www.postgresql.org/ (accessed on 26 September 2021)),SQLite (
https://www.sqlite.org/ (accessed on 26 September 2021)), Redis (
https://redis.io/ (accessed on 11 October 2020)), and Cassandra (
http://cassandra.apache.org/ (accessed on 13 October 2020)). MySQL is an open-source relational database management system (RDBMS). PostgreSQL is an open-source object-relational database management system (ORDBMS). SQLite is an open-source embedded RDBMS. Redis is an open-source in-memory data structure store. Cassandra is an open-source column-oriented NoSQL database management system. In this experiment, we choose two (PostgreSQL, SQLite, and Cassandra) or three (MySQL, Redis) representative versions for convenience. The time gaps among these releases of three database are about five years for PostgreSQL, nearly a half year for SQLite, about five and a half years for Cassandra, and approximate three years for MySQL and Redis. In addition, we use sysbench (
https://github.com/akopytov/sysbench (accessed on 25 September 2020)) for MySQL, pgbench (
https://www.postgresql.org/docs/11/pgbench.html (accessed on 26 September 2021)) for PostgreSQL, a customized workload for SQLite, YCSB [
45] for Cassandra, and Redis-Bench (
https://redis.io/topics/benchmarks (accessed on 11 October 2020)) for Redis.
Parameters. For each database system, we use domain expertise to identify a subset of parameters that are considered critical to the performance, as in [
2,
9,
10]. Reducing the number of considered parameters can reduce the search space exponentially, and numerous existing approaches [
9,
12] also adopt this manual feature selection strategy. Note that even with only these parameters, the search space is still enormous, and exhaustive search is infeasible.
Table 1 summarizes the database systems and versions, along with the benchmarks, the numbers of selected parameters, and performance metrics, respectively.
Running environment. In order to avoid interference in collecting samples from different subject database systems, we conduct experiments on different servers and computer. In addition, we ensure a consistent running environment for different versions of the same subject system. The running environments for different subject databases systems are listed as follows.
MySQL, PostgreSQL: The physical server is equipped with two 2-core Intel(R) Core(TM) i5-4590 CPU @3.30GHZ processors, 4GB RAM, 64GB disk, and running CentOS 6.5 and Java 1.8.0.
SQLite: The computer is equipped with a Intel(R) Core(TM) i5-4460 CPU @3.20GHZ processors, 8GB RAM, 1TB disk, and running Windows 10 and Java 1.8.0_291.
Redis: The cloud server is equipped with two 2-core Intel(R) Xeon(R) Platinum 8163 CPU @2.50GHz processors, 4GB RAM, 53.7GB disk, and running CentOS 7.6 and Java 1.8.0_261.
Cassandra: The physical server is equipped with two 4-core Intel(R) Xeon(R) CPU E5-2683 V3 @2.00GHz processors, 32GB RAM, 86GB disk, and running CentOS 6.5 and Java 1.8.0_211.
Baseline Algorithms. To evaluate the performance of PCK approach, we compare it with six state-of-the-art algorithms: DeepPerf [
10], CART [
3], Finetune [
46], DDC [
47], Model-shift [
33], and Ottertune [
2]. We provide a brief description for each algorithm as follows.
DeepPerf and CART establish performance prediction models in target domain directly. They consider the performance prediction problem as a nonlinear regression problem and apply different machine learning methods, namely Deep Neural Network (DNN) and the Classification and Regression Trees (CART) technique, to find this nonlinear model.
Finetune and DDC are two widely used transfer learning schemes. Finetune is a network-based transfer learning method. It freezes the partial network that pre-trained in the source domain, and transfers it to be a part of DNN which used in target domain. DDC is a mapping-based transfer learning, which maps instances from the source and target into a new data space.
Model-shift approach shifts the model that has been learned in the source to predict the system performance in the target using linear regression models. CART is applied to build performance prediction models in source domain.
Ottertune is a transfer learning approach that exploits source samples to learn a performance model in the target. The Gaussian Process (GP) model is used to learn a performance model that can predict unobserved response values.
5.2. Evaluation of Prediction Accuracy
Data collection. We use the random sampling strategy to generate a set of configurations for each database and test them on the database system with given workloads on different versions to get the performance measurements. A configuration–performance pair is regarded as a sample. The numbers of samples for MySQL, PostgreSQL, SQLite, Redis, and Cassandra are 294, 300, 280, 323, and 1129, respectively. The collection of all source samples serves as auxiliary training data. In the target domain, a subset of these samples is selected randomly for training dataset; the remaining samples serve as the testing dataset.
Evaluation metric. Holdout validation is employed to compare the prediction accuracy between different methods. We use the training dataset and auxiliary training data to generate a performance model for each method, and then use this model to predict performance values of configurations in the testing dataset. We select
Mean Relative Error (MRE) as a metric for evaluating prediction accuracy, which is computed as follows:
where
N is a total number of configurations in the testing dataset, and
and
represent the actual performance value and predicted performance value, respectively. We choose this metric as it is widely used to measure the accuracy of prediction models [
5,
10,
33,
34].
Performance results. We run PCK method and six baseline algorithms for five subject database systems independently. The size of training dataset in target domain is set to
, where
c is the number of selected configuration parameters for each subject database system (which is shown in the column # of selected parameters of
Table 1), and
n ranges from 1 to 15. To evaluate the consistency and stability of the approaches, for each sample size of each subject database system, we repeat the random sampling, training and testing process 5 times. We then show and compare the mean of the MREs obtained with the 7 different approaches for each sample size. The experiment results of five subject database systems for larger version changes are listed in
Table 2,
Table 3,
Table 4,
Table 5 and
Table 6, respectively.
As expected, our proposed method has achieved better performance than all other six algorithms. Specifically, for five subject database systems, PCK outperforms all other six baseline algorithms: 24.99–53.18% improvement over DeepPerf, 14.20–51.19% improvement over CART, 4.42–46.58% improvement over Finetune, 25.26–53.13% improvement over DDC, 9.34–50.03% improvement over Model-shift, and 46.80–69.42% over Ottertune. The improvement percentages are shown in
Figure 4. For the sake of simplicity, the data in
Figure 4 is obtained by averaging the MRE data of different sample sizes in each subject database system.
DeepPerf tackles the performance prediction problem directly by training a performance model with DNN. Further, Finetune and DDC are two DNN-based transfer learning approaches. The above experimental results show that the MRE results of PCK are better than these three DNN-based performance modeling methods. It is because that training a high-performance DNN often requires a large amount of training data, but we can only provide a small sample due to the high cost of performance measurement.
In another aspect, we find out that the two DNN-based transfer learning methods in most cases perform higher prediction accuracy than simple DNN-based performance prediction model in five database systems. Similarly, the Model-shift approach transfers the CART-based performance model in the source to the target using linear regression models, and it also achieves better MRE than CART. This observation verifies the effectiveness of transfer learning on performance prediction task.
The validity of above-mentioned transfer learning methods proves that there exists a correlation between the source and target. Our proposed PCK method aims to take advantage of this correlation to build a high-performance prediction model in target domain. In the experiment results, the PCK achieves higher prediction accuracy compared with other four transfer learning method under almost all the sample sizes.
From another point of view, PCK requires a much smaller number of target samples to reach the same standard of prediction accuracy, compared with other baseline approaches. It means that PCK outperforms six baseline algorithms not only in terms of prediction accuracy, but also in terms of measurement effort.
To further verify our conclusion, we conduct the similar experiments across different version changes. The MRE comparison among different approaches for MySQL and Redis under different version change scenarios (version 5.5–5.7, 5.7–8.0 for MySQL, and version 4.0.1–5.0.0, 5.0.0–6.0.5 for Redis) are listed in
Table 7,
Table 8,
Table 9 and
Table 10, respectively. Experimental results confirm that the PCK outperforms the state-of-the-art baselines in terms of prediction accuracy and measurement cost in almost all experimental settings in this paper.
5.3. Trade-Off on Choosing K
In this part, we will discuss the trade-off on choosing the number of clusters (
K), and illustrate the influence of different
K values on the prediction accuracy. In order to explore this problem, we systematically vary the value of
K from 1 to 10 in each subject system and we measure the prediction accuracy in each case. Taking MySQL (version 5.5–8.0) as an example, the experiment results are shown in
Table 11.
Our results, in
Table 11, indicate that PCK achieves the highest prediction accuracy for almost all sample sizes when
K equals to 3 in MySQL. The
MRE of PCK decreases when
K increases appropriately. This demonstrates that clustering different smooth regions could help boost the performance of prediction model. However, the prediction accuracy will not continue to grow when
K exceeds a certain threshold, such as 3 in the above example. This is due to the insufficient target samples. If a large
K is chosen in PCK, the available target sample in each region will be too little to learn a reliable performance model. Consequently, the choice of
K is the key to whether the PCK method can achieve high prediction accuracy.
In order to further verify the necessity of PCK, we compare the
MRE in different clusters (
) with the
MRE without clustering (
) in MySQL, the result is shown in
Figure 5. We observe that the
MRE without clustering is higher than the
MRE in three different clusters in almost all sample sizes. Finally, the optimal prediction accuracy is achieved in this case (
). In addition, PCK achieves the best prediction performance in Redis (version 4.0.1–6.0.5) and PostgreSQL (version 9.3–11.0) when
, and
in Cassandra (version 2.1.0–3.11.6),
in SQLite (version 3.31.1–3.36.0), respectively. The reason for the small
K in Cassandra is its larger configuration space, thus the insufficient measurements may lead to inaccurate partitioning.