Ensemble Learning-Based Reactive Power Optimization for Distribution Networks

Reactive power optimization of distribution networks is of great significance to improve power quality and reduce power loss. However, traditional methods for reactive power optimization of distribution networks either consume a lot of calculation time or have limited accuracy. In this paper, a novel data-driven-based approach is proposed to simultaneously improve the accuracy and reduce calculation time for reactive power optimization using ensemble learning. Specifically, k-fold cross-validation is used to train multiple sub-models, which are merged to obtain high-quality optimization results through the proposed ensemble framework. The simulation results show that the proposed approach outperforms popular baselines, such as light gradient boosting machine, convolutional neural network, case-based reasoning, and multi-layer perceptron. Moreover, the calculation time is much lower than the traditional heuristic methods, such as the genetic algorithm.


Introduction
Reactive power optimization is one of the widely used means to reduce power loss and improve power quality by regulating the state of equipment, such as shunt capacitor bank, on-load tap changer (OLTC), and static var compensator (SVC). As a crucial component of the planning and scheduling of distribution networks, reactive power optimization is of great importance for both practical engineering and theoretical study [1].
Traditional methods of reactive power optimization can be subsumed under just two categories: heuristic algorithms [2] and mathematical programming algorithms [3]. Specifically, mathematical programming algorithms mainly consist of dynamic programming, linear programming, and non-linear programming. Although these mathematical programming algorithms have low complexity and fast computational speed, they have difficulty in dealing with non-linear and high-dimensional reactive power optimization problems, which results in limited optimization accuracy. The popular heuristic algorithms mainly include particle swarm optimization (PSO), simulated annealing (SA), and genetic algorithm (GA). Despite these heuristic algorithms significantly outperforming mathematical programming algorithms in terms of optimization accuracy, they involve heavy computational burdens, especially for large-scale distribution networks [4]. Therefore, it is necessary to develop a new method with a fast computational speed and high accuracy.
Driven by the development of smart meters, sensors, and communication technologies, the historical data stored in supervisory control and data acquisition systems show explosive growth, which brings opportunities to the application of data-driven technology in reactive power optimization. The existing data-driven-based algorithms for reactive power optimization can be subsumed under just two categories: similarity-based algorithms [5] and model-based algorithms [6]. Specifically, similarity-based algorithms mainly (1) A fully data-driven and scalable method is proposed for reactive power optimization of distribution networks without solving complex physical models. Additionally, the proposed approach is applied to different distribution networks by simply fine-tuning the structures and parameters. (2) Each method has its own advantages and disadvantages, while the proposed approach can learn widely from others' strong points to improve the optimization accuracy. To improve the generalization of the ensemble model, k-fold cross-validation is employed to train the model. (3) Numerical experiments on the real-world dataset are performed to validate the effectiveness of the ensemble framework for reactive power optimization of distribution networks. The simulation results show that the proposed approach achieves state-ofart performance with superior accuracy. Further, the calculation time is much lower than the traditional heuristic methods, such as GA.
The rest of this paper is organized as follows: Section 2 formulates the reactive power optimization model. Section 3 describes the application of ensemble learning in reactive power optimization. Simulations and results are discussed in Section 4. Section 5 summarizes the conclusions.

Reactive Power Optimization Model
Normally, the goal of reactive power optimization is to reduce power loss and improve the power quality of distribution networks [11]. Without loss of generality, the changes of power loss and voltage offset are defined as a comprehensive objective function of reactive power optimization in this paper: Energies 2022, 15,1966 3 of 15 maxF 1 = W P loss − P loss P loss where W is the weight (i.e., W is 0.5 in this paper), which is used to balance the power loss and voltage offset; P loss is the power loss before reactive power optimization; P loss is the power loss after reactive power optimization; dU is the voltage offset before reactive power optimization; dU is the voltage offset after reactive power optimization; n is the number of nodes in distribution networks; N is the number of branches in distribution networks; U 0 is the rated voltage; U i is the voltage of node i; R l is the resistance of branch l; P l is the active power of terminal node in the branch l; Q l is the reactive power of terminal node in the branch l; and U l is the voltage of terminal node in the branch l. Additionally, the reactive power optimization model of distribution networks has to meet the following constraints: (1) Power flow constraints in distribution networks where δ ij is the phase difference of the voltage between node i and node j, G ij is the conductance between node i and node j, and B ij is the susceptance between node i and node j.
(2) Current and voltage constraints in distribution networks where U i,max is the upper bound of voltage for node i, U i,min is the lower bound of voltage for node i, and I l,max is the upper bound of current for branch l.
(3) Equipment constraints in distribution networks where n C is the number of nodes with the shunt capacitor bank, n T is the number of nodes with OLTC, n SVC is the number of nodes with SVC, Q C,max is the maximum reactive power generated by the shunt capacitor bank, T i,min is the minimum tap position of the OLTC, T i,max is the maximum tap position of the OLTC, and Q SVC,max is the maximum reactive power generated by the SVC. Moreover, different sub-models (i.e., neural networks) are used to project the complex relationship between power loads and dispatching strategy. The new form of the comprehensive objective function can be defined as its opposite. Considering that these sub-models are difficult to deal with constraints directly, the penalty function method is employed to transform the reactive power optimization model into an unconstrained optimization problem.
Energies 2022, 15, 1966 4 of 15 where F 2 is a new form of the comprehensive objective function, λ 1 is the penalty coefficient of voltage constraints, ε is the step function, and λ 2 is the penalty coefficient of current constraints.
Note that dynamic reactive power optimization has the third constraint, while static reactive power optimization does not need to consider them. In this paper, the time interval control strategy is used to divide a day into several time intervals [12]. Then, the dynamic reactive power optimization is simplified to multiple static reactive power optimizations within the interval. Therefore, the third constraint was not added to the comprehensive objective function, since they have been implicitly considered by the time interval control strategy.

Framework of the Proposed Method
Ensemble learning is a popular meta approach of machine learning that obtains strong performance by combining the forecasting results from multiple different sub-models [13]. As one of the contributions of this paper, this section presents a framework that can ensemble three popular sub-models to obtain dispatching strategies of reactive power optimization, as shown in Figure 1.
where 2 F is a new form of the comprehensive objective function, 1  is the penalty coefficient of voltage constraints,  is the step function, and 2  is the penalty coefficient of current constraints. Note that dynamic reactive power optimization has the third constraint, while static reactive power optimization does not need to consider them. In this paper, the time interval control strategy is used to divide a day into several time intervals [12]. Then, the dynamic reactive power optimization is simplified to multiple static reactive power optimizations within the interval. Therefore, the third constraint was not added to the comprehensive objective function, since they have been implicitly considered by the time interval control strategy.

Framework of the Proposed Method
Ensemble learning is a popular meta approach of machine learning that obtains strong performance by combining the forecasting results from multiple different sub-models [13]. As one of the contributions of this paper, this section presents a framework that can ensemble three popular sub-models to obtain dispatching strategies of reactive power optimization, as shown in Figure 1.    First of all, the power loads are regarded as original features to train Model 1, which outputs the forecasting values (i.e., dispatching strategy). The power loads and the predicted dispatching strategy of distribution networks are considered as new input features of the next sub-model. Then, the new input features are used to train Model 2, which predicts the dispatching strategy of the training set and test set. The power loads, the predicted dispatching strategy of Model 1 and Model 2 are considered as new input features for the next sub-model. Similarly, the new input features are used to train Model 3, which predicts the dispatching strategy of the test set. Finally, final results can be obtained by averaging forecasting values of all sub-models.
Traditional hold-out validation is dependent on just one train-test split, which makes its performance depend on how the data are divided into the training set and test set. Relatively, k-fold cross-validation is a popular resampling technique, which is widely used to improve the generalization of different models in computer visions [14]. The technique has a single parameter k, which refers to the number of groups that a given dataset is to be divided into. So far, k-fold cross-validation has shown outstanding performance for different fields such as classification and prediction tasks. As another contribution of this paper, k-fold cross-validation is generalized from computer vision into the training process of each sub-model for reactive power optimization. The specific framework is shown in Figure 2.
Firstly, samples in the training set are sectioned into k equal groups. The samples in the first k − 1 groups are used to train a sub-model, which predicts the dispatching strategies of samples in the kth group and test set. Secondly, the samples in the training set (except for samples in the (k − 1)th groups) are utilized to train a sub-model, which predicts the dispatching strategies of samples in the (k − 1)th group and test set. Similarly, k sub-models can be trained to predict the dispatching strategies of samples in the training set and test set. Finally, the predicted dispatching strategies of the training set are considered as a new feature, which is used to train the next sub-model, and the average values of the test set are the output results of this sub-model.
Traditional hold-out validation is dependent on just one train-test split, which makes its performance depend on how the data are divided into the training set and test set. Relatively, k-fold cross-validation is a popular resampling technique, which is widely used to improve the generalization of different models in computer visions [14]. The technique has a single parameter k, which refers to the number of groups that a given dataset is to be divided into. So far, k-fold cross-validation has shown outstanding performance for different fields such as classification and prediction tasks. As another contribution of this paper, k-fold cross-validation is generalized from computer vision into the training process of each sub-model for reactive power optimization. The specific framework is shown in Figure 2. Firstly, samples in the training set are sectioned into k equal groups. The samples in the first k − 1 groups are used to train a sub-model, which predicts the dispatching strategies of samples in the kth group and test set. Secondly, the samples in the training set (except for samples in the (k − 1)th groups) are utilized to train a sub-model, which predicts the dispatching strategies of samples in the (k − 1)th group and test set. Similarly, k sub-models can be trained to predict the dispatching strategies of samples in the training set and test set. Finally, the predicted dispatching strategies of the training set are considered as a new feature, which is used to train the next sub-model, and the average values of the test set are the output results of this sub-model.
Compared with other data-driven-based methods, CNN, MLP, and LightGBM have better performance in many fields. Therefore, they are employed as examples to verify the effectiveness of the proposed ensemble framework [15]. Note that these three models may be replaced with other advanced models in future work. In the following sections, this paper shows how to employ sub-models to map the non-linear relationship between power loads and dispatching strategies, which is one of the contributions.

Convolutional Neural Network
The emergence of CNN has greatly promoted the development process of deep learning and artificial intelligence. So far, CNN has been widely used in various fields, such as target detection, fault diagnosis, time-series prediction, and semantic segmentation due to its powerful feature extraction capability [16]. As shown in Figure 3, a simple CNN structure consists of a convolutional layer, a pooling layer, and a dense layer.  Compared with other data-driven-based methods, CNN, MLP, and LightGBM have better performance in many fields. Therefore, they are employed as examples to verify the effectiveness of the proposed ensemble framework [15]. Note that these three models may be replaced with other advanced models in future work. In the following sections, this paper shows how to employ sub-models to map the non-linear relationship between power loads and dispatching strategies, which is one of the contributions.

Convolutional Neural Network
The emergence of CNN has greatly promoted the development process of deep learning and artificial intelligence. So far, CNN has been widely used in various fields, such as target detection, fault diagnosis, time-series prediction, and semantic segmentation due to its powerful feature extraction capability [16]. As shown in Figure 3, a simple CNN structure consists of a convolutional layer, a pooling layer, and a dense layer.  Specifically, the convolutional operation is performed to extract features of input data, and then a bias vector is added to obtain the output data of convolutional layers: ( ) con con con con con where con Y is the output data of convolutional layers, con X is the input data of convolutional layers, con ( ) σ ⋅ is the activation function of convolutional layers, con W represents weights of convolutional layers, and con B represents bias vectors of convolutional layers. Note that the output data of convolutional layers is utilized as the input data to the following maximum pooling layers.
As shown in Figure 4, the maximum pooling layer is employed to reduce the dimensionality of input data: where pool Y is the output data of maximum pooling layers, pool X is the input data of maximum pooling layers, and R is the domain of definition for maximum pooling layers. Note that the output data of maximum pooling layers is utilized as the input data to the following convolutional layers or dense layers. Specifically, the convolutional operation is performed to extract features of input data, and then a bias vector is added to obtain the output data of convolutional layers: where Y con is the output data of convolutional layers, X con is the input data of convolutional layers, σ con (·) is the activation function of convolutional layers, W con represents weights of convolutional layers, and B con represents bias vectors of convolutional layers. Note that the output data of convolutional layers is utilized as the input data to the following maximum pooling layers. As shown in Figure 4, the maximum pooling layer is employed to reduce the dimensionality of input data: where Y pool is the output data of maximum pooling layers, X pool is the input data of maximum pooling layers, and R is the domain of definition for maximum pooling layers. Note that the output data of maximum pooling layers is utilized as the input data to the following convolutional layers or dense layers.
where pool Y is the output data of maximum pooling layers, pool X is the input data of maximum pooling layers, and R is the domain of definition for maximum pooling layers. Note that the output data of maximum pooling layers is utilized as the input data to the following convolutional layers or dense layers. To reshape the multi-dimensional tensors into a one-dimensional vector, a flatten layer is inserted between dense layers and the last maximum pooling layer. Moreover, the vectors from the flatten layer are fed to a dense layer to obtain dispatching strategies: represents bias vectors of dense layers.

Multi-Layer Perceptron
Normally, the MLP consists of multiple dense layers. In this paper, the encoderdecoder pipeline is used to project the non-linear relationship between power loads and dispatching strategies, as shown in Figure 5.  To reshape the multi-dimensional tensors into a one-dimensional vector, a flatten layer is inserted between dense layers and the last maximum pooling layer. Moreover, the vectors from the flatten layer are fed to a dense layer to obtain dispatching strategies: where Y dense is the output data of dense layers, X dense is the input data of dense layers, σ dense (·) is the activation function of dense layers, W dense represents weights of dense layers, and B dense represents bias vectors of dense layers.

Multi-Layer Perceptron
Normally, the MLP consists of multiple dense layers. In this paper, the encoderdecoder pipeline is used to project the non-linear relationship between power loads and dispatching strategies, as shown in Figure 5. For the encoder, low-dimensional latent variables can be obtained by feeding input data to multiple dense layers: where en Y is the output data of the encoder; en X is the input data of the encoder, en ( ) σ ⋅ is the activation function of the encoder, en W is weights of the encoder, and en B is bias vectors of the encoder. Note that the output data of the encoder is used as the input data of the decoder. For the decoder, low-dimensional latent variables can be obtained by feeding input data to multiple dense layers: where de Y is the output data of the decoder; de X is the input data of the decoder, de ( ) σ ⋅ is the activation function of the decoder, de W represents weights of the decoder, and de B represents bias vectors of the decoder.

Light Gradient Boosting Machine
LightGBM is a high-performance and distributed gradient boosting framework improved from the decision tree, which is widely used for regression and classification tasks [17]. Specifically, multiple decision trees are trained in an additive manner to For the encoder, low-dimensional latent variables can be obtained by feeding input data to multiple dense layers: where Y en is the output data of the encoder; X en is the input data of the encoder, σ en (·) is the activation function of the encoder, W en is weights of the encoder, and B en is bias vectors of the encoder. Note that the output data of the encoder is used as the input data of the decoder.
For the decoder, low-dimensional latent variables can be obtained by feeding input data to multiple dense layers: where Y de is the output data of the decoder; X de is the input data of the decoder, σ de (·) is the activation function of the decoder, W de represents weights of the decoder, and B de represents bias vectors of the decoder.

Light Gradient Boosting Machine
LightGBM is a high-performance and distributed gradient boosting framework improved from the decision tree, which is widely used for regression and classification tasks [17]. Specifically, multiple decision trees are trained in an additive manner to forecast the residual errors of the prior models. Suppose that a LightGBM model with n tr trees is trained with n sa samples, and the additive training process can be represented as: where f t (·) is the learned function of the tth decision tree, and y During iteration, the current forecasts y (t) i and the learned function f t (·) are updated by minimizing the loss function: where Ω(·) is a regularization, and D(·) is the distance between current forecasts y (t) i and real values y i , such as the mean squared error (MSE): Moreover, LightGBM can be seen as an improved version of extreme gradient boosting in the following aspects: Firstly, the gradient-based one-side sampling (GOSS) is incorporated into LightGBM. GOSS achieves a good balance between the accuracy of LightGBM and the number of samples. More attention should be paid to samples with a larger gradient in training, which have a greater impact on the gain. Secondly, LightGBM employs a leaf-wise with depth limitation rather than the traditional level-wise algorithm to improve accuracy. Thirdly, exclusive feature bundling (EFB) is utilized to reduce the dimension of features. Moreover, new features can be obtained by binding mutually exclusive features together. Fourthly, the histogram is used to identify the optimal segmentation point in LightGBM, which constructs a histogram with width, and discretizes successive floating-point eigenvalues to multiple integers.

Parameters and Data Description
In order to fully test the performance of the proposed ensemble model, the modified IEEE 33-bus radial distribution network and modified IEEE 69-bus radial distribution network are employed for simulation and analysis. The parameters (e.g., resistance and reactance of branches) can be found in [18,19], and the topologies are shown in Figures 6 and 7.
For the modified IEEE 33-bus radial distribution network, the rated voltage is 10 kV. The OLTC includes 17 different tap positions, which vary from −8 to 8. Generally, decentralized capacitor banks and SVCs at the end of feeders can reduce the power loss and voltage offset. Therefore, capacities and locations of the equipment as assumed as follows: The six shunt capacitor banks are added at Node 17, and seven shunt capacitor banks are Energies 2022, 15, 1966 8 of 15 added at Node 32. The capacity of each shunt capacitor bank is 100 kvar. The SVC is added at Node 8 and the reactive power of the SVC varies from 0 to 500 kvar. discretizes successive floating-point eigenvalues to multiple integers.

Parameters and Data Description
In order to fully test the performance of the proposed ensemble model, the modified IEEE 33-bus radial distribution network and modified IEEE 69-bus radial distribution network are employed for simulation and analysis. The parameters (e.g., resistance and reactance of branches) can be found in [18,19], and the topologies are shown in Figures 6 and 7.  For the modified IEEE 33-bus radial distribution network, the rated voltage is 10 kV. The OLTC includes 17 different tap positions, which vary from −8 to 8. Generally, decentralized capacitor banks and SVCs at the end of feeders can reduce the power loss and voltage offset. Therefore, capacities and locations of the equipment as assumed as follows: The six shunt capacitor banks are added at Node 17, and seven shunt capacitor banks are added at Node 32. The capacity of each shunt capacitor bank is 100 kvar. The SVC is added at Node 8 and the reactive power of the SVC varies from 0 to 500 kvar. discretizes successive floating-point eigenvalues to multiple integers.

Parameters and Data Description
In order to fully test the performance of the proposed ensemble model, the modified IEEE 33-bus radial distribution network and modified IEEE 69-bus radial distribution network are employed for simulation and analysis. The parameters (e.g., resistance and reactance of branches) can be found in [18,19], and the topologies are shown in Figures 6 and 7.  For the modified IEEE 33-bus radial distribution network, the rated voltage is 10 kV. The OLTC includes 17 different tap positions, which vary from −8 to 8. Generally, decentralized capacitor banks and SVCs at the end of feeders can reduce the power loss and voltage offset. Therefore, capacities and locations of the equipment as assumed as follows: The six shunt capacitor banks are added at Node 17, and seven shunt capacitor banks are added at Node 32. The capacity of each shunt capacitor bank is 100 kvar. The SVC is added at Node 8 and the reactive power of the SVC varies from 0 to 500 kvar. For the modified IEEE 69-bus radial distribution network, the rated voltage is 10 kV, and the OLTC also includes 17 tap positions. The reactive power of all SVCs varies from 0 to 400 kvar. The seven shunt capacitor banks are added at Node 17, Node 26, Node 51, and Node 67. The SVCs are added at Node 9, Node 33, and Node 44. The capacity of each shunt capacitor bank is 100 kvar.
The smart meter dataset of London is used for power loads of the modified IEEE 33-bus radial distribution network and the modified IEEE 69-bus radial distribution network. This dataset's hourly household power load curves are in 112 blocks from November 2011 to February 2014 [20]. The power loads of three adjacent blocks are randomly selected to analog the electricity consumption of each node in distribution networks. Only 5000 samples are filtered for simulation via data cleaning, since the collected time of each block is different. Further, 80% of the samples are randomly selected to train each model, and 10% of the samples are randomly selected as the validation set. The rest are employed to evaluate the performance of the trained models. The active power and reactive power are used to form the input feature of one sample. For the modified IEEE 33-bus radial distribution network, the input feature is a vector of 1 × 64 scale. For the modified IEEE 69-bus radial distribution network, the input feature is a vector of 1 × 136 scale. Before training sub-models, dispatching strategies should be obtained as labels. In this paper, the GA is performed 40 times independently, and then the best dispatching strategy is considered as the label of each sample.
All programs for reactive power optimization are implemented in PyCharm with deep learning libraries (e.g., Tensorflow 1.0 and Keras 2.0). The parameters of the laptops are: a dual-core 2.40 GHz processor, 6 GB memory cards, Intel(R) Core(TM) i3-3110M.
Furthermore, the probing method is used to find the appropriate structures and parameters for sub-models and baselines by performing multiple experiments and finetuning the parameters [21]: (1) For the CNN, it includes a convolutional layer, a maximum pooling layer, a flatten layer, and a dense layer with 4 units. The number of convolutional filters is 16, and the size of the convolutional kernel is 2 × 2. The pool size is 3 × 3. The activation function of the deny layer is the sigmoid function, and the others are the rectified linear unit (ReLU) function. The optimizer is the adaptive moment estimation (Adam) algorithm, and the loss function is the MSE between forecasting labels and real labels. respectively. The activation functions of the middle layer are all ReLU functions, and the activation function of the output layers is the sigmoid function. The loss function and optimizer are the same as the CNN. (3) For LightGBM, the boosting type is the traditional gradient boosting decision tree, and the maximum tree depth for base learners is 5. The boosting learning rate is 0.005, and the number of boosted trees is 1000. The minimum number of data needed in a child is 80, and the sub-sample ratio of the training instance is 0.8. The maximum tree leaves for base learners is 25, and the sub-sample ratio of columns is 1. (4) The parameters of the CBR are the same as the algorithm in [5]. (5) For GA, the size of chromosomes is 50, and the number of iterations is 300. The probability of variation is 0.2, and the probability of chiasma is 0.5.

Effect of k-Fold Cross-Validation
To compare the performance of k-fold cross-validation and traditional hold-out validation, LightGBM, MLP, and CNN are used as sub-models to form an ensemble model. The k varies from 2 to 15, and the step size is 1. Each case is repeated 30 times. The mean loss functions (i.e., MSE between forecasting and real labels) of the test set, as shown in Table 1. The following conclusions can be drawn from Table 1: (1) Compared with the traditional hold-out validation, k-fold cross-validation shows smaller loss functions, which indicate that k-fold cross-validation outperforms hold-out validation. This is because every fold appears in the training set k − 1 times, which in turn ensures that each sample appears in the dataset, thus enabling the sub-models to represent the latent features better. (2) With the increase in k, the loss function first decreases and then increases, which indicates that k should not be too small or too large. In addition, the training time of each sub-model increases linearly with the increase of k. Hence, the loss function and training time should be considered at the same time, when k is set. In general, four can be considered as a good starting point for k, and higher values or lower values may be fine for other datasets. (3) Although the proposed ensemble model requires some time to pre-train the models before using them, this training time is not very long and it is acceptable in practical engineering.

The Effect of the Order on Performance
In order to analyze the influence of the sub-models' orders on the performance of the proposed method, 15 cases with different ranking are set, and each case is repeated 30 times. The mean loss functions (i.e., MSE) of the test set are shown in Table 2 and Figure 8.
The following conclusions can be drawn from Figure 8 and Table 2: (1) Comparing the loss functions of Case 6, Case 13, Case 14, and Case 15, it is found that multiple different sub-models are more conducive to improving the performance of the ensemble model than multiple identical sub-models. (2) Sometimes, the performance of the ensemble model composed of different sub-models may not be better than that of another ensemble model with the same sub-models, because the performance of the former is significantly affected by the order of different sub-models. For example, the loss function of Case 2 is larger than those of Case 13, Case 14, and Case 15. Generally, different sub-models can be selected to form the proposed ensemble model, and their order should be determined by the loss function of the validation set.

Comparative Analysis with Baselines
Normally, the dynamic reactive power optimization can be simplified into multiple static reactive power optimization problems using the time interval control strategy [12]. Specifically, the power load curve is divided into multiple time intervals, and then the static reactive power optimization is performed in each time interval to obtain a comprehensive dispatching strategy for dynamic reactive power optimization. Therefore, the static reactive power optimization can be used as an example to validate the effectiveness of the proposed method in this section.
To illustrate the effectiveness of the proposed ensemble model, the traditional heuristic algorithm (e.g., GA) and popular data-driven-based algorithms (e.g., CBR, MLP, CNN, and LightGBM) are used as the baselines. Each method is repeated 30 times, and the mean results of the test set are shown in Table 3.
The following conclusions can be drawn from Table 3: (1) Generally, the smaller the power loss and voltage offset, the better the performance of the model. Note that power loss and voltage offset are two conflicting metrics sometimes. Therefore, the comprehensive objective function is presented to balance them to evaluate the model performance in an integrated manner. The larger the comprehensive objective function, the better the performance of the model. Specifically, the average comprehensive objective function of CBR is the smallest, which shows that dispatching strategies of historical cases found by the CBR are not well suited to current cases, since the historical power load may significantly vary from the current power loads. (2) Although the performance of the ensemble model is slightly weaker than GA with regard to the average comprehensive objective function and its variance, the ensemble model outperforms other data-drivenbased algorithms (e.g., CNN, CBR, MLP, and LightGBM) due to the fact that the average comprehensive objective function of the ensemble model is larger than those of datadriven-based algorithms. This phenomenon shows that the ensemble model can seek better performance from multiple sub-models for reactive power optimization. (3) The online calculation time is one of the important metrics to evaluate the performance of each model for reactive power optimization. Normally, suitable dispatching strategies should be obtained within 60 s [22], during which real-time power systems obtain the observations and then calculate solutions for all power equipment. For single reactive power optimization of the modified IEEE 69-bus radial distribution network, the online time consumptions of the ensemble model, GA, CNN, MLP, LightGBM, and CBR are 0.23 s, 64.77 s, 0.08 s, 0.06 s, 0.09 s, and 4.37 s, respectively. Although data-driven-based algorithms require some time to pre-train models, their online time consumptions are much lower than traditional heuristic algorithms, such as GA. (4) Further, the online calculation time of GA increases significantly with the size of distribution networks (e.g., the number of nodes and equipment), while the online calculation time of the ensemble model is not sensitive to the size of distribution networks, which shows that proposed model is also suitable for reactive power optimization of large-scale distribution networks. This is one of the advantages of the proposed model, i.e., the online calculation time is very short, and is well suited for the real-time optimization of power systems.

Reactive Power Optimization with Renewable Energy Sources
In order to achieve carbon neutrality, the integration of renewable energy sources in distribution networks has become more and more popular in recent years. To test the performance of different models for reactive optimization of distribution networks with renewable energy sources, the IEEE 33-bus radial distribution network is again modified, as shown in Figure 9. performance of the ensemble model is slightly weaker than GA with regard to the average comprehensive objective function and its variance, the ensemble model outperforms other data-driven-based algorithms (e.g., CNN, CBR, MLP, and LightGBM) due to the fact that the average comprehensive objective function of the ensemble model is larger than those of data-driven-based algorithms. This phenomenon shows that the ensemble model can seek better performance from multiple sub-models for reactive power optimization. (3) (4) Further, the online calculation time of GA increases significantly with the size of distribution networks (e.g., the number of nodes and equipment), while the online calculation time of the ensemble model is not sensitive to the size of distribution networks, which shows that proposed model is also suitable for reactive power optimization of large-scale distribution networks. This is one of the advantages of the proposed model, i.e., the online calculation time is very short, and is well suited for the real-time optimization of power systems.

Reactive Power Optimization with Renewable Energy Sources
In order to achieve carbon neutrality, the integration of renewable energy sources in distribution networks has become more and more popular in recent years. To test the performance of different models for reactive optimization of distribution networks with renewable energy sources, the IEEE 33-bus radial distribution network is again modified, as shown in Figure 9. In particular, the first PV system is added to Node 24 and the second PV system is added to Node 21. The first wind turbine (WT) is added to Node 25, and the second wind turbine is added to Node 12. Assume that the power factor of a node with a wind turbine or PV system is fixed (power factor is 0.95). The power generation of renewable energy sources originates from the National Renewable Energy Laboratory [23,24]. The time resolution of power generation is also 1 h. To ensure that the penetration of renewable energy sources in distribution networks was between 10% and 50%, the original power generation of renewable energy sources is scaled up appropriately. Each method is repeated 30 times respectively and the mean results of the test set are shown in Table 4. In particular, the first PV system is added to Node 24 and the second PV system is added to Node 21. The first wind turbine (WT) is added to Node 25, and the second wind turbine is added to Node 12. Assume that the power factor of a node with a wind turbine or PV system is fixed (power factor is 0.95). The power generation of renewable energy sources originates from the National Renewable Energy Laboratory [23,24]. The time resolution of power generation is also 1 h. To ensure that the penetration of renewable energy sources in distribution networks was between 10% and 50%, the original power generation of renewable energy sources is scaled up appropriately. Each method is repeated 30 times respectively and the mean results of the test set are shown in Table 4. No matter how the penetration changes, the comprehensive objective function value of the proposed ensemble model is the largest, which shows that the ensemble model has better performance than other data-driven-based algorithms (e.g., CNN, CBR, MLP, and LightGBM) for reactive power optimization of distribution networks with different penetration levels.

Conclusions
To improve the accuracy and reduce the calculation time of reactive power optimization, a novel ensemble learning-based model is presented in this paper. Through the simulation analysis on two radial distribution networks, the following conclusions are obtained: (1) The accuracy of models trained by k-fold cross-validation is higher than that of holdout validation. In addition, k should not be too small or too large. Four can be considered as a good starting point for k, and higher values or lower values may be fine for other data sets. (2) Multiple different sub-models are more conducive to improving the performance of the ensemble model than multiple identical sub-models. Additionally, the performance of the ensemble model is significantly affected by the order of different sub-models. Normally, different sub-models can be selected to form the proposed ensemble model, and their order should be determined by the loss function of the validation set. (3) The proposed ensemble model outperforms other data-driven-based algorithms (e.g., CNN, CBR, MLP, and LightGBM) in terms of optimization accuracy and stability. In addition, the calculation time is much lower than the traditional heuristic methods (e.g., GA), especially for large-scale distribution networks. (4) No matter how the penetration changes, the ensemble model has better performance than other data-driven-based algorithms (e.g., CNN, CBR, MLP, and LightGBM) for reactive power optimization of distribution networks.