A New Hyper-Parameter Optimization Method for Power Load Forecast Based on Recurrent Neural Networks

: The selection of the hyper-parameters plays a critical role in the task of prediction based on the recurrent neural networks (RNN). Traditionally, the hyper-parameters of the machine learning models are selected by simulations as well as human experiences. In recent years, multiple algorithms based on Bayesian optimization (BO) are developed to determine the optimal values of the hyper-parameters. In most of these methods, gradients are required to be calculated. In this work, the particle swarm optimization (PSO) is used under the BO framework to develop a new method for hyper-parameter optimization. The proposed algorithm (BO-PSO) is free of gradient calculation and the particles can be optimized in parallel naturally. So the computational complexity can be effectively reduced which means better hyper-parameters can be obtained under the same amount of calculation. Experiments are done on real world power load data, where the proposed method outperforms the existing state-of-the-art algorithms, BO with limit-BFGS-bound (BO-L-BFGS-B) and BO with truncated-newton (BO-TNC), in terms of the prediction accuracy. The errors of the prediction result in different models show that BO-PSO is an effective hyper-parameter optimization method.


Introduction
The selection of hyper-parameters has always been a key problem in the practical application of machine learning models. The generalization performance of the model depends on the reasonable selection of hyper-parameters of the model. There are many research works on hyper-parameter tuning of machine learning models, including convolutional neural networks (CNN), etc. [1]. At present, neural network models have made remarkable achievements in the fields of image recognition,fault detection and classification (FDC) [2][3][4], natural language processing, and so on. In practice, the hyper-parameters of models rely on experience and a large number of attempts is not only time-consuming and computationally expensive for algorithm training but also does not always maximize the performance of the model [5,6]. With the increasing complexity of the model, the number of hyper-parameters is increasing, and the parameter space is very large. It is not feasible to try all the hyper-parameter combinations in calculation, which is not only time-consuming, but also causes knowledge and labor burden.
Therefore, as an alternative to the manual selection of hyper-parameters, many naive optimization methods are used in the field of hyper-parameter automatic estimation in engineering practice, such as the grid search method [7,8] and random search method [9]. Through the experiments on hypotheses of hyper-parameters, these methods finally select the hyper-parameters with the best performance. In recent years, Bayesian optimization (BO) algorithm [10,11] is very popular in the field of hyper-parameter estimation in machine learning [12]. Different from grid search and random search, the framework of BO is sequential, that is, the current optimal value search is based on the previous search results and makes full use of the information of the existing data [13]. However, other methods ignore these information. BO uses the limited sample to construct a posteriori probability distribution of the black box function to find the optimal value of the function. In BO, hyper-parameters mapping to model generalization accuracy is implemented through a surrogate model. The hyper-parameter tuning problem is then turned into a problem of solving the maximum value of the acquisition function. The acquisition function describes the likelihood of the maximum or minimum of the generalization accuracy of the model. The mathematical function may be high-dimensional and may have many extreme points.
In the basic BO, many gradient-based methods are used to query the maximum value of the acquisition function, such as limit-BFGS-bound (L-BFGS-B) [14] and truncatednewton (TNC) [15]. However, these methods require that the gradient of variables can be solved. In high-dimensional space, the calculation of the first or second derivative of variables is complex, and the result can not be guaranteed to be globally optimal. PSO method [16] has the characteristics of a simple concept, easy to implement and high computational efficiency, and has been successfully applied in many fields [17][18][19]. However, when the mapping of hyper-parameters to loss function or generalization accuracy of the model is lack of clear mathematical formula, PSO and other optimization methods can not be directly applied to the estimation of hyper-parameters [20,21]. In this paper, we use the PSO method to query the maximum value of the acquisition function. PSO method can well complete the task of querying the maximum value of the acquisition function without calculating the gradient of variable. When PSO gets better results in querying the maximum value of the acquisition function, the generalization accuracy of the machine learning model can be improved in a high probability.
In this paper, aiming at the problem of high computational complexity when querying the maximum value of the acquisition function, BO-PSO algorithm is proposed, which combines the advantages of high sample efficiency of BO and simple of PSO. Additionally, with the progress of science and technology and the rapid development of social economy, the demand for power is increasing. Accurate power load forecasting is very important for the stability of power system, the guarantee of power service and the rational utilization of power. Scholars have put forward a variety of prediction methods, including time series prediction methods, multiple linear regression prediction methods [22,23] and so on. However, with the development of intelligence, the data of power load is becoming more and more complex. Power load forecasting is a nonlinear time series problem, and more accurate forecasting needs to rely on machine learning algorithms. In recent years, the deep learning methods are playing a vital role in this field [24].
For these issues above, we have done the following study: based on recurrent neural network (RNN) and Long-Short Term Memory (LSTM) models, the method we propose in this paper is used instead of the manual method to determine the hyper-parameters, and the power load forecasting is carried out on the real time series data set. The experimental results show that the method is effective in the hyper-parameter tuning of machine learning model and can be effectively applied to power load forecasting.
The remainder of this paper is organized as follows: BO, PSO, RNN and LSTM are introduced in Section 2. The BO-PSO is introduced in Section 3. Furthermore, the experimental results are demonstrated in Section 4. The paper is concluded in Section 5.

Bayesian Optimization
Bayesian optimization (BO) was first proposed by Pelikan, et al. of the University of Illinois Urbana-Champaign (UIUC) in 1998 [25]. It finds the optimal value of the function by constructing a posteriori probability of the output of the black box function when the limited sample points are known. Because the BO algorithm is very efficient, it is especially useful when the evaluation cost of the objective function is high, the derivative of the independent variable can not be obtained, or there are multiple peaks. BO method has two core components, one is a probabilistic surrogate model composed of prior distributions, and the other is the acquisition function. BO is a sequential model, and the posterior probability is updated by the new sample points in each iteration. At the same time, in order to avoid falling into the local optimal value, BO algorithms usually add some randomness to make a tradeoff between random exploration and a posteriori distribution. BO is one of the few hyper-parameter estimation methods with good convergence theory [26].
Taking the optimization of hyper-parameters of models with BO as the research direction, the problem of finding the global maximum or minimum value of black objective function is defined as (this paper takes finding the maximum value of objective function as an example): where x ∈ X,X is hyper-parameters space. The purpose of this article is to find the maximum value of the objective function. Suppose the existing data is D 1:t = (x i , y i ), i = 1, 2, · · · , t, y i is the generalization accuracy of the model under the hyper-parameter x i . In the following, D 1:t = (x i , y i ), i = 1, 2, · · · , t was simplified as D. We hope to estimate the maximum value of the objective function in a limited number of iterations. If y is regarded as a random observation of the generalization accuracy, y = f (x) + ε, where the noise ε satisfies p(ε) = N(0, σ 2 ε ), independent and identically distributed. The goal of hyper-parameter estimation is to find x * in the d-dimensional hyper-parameters space.
One problem with this maximum expected accuracy framework is that the true sequential accuracy is typically computationally intractable. This has led to the introduction of many myopic heuristics known as acquisition functions, which is maximized as: There are three commonly acquisition functions: probability of improvement (PI), expected improvement (EI) and upper confidence bounds (UCB). These acquisition functions trade off exploration against exploitation.
In recent years, BO has been widely used in machine learning model hyper-parameters estimation and model automatic selection [27][28][29][30][31], which promotes the research of BO method for hyper-parameters estimation in many aspects [32][33][34][35]. The flow of the BO algorithm is shown in Figure 1.

Particle Swarm Optimization
Particle swarm optimization (PSO) [36,37] is a method based on swarm intelligence, which was first proposed by Kenndy and Eberhart in 1995 [38]. Because of its simplicity in implementation, PSO algorithm is successfully used in machine learning, signal processing, adaptive control and so on. The PSO algorithm first initializes m particles randomly, and each particle is a potential solution to the problem that needs to be solved in the search space.
In each iteration, the velocities and positions of each particle are updated using two values: one is the best value (p b ) of particle, and the other is the best value (g b ) of population overall previous. Suppose there are m particles in the d-dimensional search space, the velocity v and position x of the i-th particle at the time of t are expressed as: The best value of particle and the overall previous best value of population at iteration t are: At iteration t + 1, the position and velocity of the particle are updated as follows: where ω is the inertia weight coefficient, which can trade off the global search ability against local search ability; c 1 and c 2 are the learning factors of the algorithm. If c 1 = 0, it is easy to fall into and can not jump out local optimization; if c 2 = 0, it will lead to slow convergence speed of PSO; r 1 and r 2 are random variables uniformly distributed in [0, 1].
In each iteration of the PSO algorithm, only the optimal particle can transmit the information to other particles. The algorithm generally has two termination conditions: a maximum number of iterations or a sufficiently good fitness value.

Recurrent Neural Network
The recurrent neural network (RNN) does not rigidly memorize all fixed-length sequences. On the input sequence x t , it determines the output sequence y t by storing the hidden state h t of the time step information. The network structure is shown in Figure 2, where the calculation of h t is determined by the input of the current time step and the hidden variables of the previous time step: where φ is the activation function, ω xh , ω hh are the weight, b h is the hidden layer deviation. Long-Short term Memories (LSTM) is a kind of gated recurrent neural network, which is carefully designed to avoid long-term dependence. It introduces three gates on the basis of RNN, namely, input gate, forget gate and output gate, as well as memory cells with the same shape as the hidden state, so as to record additional information. As shown in the Figure 3, the input of the gate of LSTM is the current time step input, such as x t and the previous time step hidden state h t−1 , and the output is calculated by the full connection layer. In this way, the values of the three gates are all in the range of [0, 1]. Suppose the number of hidden units is l, the input x t of time step t and the hidden state h t−1 of the previous time step, memory cell C t , candidate memory cellC t , input gate I t , forget gate F t and output gate O t . The calculations are as follows: where σ is the activation function, ω xi , ω hi , ω x f , ω h f , ω xo , ω ho , ω xc , ω hc are the weight, b i , b f , b o , b c are the hidden layer deviation. Once the memory cell is obtained, the flow of information from the memory cell to the hidden state h t can be controlled by output gate: where the tanh function ensures that the hidden state element value is between −1 and 1.

Output layer
Hidden state

Memory cell
Hidden state

BO-PSO
BO algorithm based on particle swarm optimization (BO-PSO) is an iterative process. PSO is used to solve the maximum value of the acquisition function to obtain the next point x t+1 to be evaluated; Then, the value of the objective function is evaluated according to y t+1 = f (x t+1 ) + ε; Finally, the existing data D is updated with the new sample data {(x t+1 , y t+1 )}, and the posterior distribution of the probabilistic surrogate model is updated to prepare for the next round of iteration.

Algorithm Framework
The effectiveness of BO depends on the acquisition function to some extent. The acquisition function generally has the characteristics of non-convex and multi-peak, which needs to solve the non-convex optimization problem in the search space X. PSO has the advantages of simplicity, few parameters need to be adjusted, fast convergence and so on. It is not necessary to calculate the derivative of the objective function. Therefore, this paper chooses PSO to optimize the acquisition function to obtain new sample points.
First, we need to select a surrogate model. The approximation characteristics of potential functions and the ability to measure uncertainty of Gaussian process (GP) make it a popular choice of surrogate model. Gaussian process is a nonparametric model determined by mean function and covariance function (positive definite kernel function). In general, every finite subset of the Gaussian process model follows the multi variable normal distribution. Assuming that the output expectation of the model is 0, the joint distribution of the existing data D and the new sample point (x t+1 , y t+1 ) can be expressed as follows: Gram matrix matrix is as follows: the I is identity matrix and σ 2 ε is the noise variance. The prediction can be made by considering the original observation data as well as the new x. Since the posterior distribution of y t+1 is: The mathematical expectation and variance of y t+1 are as follows: The ability of GP to express the distribution of functions only depends on the covariance function. Matern-52 covariance function is one of them and as follows: The second choice we need to make is acquisition function. Although our method is applicable to most acquisition functions, we choose to use UCB which is more popular in our experiment. GP-UCB proposed by Srinivas in 2009 [39]. The UCB strategy considers to increase the value of the confidence boundary on the surrogate model as much as possible, and its acquisition functions is as follows: the γ is a parameter that controls the trade-off between exploration (visiting unexplored areas in X) and exploitation (refining our belief by querying close to previous samples). This parameter can be fixed to a constant value.

Algorithm Framework
BO-PSO consists of the following steps: (i) assume a surrogate model for the black box function f , (ii) define an acquisition function α based on the surrogate model of f , and maximize α by the PSO to decide the next evaluation point, (iii) observe the objective function at the point specified by α maximization, and update the GP model using the observed data. BO-PSO algorithm repeats (ii) and (iii) above until it meets the stopping conditions. The Algorithm 1 framework is as follows: Input: surrogate model for f , acquisition function α Output: hyper-parameters vector optimal x * Step 1. Initialize hyper-parameters vector x 0 ; Step 2. For t = 1, 2, . . ., T do: Step 3. Using algorithm 1 to maximize the acquisition function to get the next evaluation point: x t+1 = argmax x∈X α(x|D); Step 4. Evaluation objective function value y t+1 = f (x t+1 ) + ε t+1 ; Step 5. Update data: D t+1 = D ∪ (x t+1 , y t+1 ), and update the surrogate model; Step 6. End for.

Results
To verify the effectiveness of BO-PSO, we select the power load data set of a given year from the city of Nanchang, and determine the hyper-parameters of RNN and LSTM model based on the optimization method proposed in this paper to realize power load forecasting.

Data Sets and Setups
In this study, the power load data set includes 35,043 data, in which a sampling frequency is 15 min. The dataset was normalised to eliminate the magnitude differences. And the dataset was divided into a training set and a test set, with a test set size of 3000.
In the process of hyper-parameter optimization of machine learning model, when there are boundary restrictions of hyper-parameter in the BO, the methods to optimize the acquisition function are L-BFGS-B, TNC, SLSQP, TC and so on. Among them, L-BFGS-B and TNC are the most commonly used. We choose these two methods as the comparison methods, and the corresponding BO framework are named BO-L-BFGS-B and BO-TNC respectively.
The PSO algorithm can guarantee the convergence of the algorithm when the parameter (ω, c 1 , c 2 ) satisfies the condition −1 < ω < 1, 0 < c 1 + c 2 < 4(1 + ω) [40]. c 1 and c 2 affect the expected value and variance of the particle position. The smaller the variance is, the more concentrated the optimization result is, and the better the stability of the optimization system is. Other related research results show that the constant inertia ω = 0.7298 and acceleration coefficient c 1 = c 2 = 1.49618 have good convergence characteristics [41], and the PSO parameters in this experiment are set according to this.
To ensure the comparability of the experiment, all the train and test are run in the same code package of Python; The surrogate model is a Gaussian process with mean function 0 and covariance function Matern-52, and the kernel function and hyper-parametric likelihood are optimized by maximizing logarithmic likelihood; The acquisition function is UCB, and optimization algorithm is randomly initialized with 5 observations in each experiment.
In this experiment, RNN and LSTM are selected as the basic model of power forecasting. The search space of hyper-parameters is shown in Table 1

Comparative Prediction Results
In order to compare the accuracy of several prediction models, the normalized mean square error (NMSE), the median square error (NMDSE) and coefficient of determination (R 2 ) are selected as the evaluation of the performance of hyper-parameters, the calculation formula is as follows: where y t is the actual power consumption at time t,ŷ t is the predicted power consumption at time t. Finally, the values of R 2 in Table 2 of models are close to 1, which means that all of these models achieve an excellent fitting effect. BO-PSO performs better than others methods in the view of R 2 . A visual comparison of the values of NMSE is shown in Figure 5.  It can be seen that after optimizing the model, the NMSE value of the BO-PSO method, whether the maximum value, the minimum value or the average value, is smaller than that of the other two methods, which shows that the BO-PSO method is effective in this experiment and is better than the other two methods. After comparing the NMSE and R 2 values of the three methods, the final hyper-parameter (Feature length, Number of network units, Batch size of training data) values are shown in Table 3. To show the comparison of the effects of the three methods more intuitively, the hyperparameters in Table 3 are used to train the RNN and LSTM models. The top 300 predictions for the test set using the proposed method and the comparison method as shown in Figures 6 and 7, it can be seen that there is an obvious fluctuation law of power load and the prediction results are closest to the true values. At the same time, the point-by-point prediction error of each model is shown. It can be observed that the prediction curves fit well for these six models. The error curves of RNN-BO-PSO and LSTM-BO-PSO fluctuate smoothly. The error of RNN-BO-PSO and LSTM-BO-PSO are obviously less than the other models in Figures 6 and 7. After calculation, the NMSE value and NMDSE value of models are obtained, as shown in Table 4. Figure 8 shows the comparison of the two indicators, respectively. As can be seen from the chart, the two models optimized based on the BO-PSO method have the smallest NMSE and NMDSE. Thus, it can be seen that the optimization method in this paper can not only replace the manual selection of hyper-parameters, but also improve the performance of the model. From the results of Figures 6-8 and Table 4, we can see that BO-PSO is an effective hyper-parameter optimization method.

Conclusions
Hyper-parameters have been a key problem in the practical application of machine learning models. In this paper, the BO-PSO algorithm is proposed, which gives full play to the simple calculation of PSO and high sample efficiency of BO for time series data modeling. In this method, the PSO algorithm is used to solve the maximum value of the acquisition function, to obtain new points to be evaluated, which solves the problem that the basic method needs to calculate the gradient and greatly reduces the computational complexity. Finally, we use BO-PSO to confirm the hyper-parameters of RNN and LSTM models, and compare the optimization results with BO-L-BFGS-B and BO-TNC methods. The errors of the prediction result in different models show that BO-PSO is an effective hyper-parameter optimization method, which can be applied to power load forecasting based on neural network model. However, the algorithm has not been run in the high-dimensional space. In the future work, we plan to continue to improve the BO-PSO algorithm on this basis, so that it can also run effectively in high-dimensional space.