Bus Load Forecasting Method of Power System Based on VMD and Bi-LSTM

: The effective prediction of bus load can provide an important basis for power system dispatching and planning and energy consumption to promote environmental sustainable development. A bus load forecasting method based on variational modal decomposition (VMD) and bidirectional long short-term memory (Bi-LSTM) network was proposed in this article. Firstly, the bus load series was decomposed into a group of relatively stable subsequence components by VMD to reduce the interaction between different trend information. Then, a time series prediction model based on Bi-LSTM was constructed for each sub sequence, and Bayesian theory was used to optimize the sub sequence-related hyperparameters and judge whether the sequence uses Bi-LSTM to improve the prediction accuracy of a single model. Finally, the bus load prediction value was obtained by superimposing the prediction results of each subsequence. The example results show that compared with the traditional prediction algorithm, the proposed method can better track the change trend of bus load, and has higher prediction accuracy and stability.


Introduction
With the development of energy environment, the proportion of power consumption on the user side in energy consumption is gradually increasing. The large-scale introduction of distributed generation and the diversification of user behavior characteristics pose new challenges to power dispatching planning. Because the power on the user side is mostly collected from the bus side, accurate prediction of bus load is of great reference significance to power planning. The bus load can be predicted by exploring the internal connection and development law between the load influencing factors and the bus load. Accurate load forecasting can be applied to many fields such as power system planning, market transactions, dispatching, etc., and serves as an important basis for the work of related departments. Electric energy consumption has become one of the most important modes of energy consumption in the world. The accurate prediction of electric energy consumption can also predict the future energy development trend, provide a basis for the development of renewable energy, and help the sustainable development of humans and the environment.
In recent years, the research on the theory of bus load forecasting has become increasingly mature, and its forecasting methods can be divided into three categories: statistical forecasting methods, intelligent forecasting methods, and combined forecasting methods. The statistical forecasting algorithm analyzes the time series based on the implicit time dependence and recursive relationship between the bus loads at different times, and then obtains the short-term bus load forecast values, including time series models, gray models, etc. However, the time series model has a suitable prediction effect when the relationship between load and time is linear or exponential, but the relationship between bus load and time is irregular and does not have obvious linear or exponential relationship; thus, the application scenarios have certain limitations. The intelligent prediction method realizes prediction by extracting the features of the data. Typical representatives are Deep Neural Network (DNN) [1], Support Vector Machine (SVM) [2], long and short-term memory (LSTM) Network [3][4][5], DeepAR [6], N-BEATS [7], Transformer [8], etc. These methods have the ability to model the nonlinear load process, can better adapt to nonlinear spikes and more accurately model the data characteristics of the load, and have better forecasting accuracy. They have become the main research direction of short-term bus load forecasting in recent years.
With the increasing application of data preprocessing theories such as wavelet transform, empirical mode decomposition (EMD) [9,10], ensemble empirical mode decomposition (EEMD) [11], and variational mode decomposition (VMD) [12,13], in view of the nonlinear and non-stationary characteristics of the load sequence, the data preprocessing method is used to decompose the original sequence, and each sub-sequence is predicted separately, and the prediction result is obtained by superimposing and reconstructing. With the progress of data preprocessing methods, bus load forecasting has more processing methods, and the combined forecasting method has been developed. Combination forecasting methods can be divided into two categories. One is to weight and synthesize the forecast results of different forecasting methods to obtain the combined forecast. The forecast results are easily affected by weight distribution. The other is based on the nonlinearity of the bus load sequence non-stationary characteristics, using signal processing methods such as wavelet transform, EMD, and VMD to decompose the original sequence, separately model each sub-sequence component, and reconstruct its prediction results through superposition to obtain the combined prediction results that meet the accuracy requirements. Reference [14] uses EMD to decompose the bus load for multi-step prediction, which has achieved ideal prediction results, but EMD is prone to mode aliasing and cannot choose the number of decomposed components. VMD can find the optimal solution of the natural modal function model through repeated iterations within a limited number of times, which can effectively avoid modal aliasing and improve robustness. Reference [15] proposed four hybrid models based on four decomposition methods, EMD, VMD, wavelet packet transform, and intrinsic time-scale decomposition to forecast agricultural commodity futures prices. The results showed that VMD contributed the most in improving the forecasting ability.
In intelligent prediction methods, the forecasting model based on LSTM in the field of deep learning has great prediction performance. Reference [16] improved the multi-level gated LSTM prediction model to effectively improve the accuracy of bus load prediction. However, the hyperparameters of LSTM are artificially set before the machine learning model starts the learning process, rather than parameters such as weights and biases obtained by training. The choice of hyperparameters plays a vital role in the improvement of model performance. Therefore, it is necessary to carry out parameter adjustment work according to different application scenarios. Hyperparameter optimization methods mainly include grid search method, random search method, Bayesian optimization algorithm, and so on. Among them, the grid search method is an exhaustive search method that traverses the hyperparameter space. It has the disadvantages of being time-consuming and low efficiency when searching in the high-dimensional space. The random search algorithm avoids the above to a certain extent through sparse and simple sampling. However, it still has the disadvantage of not being able to use prior knowledge to select the next set of hyperparameters. The basic idea of Bayesian optimization algorithm is to use prior knowledge to approximate the posterior distribution of the unknown objective function and then adjust the hyperparameters, which significantly improve the efficiency and accuracy of search in high-dimensional space. The data preprocessing method is selected for noise reduction preprocessing before bus load curve prediction in order to obtain more stable prediction results. In reference [17], the bidirectional long-term and short-term memory (Bi-LSTM) network is used to predict the bus load sequence, which effectively improves the learning ability of the network to historical data. However, with the combination prediction model dividing the complete sequence into multiple subsequences, not all sequences are suitable for bidirectional training. Therefore, it is necessary to use the optimization method to select the Bi-LSTM.
Based on this, a bus load forecasting method based on VMD and Bayesian Optimization Bi-LSTM is proposed in this paper. Firstly, VMD is used to stabilize the original bus load series, which is decomposed into a group of subsequence components with different frequencies. Then, the LSTM neural network prediction model of each subsequence component is constructed, the network related super parameters are optimized by Bayesian theory, and whether Bi-LSTM is used is judged to improve the prediction accuracy of a single model. Finally, the prediction results of each subsequence are superimposed to obtain the predicted value of bus load. In the second part of this paper, we expound the theoretical part of the proposed method and compare and verify the method through an example in the third part. The example results show that compared with the traditional algorithm, the prediction model constructed in this paper has a significant improvement in the accuracy of single-step prediction and multi-step prediction, and can better track the change trend of bus load.
The rest of the paper is organized as follows. Section 2 introduces the theories of methods that are used in the bus load forecasting method of power system, which includes VMD, LSTM, BiLSTM, and Bayesian optimization method, and establishes the VMD-BiLSTM combined prediction model. There is a case study in Section 3 to test the model and compare the model proposed in this paper with SVM, LSTM, Bayesian-LSTM, Bayesian-BiLSTM, and VMD-BiLSTM methods to prove its effectiveness. The conclusion and future studies are presented in Section 4.

Theoretical Framework
Here, we mainly explain the theoretical framework of the bus load forecasting in this article.

Construction of Variational Mode Decomposition Function
Variational modal decomposition (VMD) is an adaptive signal processing method proposed by Dragomiretskiy, which can be effectively applied to the smoothing processing of nonlinear and non-stationary time series [13]. It iteratively searches for the optimal solution of the variational mode, continuously updates each mode function and center frequency, and obtains a number of Intrinsic Mode Functions (IMF) with a certain bandwidth.
In the process of VMD, each natural mode is a finite bandwidth with a center frequency, so the variational problem can be defined as seeking k natural mode functions u k (t) and making the bandwidth of each mode is the smallest, and the sum of each mode is equal to the input signal f. The specific construction steps are as follows: (1) Through the Hilbert transform, the analytical signal of the modal function u k (t) is obtained: Among them, δ(t) is the Dirichlet function, * is the convolution symbol.
(2) Perform frequency mixing on the analytical signal to transform the frequency spectrum of each mode to the fundamental frequency band: (3) The constraints of the optimized variational model are: In the formula, K is the number of IMFs, {u k } = {u 1 , u 2 ,..., u k } is IMFs, and {ω k } = {ω 1 , ω 2 , ..., ω k } is the center frequency of u k .

Solution of Variational Mode Decomposition Function
(1) Using the quadratic penalty factor α and the Lagrangian multiplication operator λ (t), the constrained problem is turned into a non-constrained problem.
Extended Lagrangian function expression: (2) Initializeû 1 k , ω k ,λ 1 : Iteratively updateû k , ω k ,λ n under the condition of ω ≥ 0: Until ∑ k û n+1 k −û n k 2 2 / û n k 2 2 < ε. By using the VMD, different scales or trend components can be decomposed from the bus load sequence step by step to form a series of sub-sequence components with different time scales. The sub-sequences have stronger stationarity than the original series, and regularity help to improve forecast accuracy.

LSTM Operation Rules
The long and short-term memory (LSTM) network is a special modified version of the cyclic neural network. While retaining the cyclic feedback mechanism, the topology of the LSTM network controls the accumulation speed of information by introducing a gating unit, selectively adding new information, and selectively forgetting. The previously accumulated information solves the long-term dependence problem in sequence modeling.
Compared with ordinary RNN, LSTM neural network is also composed of an input layer, output layer, and hidden layer, but its hidden layer is replaced by ordinary neurons with memory modules containing gating mechanism. Its internal structure is shown in the Figure 1. The memory unit c t is the core component of the LSTM memory module. It is realized by controlling the forget gate, input gate, and output gate. It contains the long-term memory information of the sequence. The hidden layer state h t contains the short-term memory information of the sequence, which is updated faster than memory unit update speed [16].
Suppose a total of k time steps of the input vector sequence are divided into x 1 , x 2 , ..., x k according to the input time sequence, and the t-th time step is taken for analysis.
The LSTM operation rules are as follows: (1) Update the output of the forget gate, select the historical information that the memory unit needs to keep, and control the degree of influence of c t−1 on c t : (2) Update the two parts of the output of the input gate, select the current input information that the memory unit needs to retain, and control the degree of influence of x t on c t : (3) Update the cell status according to the input gate and forget gate: (4) Update the output gate output, select the output information that the memory unit needs to retain, and control the degree of influence of c t on h t : (5) Update the forecast output at the current moment: Among them, f t , i t , and o t represent the calculation results of the forget gate, input gate, and output gate at time t; W f , Wi, and Wo represent the weight matrix of the forget gate, input gate, and output gate, respectively; b f , b i , and b o represent the weight matrix of the forget gate, input gate, and output gate, respectively. The bias term of the forget gate, input gate, and output gate; o is the dot product symbol of the matrix element; σ(x) is the Sigmoid activation function of each gate: Its value range is (0, 1). Through the conversion of Sigmoid function, the input can be converted into probability values, so it is widely used as the activation function of artificial neural network transmission.
When x t is input to the network, it will be processed by a tanh neural layer and three gates at the same time as the hidden layer vector h t−1 of the previous time step. The tanh function is a hyperbolic tangent activation function, and its expression is: Its value range is (−1, 1), the output is centered at the origin, and the convergence speed is faster than that of Sigmoid. It is usually used as the activation function of the output gate.
The tanh layer will create a new candidate state vector c t . The forgetting gate f t determines what information to discard and retain from the cell state c t−1 of the previous time step. The input gate it determines how to update the candidate state vector. After the cell state is updated, the output gate o t decides how to filter the new state vector c t into output information h t .

Training Process of LSTM
The training algorithm of LSTM neural network mainly includes two categories: back propagation algorithm over time and real-time cyclic learning algorithm. The concept of backpropagation algorithm is clear and it has advantages in computational efficiency. This paper selects this method to train LSTM neural network.
The back propagation method expands the LSTM into a deep feedforward neural network in time sequence, and then further calculates the relevant parameter gradients according to the error back propagation algorithm of the feedforward network, and trains the model [18]. The LSTM network sequence expansion diagram is shown in Figure 2. The specific training steps are as follows in Figure 3. (1) Forward calculation of the output value of the LSTM memory module. The c t and h t of the current time step are calculated by the gating mechanism of LSTM and are retained for the calculation of the next time step. When the calculation of the last step is completed, the hidden layer vector h k will be used as the output and the prediction corresponding to this set of sequences' value (tag value) to compare.
(2) Reverse calculation of the error item value of each memory module, including two reverse propagation directions in chronological order and at network level.
Calculate the loss function value according to the comparison result of step 1, and select the square sum error function as the loss function of LSTM; the expression is: (3) According to the corresponding error term, calculate the gradient of each weight, and the gradient descent method iterates the parameters including W, U, V, c t , and h t : Through the gating mechanism and perfect parameter update rules, LSTM realizes the selection and screening of the input information flow, and improves the processing ability of the recurrent neural network for long sequences.
The back-propagation algorithm can compare the predicted value with the real value by using the error function after the forward calculation, and then optimize the network parameters. Therefore, the back-propagation algorithm can back calculate and optimize many parameters of LSTM. The cyclic learning algorithm can deal with some sequence problems, but it has serious long-term dependence problems after multi-stage propagation-the gradient tends to disappear or explode, and it is difficult to continue to optimize within the number of iterations in many cases. However, the introduction of gating mechanism in LSTM solves the gradient disappearance problem and performs well in sequence processing.

Bidirectional LSTM
The bidirectional long and short-term memory (Bi-LSTM) network is derived from the bidirectional cyclic neural network [19]. Its main feature is to increase the learning function of the neural network for future information, thereby overcoming the defect that the unidirectional LSTM network can only process historical information. The Bi-LSTM mainly splits the ordinary LSTM into two directions, but the two LSTMs are connected to the same output layer. Such a structure can provide complete upper and lower sequence information for the input sequence of the output layer. Using a Bi-LSTM network to model bus load forecasting is to input historical data into a forward LSTM network and a reverse LSTM network at the same time, so as to capture a complete time series global information. The Bi-LSTM neural network structure diagram is as follows in Figure 4. Because the bus load has certain regularity and planning, the joint grasp of historical and future information can better learn the characteristics and laws of the load curve.
The update formula of the back-to-forward cycle neural network layer is: The update formula of the looping neural network layer from front to back is: The two layers of recurrent neural networks are superimposed and input to the hidden layer: Among them, h 1 , t, and h 2 , t are the hidden units of the front pass layer and the back pass layer at time t, respectively; y t is the model output at time t; f (*), g(*) are optional activation functions; W h1,t , W h2 , t, W h1 , W h2 , U h1 , and U h2 are the weight matrices of the corresponding objects; and b h1 , b h2 , and b y are the bias terms of the corresponding objects.

Bayesian Optimization Theory Gaussian Regression Process
In the feasible region, uniformly select points that obey the multi-dimensional normal distribution as candidate solutions to establish a Gaussian regression model [20]: Among them, K is the covariance matrix, x is the input value, and y is the response output value.
Through the training set, the updated value y* is obtained according to the posterior formula: Among them, K * is the covariance of the training set, and K * * is the covariance of the newly added sample.
Update the Gaussian regression model according to the updated value: The Gaussian regression model considers the relationship between y N and y N+1 and establishes input and output functions to provide a search basis for parameter optimization.

Acquisition Function Process
On the basis of the Gaussian regression model, it is necessary to use the collection function to solve the optimal solution of the function. This paper selects the expectation improvement method to use the mathematical expectation to solve the optimal solution.
Among them, φ(x) is the probability density of the normal distribution, ϕ(x) is the standard normal distribution about x, µ is the mean value of the input value x, and σ is the variance of the input value x.
The expected value is used for parameter optimization, and the next optimal value is effectively sought within a certain range, so as to find the optimal parameter within the number of iterations [20].

Hyperparameter Optimization of LSTM
The hyperparameters of the LSTM neural network used for bus load prediction can be divided into two categories: structural hyperparameters and training hyperparameters. The structural hyperparameters mainly include the number of hidden neurons in the network, etc [21]. The number of hidden layer neurons determines the expressive ability of the network, but also determines whether the network is over-fitting and the network is time-consuming. Reasonable hidden layer neurons help improve the performance of the network's predictive ability. The use of the bidirectional long-term memory network improves the network's ability to learn historical data, but it also leads to slow network prediction and different effects. Choosing different curves to predict whether to use a Bi-LSTM can save prediction time while ensuring prediction accuracy [22].
The training hyperparameters of the LSTM neural network mainly include learning rate, L2 regularization parameters, etc. A suitable initial learning rate helps to significantly improve the iterative convergence speed and prediction accuracy of deep learning models. L2 regularization parameters are additional items of the network loss function and can prevent the over-fitting problem to a certain extent.
In summary, this paper will use Bayesian optimization algorithm to optimize and debug the number of hidden layer neurons of LSTM neural network, whether to use Bi-LSTM, initial learning rate, and L2 regularization parameters.

VMD-Bi-LSTM Combined Prediction Model
The power system bus load itself has fluctuating characteristics and is affected by distributed power dispatch and user-side behavior characteristics. Its curve has a certain degree of non-linear and non-stationary characteristics. The use of conventional learning forecasting methods to improve the forecasting accuracy is relatively limited. Considering the outstanding advantages of variational modal decomposition technology in sequence smoothing processing and the excellent performance of LSTM networks in time series data modeling, this paper proposes a VMD-Bi-LSTM combined forecasting model. The specific modeling process is as follows, as shown in the Figure 5. (1) In view of the non-stationary characteristics of the bus load sequence, the VMD method is used to decompose, and each IMF component and residual component are obtained; (2) Normalize each sub-sequence component separately, and divide the training sample and the test sample according to the same ratio; (3) Construct an LSTM neural network prediction model for each sub-sequence component, and use Bayesian optimization algorithm to optimize the hyperparameters of a single model to obtain the most suitable hyperparameter combination for decomposing the sequence and determine whether to use Bi-LSTM in sub-sequences; (4) Train the prediction model after hyperparameter optimization, use the trained prediction model to perform multi-step extension prediction, and superimpose the reconstruction to obtain the multi-step prediction value of bus load; (5) Compared with actual data, the multi-step prediction performance of the prediction model is evaluated by calculating error indicators.

Case Study
Here, we analyze the example of the method proposed in this paper. The example takes the bus load data of Canberra, Australia, from 16 January to 22 January 2016 as the data set, including 270 time steps (30 min as a time step). The first 244 time steps of the data are used as the training sequence and the last 36 time steps are used as the verification sequence.

Temporal Data Decomposition
The original bus load sequence is decomposed byVMD, and six groups of IMF components and one group of residual components are separated step by step. The decomposition results are shown in Figure 6. Decomposing the bus load sequence through variational modal decomposition shows that the natural modal function consists of multiple sub-sequences with small amplitude and stable frequency. Although the frequency of the remainder is relatively unstable, the amplitude is small, which affects the overall bus load. The forecast trend change of the load has less impact.

Hyperparameter Optinization of VMD-LSTM
On the basis of smoothing the original sequence, it is necessary to construct the LSTM network prediction model of the sub-sequence components, and to optimize the related structure hyperparameters and training hyperparameters. The hyperparameter results obtained by using Bayesian optimization algorithm in this paper are shown in the Table 1.

Model Evaluation Index
In order to visually analyze the predicted error value, the curve correlation coefficient is introduced to evaluate the shape difference between the predicted curve and the real curve, and the root mean square error (RMSE) is introduced to visually analyze the deviation between the observed value and the true value. The accuracy is analyzed, and the standard error is selected as the criterion for predicting the dispersion level.
In order to evaluate the forecasting effect of the proposed bus load forecasting method, the RMSE, NRMSE, standard error, and correlation coefficient are selected as the overall forecasting results evaluation index of the short-term bus load forecasting method.
where σ A and σ B are the standard deviation of dataset A and dataset B, respectively, and the standard deviation formula is as follows: where y i is the true value of bus load andŷ i is the predicted value of bus load. The correlation coefficient r is used to characterize the accuracy of the prediction curve and the calibration curve. The closer the correlation coefficient is to 1, the higher the prediction accuracy.

Model Evaluation Index
After optimizing the Bayesian parameters of each sub-sequence of the bus load, the long and short-term memory network is trained, and the next time step is predicted in Figure 7, and the single-step prediction results of each sub-sequence component are integrated into the historical monitoring data. The new input sequence of the single-step forecasting model can realize the multi-step rolling forecast of each component, which can realize the rolling forecast load value for a period of time in the future. The rolling prediction method is used to predict multiple time steps in the future, and the prediction results are shown in Figure 8.  The multi-step prediction results of bus load can be obtained by superimposing the predicted values of each subsequence, and the error analysis of each subsequence is carried out by using the prediction evaluation index. The error results are shown in Table 2. From the training prediction results and multi-step prediction results, it can be seen that the e RMSE of IMF6, which accounts for a large proportion of the original sequence of bus load, is 1.3635 and e STD is 0.8233 in the next 12 time steps, with small prediction error. With the increase in prediction time steps, the prediction e RMSE in the next 36 time steps will increase to 20.2677, e STD and e NRMSE will also increase accordingly, but the r will increase from 0.9521 to 0.992. The shape of the prediction curve will gradually stabilize and be closer to the shape of the original curve. With the increase in time steps, the prediction accuracy will decline, but the prediction curve will gradually remain stable; the shape gradually approaches the original curve, which has suitable prediction stability. The prediction e NRMSE of other subsequences is lower than 16, the r is higher than 0.98, the prediction error is small, and the prediction shape remains suitable. The prediction results of a single stable natural mode function subsequence meet expectations.
The decomposition remainder of bus load has many burrs and unstable frequency, so the prediction correlation coefficient is low. However, due to its small amplitude, its prediction error is also small, and its impact on the overall prediction result of bus load is relatively small.
The multi-step prediction results of each subsequence and remainder are superimposed to obtain the multi-step prediction curve of bus load, as shown in Figure 9. The prediction curve fits with the real curve and has accurate prediction results. In order to verify the prediction performance of the proposed method, different models in various cases are selected for comparative analysis. The SVM that uses radial basis function as kernel function for prediction is compared with LSTM to verify the advantages of LSTM in time series prediction. The VMD-Bayesian-BiLSTM model is compared with EEMD-Bayesian-BiLSTM model and EMD-Bayesian-BiLSTM to verify the advantages of VMD in short-term combination forecasting. The VMD-LSTM combined prediction model and LSTM model are selected for prediction to verify the prediction accuracy and stability of the combined prediction model. The VMD-Bayesian-LSTM model is compared with VMD-LSTM model, and the Bayesian-BiLSTM model is compared with the Bayesian-LSTM model and LSTM to verify the improvement effect of Bayesian optimization theory on LSTM prediction accuracy and the necessity to consider the applicability of BiLSTM in sequence prediction. The comparison results are shown in Table 3 and Figure 10.  The comparison of different models shows that the prediction accuracy of each time step of LSTM is greatly improved compared with SVM. After decomposing the bus load sequence, the e RMSE , r and other indices of VMD-LSTM are greatly improved on the basis of the LSTM model, which greatly improves the prediction accuracy and stability. The comparison of VMD-Bayesian-BiLSTM and EEMD-Bayesian-BiLSTM and the comparison of VMD-Bayesian-BiLSTM and EMD-Bayesian-BiLSTM shows that the prediction error of the prediction model considering VMD in multiple time steps is lower than that of EMD and EEMD. With the proposal of Bayesian optimization theory, the e RMSE of 36 time steps of combined prediction is reduced from 23.9219 to 14.9219, the e NRMSE is reduced from 0.0088 to 0.0055, the e STD is reduced from 20.9806 to 14.8022, and the results of LSTM are also greatly improved after Bayesian optimization. The results of the Bayesian-BiLSTM model are improved compared with those without considering Bi-LSTM. The comparison of various data shown in Figure 10 verifies that the VMD-LSTM combined prediction model based on Bayesian optimization proposed in this paper has more accurate prediction results and more stable multi-step prediction results. In order to verify the applicability of the model proposed in this paper, this paper randomly selects two groups of 270 time step data sets in the bus load data set in Canberra, Australia, in 2016, which are divided into data set 2 and data set 3 for verification. In order to eliminate the regional influence, three groups of 270 time step data sets are randomly selected from the 2014 Beijing bus load data set, namely data set 4, data set 5, and data set 6, and their errors are tested in Table 4.  It can be seen from Table 4 that through the test of multiple groups of randomly selected data in the same time step, the error value is always controlled in a low range, and the average error of six groups of data has suitable prediction results. This shows that the proposed method has appropriate prediction stability and adaptability.

Conclusions and Future Studies
Aiming at the current research hotspot in the field of deep learning, this paper studies the bus load forecasting, establishes the bus load forecasting method based on VMD-BiLSTM, and draws the following conclusions: (1) The VMD method is used to deal with the non-stationary characteristics of bus load series and reduce the interaction between different time scale information, which is conducive to further mining the characteristics of original series and improving the prediction performance of the model; (2) The cyclic network structure and gating mechanism of LSTM neural network are used to capture the temporal correlation of each sub sequence component, so as to track the change trend of bus load more effectively. Compared with other models, the VMD-LSTM combined prediction model has a significant improvement in multi-step prediction accuracy; (3) Bayesian optimization algorithm is used to optimize the super parameter combination of LSTM neural network to overcome the adverse effect of empirical selection on the improvement of model prediction performance; (4) Bayesian optimization considers the applicability of the sequence to the bidirectional neural network. While using the Bi-LSTM network to enhance the training ability, considering the applicability of the sequence, the prediction performance of the network has been optimized and improved.
This method has certain feasibility in the field of bus load forecasting, can be applied in the direction of energy consumption forecasting and power production planning, and is conducive to the planning and development of clean energy in the future and the sustainable development of energy.
The model proposed in this paper has achieved suitable results in short-term prediction, but it is uncertain whether it can ensure the prediction accuracy and stability in long-term prediction when the data support is sufficient, and whether it is adaptable in other areas. Due to time constraints, this paper does not compare it with ETS, ARIMA, and Thet to reflect the optimization performance. In the future, this proposed model can be implemented in different areas to validate its effectiveness and compared to alternative approaches and stronger baselines. Moreover, by taking a new variety of data input for sustainability studies, the model can be implemented for carbon emission forecasting. The author will study and optimize the excellent prediction methods in the field of machine learning in order to establish an accurate prediction model that can be applied in the field of sustainability.  Acknowledgments: This work was supported by State Grid Zhejiang Electric Power Co., Ltd. Science and technology project: Research on bus load forecasting method in spot market (5211TZ1900S4). Thanks for the information and data provided by power system company.

Conflicts of Interest:
The authors declare that there is no conflict of interests regarding the publication of this paper.

Nomenclature δ(t)
The Dirichlet function K The number of IMFs/the covariance matrix u k IMF ω k The center frequency of u k α The quadratic penalty factor λ (t) The Lagrangian multiplication operator c t The memory unit h t The hidden layer state x k The input vector sequence f t The calculation results of the forget gate i t The calculation results of the input gate o t The calculation results of the output gate W f The weight matrix of the forget gate W i The weight matrix of the input gate W o The weight matrix of the output gate b f The weight matrix of the forget gate b i The weight matrix of the input gate b o The weight matrix of the output gate o The dot product symbol of the matrix element

σ(x)
The Sigmoid activation function of each gate x The input value y The response output value y* The updated value through the training set K* The covariance of the training set K** The covariance of the newly added sample The probability density of the normal distribution