Research on Short-Term Load Forecasting Based on Optimized GRU Neural Network

: Accurate short-term load forecasting can ensure the safe and stable operation of power grids, but the nonlinear load increases the complexity of forecasting. In order to solve the problem of modal aliasing in historical data, and fully explore the relationship between time series characteristics in load data, this paper proposes a gated cyclic network model (SSA–GRU) based on sparrow algorithm optimization. Firstly, the complementary sets and empirical mode decomposition (EMD) are used to decompose the original data to obtain the characteristic components. The SSA–GRU combined model is used to predict the characteristic components, and ﬁnally obtain the prediction results, and complete the short-term load forecasting. Taking the real data of a company as an example, this paper compares the combined model CEEMD–SSA–GRU with EMD–SSA–GRU, SSA–GRU, and GRU models. Experimental results show that this model has better prediction effect than other models.


Introduction
With the rapid development of power systems, load forecasting attracted great attention of power companies and consumers, and became an important direction of modern power system research. Considering the periodicity, fluctuation, continuity, and randomness of power loads, the complexity and difficulty of load forecasting are increased.
Under the completely free power market operation mode, the load forecasting problem affects the power dispatching of power companies and the production plan of powerconsuming enterprises [1]. Among them, short-term load forecasting plays an important role in guiding and regulating the operation of power companies. Accurate prediction results can more reasonably arrange the daily production plan. Short-term load forecasting (STLF) of the power system refers to forecasting the load in the next few hours to several days [2]. STLF is an important foundation to ensure the reliable operation of modern power systems and an important link in energy management systems. Its results play an important reference role for dispatching departments to determine the daily, weekly, and monthly dispatching plans, and to reasonably arrange the unit start-stop, load distribution, and equipment maintenance [3]. With the continuous expansion of the scale of modern power systems, higher requirements are put forward for STLF, and STLF technology of power systems is increasingly becoming a key technology in the power industry.
Short-term load forecasting methods can be divided into three categories: traditional forecasting technology, improved traditional technology, and artificial intelligence technology. Traditional techniques include regression analysis [4], least square method [5], and exponential smoothing method [6]. Improved technologies include time series method [7], autoregression and moving average based model [8], support vector machine [9], etc. However, most of the traditional technologies and improved traditional technologies are linear prediction models, and the relationship between load and other characteristic factors in load forecasting is complex and non-linear, so it is not effective in forecasting power load [10].

GRU
In order to solve the problem that the feed forward neural network cannot retain previous information, some scholars proposed a feedback neural network, recurrent neural network (RNN), which can transfer the information between each layer in a twoway, and then form a memory for the information, allowing the information to persist, and has a certain memory capacity. Its structure is shown in Figure 1. As can be seen from Figure 1, represents the time, represents the input layer, represents the hidden layer, and represents the output layer. The calculation formula of and is as follows: Wherein in Formula (1) represents the hidden layer value at time , (·) represents the activation function of the hidden layer, represents the input vector at time , represents the parameter matrix, represents the weight matrix, is the state of the hidden layer at the previous time. in Equation (2) represents the output at time , represents the parameter matrix, and (·) represents the activation function of the output layer. (·) generally adopts softmax function, and (·) can choose sigmoid function or tanh function.
Although RNN solves the problem that a feed forward neural network cannot remember information, RNN neural networks have some shortcomings. It can only deal with short-term dependence. It is difficult to solve long-term dependence when using RNN, and its memory capacity is limited. Moreover, when the sequence is long, its learning ability and memory ability decline, and there is the problem that the gradient disappears.
In order to solve the problems of RNN, a variant of RNN, long short-term memory neural network (LSTM), was proposed. LSTM not only solves the problems of RNN, but also can handle the problems of short-term and long-term dependence, and realizes the function of long-term and short-term memory. The network structure diagram of LSTM is shown in Figure 2. As can be seen from Figure 1, t represents the time, X represents the input layer, S represents the hidden layer, and O represents the output layer. The calculation formula of S and O is as follows: Wherein S t in Formula (1) represents the hidden layer value at time t, f (·) represents the activation function of the hidden layer, X t represents the input vector at time t, U represents the parameter matrix, W represents the weight matrix, S t−1 is the state of the hidden layer at the previous time. O t in Equation (2) represents the output at time t, V represents the parameter matrix, and g(·) represents the activation function of the output layer. g(·) generally adopts softmax function, and f (·) can choose sigmoid function or tanh function.
Although RNN solves the problem that a feed forward neural network cannot remember information, RNN neural networks have some shortcomings. It can only deal with short-term dependence. It is difficult to solve long-term dependence when using RNN, and its memory capacity is limited. Moreover, when the sequence is long, its learning ability and memory ability decline, and there is the problem that the gradient disappears.
In order to solve the problems of RNN, a variant of RNN, long short-term memory neural network (LSTM), was proposed. LSTM not only solves the problems of RNN, but also can handle the problems of short-term and long-term dependence, and realizes the function of long-term and short-term memory. The network structure diagram of LSTM is shown in Figure 2. It can be seen from Figure 2 that LSTM is much more complex in network structure than RNN, and LSTM introduces cell state to memorize information. At the same time, a gating structure is introduced to maintain and control information, i.e., input gate, forgetting gate, and output gate. Although LSTM solves the problem that RNNs cannot It can be seen from Figure 2 that LSTM is much more complex in network structure than RNN, and LSTM introduces cell state C t to memorize information. At the same time, a gating structure is introduced to maintain and control information, i.e., input gate, forgetting gate, and output gate. Although LSTM solves the problem that RNNs cannot carry out long-term memory, the network structure of LSTM is complex and the convergence speed is slow. It affects the training process and results when carrying out power load forecasting, and causes problems such as training complexity. In order to solve these problems, a variant of LSTM, gated recurrent neural network (GRU), is proposed on the basis of LSTM. It optimizes the function of LSTM and makes the network structure simple. It is a widely used neural network at present. The structure of GRU has changed from the three gates of LSTM to two gates, i.e., update gate and reset gate. The network structure is shown in Figure 3. It can be seen from Figure 2 that LSTM is much more complex in network structure than RNN, and LSTM introduces cell state to memorize information. At the same time, a gating structure is introduced to maintain and control information, i.e., input gate, forgetting gate, and output gate. Although LSTM solves the problem that RNNs cannot carry out long-term memory, the network structure of LSTM is complex and the convergence speed is slow. It affects the training process and results when carrying out power load forecasting, and causes problems such as training complexity. In order to solve these problems, a variant of LSTM, gated recurrent neural network (GRU), is proposed on the basis of LSTM. It optimizes the function of LSTM and makes the network structure simple. It is a widely used neural network at present. The structure of GRU has changed from the three gates of LSTM to two gates, i.e., update gate and reset gate. The network structure is shown in Figure 3. in Figure 3 means update door, represents the reset door. The function of these two gates is to control the degree to which information is transferred. The inputs of both gates are the input at the current time and the hidden state ℎ at the previous time. The calculation formula of the two doors is as follows: z t in Figure 3 means update door, r t represents the reset door. The function of these two gates is to control the degree to which information is transferred. The inputs of both gates are the input x t at the current time and the hidden state h t at the previous time. The calculation formula of the two doors is as follows: where x t represents the input at time t, h t−1 represents the hidden state at the previous time, [] represents the connection of two vectors, W z and W r represent the weight matrix, and σ(·) represents the sigmoid function. GRU discards and memorizes the input information through two gate structures, and then calculates the candidate hidden state value h t , The calculation expression is shown in Formula (5): where tanh(·) represents the tanh activation function, W h represents the weight matrix, and represents the product of the matrix. After the tanh activation function obtains the updated state information through the update gate, it creates vectors of all possible values according to the new input, and calculates the candidate hidden state value h t , Then, the final state h t−1 at the current time is calculated through the network, as shown in Formula (6): According to the above calculation formula, GRU stores and filters information through two gates, retains important features through gate functions, and captures dependencies through learning to obtain the best output value.
When the same effect is achieved, the training time of GRU is shorter. Especially in the case of large training data, the effect of training and prediction using GRU is better, and much time is saved. Therefore, this paper selects a GRU neural network model for short-term power load forecasting to achieve the purpose of short training time and good forecasting effect.
In the prediction process of the GRU model, the number of hidden layer neural units, the learning rate, the number of small batch training, and the number of iterations need to be considered. The values of these parameters can affect the model fitting effect, training duration, generalization ability, or degree of convergence. After many experiments and observing the loss value of the model, a set of empirical parameters are obtained. When the parameter selection of the GRU prediction model uses one hidden layer, the number of neurons is 50, the learning rate is 0.005, the data volume of batch training is 50, and the number of iterations is 100, and the GRU prediction model obtains relatively average calculation efficiency and prediction effect.

Model Comparison
In this section, the GRU model and BP model are trained and predicted, and the prediction performance is compared. The power load data used in this paper are the real power load data of an industrial user's factory. The data set is selected from the real power load data of an industrial user in the two years from 1 January 2018 to 31 December 2020 as the training set data of the experiment. The sampling point is collected at the same time node every day, with a total of 731 sampling points. The daily power consumption load data in January 2021 are taken as the test set of the experiment, with a total of 31 data points. The environment of the electrical equipment in this factory is not affected by the external weather, and it is under constant temperature and humidity all the year round. The original power load data after missing or abnormal value processing are shown in Figure 4.
The normalized power load data is shown in Figure 5.
In this experiment, the GRU model and BP model are used to train the training set data. After the training, the data of the next month can be predicted. Finally, there is a comparison between the prediction results of the two models and the actual real value, using the above error evaluation indicators for analysis and comparison. The generated experimental results are shown in Table 1. as the training set data of the experiment. The sampling point is collected at the same time node every day, with a total of 731 sampling points. The daily power consumption load data in January 2021 are taken as the test set of the experiment, with a total of 31 data points. The environment of the electrical equipment in this factory is not affected by the external weather, and it is under constant temperature and humidity all the year round. The original power load data after missing or abnormal value processing are shown in The normalized power load data is shown in Figure 5. In this experiment, the GRU model and BP model are used to train the training set data. After the training, the data of the next month can be predicted. Finally, there is a comparison between the prediction results of the two models and the actual real value, using the above error evaluation indicators for analysis and comparison. The generated experimental results are shown in Table 1. points. The environment of the electrical equipment in this factory is not affected by the external weather, and it is under constant temperature and humidity all the year round. The original power load data after missing or abnormal value processing are shown in Figure 4. The normalized power load data is shown in Figure 5. In this experiment, the GRU model and BP model are used to train the training set data. After the training, the data of the next month can be predicted. Finally, there is a comparison between the prediction results of the two models and the actual real value, using the above error evaluation indicators for analysis and comparison. The generated experimental results are shown in Table 1.  It can be seen from Figure 6 that the trend of the GRU model curve is closer to the test curve and smoother. However, the BP model curve has a relatively turbulent trend, and it is difficult to judge local fluctuations in time and respond quickly. The main reason is that the BP neural network cannot remember and save information, while GRU has the function of long-term memory, which can better remember and store previous data. In order to more intuitively compare the prediction results of the two models, Figure 7 shows the numerical value and curve comparison of the prediction error of the model. test curve and smoother. However, the BP model curve has a relatively turbulent trend, and it is difficult to judge local fluctuations in time and respond quickly. The main reason is that the BP neural network cannot remember and save information, while GRU has the function of long-term memory, which can better remember and store previous data. In order to more intuitively compare the prediction results of the two models, Figure 7 shows the numerical value and curve comparison of the prediction error of the model.   For the maximum relative error and the average absolute error, the BP model is 8.04% and 1.96%, respectively. The GRU model is 3.51% and 1.63%, respectively. Compared with BP model, the accuracy of the maximum relative error of GRU model is increased by 56.3%, and the accuracy of the average absolute error is increased by 16.8%. The GRU neural network has some problems in the training process. The GRU model has a slow speed in the training process, and its model parameters are obtained based on experience. It is easy to fall into local optimization, complicating the training process and increasing the difficulty of training.

Sparrow Search Algorithm
In order to make the GRU model automatically find the optimal parameters in the training process instead of manually selecting through experience, the intelligent optimization algorithm sparrow search algorithm (SSA) is used to optimize the parameters in the GRU model. A new swarm intelligence optimization algorithm sparrow search algorithm (SSA) was proposed, which is mainly affected by the sparrow's foraging and anti-predatory behavior. Assuming that there are n sparrows in a search space, the population composed of them can be expressed as: For the maximum relative error and the average absolute error, the BP model is 8.04% and 1.96%, respectively. The GRU model is 3.51% and 1.63%, respectively. Compared with BP model, the accuracy of the maximum relative error of GRU model is increased by 56.3%, and the accuracy of the average absolute error is increased by 16.8%. The GRU neural network has some problems in the training process. The GRU model has a slow speed in the training process, and its model parameters are obtained based on experience. It is easy to fall into local optimization, complicating the training process and increasing the difficulty of training.

Sparrow Search Algorithm
In order to make the GRU model automatically find the optimal parameters in the training process instead of manually selecting through experience, the intelligent optimization algorithm sparrow search algorithm (SSA) is used to optimize the parameters in the GRU model. A new swarm intelligence optimization algorithm sparrow search algorithm (SSA) was proposed, which is mainly affected by the sparrow's foraging and anti-predatory behavior. Assuming that there are n sparrows in a search space, the population composed of them can be expressed as: where m represents the dimension of the variable to be optimized. The fitness value of sparrow population can be expressed as: where f represents the fitness value.

Update Discoverer Location
In the search process, discoverers with high fitness will obtain food first, and at the same time, they provide the followers with the area and direction where the food is located. Therefore, the search scope of the discoverer is wider and the search ability is stronger. The location update is described as follows: where t is the current iteration number, k = 1, 2, 3, . . . , m; iter max is the maximum number of iterations. X i,k is the position information of the ith sparrow in the kth dimension. α is a random number (0, 1). R 2 is the warning value (0, 1); ST is the safe value (0.5, 1]. Q is a random number. L represents a 1 × m matrix. From Equation (12), when R 2 < ST, it means that the discoverer has not found that there are predators around the current foraging environment. At this time, the search space is safe and the discoverer can continue to perform more extensive search. When R 2 ≥ ST, it means that there are predators. The discoverer will quickly send an alarm and send a signal to other sparrows. At this time, all sparrows will fly to other safe places to find food.

Update Follower Position
When foraging, the behavior of the discoverer will be watched by some followers. If the former finds better food, the latter will quickly detect it and immediately go to fight for food. The location update is described as follows: where X P , X w is the current best and worst position of the discoverer. A is a 1 × m matrix, and A + = A T AA T −1 . When i > n/2, it means that the ith follower has not found food, and it needs to continue to look for food.

Update the Guard Position
For the convenience of expression, we call these sparrows who are in danger without food as vigilantes. In the simulation, the number of vigilantes accounts for 10-20% of the total. The location update is described as follows: where X b is the current global optimal position. β and K are step control parameters. f i represents the fitness value of the current sparrow individual. f b , and f w represent the current global optimal and worst fitness values, respectively. ε is the minimum constant. From Equation (15), if f i = f b , it indicates that the vigilant is at the edge of the population and is easily attacked by predators. If f i = f b , it indicates that the watcher is at the center of the population, and this part of sparrows has realized the threat. To prevent the predator from attacking, it must be close to other sparrows to reduce the risk of being attacked.
According to the design of sparrow search algorithm, the parameter optimization process of SSA is shown in Figure 8.

Comparison of Optimization Algorithms
In order to test the optimization ability of sparrow search algorithm, particle swarm optimization (PSO), genetic algorithm (GA), and artificial bee colony algorithm (ABC) are introduced for experimental test and comparison. We test and compare the fitness values of the four algorithms through the Griewank multi peak test function. The dimension of the Griewank test function is set to 30 and the search range is [−600, 600]. The maximum number of iterations of each algorithm is set to 1000 and the population number is set to 100. The parameter settings of the four optimization algorithms are shown in Table 2:

Comparison of Optimization Algorithms
In order to test the optimization ability of sparrow search algorithm, particle swarm optimization (PSO), genetic algorithm (GA), and artificial bee colony algorithm (ABC) are introduced for experimental test and comparison. We test and compare the fitness values of the four algorithms through the Griewank multi peak test function. The dimension of the Griewank test function is set to 30 and the search range is [−600, 600]. The maximum number of iterations of each algorithm is set to 1000 and the population number is set to 100. The parameter settings of the four optimization algorithms are shown in Table 2: After the test function is tested, the fitness value of each algorithm is shown in Figure 9: Electronics 2022, 11, x FOR PEER REVIEW 11 of 19 After the test function is tested, the fitness value of each algorithm is shown in Figure  9: In the comparison of the four algorithms, the SSA algorithm has the fastest convergence speed and the highest convergence accuracy. The SSA algorithm obtains the best fitness value at the fastest speed in the iterative process, and its optimization ability is the best. It can be concluded that it has the advantages of high search accuracy, fast convergence speed, and strong stability. Therefore, the SSA algorithm is used to optimize the neural network parameters in this paper.

CEEMD
Empirical mode decomposition (EMD) is an adaptive data mining method for signal analysis. It analyzes the signal based on the time scale characteristics of the data themselves, and decomposes the original signal into a series of intrinsic mode components (IMF) and a residual component. However, the EMD method has serious mode aliasing. In 2010, Yeh et al. proposed the complementary set empirical mode decomposition algorithm (CEEMD), which is an improved algorithm of EMD and can solve this phenomenon.
CEEMD changes the extreme point of the original signal by adding a pair of white noise signals with opposite signs, and cancels the noise in the signal through multiple average processing. The decomposition process is shown in Figure 10. In the comparison of the four algorithms, the SSA algorithm has the fastest convergence speed and the highest convergence accuracy. The SSA algorithm obtains the best fitness value at the fastest speed in the iterative process, and its optimization ability is the best. It can be concluded that it has the advantages of high search accuracy, fast convergence speed, and strong stability. Therefore, the SSA algorithm is used to optimize the neural network parameters in this paper.

CEEMD
Empirical mode decomposition (EMD) is an adaptive data mining method for signal analysis. It analyzes the signal based on the time scale characteristics of the data themselves, and decomposes the original signal into a series of intrinsic mode components (IMF) and a residual component. However, the EMD method has serious mode aliasing. In 2010, Yeh et al. proposed the complementary set empirical mode decomposition algorithm (CEEMD), which is an improved algorithm of EMD and can solve this phenomenon.
CEEMD changes the extreme point of the original signal by adding a pair of white noise signals with opposite signs, and cancels the noise in the signal through multiple average processing. The decomposition process is shown in Figure 10. (1) First, groups of white noise with opposite signs are added to the original signal ( ) to obtain a pair of new signals, which can be expressed as shown in Equation (15): where ( ) represents added white noise; ( ), ( ) denotes signals obtained by adding positive and negative white noise, respectively; (2) Then, EMD decomposition is performed on the 2n signals obtained, and a group of IMF components are obtained for each signal, and the jth IMF component of the ith signal is recorded as ; the last IMF component is taken as the residual component RES; (3) Finally, the 2n groups of IMF components obtained are averaged, and the components obtained by CEEMD decomposition of the original signal ( ) are expressed as: where represents the jth IMF component obtained after decomposition.

Introduction to Combination Model
The essence of CEEMD-SSA-GRU model prediction is equivalent to adding the complementary set empirical mode decomposition algorithm CEEMD on the basis of the SSA-GRU prediction model. From the original training and prediction of the training set (1) First, n groups of white noise with opposite signs are added to the original signal S(t) to obtain a pair of new signals, which can be expressed as shown in Equation (15): where N i (t) represents added white noise; M i1 (t), M i2 (t) denotes signals obtained by adding positive and negative white noise, respectively; (2) Then, EMD decomposition is performed on the 2n signals obtained, and a group of IMF components are obtained for each signal, and the jth IMF component of the ith signal is recorded as C ij ; the last IMF component is taken as the residual component RES; (3) Finally, the 2n groups of IMF components obtained are averaged, and the components obtained by CEEMD decomposition of the original signal S(t) are expressed as: where I MF j represents the jth IMF component obtained after decomposition.

Introduction to Combination Model
The essence of CEEMD-SSA-GRU model prediction is equivalent to adding the complementary set empirical mode decomposition algorithm CEEMD on the basis of the SSA-GRU prediction model. From the original training and prediction of the training set load data directly, the CEEMD algorithm decomposes the training set load data to obtain several subsequences, and then predicts through the SSA-GRU prediction model.

Model Example Analysis
In the process of CEEMD decomposition, the signal-to-noise ratio Nstd of 0.01-0.5, the number of white noise additions NR of 50-300, and the parameter value of the maximum iteration number Maxiter of no more than 5000 are usually added to obtain a good decomposition effect. After several decomposition tests, the parameter values selected for the final decomposition in this paper are set as Nstd = 0.2, NR = 200, and Maxiter = 5000. This section selects the preprocessed training set load data in Section 2.2.2, and decomposes the training set data with CEEMD and EMD algorithms. The decomposition results are shown in Figures 11 and 12. Electronics 2022, 11, x FOR PEER REVIEW 13 of 19 load data directly, the CEEMD algorithm decomposes the training set load data to obtain several subsequences, and then predicts through the SSA-GRU prediction model.

Model Example Analysis
In the process of CEEMD decomposition, the signal-to-noise ratio Nstd of 0.01-0.5, the number of white noise additions NR of 50-300, and the parameter value of the maximum iteration number Maxiter of no more than 5000 are usually added to obtain a good decomposition effect. After several decomposition tests, the parameter values selected for the final decomposition in this paper are set as Nstd = 0.2, NR = 200, and Maxiter = 5000. This section selects the preprocessed training set load data in Section 2.2.2, and decomposes the training set data with CEEMD and EMD algorithms. The decomposition results are shown in Figure 11 and Figure 12.   load data directly, the CEEMD algorithm decomposes the training set load data to obtain several subsequences, and then predicts through the SSA-GRU prediction model.

Model Example Analysis
In the process of CEEMD decomposition, the signal-to-noise ratio Nstd of 0.01-0.5, the number of white noise additions NR of 50-300, and the parameter value of the maximum iteration number Maxiter of no more than 5000 are usually added to obtain a good decomposition effect. After several decomposition tests, the parameter values selected for the final decomposition in this paper are set as Nstd = 0.2, NR = 200, and Maxiter = 5000. This section selects the preprocessed training set load data in Section 2.2.2, and decomposes the training set data with CEEMD and EMD algorithms. The decomposition results are shown in Figure 11 and Figure 12.   The prediction results based on CEEMD prediction model and EMD prediction model are compared and analyzed.
The parameter settings of the CEEMD algorithm are the same as those above. The range of parameters n, ε, v, β in the SSA-GRU model are initialized to [10,200], [0.001, 0.01], [50,256], and [100, 1000]. The SSA initialization parameters are set according to Table 3. Through the construction of CEEMD-SSA-GRU model and EMD-SSA-GRU model, the experimental simulation is carried out on the two models to predict the data in the next month. The prediction results of the two models are shown in Table 4, and the result curves are shown in Figures 13 and 14.   The curve fitting degree of the CEEMD-SSA-GRU model is higher than that of the EMD-SSA-GRU model, and the number of extreme points close to the real value is higher. In order to more intuitively see the prediction of the two models, the prediction errors of the two models are calculated. The comparison of the errors of the two models is shown in Figure 15. The curve fitting degree of the CEEMD-SSA-GRU model is higher than that of the EMD-SSA-GRU model, and the number of extreme points close to the real value is higher. In order to more intuitively see the prediction of the two models, the prediction errors of the two models are calculated. The comparison of the errors of the two models is shown in Figure 15. For the maximum relative error and average absolute error, the EMD-SSA-GRU model is 2.038% and 0.80%, respectively; the CEEMD-SSA-GRU model is 1.98% and 0.64%, respectively. Through comparison, the accuracy of the maximum relative error of CEEMD-SSA-GRU model is increased by 16.8%, and the accuracy of the average relative error is increased by 20.0%. From the most direct prediction error analysis, we can see that the CEEMD-SSA-GRU model has higher prediction accuracy and accuracy.
Each error evaluation index formula calculates each error of the prediction results of the two models, as shown in Table 5.  For the maximum relative error and average absolute error, the EMD-SSA-GRU model is 2.038% and 0.80%, respectively; the CEEMD-SSA-GRU model is 1.98% and 0.64%, respectively. Through comparison, the accuracy of the maximum relative error of CEEMD-SSA-GRU model is increased by 16.8%, and the accuracy of the average relative error is increased by 20.0%. From the most direct prediction error analysis, we can see that the CEEMD-SSA-GRU model has higher prediction accuracy and accuracy.
Each error evaluation index formula calculates each error of the prediction results of the two models, as shown in Table 5. Compared with the EMD-SSA-GRU model, the accuracy of the MAPE, MAE, and RMSE of the CEEMD-SSA-GRU prediction model increases by 20.0%, 20.1%, and 19.5%, respectively.

Results
We compare the prediction effect of CEEMD-SSA-GRU model with that of single GRU model, the GRU model optimized by SSA, and the EMD-SSA-GRU model. The comparison of the prediction curves of the four models is shown in Figure 16 and the comparison of prediction errors is shown in Figure 17.
According to the calculation of error evaluation index formula, the comparison of various error evaluation indexes is shown in Table 6.  Figure 16. Comparison of prediction curves of four models. Figure 16. Comparison of prediction curves of four models.  According to the calculation of error evaluation index formula, the compar various error evaluation indexes is shown in Table 6.

Conclusions
The fitting degree of the prediction curve of each model from high to low is the CEEMD-SSA-GRU model, the EMD-SSA-GRU model, the SSA-GRU model, and the GRU model, and the predicted value of each point in the prediction curve of CEEMD-SSA-GRU model is closest to the extreme point of the real curve. It can be seen from Table 4 that for MAPE, the CEEMD-SSA-GRU model is 60.7% lower than the GRU model, 39.0% lower than the SSA-GRU model, and 20.0% lower than the EMD-SSA-GRU model. For MAE, the CEEMD-SSA-GRU model is 60.8% lower than the GRU model, 39.1% lower than the SSA-GRU model, and 20.1% lower than the EMD-SSA-GRU model. For RMSE, the CEEMD-SSA-GRU model is 59.2% lower than the GRU model, 38.5% lower than the SSA-GRU model, and 19.5% lower than the EMD-SSA-GRU model.
The prediction accuracy of the CEEMD-SSA-GRU model reaches 99.36%, and the prediction result of the CEEMD-SSA-GRU model is the most accurate. Its prediction accuracy is obviously better than the other three models, and the fitting degree of the curve is the closest to the real curve. Therefore, the CEEMD-SSA-GRU model has more advantages in short-term power load forecasting and can better provide reliable forecasting trends for industrial users.  Data Availability Statement: The load forecasting data used to support the results of this study have not been provided because they are private data of enterprises.

Conflicts of Interest:
The authors declare that there are no conflicts of interest regarding the publication of this paper.