Ultra-Short-Term Load Demand Forecast Model Framework Based on Deep Learning

: Ultra-short-term load demand forecasting is signiﬁcant to the rapid response and real-time dispatching of the power demand side. Considering too many random factors that a ﬀ ect the load, this paper combines convolution, long short-term memory (LSTM), and gated recurrent unit (GRU) algorithms to propose an ultra-short-term load forecasting model based on deep learning. Firstly, more than 100,000 pieces of historical load and meteorological data from Beijing in the three years from 2016 to 2018 were collected, and the meteorological data were divided into 18 types considering the actual meteorological characteristics of Beijing. Secondly, after the standardized processing of the time-series samples, the convolution ﬁlter was used to extract the features of the high-order samples to reduce the number of training parameters. On this basis, the LSTM layer and GRU layer were used for modeling based on time series. A dropout layer was introduced after each layer to reduce the risk of overﬁtting. Finally, load prediction results were output as a dense layer. In the model training process, the mean square error (MSE) was used as the objective optimization function to train the deep learning model and ﬁnd the optimal super parameter. In addition, based on the average training time, training error, and prediction error, this paper veriﬁes the e ﬀ ectiveness and practicability of the load prediction model proposed under the deep learning structure in this paper by comparing it with four other models including GRU, LSTM, Conv-GRU, and Conv-LSTM.


Introduction
At present, the power system reform in China is underway, and the spot market in pilot provinces such as Guangdong and Zhejiang will be implemented [1]. Since the electricity spot market has the characteristics of complex trading varieties, high trading frequency, and fluctuating price, the forecasting level of ultra-short-term load is significant. It can help power market members make trading decisions in the energy market, capacity market, auxiliary service market, and demand-side response market [2]. Additionally, ultra-short-term load forecasting is beneficial to arrange the operation mode of a power network and the maintenance plan of a unit reasonably, and can improve the economic and social benefits of a power system.
Load forecasting methods are divided into two categories: Classical statistical forecasting technologies and intelligent forecasting technologies. Classical load forecasting methods mainly include exponential sliding average [3], linear regression [4], auto-regressive integrated moving average [5,6], the dynamic regression method [7], and generalized auto-regressive conditional heteroskedastic approach [8]. The prediction model based on statistics has a relatively simple structure and a clear prediction principle, but its prediction accuracy is low, and it is often only applicable to the case with a small amount of data. Based on the machine learning theory, the intelligent forecasting model can fit Different dimensions of convolution filters are used to process different types of data. Onedimensional convolution is often used in sequence models, such as natural language processing; twodimensional convolution is applied in the field of computer vision and image processing; and threedimensional convolution is suitable for the medical and video-processing field. The deep learning model framework constructed in this paper uses one-dimensional convolution to process time series data related to electrical load.

Long Short-Term Memory
The LSTM neural network is a special recurrent neural network (RNN), which introduces a weighted connection with memory and feedback functions. Compared with the feedforward neural network, LSTM can avoid gradient explosion and gradient disappearance, so LSTM can achieve continuous learning for longer time series [42]. The LSTM hidden layer structure is shown in Figure  2. The core of the LSTM is to store the information of the cell state and three different functional gate structures [43], input gate, forget gate, and output gate, and memory cells of the same shape as the hidden state. The LSTM uses two gates to control the content of the unit state C; one is the forgetting gate, which determines how much unit state is retained to the current moment Ct-1. The other is the input gate, which determines how many inputs of the network are saved to the unit state at the current moment. LSTM uses the output gate to control value the unit state has compared to the current output value of Ct .
Input gate: Forgotten door: Different dimensions of convolution filters are used to process different types of data. One-dimensional convolution is often used in sequence models, such as natural language processing; two-dimensional convolution is applied in the field of computer vision and image processing; and three-dimensional convolution is suitable for the medical and video-processing field. The deep learning model framework constructed in this paper uses one-dimensional convolution to process time series data related to electrical load.

Long Short-Term Memory
The LSTM neural network is a special recurrent neural network (RNN), which introduces a weighted connection with memory and feedback functions. Compared with the feedforward neural network, LSTM can avoid gradient explosion and gradient disappearance, so LSTM can achieve continuous learning for longer time series [42]. The LSTM hidden layer structure is shown in Figure 2. The core of the LSTM is to store the information of the cell state and three different functional gate structures [43], input gate, forget gate, and output gate, and memory cells of the same shape as the hidden state. Different dimensions of convolution filters are used to process different types of data. Onedimensional convolution is often used in sequence models, such as natural language processing; twodimensional convolution is applied in the field of computer vision and image processing; and threedimensional convolution is suitable for the medical and video-processing field. The deep learning model framework constructed in this paper uses one-dimensional convolution to process time series data related to electrical load.

Long Short-Term Memory
The LSTM neural network is a special recurrent neural network (RNN), which introduces a weighted connection with memory and feedback functions. Compared with the feedforward neural network, LSTM can avoid gradient explosion and gradient disappearance, so LSTM can achieve continuous learning for longer time series [42]. The LSTM hidden layer structure is shown in Figure  2. The core of the LSTM is to store the information of the cell state and three different functional gate structures [43], input gate, forget gate, and output gate, and memory cells of the same shape as the hidden state. The LSTM uses two gates to control the content of the unit state C; one is the forgetting gate, which determines how much unit state is retained to the current moment Ct-1. The other is the input gate, which determines how many inputs of the network are saved to the unit state at the current moment. LSTM uses the output gate to control value the unit state has compared to the current output value of Ct .
Input gate: Forgotten door: The LSTM uses two gates to control the content of the unit state C; one is the forgetting gate, which determines how much unit state is retained to the current moment Ct-1. The other is the input gate, which determines how many inputs X t of the network are saved to the unit state at the current moment. LSTM uses the output gate to control H t value the unit state has compared to the current output value of Ct.

of 16
Forgotten door: Output layer: Calculation of candidate memory cells: Calculation of memory cells: The calculation of the hidden state: where W xi , W xf , W xo and W hi , W hf , W ho are the weight parameters, b i , b f , b o are the deviation parameters, H t−1 is the output value of the network layer at the previous moment, X t is the current time input value, and I t , F t , O t are the gate structures that control whether the memory unit needs to be updated, whether it needs to be set to 0, and whether it needs to be reflected in the activation vector.

Gate Recurrent Unit
GRU is another kind of recurrent neural network (RNN). GRU and LSTM are similar in actual performance in many cases. GRU is also proposed to solve problems such as gradients in long-term memory and back propagation. Compared with LSTM, GRU can achieve considerable results, and it is easier to train, which can greatly improve training efficiency [44]. Therefore, GRU tends to be used more in many cases.
As shown in Figure 3, the structure of the GRU input and output is similar to that of a traditional RNN.
Energies 2020, 11, x FOR PEER REVIEW 5 of 16 Output layer: Calculation of candidate memory cells: Calculation of memory cells: The calculation of the hidden state: where Wxi, Wxf, Wxo and Whi, Whf, Who are the weight parameters, bi, bf, bo are the deviation parameters, H t−1 is the output value of the network layer at the previous moment, Xt is the current time input value, and I t ，F t ，O t are the gate structures that control whether the memory unit needs to be updated, whether it needs to be set to 0, and whether it needs to be reflected in the activation vector.

Gate Recurrent Unit
GRU is another kind of recurrent neural network (RNN). GRU and LSTM are similar in actual performance in many cases. GRU is also proposed to solve problems such as gradients in long-term memory and back propagation. Compared with LSTM, GRU can achieve considerable results, and it is easier to train, which can greatly improve training efficiency [44]. Therefore, GRU tends to be used more in many cases.
As shown in Figure 3, the structure of the GRU input and output is similar to that of a traditional RNN. The GRU uses the update gate and the reset gate to update and reset the information. As shown in Equations (1) and (2), the structure is similar to that of the LSTM gate.

GRU
The input of the GRU hidden layer: The output of the GRU hidden layer: The GRU uses the update gate and the reset gate to update and reset the information. As shown in Equations (1) and (2), the structure is similar to that of the LSTM gate.
The input of the GRU hidden layer: Energies 2020, 13, 4900 6 of 16 The output of the GRU hidden layer: Calculation of memory cells: The calculation of the hidden state: where w ij w if , w iz and w io , W xc , W hc are the weight parameters, b t−1 h , b t−1 f , b c are the deviation parameters, H t−1 is the output value of the network layer at the previous moment, X t is the current time input value, and a t j , z t j are the gate structures that control whether the memory unit needs to be updated, whether it needs to be set to 0, and whether it needs to be reflected in the activation vector.
Compared with LSTM, the GRU has one less "gating" inside, and the parameters are less than LSTM, but it can also achieve the same function as LSTM. As a result, GRU is more practical sometimes. Therefore, the ability to learn the time series of GRU is greatly superior [45].

Data Description
This paper collected three-year load data from Beijing from 2016 to 2018 (sampling interval is 15 min with a total of 105,163 points of data) and meteorological data (including temperature and weather condition descriptions) as experimental samples. Among them, the temperature data in every 15 min is generated from the highest and lowest temperature data in the day according to the arithmetic relationship. The network training was carried out under the TensorFlow deep learning framework [46], and the Adam optimization algorithm [47] was used to solve the problem. The computer used in the experiment was configured with a 2.2 GHz Intel Core i7 processor and 16 GB 1600 MHz DDR3 memory.

Feature Engineering
Traditional machine learning methods such as SVM, shallow neural networks, etc., rely on the experience of the relevant staff to manually construct features when building models, while deep neural networks are an end-to-end training that automatically extracts sample data features and can greatly improve work efficiency. This paper combines deep convolution, LSTM, and GRU to simplify the construction of sample features. Because the deep neural network can capture general periodicity of features, this paper therefore no longer selects forecast day types (workdays or weekends) as input features. The high precision of the experimental results indicates that the combination of convolution, LSTM, and GRU has fully extracted the features in the sample data.
According to the collected raw data, the input and output of the model constructed in this paper are shown in Table 1.
A large amount of literatures has only selected temperature as a factor of load for meteorological factors, and has not considered the weather conditions. However, in actual situations, the impact of this condition on the load during the day is very significant. Especially in areas such as Beijing, when extreme weather such as haze occurs, it will have a great impact on the load. It is not sensible to consider the impact on the load from the temperature alone. Therefore, the characteristics of historical weather conditions (such as fog, clouds, etc.) are also considered. There are 18 types, and the text is digitized by using the category features. The information is mapped into a vector, and the conversion result is as shown in Table 2.

Data Preprocessing
In the data preprocessing stage, in order to eliminate the influence of different physical dimensions, the original data needs to be standardized. This paper uses the Z-score method to standardize all sample data. The formula is as follows:x For modeling and calculation, the basic unit of measure is the same, the neural network is trained (probabilistic calculation) and predicted by the statistical probability of the sample in the event, and the value of the Sigmoid function is between 0 and 1. The output of the last node is the same. Wherex(i) represents the normalized data value, and the mean and standard deviation of the original samples, respectively, where weather condition assignment refers to the mapping results of Table 2. The first five rows of the data preprocessing result are shown in Table 3.

Deep Learning Network Prediction Framework
The deep learning framework constructed in this paper consists of two convolutional layers, one LSTM layer and one GRU layer.
As the Figure 4 shows, firstly, the historical meteorological data and the load data are pre-processed and combined, and then the overall time series is sampled. Then, the convolution filter is used to extract higher-order sample features and reduce the number of training parameters. The Relu function [21] is used as the activation function. Next, the LSTM layer or GRU layer is used for time series-based modeling, and the dropout layer is introduced after each layer to reduce the risk of overfitting. Finally, the load prediction result is output by a dense layer.  The overall construction process of this deep learning model is as follows: Step 1: Data preprocessing. The input characteristics of a single moment is 4 (see in Table 1) with a total of 105,163 training samples. Time step and batch size are adjustable hyperparameters, so the input data is stored in a 3dimensional tensor (batch size * Time step * 4).
Step 2: Model training. Eighty percent of the sample data is set as the training set, 20% of the sample data is set as the test set, then the processed training set data is input into the deep learning model for training. Then model outputs the next four consecutive 15 minutes, which is one hour of load forecast.
Step 3: Adjust the model hyperparameters.
Continue to optimize the model and compare the accuracy using different hyperparameters models. The overall construction process of this deep learning model is as follows:

Hyperparameters of Deep Learning Model
Step 1: Data preprocessing. The input characteristics of a single moment is 4 (see in Table 1) with a total of 105,163 training samples. Time step and batch size are adjustable hyperparameters, so the input data is stored in a 3-dimensional tensor (batch size * Time step * 4).
Step 2: Model training. Eighty percent of the sample data is set as the training set, 20% of the sample data is set as the test set, then the processed training set data is input into the deep learning model for training. Then model outputs the next four consecutive 15 min, which is one hour of load forecast. Step 3: Adjust the model hyperparameters.
Continue to optimize the model and compare the accuracy using different hyperparameters models.

Hyperparameters of Deep Learning Model
In order to obtain the optimal structure of the above deep learning model, this paper uses the vertical comparison method to adjust the parameters of the number of hidden layer nodes, time step, and batch size of the improved RNN. When analyzing the influence of one of the parameters on the prediction result, the remaining parameters are fixed. The parameters selected throughout the experimental process are shown in the following Table 4: In this paper, mean square error (MSE) is used for error evaluation. The expressions are as follows: MSE is a convenient method to measure the "average error ". MSE can evaluate the degree of change of the data. The smaller the value of the MSE, the better the accuracy of the prediction model to describe the experimental data. Where y i represents the actual load value,ŷ i represents the load forecast value, and n represents the number of load forecast points. The value of n in this deep learning model is 4.
According to Figure 5, the epoch of the training process is 5, and each training basically converges in the second epoch model. According to the trend in the figure, it can be seen that the overall error of the model is decreasing, and the error is already in the acceptable range.
forecast value, and n represents the number of load forecast points. The value of n in this deep learning model is 4.
According to Figure 5, the epoch of the training process is 5, and each training basically converges in the second epoch model. According to the trend in the figure, it can be seen that the overall error of the model is decreasing, and the error is already in the acceptable range.   The final experimental scene has a tendency to fit, so the model training is stopped, and the optimal model parameters are obtained as shown in Table 5. The final experimental scene has a tendency to fit, so the model training is stopped, and the optimal model parameters are obtained as shown in Table 5.

Evaluation Index
In order to test the prediction effect of the model, it is necessary to select the appropriate evaluation criteria. This paper uses the coefficient of determination to evaluate, denoted by R 2 , and the expression is as follows: R 2 is generally the best measure of linear regression, usually indicating the quality of the model. R 2 ranges between 0 and 1; the closer to 1, the better. y i represents the actual load value,ŷ i represents the load forecast value, y i represents the actual load average, and n represents the number of load forecast points. The closer R 2 is to 1, the higher the goodness of fit is.

Results
In order to verify the superiority of the proposed model, this section describes the model training process in detail. The model proposed in this paper is compared with the other four deep learning models, and the details of models are as follows, in which model 5 is the abbreviation of the model proposed in this paper: Model 1 (GRU): The preprocessed data is input to the GRU layer directly, without using a convolution filter layer. GRU layer hidden layer unit is 50; Model 2 (LSTM): The preprocessed data is input to the LSTM layer directly, without using a convolutional layer for filtering. LSTM layer hidden layer unit is 50; Model 3 (Conv-LSTM): The preprocessed data is input to the convolutional layer firstly for filtering, and then two LSTM layers are used for prediction. Kernel size in Conv Layer is 4 × 4. LSTM layer hidden layer unit is 50; Energies 2020, 13, 4900 11 of 16 Model 4 (Conv-GRU): The preprocessed data is input to the convolutional layer firstly for filtering, and then two GRU layers are used for prediction. Kernel size in Conv Layer is 4 × 4. GRU layer hidden layer unit is 50; Model 5 (Conv-GRU-LSTM): The preprocessed data is input to the convolutional layer firstly for filtering, and then a GRU layer and an LSTM layer are used for prediction. Kernel size in Conv Layer is 4 × 4. GRU and LSTM layer hidden layer unit is 50.

Training Process Analysis
In order to reflect the superiority of the proposed deep learning framework, the other two deep learning model without convolutional layer are introduced to compare with three models constructed in the framework. The five models are all trained using the optimal parameters obtained in Table 5, and the epoch was set to 20. Training time and accuracy of the five models are demonstrated as follows.
According to Figure 6, it can be seen whether the introduction of convolution has a great training time for deep neural networks in terms of training time, which is positively related to the number of parameters that need to be trained. Conv-GRU had the shortest training time in the five models, LSTM had the longest training time, LSTM training time was almost five times that of Conv-LSTM, and GRU training time was more than three times that of Conv-GRU. According to Figure 7, as the training deepens, both the LSTM and GRU models have a tendency to overfit, which may be due to the complexity of the training parameters, while the Conv-LSTM, Conv-GRU, and Conv-GRU-LSTM become more and more stable. This is because the deep learning framework proposed in this paper can greatly reduce the parameters that need to be trained while ensuring the accuracy of prediction, and ultimately reducing the cost of model training time. According to Figure 7, as the training deepens, both the LSTM and GRU models have a tendency to overfit, which may be due to the complexity of the training parameters, while the Conv-LSTM, Conv-GRU, and Conv-GRU-LSTM become more and more stable. This is because the deep learning framework proposed in this paper can greatly reduce the parameters that need to be trained while ensuring the accuracy of prediction, and ultimately reducing the cost of model training time.
According to Figure 7, as the training deepens, both the LSTM and GRU models have a tendency to overfit, which may be due to the complexity of the training parameters, while the Conv-LSTM, Conv-GRU, and Conv-GRU-LSTM become more and more stable. This is because the deep learning framework proposed in this paper can greatly reduce the parameters that need to be trained while ensuring the accuracy of prediction, and ultimately reducing the cost of model training time.

Forecast Results Display
In order to further verify the superiority of the deep learning framework of this paper, the five models are used to predict the 288 consecutive point loads in the last three days in 2018. The prediction results, error, and R 2 are shown in Figures 8 and 9 and Table 6. The expression of error is as follows: Error = (predicted value − real value)/real value (14) Figure 7. Comparison of each Epoch test error.

Forecast Results Display
In order to further verify the superiority of the deep learning framework of this paper, the five models are used to predict the 288 consecutive point loads in the last three days in 2018. The prediction results, error, and R 2 are shown in Figures 8 and 9 and Table 6. The expression of error is as follows: Error = (predicted value − real value)/real value (14) Energies 2020, 11, x FOR PEER REVIEW 12 of 16       Figure 9. Error of the five models. As Figure 8 shows, the five deep learning models generally have splendid prediction accuracy and strong stability, proving the feasibility of applying the deep learning method to ultra-short-term load forecast. Through the calculation of R 2 value, the results of the five deep learning models were all greater than 0.9. Conv-LSTM had the best goodness of fit, and Conv-GRU-LSTM had the second goodness of fit, which further proves the superiority of the deep learning framework proposed in this paper.
According to the experimental results, although the Conv-LSTM model had the highest coefficient of determination (0.9705), judging from the model training time in Figure 6, the training time of the Conv-GRU-LSTM model was much lower than that of the Conv-LSTM model. Therefore, comprehensively considering, the Conv-GRU-LSTM model was more practical. Especially when dealing with a large amount of sample data, the superiority of the model proposed in this paper is even more significant.

Conclusions
With the acceleration of the power market reform process, the importance of ultra-short-term load forecasting for grid companies and emerging purchase and sale companies is becoming more apparent. At the same time, affected by many uncertain factors, the future load changes present uncertainty. In comparison with the traditional point forecasting method, the deep learning framework can actively mine the hidden information in historical data, which is conducive to the decision-making and execution of electricity purchase and sale strategies of each power trading subject, and further promotes the economics of electricity market trading.
When using large-scale data for load forecasting, the conventional prediction method always leads to an excessively complicated model and an excessive computational cost in the training process. In this paper, convolution was combined with LSTM and GRU to construct Conv-GRU-LSTM ultra-short-term load forecast models. The main research conclusions are as follows: (1) With the use of power system big data, this paper collected more than 100,000 historical load data, making full use of the advantages of deep learning neural network to automatically extract features, simplifying the input features and reducing the process of manual construction features. The coefficient of determination of the Conv-GRU-LSTM model is 0.9639, which is very close to 1. Considering the comprehensive training time, the final experimental results show that the learning framework combining convolution with LSTM and GRU has excellent ability of feature mining.
(2) The model proposed in this paper is compared with the other four models including GRU, LSTM, Conv-GRU, and Conv-LSTM. The results show that the Conv-GRU-LSTM model proposed in this paper presents comprehensive advantages in training time and prediction accuracy.
(3) This paper aims at the short-term load forecasting in the next few minutes. The input sample has a three-year time span, so the forecasting results will not be affected by seasonal changes. Therefore, the model in this paper can be applied to short-term load forecasting in all periods of the year.

Discussion
Although the deep learning proposed in this paper can be well applied to forecast ultra-short-term load, there is still room for improvement in this paper. Further research can be carried out in the following two aspects: (1) The model hyperparameters can be further adjusted, such as hidden layers and number of nodes. Meanwhile, the prediction model of this paper can also be generalized to photovoltaic power generation prediction and wind power prediction through hyperparameter adjusting; (2) The deep learning framework constructed in this paper can be combined with multi-task learning as well. With reference to migration learning, and the coupling relationship of different energy sources in the integrated energy system, this model can also be introduced to improve the accuracy of multi-load prediction.