Analysis of Different Neural Networks and a New Architecture for Short-Term Load Forecasting

Short-term load forecasting (STLF) has been widely studied because it plays a very important role in improving the economy and security of electric system operations. Many types of neural networks have been successfully used for STLF. In most of these methods, common neural networks were used, but without a systematic comparative analysis. In this paper, we first compare the most frequently used neural networks’ performance on the load dataset from the State Grid Sichuan Electric Power Company (China). Then, considering the current neural networks’ disadvantages, we propose a new architecture called a gate-recurrent neural network (RNN) based on an RNN for STLF. By evaluating all the methods on our dataset, the results demonstrate that the performance of different neural network methods are related to the data time scale, and our proposed method is more accurate on a much shorter time scale, particularly when the time scale is smaller than 20 min.


Introduction
Accurate short-term load forecasting (STLF) can play a significant role in power construction planning and power grid operation, and also has crucial implications for the sustainable development of power enterprises.STLF can predict future loads for minutes to weeks.Because of the nonlinearity, non-stationarity, and non-seasonality of STLF, it is very challenging to predict accurately.Inaccurate load forecasting may increase operating costs [1].By contrast, with an accurate electric load forecasting method, fundamental operating functions, such as unit maintenance, reliability analysis, and unit commitment, can be operated more efficiently [2].Thus, it is essential for power suppliers to build an effective model that can predict power loads, accomplish a balance between production and demand, reduce production costs, and implement pricing schemes for various demand responses.According to the length of the forecast period, power load forecasting is divided into four categories: long-term load forecasting, medium-term load forecasting, STLF, and ultra-STLF [3].
There have been many efforts to develop accurate STLF; many methods have been propsoed [4][5][6][7][8][9][10][11][12][13][14][15][16].In much earlier works, researchers attempted to forecast load precisely using a mathematical statistics approach.The most representative is the regression analysis approach [4], which uses a set of functional linear regression models.In [5], the authors used the Kalman filter to develop a very short-term load predictor, and the Box-Jenkins autoregressive integrated moving average approach was proposed in [6].In [17], the relationships between demand and driver variables were evaluated by using semi-parametric additive models.Additionally, in [18], the authors came up with a new SVD-based exponential smoothing formulation.Based on linear regression and patterns, an univariate models were proposed for 34 daily cycles of a load time series in [19].In [20], by combining a Bayesian neural network with the hybrid Monte Carlo algorithm, the authors assumed a new model for STLF.Modified artificial bee colony algorithm, extreme learning machine, and the wavelet transform were combined to construct a novel STLF method in [21].
However, mathematical statistics approaches are based on linear analysis; thus, it is difficult for them to predict nonlinear and non-stationary problems.For better learning of nonlinear features, machine learning is a good approach.Two broad categories of methods exist: support vector regression (SVR) and artificial neural networks (ANNs).In the SVR approach, a generic strategy for STLF based on the SVR has been proposed [22][23][24][25].In this method, there are two considerable improvements to SVR-based load forecasting methods: the procedure for generating model inputs and the subsequent model input was selected by feature selection algorithms.By combining SVR with other algorithms, many hybrid methods have been proposed.An SVR model combined with the differential empirical mode decomposition algorithm and autoregression was proposed in [26][27][28], and provides higher accuracy and interpretability, and better generation ability.To extend the SVR method, a chaotic genetic algorithm was presented to improve forecasting performance in [29].In [30,31], the authors combined support vector machines (SVMs) with a genetic algorithm, and combined SVR with ant colony optimization to forecast the system load.In [32],an attempt based on double seasonal exponential smoothing was used for STLF.Besides these studies, in [7], in order to achieve better regression and forecasting performance, the proposed SVR-based STLF approach, a supervised machine learning approach with the preprocessing of input data is required.Simultaneously, STLF models were developed using fuzzy logic and an adaptive neuro-fuzzy inference system [33], with efficient load forecasting and has been the alternative approach for STLF in Turkey.
For ANNs, the backpropagation neural network (BPNN) was the first ANN method used for load forecasting.In [34], the authors presented a BPNN approach with a rough set for complicated STLF with dynamic and nonlinear factors to enhance the performance of predictions.By combining a Bayesian neural network and BPNN, Ningl et al. [35] proposed a Bayesian-backpropagation method to forecast the hour power load of weekdays and weekends.Based on the BPNN, the authors discussed the relationship between the daily load and weather factors in [36].Because the BPNN is a type of feedforward ANN, it cannot learn the features of time sequential data, but power load data can be considered as sequential data.Recurrent neural networks (RNNs) have been introduced into STLF.In [37][38][39][40], the authors proposed using an RNN to capture a compact and robust representation.A multiscale bi-linear RNN was proposed for STLF [41].Additionally, a long-short term memory (LSTM) network as a type of more complex RNN has been used in STLF [42][43][44].However, the performance of LSTM is not very effective.
Although neural networks have been frequently used in short-term power load prediction, there barely has been a systematic comparison of the role of neural networks in this problem to determine how to solve the key problems and which approach exerted the least negative effect on the neural network.Therefore, we systematically compare the advantages and disadvantages of different types of commonly used neural network methods and then analyse the performance of neural networks for different time scales.Simultaneously, according to the difference in network performance, we design new RNN architecture that balance memory and the current scenario at any time by referring to highway neural networks [45] in the time dimension.Thus, our main contributions in this paper are as follows: first, we systematically analyze the performance of commonly used neural networks on our power load data.Second, according to the advantages and disadvantages of these networks, we propose a new neural network architecture, the gate-RNN, to forecast STLF.

Methods
To explore the performance of different types of neural networks applied to STLF, we use four types: three types of the most commonly used neural networks and an improved neural network that we call the gate-RNN.

BPNN
A BPNN is a type of feedforward neural network (FNN).The simple architecture of a BPNN is shown in Figure 1, where the neurons in the same layer of the BPNN are only connected with adjacent layers' neurons; there are no connections among neurons in the same layer.The BPNN contains three types of layers: the input layer, hidden layers, and output layer.Input layer inputs data into the neural network.Output layer outputs the neural network's computational results.The hidden layers are the layers between the input layer and output layer.The values of connections between different layers are weights denoted by w i where i denotes i-th layer.All the knowledge that the neural network has learned is stored in the weights.

Layer 1
Layer 2 Layer L-1 Layer L Figure 1.Architecture of a BPNN that contains L layers.Typically, Layer 1 is the input layer, which inputs data into the neural network, and the last layer, Layer L, is the output layer, which outputs the predicted values.The W between every pair of layers is the weight, which is the knowledge of the network.
The goal of training a BPNN is to determine a set of suitable W so that the network can obtain the correct output when test data is input into it by training on the train dataset.Suppose there is a training set (x 1 , y 1 ) , (x 2 , y 2 ), . . ., (x n , y n ) , which contains n tuples.Each tuple contains input data x i and target label y i .To train a BPNN, the first step is forward computing, which can be computed as where z l is the l-th layer's input vector, a l is the l-th layer's output vector, and f (•) is an active function.
In this paper, we chose the rectified linear unit as our active function, which is defined as After forward computing, the BP networks need to update the network through the losses that were calculated from the target labels so that the network can determine the suitable W, where W is a set of W 1 , . . ., W L and L is the number of layers.One of the most commonly used update methods is the gradient descent method with BP.The BP process starts by defining a loss function, Loss.The loss function measures the distance between the outputs of the BP network and the true targets.The mean-square error (MSE) is a common loss function for prediction.It is defined as Equation (3): where y L i is the i-th output of the output layer, a L i is the i-th target label of output layer, and n L is the total number of outputs in the output layer. 1  2 is for better computing the derivative of loss.Because a L i is computed by W, loss function Loss is the function of weight W.
To use the gradient descent method to determine the optimal W, we need to compute W which is defined as and update W using where α is the learning rate, which controls the learning step during each update.We define an error term in the i-th neuron l-th layer as δ l i = ∂L ∂z l i for better computation of gradient W in each layer.
In the output layer, we can compute the gradient of w L ij directly by combining Equation (3) and using the chain rule: Then, we compute the other layers' error terms using back propagation: Then, we compute the remaining layers' weight gradients from back to front in matrix form as Thus, the learning process of the BPNN can be presented as Algorithm 1.

Algorithm 1:
The BP network update process with gradient decent.

Input :
The input dataset D = (x, y L ) ; The learning rate α = 0.0001; Output : The weight of the BP network W after training ; where k is the sum of input and output dimensions;

RNN
STLF is a type of prediction based on the previous time.It is based on the history load information forecasting the next time load value.We can consider it as a sequential problem.Among all types of neural networks, the RNN is good at solving sequential problems.An RNN contains recurrent connections.A simple architecture of an RNN is shown in Figure 2 .It contains an input layer, recurrent layer, and output layer.The difference between an RNN and BPNN is that an RNN has connections among the same recurrent layer's neurons.
For a better understanding of the computation of an RNN, we can unroll the RNN on the time dimension as shown in Figure 3.In the unrolled RNN, the neurons in the hidden layer at each time step t can be considered as one layer of an FNN.If the time step is T, then the unrolled RNN has T hidden layers.Based on the unrolled network, we can train the RNN using backpropagation through time (BPTT) [46], specifically epoch-wise BPTT, using the following steps: 1. Forward computing: the output of network at time t can be computed as the following equation:

Recurrent layer
2. Define the loss as where is the error at time t .At each time step, d(t ) represents the target label and y(t ) is the output of the network at time t .3. Compute gradient w of w as follows: Combine with the chain rule and define e(τ) = ∂E(τ) ∂y(τ) , which can be written as where Hence, the gradients of weights can be computed as 4. Then, update the weights in the RNN using Equation ( 5) until the network converges.

LSTM
LSTM is a type of RNN, but it has a more complicated architecture.A common LSTM network consists of many LSTM blocks, which are called memory cells.An LSTM cell stores the input for some period of time.The flow of information into and out of the cell was determined by the values in the cell and regulated by the three gates.A classical cell architecture is shown in Figure 4.The cell receives input x t at time t, output h t−1 and state vector C t−1 at time t − 1.There are three gates in the cell to control the computation of the cell: the input gate, output gate and forget gate.The input gate selectively records new information into the cell state.The output gate determines which information is worth outputting.The forget gate selectively forgets some information and retains much more valuable information.The forward pass of an LSTM unit with a forget gate can be computed as following equations where x t is the input vector to the cell.W ∈ R h×d and U ∈ R h×h are the weight matrices and bias vector parameters, where h and d refer to the number of input features and number of hidden units.f t , i i and O t are the forget gate, input gate and output gate's output, respectively.h t is the hidden state vector, which is also called the output vector.c t is the cell state vector.σ g , σ c and σ h are activation functions, where σ g is the sigmoid function, σ c is the hyperbolic tangent function and σ h is the hyperbolic tangent.
To update the parameters of LSTM, a common way is using epoch-wise BPTT algoritm as the same with RNN, the detail can be found in [47].

Gate-RNN
As we can see, BPNN is an FNN.If BPNNs are used to predict load, they will weaken the relationships in the time dimension.Regarding RNN and LSTM, they are good at capturing temporal features.However, RNN can only memory history information through the weights among neurons, which is much is too simple to handle the input information and memory information.By contrast, during an experiment, we found that LSTM was too complicated to train stably.Thus, we propose an RNN cell that can control the computation of input information and memory information, converges well in the training process, and has excellent results in ultra STLF.
Figure 5 shows the cell architecture of our gate-RNN.At time t, the input vector of the cell is x t and the history information is output h t−1 at time t − 1.When input vector x t is input into the cell, it is divided into three branches: two branches for computing two gates' values that are 1 − G and G, and one for computing the cell's state that is S. Gate G controls the effect of the cell state on the output and gate 1 − G controls the effect of history information on the output.The output of the cell h t combines the output of both gates.We use the " + " operator as our combination approach, and it adds the values of both gates at the same position.The computation of the cell is as follows:

1-G G S H
as for LSTM, W ∈ R h×d denotes the weight matrices parameters.Many gate-RNN cells are combined to obtain a layer and then a neural network.To update the gate-RNN, we compute the gradient of the weight at each time step.We also define an error term at the t-th time step (same as the l-th layer) as δ t = ∂Loss ∂h t to better compute the gradient of W in each layer.In the t-th time step, we can compute the gradients of W g , W s , W h directly by combining Equation (3) and using the chain rule, as follows: Then, we update the entire network using Algorithm 2.

Algorithm 2:
The update process of gate-RNN.

Input :
The input dataset D = (x, y L ) ; The learning rate α = 0.0001; Output : The weight of the gate-RNN network W after training ;

Experiment Data Processing and Experiment Settings
Our dataset contained an entire year's electrical load of the 2016 Electric Power Company in Sichuan Province, China.The data values distribution is shown in Figure 6.The average, variance, standard deviation and coefficient of variation are 16,965.71,11,692,942.24,3419.49and 4.96, respectively.To avoid overlap between the training data and test data, all the data were sorted by time, and the first three-quarters of the total data was chosen as our training data and the remainder as our test data.It means that we used nine months' load values as the train date, the rest of the three months' load values as test sets.Because, for training neural networks, it is suggested to use raw data, we don't use any regularization or pay special attention to the special days in the whole years such as holidays and Chinese New Year day.The original data recorded electrical load with a 1 min interval.According to the data processing method in [7], the data were sampled for different time scales.For a better analysis of neural network performance, the data were sampled for 5-min, 20-min, 30-min, and 40-min time scales after dividing the training data and test data into different time scales.The data size of different time scales' train dataset and test dataset are shown in Table 1.It can be visualized as in Figure 7.As we can see, with the time scale become larger, the size of dataset become smaller.Because of sampling by different time scales, the data have been divided into different sizes.Regarding training the BPNN, the inputs of the neural networks were the last 10 samples and the output was the prediction value of the 11th moment load value.For the RNN series methods, the input was the previous nine load values and the current load value, and the output was the next time predicted load value.All the input values are the nearest ten values and ordered in time; this is the same way as mentioned in [35,43].The complete structures of ANNs are shown in Figure 8.After predicting the eleventh load value, the second value to the eleventh value is chosen to predict the twelfth value.We iterate this procedure until finishing the prediction of the last value.The prediction mechanism of the neural network is shown in Figure 9.In the experiment, the following criteria were used to evaluate all the mentioned methods: the root mean square error (RMSE), mean absolute error (MAE) and mean absolute prercentage error (MAPE), which are widely used in STLF [1,7,17].They are calculated as follows:

Neural networks
where x i is the actual load value, ẋi is the predicted load value, N is the number of test samples.These criteria represent three types of deviation between the forecast and actual values: the smaller the criteria, the higher the forecasting accuracy.MAE is the basic metric for STLF, RMSE is sensitive to the regression point with large deviation, and MAPE considers both the error and the ratio which are between the predicted value and true value.
In order to balance performance and accuracy, we tried several trials to decide the size of all the neural networks in a 30-min time scale data set, the trails' results on 30-min test set are shown in Table 2.The number after the is the number of neurons.We have tried a three-layer BPNN architecture, and the results show it can forecast the load values well when the number of hidden neurons is 64.More or less hidden neurons will cause a decrease in performance.It means that a three-layer BPNN with 64 hidden neurons is a good choice for this load forecast task.Therefore, we have not added more layers into BPNN.The same as with BPNN, from the results, we can see that one recurrent layer with four neurons is a good choice for this load forecast task.The specific configuration of the mentioned neuron networks for STLF was as follows: The BPNN contained three layers.Specifically, the first layer was the input layer, which contained 10 neurons, and each input was a sampled load value; the second layer was a hidden layer whose size was 64; and the last layer was the output layer, which only contained one neuron, which represented the predicted value.The RNN, LSTM, and gate-RNN input layers contained one neuron, which was input into a 4-neuron RNN, LSTM, or gate-RNN cell, and the output was the same as that of the BPNN.The RNN cell calculated the value of the last 10 moments to predict the value of the next moment.All the networks in this paper were implemented using the Keras framework.The loss function of these methods was Equation (22).The optimizer of the neural networks was RMSProp and its parameters were set as Keras default parameters except for the learning rate.During the training process, the learning rate of epoch 1 to 5 was 0.01 and the remaining epoches learning rate was 5 × 10 −5 .The total number of epoch was 55 and the value of weight decay is 5 × 10 −5 .

Comparison of the Models with the Baseline Method
To evaluate the effectiveness of the neural network methods mentioned in the section above, several most commonly used traditional methods were chosen as the baseline methods.They are SVR [22], decision tree (DT) [9], autoregressive integrated moving average model (ARIMA) [48], and one random forest (RF) proposed in [49].SVR is one of the most popular methods that is widely used to forecast short-term load.DT, ARIMA and RF introduced in [49] are useful methods.All these methods have been used in STLF in different time scales, 5-min, 20-min, 30-min and 40-min time scale [9,48,49].The results of all the methods test on the test set of three months are shown in Tables 3-6.From these tables, we can see that among the eight methods, the neural network methods achieved better comprehensive performance compared with other traditional methods in all different time scales, except LSTM.In a smaller time scale, specifically in 5-min and 20-min, gate-RNN achieved better performance than any other methods.ARIMA achieved the best performance among all the traditional methods.On a larger time scale, 30-min and 40-min, BP achieved better comprehensive performance than any other methods.Meanwhile, DT achieved the best performance among all traditional methods.It has shown that the performance of methods is influenced by the time scale.Table 3. Results of eight methods on the 5-min time scale in a three months' test dataset.

RMSE MAPE (%) MAE
SVR [22] 830.00 3.80 682.88 DT [9] 517.70 2.14 394.55 ARIMA [48] 670  To clearly illustrate the forecast value, from the test set of three months, we chose 1000 time scales' actual and predicted electric load values obtained by the eight methods to draw a figure on a 30-min time scale, which is shown in Figure 10.The orange line stands for real load values, the blue line stands for prediction values and the green line stands for the average value of real load values.Furthermore, 1000-time scales are nearly one month, from Figure 10, we can see the blue line is far away from the green line and close to the orange line.All of the methods' prediction values are much closer to the real values; this means that the prediction values have no relations to the average value for some months and even one year.Specifically, it can be seen that the predicted values obtained by the LSTM method have the maximum error for the original data compared with the other seven methods; particularly,

Performance Analysis of the Neural Networks
In this section, we analyze the performance of four neural network methods which include the training process of different neural networks and the performance of the neural networks.To better compare these methods, we used four time scales to process the data: 5 min, 20 min, 30 min, and 40 min.

Training Process Analysis for the Neural Networks
To assess whether the neural networks were trained and learned the distribution of data, a direct approach is to assess whether the train loss and test loss converged.From Figures 12-15, we can see that the RNN and gate-RNN train loss and test loss converged well for all time scales; LSTM had large fluctuations and BP had slight fluctuations.For the 5-min time scale, even though LSTM had a large fluctuation, all the methods achieved a small train loss and test loss, and the trends of the train loss and test loss were the same.This means that all the methods learned the distribution of train data and achieved good performance on the test set.When the time scale was larger than 5 min, even though the trends of the train loss and test loss were the same, the convergence values of the train loss and test loss were different, and the losses increased as the time scale increased.This means that, as the time scale increased, the performance of the neural network became worse.
From Figures 12-15, we also can find that our proposed method converged after training in approximately five epochs, which was faster than any of the other methods.BP converged in approximately ten epochs and RNN converged in approximately seven epochs.This means that our proposed method had significant power to forecast ultra short-term load.We believe that it benefits from the suitable gate mechanism.The RNN method has no gate mechanism; it cannot choose the memory information, but the LSTM method was too complicated to train on the data.

Examination of All the Neural Networks
After analyzing the train loss and test loss, we analyzed the criteria for each method for different time scales.The results for the four neural networks are shown in Tables 7-9.Tables 7-9 which present the RMSE, MAPE, and MAE, respectively, for four time scales.Table 7 shows that gate-RNN achieved the best performance for the 5-min and 20-min time scales; RNN achieved the best performance for the 30-min time scale; and BP achieved the best performance for the 40-min time scale.The results in Tables 8 and 9 show the same performance as Table 7 where gate-RNN achieved the best performance for the 5-min and 20-min time scales; RNN achieved the best performance for the 30-min time scale; and BP achieved the best performance for the 40-min time scale.Overall, gate-RNN achieved good results among all the four methods.This proves that our proposed method is better at forecasting ultra short-term load again.Further visualization results are shown in Figure 16.We can see it intuitively that the performance of all the methods became worse as the time scale increased.Overall, all the methods had a similar trend: they began at a relatively small value and ended with a large value.Figure 16 also shows that our method achieved the best results for the criteria for small time scales, but, as the time scales increased, our method's performance became slightly worse.Overall, the performance of LSTM for all criteria for different time scales was worse, and the values for the criteria were far from those of the other neural network methods.We believe that this scenario was caused by the following main points.First, we consider the reason that the data is time scale.The larger the time scale, the more likely that unpredictable states will affect the load value, such as sudden weather changes and holidays.Even though the neural networks could predict the trend of the load, the values of the load would have a large turbulence.Second, we consider the neural networks' features.As we know, the complexity of the structures increases gradually in the following order: BP, RNN, gate-RNN, and LSTM.The more complex the architecture of a neural network, the more difficult it is to train it, and the more data are needed to train it.After we process the data into different time scales, the data size changes according to the time scale.For example, the number of total data items is T total recorded every 1 min, so, when our time scale is 5 min, the data items become T 5min , which is 1  5 of the total items, that is, five times smaller than T total .All of the time scales can be represented by T total = 5 × T 5min = 20 × T 20min = 30 × T 30min = 40 × T 40min .Thus, when using these different time scales to process data to train the networks, the actual data size decreased with the time scale, and then the performance became worse with an increase in the time scale.

Conclusions
The aim of our work was to explore the performance of different types of neural networks in the STLF, and, based on the insufficiency of current neural networks, we proposed a new neural network architecture called the gate-RNN.Our proposed method is extremely suitable for handling ultra STLF.From the experimental results, we found that neural networks have the STLF ability.They can achieve better performance than the tranditional methods.However, the more complex the neural network's architecture, the more data are needed to train the neural network.Considering that different neural networks have different characteristics, for our dataset, the gate-RNN achieved the highest score overall.For a large time scale, the BP achieved better performance than RNNs.
In future research, we plan to improve our work in two aspects.First, we will extend our data to multivariate data.Collecting more electrical load data and adding more relevant factors can influence load forecasting, for example, weather and some breaking news information.Second, based on the gate-RNN, to train the neural network more efficiently, we plan to design a loss function that can reflect the forecasting value online.

Figure 2 .Figure 3 .
Figure 2. Architecture of a simple RNN: the inputs connect with each hidden neuron, and the hidden neurons have connections between every other neuron and have outputs.

Figure 4 .
Figure 4.A cell of LSTM.The cell receives external input x t and cell state C t−1 and outputs cell state C t and current output h t .The cell contains three gates: input gate i t , output gate o t and forget gate f t .

Figure 5 .
Figure 5.The cell of gate-RNN.It contains two gates to control the input information and history information.It uses G to control input information and 1 − G to control history information.

Figure 6 .
Figure 6.The distribution of raw data values.

Figure 7 .
Figure 7.The numbers of training data and test data distribution at different time scales.

Figure 9 .
Figure 9.The predict mechanism of the neural network.

Figure 11 .
Figure 11.All eight methods' prediction error distributions in 30-min time scale.(a) distribution of ARIMA; (b) distribution of DT; (c) distribution of RF; (d) distribution of SVR; (e) distribution of BP; (f) distribution of RNN; (g) distribution of LSTM; (h) distribution of gate-RNN.

Figure 12 .
Figure 12.Neural networks' losses on the train and test datasets for 5-min time scales: (a) loss of BPNN; (b) loss of RNN; (c) loss of LSTM; and (d) loss of gate-RNN.The blue line is the train loss curve and the orange line is the test loss curve.

Figure 13 .
Figure 13.Neural networks' losses on the train and test datasets for 20-min time scales: (a) loss of BPNN; (b) loss of RNN; (c) loss of LSTM; and (d) loss of gate-RNN.The blue line is the train loss curve and the orange line is the test loss curve.

Figure 14 .
Figure 14.Neural networks' losses on the train and test datasets for 30-min time scales: (a) loss of BPNN; (b) loss of RNN; (c) loss of LSTM; and (d) loss of gate-RNN.The blue line is the train loss curve and the orange line is the test loss curve.

Figure 15 .
Figure 15.Neural networks' losses on the train and test datasets for 40-min time scales: (a) loss of BPNN; (b) loss of RNN; (c) loss of LSTM; and (d) loss of gate-RNN.The blue line is the train loss curve and the orange line is the test loss curve.

Figure 16 .
Figure 16.Visualization of the three criteria for the neural networks for different time scales: (a) RMSE of the neural networks for different time scales; (b) MAE of the neural networks for different time scales; and (c) MAPE of the neural networks for different time scales.
For each sample (x, y L ) ∈ D, set a 1 = x ; 3 Do forward computing by using Equation (16); 4 Compute loss by Equation (3); 5 Compute the last time step T error term by δ T g = ∂Loss ∂g T , δ T h = ∂Loss ∂h T , δ T s = ∂L ∂s T ; 6 Compute gradients of W s , W h , W g , by using Equation (17)-(19); 7 Update W by Equation (5) ; 8 Repeat to line 2 until converge. 2

Table 1 .
The data size of different time scales.
Figure8.The complete structures of BP neural network and RNNs.(a) the structure of BP neural network, the input layer contains 10 neurons, hidden layer contains 64 neurons and output layer contains only one neuron; (b) the structure of RNNs, the input contains one neuron that input the current load value, the hidden layer contains four units and output contains one neuron which output the predict load value.If each unit in the right figure is neuron, it is the structure of RNN, if each unit is LSTM cell, then it is the structure of LSTM and if each unit is gate-RNN cell, then it is the structure of gate-RNN.

Table 2 .
Results of different hidden layer sizes in 30-min time scale test set.

Table 4 .
Results of eight methods on the 20-min time scale in three months' test dataset.

Table 6 .
Results of eight methods on the 40-min time scale in three months' test dataset.

Table 7 .
Results of the RMSE for different time scales.

Table 8 .
Results of the MAPE for different time scales.

Table 9 .
Results of the MAE for different time scales.