A High Precision Artiﬁcial Neural Networks Model for Short-Term Energy Load Forecasting

: One of the most important research topics in smart grid technology is load forecasting, because accuracy of load forecasting highly inﬂuences reliability of the smart grid systems. In the past, load forecasting was obtained by traditional analysis techniques such as time series analysis and linear regression. Since the load forecast focuses on aggregated electricity consumption patterns, researchers have recently integrated deep learning approaches with machine learning techniques. In this study, an accurate deep neural network algorithm for short-term load forecasting (STLF) is introduced. The forecasting performance of proposed algorithm is compared with performances of ﬁve artiﬁcial intelligence algorithms that are commonly used in load forecasting. The Mean Absolute Percentage Error (MAPE) and Cumulative Variation of Root Mean Square Error (CV-RMSE) are used as accuracy evaluation indexes. The experiment results show that MAPE and CV-RMSE of proposed algorithm are 9.77% and 11.66%, respectively, displaying very high forecasting accuracy. of DeepEnergy generalization performance, the data are split into training data and testing data. The training data are used for training of proposed model. After the training process, the proposed DeepEnergy network is created and initialized. Before the training, the training data are randomly shuffled to force the proposed model to learn complicated relationships between input and output data. The training data are split into several batches. According to the order of shuffled data, the model is trained on all of the batches. During the training process, if the desired Mean Square Error (MSE) is not reached in the current epoch, the training will continue until the maximal number of epochs or desired MSE is reached. On the contrary, if the maximal number of epochs is reached, then the training process will stop regardless the MSE value. Final performances are evaluated to demonstrate feasibility and practicability of the proposed method.


Introduction
Nowadays, there is a persistent need to accelerate development of low-carbon energy technologies in order to address the global challenges of energy security, climate change, and economic growth. The smart grids [1] are particularly important as they enable several other low-carbon energy technologies [2], including electric vehicles, variable renewable energy sources, and demand response. Due to the growing global challenges of climate, energy security, and economic growth, acceleration of low-carbon energy technology development is becoming an increasingly urgent issue [3]. Among various green technologies to be developed, smart grids are particularly important as they are key to the integration of various other low-carbon energy technologies, such as power charging for electric vehicles, on-grid connection of renewable energy sources, and demand response.
The forecast of electricity load is important for power system scheduling adopted by energy providers [4]. Namely, inefficient storage and discharge of electricity could incur unnecessary costs, while even a small improvement in electricity load forecasting could reduce production costs and increase trading advantages [4], particularly during the peak electricity consumption periods. Therefore, it is important for electricity providers to model and forecast electricity load as accurately as possible, in both short-term [5][6][7][8][9][10][11][12] (one day to one month ahead) and medium-term [13] (one month to five years ahead) periods.
With the development of big data and artificial intelligence (AI) technology, new machine learning methods have been applied to the power industry, where large electricity data need to be carefully moving average (ARIMA) method. In [17], authors used the ANN-based method reinforced by wavelet denoising algorithm. The wavelet method was used to factorize electricity load data into signals with different frequencies. Therefore, the wavelet denosing algorithm provides good electricity load data for neural network training and improves load forecasting accuracy.
In this study, a new load forecasting model based on a deep learning algorithm is presented. The forecasting accuracy of proposed model is within the requested range, and model has advantages of simplicity and high forecasting performance. The major contributions of this paper are: (1) introduction of a precise deep neural network model for energy load forecasting; (2) comparison of performances of several forecasting methods; and, (3) creation of a novel research direction in time sequence forecasting based on convolutional neural networks.

Methodology of Artificial Neural Networks
Artificial neural networks (ANNs) are computing systems inspired by the biological neural networks. The general structure of ANNs contains neurons, weights, and bias. Based on their powerful molding ability, ANNs are still very popular in the machine learning field. However, there are many ANN structures used in the machine learning problems, but the Multilayer Perceptron (MLP) [32] is the most commonly used ANN type. The MLP is a fully connected structure artificial neural network. The structure of MLP is shown in Figure 1. In general, the MLP consists of one input layer, one or more hidden layers, and one output layer. However, the MLP network presented in Figure 1 is the most common MLP structure, which has only one hidden layer. In the MLP, all the neurons of the previous layer are fully connected to the neurons of the next layer. In Figure 1, x1, x2, x3, ... , x6 are the neurons of the input layer, h1, h2, h3, h4 are the neurons of the hidden layer, and y1, y2, y3, y4 are the neurons of the output layer. In the case of energy load forecasting, the input is the past energy load, and the output is the future energy load. Although, the MLP structure is very simple, it provides good results in many applications. The most commonly used algorithm for MLP training is the backpropagation algorithm. Although MLPs are very good in modelling and patter recognition, the convolutional neural networks (CNNs) provide better accuracy in highly non-linear problems, such as energy load forecasting. The CNN uses the concept of weight sharing. The one-dimensional convolution and pooling layer are presented in Figure 2. The lines in the same color denote the same sharing weight, and sets of the sharing weights can be treated as kernels. After the convolution process, the inputs x1, x2, x3, ... , x6 are transformed to the feature maps c1, c2, c3, c4. The next step in Figure 2 is pooling, wherein the feature map of convolution layer is sampled and its dimension is reduced. For instance, in Figure 2 dimension of the feature map is 4, and after pooling process that dimension is reduced to 2. The process of pooling is an important procedure to extract the important convolution features. Although MLPs are very good in modelling and patter recognition, the convolutional neural networks (CNNs) provide better accuracy in highly non-linear problems, such as energy load forecasting. The CNN uses the concept of weight sharing. The one-dimensional convolution and pooling layer are presented in Figure 2. The lines in the same color denote the same sharing weight, and sets of the sharing weights can be treated as kernels. After the convolution process, the inputs x 1 , x 2 , x 3 , . . . , x 6 are transformed to the feature maps c 1 , c 2 , c 3 , c 4 . The next step in Figure 2 is pooling, wherein the feature map of convolution layer is sampled and its dimension is reduced. For instance, in Figure 2 dimension of the feature map is 4, and after pooling process that dimension is reduced to 2. The process of pooling is an important procedure to extract the important convolution features. The other popular solution of the forecasting problem is Long Short Term Memory network (LSTM) [33]. The LSTM is a recurrent neural network, which has been used to solve many time sequence problems. The structure of LSTM is shown in Figure 3, and its operation is illustrated by the following equations: where xt is the network input, and ht is the output of hidden layer, σ denotes the sigmoidal function, Ct is the cell state, and t C  denotes the candidate value of the state. Besides, there are three gates in LSTM: it is the input gate, t o is the output gate, and ft is the forget gate. The LSTM is designed for solving the long-term dependency problem. In general, the LSTM provides good forecasting results.

The Proposed Deep Neural Network
The structure of the proposed deep neural network DeepEnergy is shown in Figure 4. Unlike the general forecasting method based on the LSTM, the DeepEnergy uses the CNN structure. The input layer denotes the information on past load, and the output values represent the future energy load. There are two main processes in DeepEnergy, feature extraction, and forecasting. The feature extraction in DeepEnergy is performed by three convolution layers (Conv1, Conv2, and Conv3) and three pooling layers (Pooling1, Pooling2, and Pooling3). The Conv1-Conv3 are one-dimensional (1D) convolutions, and the feature maps are all activated by the Rectified Linear Unit (ReLU) function. The other popular solution of the forecasting problem is Long Short Term Memory network (LSTM) [33]. The LSTM is a recurrent neural network, which has been used to solve many time sequence problems. The structure of LSTM is shown in Figure 3, and its operation is illustrated by the following equations: where x t is the network input, and h t is the output of hidden layer, σ denotes the sigmoidal function, C t is the cell state, and C t denotes the candidate value of the state. Besides, there are three gates in LSTM: i t is the input gate, o t is the output gate, and f t is the forget gate. The LSTM is designed for solving the long-term dependency problem. In general, the LSTM provides good forecasting results. The other popular solution of the forecasting problem is Long Short Term Memory network (LSTM) [33]. The LSTM is a recurrent neural network, which has been used to solve many time sequence problems. The structure of LSTM is shown in Figure 3, and its operation is illustrated by the following equations: where xt is the network input, and ht is the output of hidden layer, σ denotes the sigmoidal function, Ct is the cell state, and t C  denotes the candidate value of the state. Besides, there are three gates in LSTM: it is the input gate, t o is the output gate, and ft is the forget gate. The LSTM is designed for solving the long-term dependency problem. In general, the LSTM provides good forecasting results.

The Proposed Deep Neural Network
The structure of the proposed deep neural network DeepEnergy is shown in Figure 4. Unlike the general forecasting method based on the LSTM, the DeepEnergy uses the CNN structure. The input layer denotes the information on past load, and the output values represent the future energy load. There are two main processes in DeepEnergy, feature extraction, and forecasting. The feature extraction in DeepEnergy is performed by three convolution layers (Conv1, Conv2, and Conv3) and three pooling layers (Pooling1, Pooling2, and Pooling3). The Conv1-Conv3 are one-dimensional (1D) convolutions, and the feature maps are all activated by the Rectified Linear Unit (ReLU) function.

The Proposed Deep Neural Network
The structure of the proposed deep neural network DeepEnergy is shown in Figure 4. Unlike the general forecasting method based on the LSTM, the DeepEnergy uses the CNN structure. The input layer denotes the information on past load, and the output values represent the future energy load. There are two main processes in DeepEnergy, feature extraction, and forecasting. The feature extraction in DeepEnergy is performed by three convolution layers (Conv1, Conv2, and Conv3) and three pooling layers (Pooling1, Pooling2, and Pooling3). The Conv1-Conv3 are one-dimensional (1D) convolutions, and the feature maps are all activated by the Rectified Linear Unit (ReLU) function. Besides, the kernel sizes of Conv1, Conv2, and Conv3 are 9, 5, 5, respectively, and the depths of the feature maps are 16, 32, 64, respectively. The pooling method of Pooling1 to Pooling3 is the max pooling, and the pooling size is equal to 2. Therefore, after the pooling process, the dimension of the feature map will be divided by 2 to extract the important features of the deeper layers.
In the forecasting, the first step is to flat the Pooling3 layer into one dimension and construct a fully connected structure between Flatten layer and Output layer. In order to fit the values previously normalized in the range [0, 1], the sigmoidal function is chosen as an activation function of the output layer. Furthermore, in order to overcome the overfitting problem, the dropout technology [34] is adopted in the fully connected layer. Namely, the dropout is an efficient way to prevent overfitting in artificial neural network. During the training process, neurons are randomly "dead". As shown in Figure 4, the output values of chosen neurons (the gray circles) are equal to zero in certain training iteration. The chosen neurons are randomly changed during training process.
Furthermore, the flowchart of proposed DeepEnergy is represented in Figure 5. Firstly, the raw energy load data are loaded into the memory. Then, the data preprocessing is executed and data are normalized in the range [0, 1] in order to fit the characteristic of the machine learning model. For the purpose of validation of DeepEnergy generalization performance, the data are split into training data and testing data. The training data are used for training of proposed model. After the training process, the proposed DeepEnergy network is created and initialized. Before the training, the training data are randomly shuffled to force the proposed model to learn complicated relationships between input and output data. The training data are split into several batches. According to the order of shuffled data, the model is trained on all of the batches. During the training process, if the desired Mean Square Error (MSE) is not reached in the current epoch, the training will continue until the maximal number of epochs or desired MSE is reached. On the contrary, if the maximal number of epochs is reached, then the training process will stop regardless the MSE value. Final performances are evaluated to demonstrate feasibility and practicability of the proposed method. Besides, the kernel sizes of Conv1, Conv2, and Conv3 are 9, 5, 5, respectively, and the depths of the feature maps are 16, 32, 64, respectively. The pooling method of Pooling1 to Pooling3 is the max pooling, and the pooling size is equal to 2. Therefore, after the pooling process, the dimension of the feature map will be divided by 2 to extract the important features of the deeper layers.
In the forecasting, the first step is to flat the Pooling3 layer into one dimension and construct a fully connected structure between Flatten layer and Output layer. In order to fit the values previously normalized in the range [0, 1], the sigmoidal function is chosen as an activation function of the output layer. Furthermore, in order to overcome the overfitting problem, the dropout technology [34] is adopted in the fully connected layer. Namely, the dropout is an efficient way to prevent overfitting in artificial neural network. During the training process, neurons are randomly "dead". As shown in Figure 4, the output values of chosen neurons (the gray circles) are equal to zero in certain training iteration. The chosen neurons are randomly changed during training process.
Furthermore, the flowchart of proposed DeepEnergy is represented in Figure 5. Firstly, the raw energy load data are loaded into the memory. Then, the data preprocessing is executed and data are normalized in the range [0, 1] in order to fit the characteristic of the machine learning model. For the purpose of validation of DeepEnergy generalization performance, the data are split into training data and testing data. The training data are used for training of proposed model. After the training process, the proposed DeepEnergy network is created and initialized. Before the training, the training data are randomly shuffled to force the proposed model to learn complicated relationships between input and output data. The training data are split into several batches. According to the order of shuffled data, the model is trained on all of the batches. During the training process, if the desired Mean Square Error (MSE) is not reached in the current epoch, the training will continue until the maximal number of epochs or desired MSE is reached. On the contrary, if the maximal number of epochs is reached, then the training process will stop regardless the MSE value. Final performances are evaluated to demonstrate feasibility and practicability of the proposed method.

Experimental Results
In the experiment, the USA District public consumption dataset and electric load dataset from 2016 provided by the Electric Reliability Council of Texas were used. Since then, the support vector machine (SVM) [35] is a popular machine learning technology, in experiment; the radial basis function (RBF) kernels of SVM were chosen to demonstrate the SVM performance. Besides, the random forest (RF) [36], decision tree (DT) [37], MLP, LSTM, and proposed DeepEnergy network were also implemented and tested. The results of load forecasting by all of the methods are shown in Figures 6-11. In the experiment, the training data were two-month data, and test data were onemonth data. In order to evaluate the performances of all listed methods, the dataset was divided into 10 partitions. In the first partition, training data consisted of energy load data collected in January and February 2016, and test data consisted of data collected in March 2016. In the second partition, training data were data collected in February and March 2016, and test data were data collected in April 2016. The following partitions can be deduced by the same analogy.
In Figure 6-11, red curves denote the forecasting results of the corresponding models, and blue curves represent the ground truth. The vertical axes represent the energy load (MWh), and the horizontal axes denote the time (hour). The energy load from the past (24 × 7) h was used as an input of the forecasting model, and predicted energy load in the next (24 × 3) h was an output of the forecasting model. After the models received the past (24 × 7) h data, they forecasted the next (24 × 3) h energy load, red curves in Figures 6-11. Besides, the correct information is illustrated by blue curves. The differences between red and blue curves denote the performances of the corresponding models. For the sake of comparison fairness, testing data were not used during the training process of models. According to the results presented in Figures 6-11, the proposed DeepEnergy network has the best prediction performance among all of the models.

Experimental Results
In the experiment, the USA District public consumption dataset and electric load dataset from 2016 provided by the Electric Reliability Council of Texas were used. Since then, the support vector machine (SVM) [35] is a popular machine learning technology, in experiment; the radial basis function (RBF) kernels of SVM were chosen to demonstrate the SVM performance. Besides, the random forest (RF) [36], decision tree (DT) [37], MLP, LSTM, and proposed DeepEnergy network were also implemented and tested. The results of load forecasting by all of the methods are shown in Figures 6-11. In the experiment, the training data were two-month data, and test data were one-month data. In order to evaluate the performances of all listed methods, the dataset was divided into 10 partitions. In the first partition, training data consisted of energy load data collected in January and February 2016, and test data consisted of data collected in March 2016. In the second partition, training data were data collected in February and March 2016, and test data were data collected in April 2016. The following partitions can be deduced by the same analogy.
In Figures 6-11, red curves denote the forecasting results of the corresponding models, and blue curves represent the ground truth. The vertical axes represent the energy load (MWh), and the horizontal axes denote the time (hour). The energy load from the past (24 × 7) h was used as an input of the forecasting model, and predicted energy load in the next (24 × 3) h was an output of the forecasting model. After the models received the past (24 × 7) h data, they forecasted the next (24 × 3) h energy load, red curves in Figures 6-11. Besides, the correct information is illustrated by blue curves. The differences between red and blue curves denote the performances of the corresponding models. For the sake of comparison fairness, testing data were not used during the training process of models. According to the results presented in Figures 6-11, the proposed DeepEnergy network has the best prediction performance among all of the models.   In order to evaluate the performance of forecasting models more accurately, the Mean Absolute Percentage Error (MAPE) and Cumulative Variation of Root Mean Square Error (CV-RMSE) were employed. The MAPE and CV-RMSE are defined by Equations (7) and (8), respectively, where yn denotes the measured value, ˆn y is the estimated value, and N represents the sample size. In order to evaluate the performance of forecasting models more accurately, the Mean Absolute Percentage Error (MAPE) and Cumulative Variation of Root Mean Square Error (CV-RMSE) were employed. The MAPE and CV-RMSE are defined by Equations (7) and (8), respectively, where y n denotes the measured value,ŷ n is the estimated value, and N represents the sample size.
The detailed experimental results are presented numerically in Tables 1 and 2. As shown in Tables 1 and 2, the MAPE and CV-RMSE of the DeepEnergy model are the smallest and the goodness of error is the best among all models, namely, average MAPE and CV-RMSE are 9.77% and 11.65%, respectively. The MAPE of MLP model is the largest among all of the models; an average error is about 15.47%. On the other hand, the CV-RMSE of SVM model is the largest among all models; an average error is about 17.47%. According to the average MAPE and CV-RMSE values, the electric load forecasting accuracy of tested models in descending order is as follows: DeepEnergy, RF, LSTM, DT, SVM, and MLP. It is obvious that red curve in Figure 11, which denotes the DeepEnergy algorithm, is better than other curves in Figures 6-10, which further verifies that the proposed DeepEnergy algorithm has the best prediction performance. Therefore, it is proven that the DeepEnergy STLF algorithm proposed in the paper is practical and effective. Although the LSTM has good performance in time sequence problems, in this study, the reduction of training loss is still not fast enough to handle this forecasting problem because the size of input and output data is too large for the traditional LSTM neural network. Therefore, the traditional LSTM is not suitable for this kind of prediction. Finally, the experimental results show that proposed DeepEnergy network provides the best results in energy load forecasting.

Discussion
The traditional machine learning methods, such as SVM, random forest, and decision tree, are widely used in many applications. In this study, these methods also provide acceptable results. In aspect of SVM, the supporting vectors are mapped into a higher dimensional space by the kernel function. Therefore, the selection of kernel function is very important. In order to achieve the goal of nonlinear energy load forecasting, the RBF is chosen as a SVM kernel. When compared with the SVM, the learning concept of decision tree is much simpler. Namely, the decision tree is a flowchart structure easy to understand and interpret. However, only one decision tree does not have the ability to solve complicated problems. Therefore, the random forest, which represents the combination of numerous decision trees, provides the model ensemble solution. In this paper, the experimental results of random forest are better than those of decision tree and SVM, which proves that the model ensemble solution is effective in the energy load forecasting. In aspect of the neural networks, the MLP is the simplest ANN structure. Although the MLP can model the nonlinear energy forecasting task, its performance in this experiment is not outstanding. On the other hand, the LSTM considers data relationships in time steps during the training. According to the result, the LSTM can deal with the time sequence problems, and the forecasting trend is marginally correct. However, the proposed CNN structure, named the DeepEnergy, has the best results in the experiment. The experiments demonstrate that the most important feature can be extracted by the designed 1D convolution and pooling layers. This verification also proves the CNN structure is effective in the forecasting, and the proposed DeepEnergy gives the outstanding results. This paper not only provides the comparison of the traditional machine learning and deep learning methods, but also gives a new research direction in the energy load forecasting.

Conclusions
This paper proposes a powerful deep convolutional neural network model (DeepEnergy) for energy load forecasting. The proposed network is validated by experiment with the load data from the past seven days. In the experiment, the data from coast area of the USA were used and historical electricity demand from consumers was considered. According to the experimental results, the DeepEnergy can precisely predict energy load in the next three days. In addition, the proposed algorithm was compared with five AI algorithms that were commonly used in load forecasting. The comparison showed that performance of DeepEnergy was the best among all tested algorithms, namely the DeepEnergy had the lowest values of both MAPE and CV-RMSE. According to all of the obtained results, the proposed method can reduce monitoring expenses, initial cost of hardware components, and long-term maintenance costs in the future smart grids. Simultaneously, the results verify that proposed DeepEnergy STLF method has strong generalization ability and robustness, thus it can achieve very good forecasting performance.