Integration of Ensemble GoogLeNet and Modified Deep Residual Networks for Short-Term Load Forecasting

: Due to the strong volatility of the electrical load and the defect of a time-consuming problem, in addition to overfitting existing in published forecasting methods, short-term electrical demand is difficult to forecast accurately and robustly. Given the excellent capability of weight sharing and feature extraction for convolution, a novel hybrid method based on ensemble GoogLeNet and modified deep residual networks for short-term load forecasting (STLF) is proposed to address these issues. Specifically, an ensemble GoogLeNet with dense block structure is used to strengthen feature extraction ability and generalization capability. Meanwhile, a group normalization tech-nique is used to normalize outputs of the previous layer. Furthermore, a modified deep residual network is introduced to alleviate a vanishing gradient problem in order to improve the forecasting results. The proposed model is also adopted to conduct probabilistic load forecasting with Monte Carlo Dropout. Two acknowledged public datasets are used to evaluate the performance of the proposed methodology. Multiple experiments and comparisons with existing state-of-the-art models show that this method achieves accurate prediction results, strong generalization capability, and satisfactory coverages for different prediction intervals, along with reducing operation times.


Introduction
Modern power systems are in a new era of coexistence with combined traditional and renewable energy power generation modes, with the latter especially generated by wind and photovoltaic (PV) sources. Due to design codes [1], most installed capacity of PV is from 1 to 30 MWp, while PV stations are generally connected to 10 or 35 kV, which belongs to the voltage level of distributed electrical networks. The integration of new energy sources brings challenges not only to regional energy forecasts, but also to total energy demand prediction.
Accurate load forecasting plays an important role in ensuring the safe and stable economic operation of the power system, which is conducive to not only maintaining the normal survival function of a city, but also to optimizing the allocation of resources and alleviating the increasingly tense energy pressure. Short-term load forecasting (STLF) is the basic work of power enterprise planning, scheduling, dispatching, power consumption, and other departments.
Particularly, STLF often denotes a prediction for a leading time of one hour to a few days, while one hour ahead forecasting is mostly studied by researchers [2][3][4][5][6][7][8][9][10][11][12][13]. The methodologies of STLF proposed in the literature can be divided into two categories, namely statistical and artificial intelligence (AI) methods. Many approaches for statistical methods have been proposed in recent decades, such as multiple regression models [14], exponential smoothing [15], Kalman filters [16], and the autoregressive moving average method [17]. Nevertheless, these traditional methods cannot yield high accuracy when facing inevitable large uncertainties in load data and flexible meteorological data. By contrast, the performance of AI and hybrid methods outweighs statistical ones with the development of machine learning. One of the most prevalent machine learning methods, support vector regression (SVR), has great generalization ability and allows for a rapid global solution. It does not perform well, however, when dealing with large data. The typical back-propagation neural network (BPNN) is proposed to handle large data due to its powerful mapping ability. However, BPNN is very easily trapped in a local optimal solution [18]. Meanwhile, gradient vanishing is more likely to occur when the network gets deeper. To address this, Chen et al. proposed a deep residual neural network [19] to forecast 24 h load for the next day using eleven inputs including historical load and temperature and one-hot codes of weekday, season, and festival. Another problem in BPNN is that the intrinsic characteristics existing in the time series are neglected. To resolve this, a recurrent neural network (RNN) [20] is proposed to take into account the short memory of previous time series. The disadvantages of gradient explosion and vanishing, however, restrict RNN's scalability. To make up for this deficiency, long short-term memory (LSTM) is obtained by adding a state vector and introducing a gate mechanism to RNN. A hybrid model [3] combining multivariable linear regression (MLR) and a LSTM neural network forecasted low and high frequency components, respectively, based on ensemble empirical-mode decomposition (EEMD) technology. This model makes the most of the advantages of MLR and LSTM, while the LSTM is sensitive to the fine part and the MLR is excellent in fitting low-frequency time series. LSTM, however, is relatively complicated and results in high computing costs. Therefore, a gated recurrent unit (GRU), one of the most popular LSTM variants, was proposed to combine a state vector and output vector and to reduce the number of gates to two. A model EGM [21] combined EEMD, GRU, and MLR to forecast the load one hour ahead with only one variable, namely the historical load. The EGM method can accurately predict the local characteristics with strong randomness. A hybrid model [22] combining variational mode decomposition and long short-term memory networks and taking relevant factors with Bayesian optimization algorithms into consideration obtained high forecasting accuracy. An ensemble method [23] exploited the full wavelet packet transform model to decompose features into various components and then fed them into a trained neural network. This method reduced the MAPE by 20% with regards to traditional neural network methods. The literature [24] proposed a fuzzy theory-based machine learning model for workdays and short-term load forecasting for weekends, which enhanced the forecasting stability and accuracy.
The previous methods only take temporal features into consideration, while spatial and shape characteristics can also affect the real-time load. A convolutional neural network (CNN) was further developed into LeNet by Yann Lecun to address issues in the field of computer vision, including handwritten digit recognition in 1989. The core concept existing in CNN involves local correlation and weight sharing. A hybrid model [4] combined CNN, LSTM, and GRU, where the LSTM and GRU both have the ability to store short-term and long-term memories while CNN has a significant ability to extract the input variable features. A CNN-LSTM neural network [25] is proposed to predict residential energy consumption. This novel network extracts temporal and spatial features from input variables that influence the energy consumption. An improved CNN [26] for probabilistic load prediction was put forward by discretizing continuous load range. A hybrid model combining CNN with multiple wavelets [27] is proposed for deployment in practice. However, the problems of overfitting together with huge computational effort and time consumption occur. To resolve this, GoogLeNet was proposed as a new deep learning structure by Szegedy et al. in 2014 and extracts more characteristics with the same amount of computation, while previously proposed structures obtain decent results with numerous hidden layers. A 22-layer GoogLeNet [28] is proposed and evaluated in the classification and detection of context. GoogLeNet's inception structure strengthens the computing power of the entire network, and the whole model is lightweight. GoogLeNet has emerged as one of the most influential improved network structures which adopted inception mechanism that utilizes multi-scale processing to image. Simultaneously adopted as a forecasting methodology, GoogLeNet has been utilized in various fields, such as wind power generation forecasting and sun coverage forecasting. A GoogLeNet based CNN [29] was proposed to forecast solar irradiance. However, GoogLeNet has weak expandability and cannot be adopted as a potential model.

Contributions and Findings
In this paper, we aimed to expand the existing neural network structure of CNN by utilizing dense blocks, residual network structure, and realization techniques. We learned from the GoogLeNet [28] structure and proposed an end-to-end neural network instead of simply stacking convolutional and hidden layers.
Three points cover the contributions of this work. First, a hybrid model consisting of ensemble GoogLeNet and modified deep residual neural networks for short-term load forecasting is proposed. The model we propose involves no external variables except temperature and applies no complicated feature extraction technology except for Pearson correlation analysis. It is the first time that GoogLeNet is combined with dense blocks and a deep residual network structure. The input of each inception module comes from the concatenation of previous inception modules' outputs and the inputs of the first inception module. Second, the output of the last inception module, together with one-hot codes, is added to a deep residual network in order to enhance the fitting ability and to avoid gradient vanishing. Third, we obtained a probabilistic load forecast by applying Monte Carlo (MC) dropout to the proposed model trained for point forecasting. The indices of the Winkler score of different confidence coefficients and pinball loss are calculated to estimate the performance of probabilistic load forecasting.
To assess the model's generalization ability, two widely recognized datasets are utilized. Then, compared with various methods, including ANN + RF, HEFM, LWSVR-MGOA and WT-ELM-MABC, the proposed model obtained the best performance.
To evaluate the model's robustness, two measures were adopted. One is that we added Gaussian noise with 1 °F standard variation to the observed temperature at the forecast time so that we could see how the model performs when we use modified temperature. The other measure we used is the predicted temperature forecasted by the proposed model to replace the actual temperature. Experiments using both measures showed that the proposed model has strong robustness against great uncertainty of temperature.
The comparison of training time and forecasting performance among various models together with the forecasting performance is also implemented.

Structure
The remainder of this article is organized as follows. In Section 2, we present the framework of the proposed model based on an ensemble GoogLeNet with dense block structure and modified deep residual neural networks. The input variable selection and Pearson correlation coefficient are also provided. In Section 3, the experiments of the STLF conducted by adopting the proposed model are illustrated. We then analyze the test results compared with other existing state-of-the-art methods. Section 4 concludes the paper.

Proposed Hybrid STLF Method
In this work, an hourly load forecasting model is proposed based on ensemble GoogLeNet with dense block structure and deep residual neural networks. The fundamental structure of subsections of the hybrid model is presented first. The inputs of the overall model are selected by correlation analysis and then processed prior to the introduction of elementary structure. The preliminary forecast produced by ensemble GoogLeNet is then passed through a modified deep residual neural network to improve prediction accuracy. During the training process, we find that some modifications, like learning rate, etc., should be made to enhance the model's learning power. The implementation and evaluation indices of probabilistic load forecasting by MC dropout are also presented.

Model Input Selection
The aim of this part is to appropriately select the historical time range of inputs and to reduce the computation amount as much as possible. The dimension of inputs rests with its historical time range and directly determines the complexity of the matrix produced during the computation of the network. The Pearson correlation coefficient, which denotes the linear correlation relationship between two time series, is applied to select historical data width.
If the linear correlation relationship between X and Y is strong, then the Pearson correlation coefficient is very close to 1. Sliding window technology is utilized to obtain timelag data. Figure 1 shows the Pearson correlation coefficients between load data of different time lags and Figure 2 shows the Pearson correlation coefficients between load and lagged temperature data of different time lags. The raw hourly load and temperature data used in Figures 1 and 2 are from ISO New England (ISO-NE) data [30]. As seen from Figure 1, the periodicities in the raw load data are 24 h, seven days, 52 weeks, and 12 months, according to different time lags. Larger time lags can reveal more information existing in input time series, but more time and computation are required. To avoid parameter explosion resulting from too many inputs and to preserve the most influential information, only the previous 48 h load data are taken into consideration. Obviously, temperature is a significant and influential external variable while the linear correlation relationship between load and temperature data is weak, as can be seen from Figure 2, which shows that the coefficient is less than 0.3. Nevertheless, the high temperature in summer and low temperature in winter contribute to great increases in electricity demand of air conditioners, which accounts for a large proportion of energy consumption. As can be seen from Figure 3, the strong nonlinear correlation between daily load and daily temperature is revealed; thus, the previous 48 h temperature data are also used to tune the prediction results. Note that the size of these input time variables is small.
Additionally, the objective information concerning time, including weekday, weekend, season, holiday and nonholiday, is turned into one-hot codes which are added as binary inputs to assist the model in extracting the periodic and unusual time features of electric demand time series.
Linear interpolation is applied in the preprocessing stage to cope with missing data. The normalization of the data is conducted as given in the following formula: where Norm Load and Norm T denote normalized load and temperature data, respectively.

GoogLeNet Architecture
A great amount of improved network structure with respect to image processing was developed after 2012 when AlexNet was first introduced to implement tasks of image classification. The original intention of GoogLeNet was to dedicate it to solving two problems: (1) overfitting in the circumstance of too many parameters, and (2) increased calculation [8] due to a sharply increasing number of layers.
GoogleNet consists of two different layers. First, the convolution layer plays a critical role in extracting various features from the input data. The convolution layer sustains the spatial relationship hidden in the data by acquiring data features by calculating the tiny space of input data. Each input datum is viewed as an observed matrix. The "tiny space" directly depends on the filter, which is also a matrix that is implemented on input data. The output of the convolution layer is a matrix, namely a feature map, arranged by sliding the filter through the entire input matrix and computing the values at each corresponding window. After the convolution operation, a padding operation is also needed for successful concatenation. Second, the pooling layer sustains the most favorable features while decreasing the dimension of input representations. The pooling layer is divided into three types: max pooling, average pooling and sum pooling. In this work, we adopt max pooling, where the largest value is taken in each sliding window. The pooling operation can make the input of the pooling layer smaller and the number of parameters less, reduce the amount of calculation, and improve the overfitting problem.
The nuclear idea of GoogLeNet lies in the inception mechanism, which is adopted through combining with bidirectional LSTM to predict 48 points of the next day [31]. First, the current simple representative of the inception module is limited to three filter sizes, namely 1 × 1, 3 × 3, and 5 × 5; this criterion was set, by default, out of commodiousness rather than requirement. In addition to the convolution layer, there is a max pooling layer with filter size 3 × 3 in a simple inception module as it is essential for the success of the current convolutional network, as seen in Figure 4a. The input image can be processed in parallel with these different operations. The suggested module is then an integration of all previous layers by concatenating their output into a separate output vector as the input of the subsequent layer.
Second, 1 × 1 convolutions brilliantly reduce the dimension wherever the computational requirements become greater. This idea comes from the success of embeddings, which mean 1 × 1 convolutions are used to reduce computation before 3 × 3 and 5 × 5 convolution layers. In addition to the dimension reductions, the adoption of rectified linear activation ReLU has also improved the performance of the inception module.
Specifically, the ReLU has the form of Equation (3), where i x is the input of the i-th node of a layer. The modified inception module is depicted in Figure 4b. The inputs of the inception module derive from the preprocessed data, as described in Section 2.1.
Instead of simply stacking the inception module layer by layer, we modified the connection between different inception modules based on the structure of dense block. Dense blocks exist in the DenseNet structure. The advantages of this structure involve alleviating gradient vanishing and strengthening the transmission of features. In traditional CNN, the number of layers is equal to the number of connections. Nevertheless, the number of connections in DenseNet is ( + 1)/2, where denotes the number of layers. Generally, the inputs of the layer come from the concatenation of the outputs of all the previous layers. A transition layer between different inception modules is adopted in case of increasing features with more inception modules. In the training period of a random neural network, the parameters of each layer will continue to change with the iteration processes, which will result in the continuous change in the input data distribution of the next layer, which creates the problem of internal covariate shift. The distribution of different input data requires adjustment of the learning rate and fine initialization weight parameters during the iterative process. To address this problem, batch normalization is adopted to normalize the output of the previous layer. This practice is likely to accelerate the training process. Nevertheless, BN's error is very sensitive to the batch size. Group Normalization (GN) is adopted due to its independence of batch size and satisfactory accuracy. The data can be converted into data with zero mean and unit variance by GN. The fundamental conversion formula of normalization methods, including BN and GN, is obtained by: where ' i X denotes the normalized output of previous layer,  i represents the mean value of vector i X , and  i stands for the standard deviation (std) of vector i X . The details of computing  i and  i is illustrated in Appendix A.

Deep Residual Neural Network
The layer number fallback mechanism is implemented by the residual neural network shown in Figure 5a, which adds a shortcut between the input and output of the network and improves the prediction accuracy of the overall deep learning model. As shown in Figure 5a, a mapping which is from X to H () X is learned in place of training a map from X to G() X . The overall formula of the residual block turns into Equation (5). Note that G() X and X must possess the same dimension.
A deep residual neural network could be readily achieved through stacking ample residual blocks. Simply stacking residual blocks, however, inevitably limits its optimization ability. Thus, residual networks of residual networks (RoR) [32], a novel residual network structure, is proposed to improve the optimization ability of residual networks. A three-level residual network, RoR-3, illustrated in Figure 5b, contains a shortcut above all residual blocks, a shortcut above each residual block group which holds two residual blocks, and a shortcut in the residual block; respectively, these are referred to as a firstlevel shortcut, a second-level shortcut, and an inner-level shortcut. Specifically, if the layer in the first-level shortcut is set to 1 and K residual block groups are stacked, the forward propagation can be described by where 0 X is regarded as the input source of the residual neural network and the i-th residual block group yields i X . The total loss of the residual network back propagated to 0 X can be obtained by where Q denotes the total loss of the RoR network. The "1" in Formula (7) implies that the gradient at the final layer can be straight back propagated to the input layer. This indicates that the gradient vanishing problem likely can be alleviated.

Proposed Model Structure
The diagram of the hybrid model proposed in this work can be seen in Figure 6. As shown, we use the hourly electrical load data and one-hot codes together with hourly temperature data as the input data sources of our model. First, preprocessing is conducted on the electrical load data and temperature data, which may have a singular value due to measurement error of the electrical device. In the next stage, the corrected data set is normalized to facilitate the learning model.
Second, loads and temperature values of the most recent 48 h are delivered to the ensemble GoogLeNet. The input of the modified GoogLeNet unit is the concatenation of output of all previously modified GoogLeNet units and input of the ensemble Goog-LeNet. The output of the ensemble GoogLeNet with dense block structure is concatenated with the one-hot codes of weekday, weekend, season, holiday, and nonholiday distinction as input to go across the fully connected (FC) layers.
Last, the output of the FC layers is viewed as the input of the deep residual networks that contain the right and left columns of residual blocks. The input of the right column of residual block groups is what the FC layers yield (except for the second one in the right column of residual block groups, whose input is connected with the first blue spot in the left column). What the residual block in the same layer yields is averaged (represented by gray spots on the right). The output of the gray spots is attached to all blue spots in following layers. The input of the first residual block of the left column is also what the FC layers yield. Beginning from the second layer, what imports the residual block in the left column is acquired via averaging all links from the gray spots with the connection from the output of the FC layers. What the modified deep residual network yields is considered as predicted hourly load. It is believed that the added column of residual block groups together with the shortcut connections can promote forecasting ability and alleviate the gradient vanishing problem of back-propagated error.

Probabilistic Load Forecasting
Probabilistic load forecasting is likely to have more practical significance than point forecasting. Due to the uncertainties of both model and data, the proposed model has potential to be used for probabilistic load forecasting. In this work, MC dropout [33] is adopted to quantify the prediction uncertainty. Generally speaking, dropout is a technique that randomly drops out units with probability p in a neural network. In practice, this is comparable to performing T stochastic forward passes through the network and averaging the results. We use the training dataset containing ' x and ' y to acquire the hidden knowledge in W () f , an unknown neural network with parameters W to be trained. The prediction uncertainty [19] can be quantified as follows: where M denotes the test times of dropout,  ' m y is the m-th output that we observe,  y denotes the mean of all M outputs of test dataset corresponding to each hour and W is the trained parameters. The first equation in (8) derives from canonical conditional variance formula. The other term  2 measures the inherent noise existing in dataset which can be estimated by a validation dataset.
After obtaining the prediction uncertainties, the comprehensive standard variation is obtained: where  and  are parameters to be estimated;  represents the ratio that model uncertainty occupies the comprehensive standard variation while  indicates the percentage that data uncertainty occupies the comprehensive standard variation. The details of computation for W ' Var( ( )) fx and  2 are illustrated in Appendix B. When coping with various methods with equally accurate degrees of coverage, we prefer to select the method that yields the tightest intervals. The generally acknowledged score formula proposed by Winkler, namely the Winkler score [34], allows for jointly estimating the coverage and interval width. For a central (1 − α) × 100% prediction interval, the Winkler score is defined as follows: where L and U denote lower and upper bounds, respectively, and t y is the actual value at time t. The parameter  t represents the variance between t L and t U . Obviously,  is similar to confidence level.
The pinball loss function [34] is an error estimation for quantile forecasts. The pinball loss is obtained as:  (11) where y t is the actual load at time t, q denotes quantiles ( 1, 2, , 99 q = ), and ' , y tq is the quantile forecast at the q − th quantile. We can secure the pinball loss of homologous probabilistic load forecasts by summing up the pinball loss of each quantile forecast.

Results and Discussion
The models in each experiment are trained by the Adam optimizer with default parameters as mentioned in [35]. The models are accomplished by adopting Keras 2.2.4 with Tensorflow 1.11.0 as backend in the Python 3.6 environment. Note that adaptive adjustment of the learning rate during the training process is used. Two different public data sets are utilized, both of which contain hourly load and temperature data. The first data set uses the ISO-NE dataset [30] from New England. The time range of the first dataset is from March 2003 to December 2014. The second dataset is the North American Utility dataset [36]. The time scope of the second dataset is from 1 January 1985 to 12 October 1992.

Comparison Criteria
Mean absolute percentage error (MAPE), root mean square error (RMSE) [2], R-Square and mean absolute error (MAE) are among the most significant indices to apply when comparing the results of various models in STLF issues. They are described as follows:  (15) where actual ̅̅̅̅̅̅̅̅ denotes the average of the actual time series.

Experiment 1: Parameter of Filters Decision
Two inception modules are adopted in the ensemble GoogLeNet and the filter size consists of 1 × 1, 3 × 3, and 5 × 5. The ISO-NE data in 2014 are used as the test set and the previous five-year data is used as the training set. Ten percent of the training data set is used as a validation set to tune the hyperparameters. The North American data of the oneyear period before 12 October 1992 is utilized as the test set and the data ranging from 12 October 1986 to 12 October 1991 is treated as the training set. Ten percent of the training data set is also used as a validation set to tune the hyperparameters. The magnitude of load data in the ISO-NE data set and the North American data set is 1 × 10 4 and 1 × 10 3 respectively. The inputs of both tests contain two-day load and temperature data. We change the learning rate of the first training phase from 0.001 to 0.0012 on the second test of the North American dataset. The rest of the hyperparameters are unchanged except for the parameter of filters, which exists in the layer of Conv1D. The parameter of filters denotes the output dimension of Conv1D.
As is illustrated in Table 1, the method used is susceptible to the parameter of filters in each convolutional layer. As an example, with 64 filters, the MAPE in the North American test set is 1.41 for year 1992, which has a deviation of 0.84 from the results when the number of filters is 128. The MAPE of the ISO-NE dataset is the lowest when the number of filters is set 128. Figure 7 presents the illustration of the prediction results for two different weeks in 2014 for the ISO-NE data set. The larger parameter of filters in each convolutional layer, then the more trainable parameters there are more likely to extract more features hidden in the dataset. Once the parameter of parameters exceeds a potential number, however, overfitting is easy to occur and the performance on the test set worsens.
As seen from Table 1, the prediction performance gets worse when the parameter of filters becomes 128 for North American Utility data set. The results also show that the proposed hybrid model with ensemble GoogLeNet and deep residual networks has strong generalization ability. Consequently, the parameter of filters for the ISO-NE dataset is 128; for the North American Utility dataset it is 64. The subsequent experiments are based on this fundamental criterion.

Experiment 2: Validation of the integrated Method
To demonstrate the effectiveness of the integrated method, we compared the performance of the proposed method with typical GoogLeNet and ResNetPlus [19] model.
The ISO-NE data is split into training set, validation set and test set. Five-year data before test data is used to train the three diverse models. Ten percent of the training data set is adopted as a validation set to adjust the hyperparameters. We tested the three models on the test set containing the data in 2014 and achieved the results of each model depicted in Table 2. The proposed model shows the best fitting ability and reaches the highest R-Square, namely 0.9898, which represents the fitting degree of a model. Meanwhile, the raised model obtains lower values of MAE (168.47 MW) and RMSE (271.23 MW) than the other two basic models. It is clear that the MAPE of the proposed model is reduced by 0.23% with respect to the typical GoogLeNet model.
Additionally, the monthly comparison of R-Square, MAE, RMSE and MAPE for typical GoogLeNet, ResNetPlus [19] and the proposed model is exemplified in Table 3. Specifically, the average value of R-Square for ResNetPlus [19] is 0.971, while the proposed model increases by 0.014. Similarly, the MAE and RMSE of the proposed model fluctuate from 139.1 to 228.9 and from 189.3 to 546.3 respectively. The upper and lower bounds of the fluctuation intervals is lower than typical GoogLeNet and ResNetPlus. Besides, the proposed method provides the lowest MAPE among the tested models. Therefore, the integration of ensemble GoogLeNet and modified residual network is shown to have more effectiveness in STLF tasks.
In this experiment, it is obvious that the typical GoogLeNet has considerable performance with regards to ResNetPlus. The features extracted from the input matrix through GoogLeNet, which consists of convolutional layers, represent temporal and spatial relationships hidden in the load and temperature time series. The hidden temporal and spatial relationships together with other temporal information are viewed as inputs of deep residual networks which enhance the learning ability of the architecture. Therefore, the better performance presented clarifies the superiority of the integrated model.

Experiment 3: Comparison with Existing Models
Test 1: In this test, ANN + RF [37], HEFM [37], and LWSVR-MGOA [38] are taken into consideration for comparison with the proposed methodology. To evaluate the validity of the proposed model, the time range of historical data is from January 2013 to July 2015 for the ISO-NE dataset. In this test, a winter month (January 2015) and a summer month (July 2015) are tested. The results in Table 4 imply that the MAPEs of the proposed model decrease by 19.28% and 17.17% in winter months and summer months, respectively. It is notable that the load in the summer months is difficult to forecast due to the effect of other meteorological factors, such as humidity and wind speed.
Test 2: This test is to forecast daily loads of the ISO-NE dataset in the year 2006. The training set is from 1 March 2003 to 31 December 2005. The models of WT-ELM-MABC proposed in [39] and ResNetPlus proposed in [19] are also trained with data from 2003 to 2005. The MAPE of each month in 2006 for the three models are listed in Table 5. Nevertheless, the WT-ELM-MABC model produced better results than the proposed model for four months. However, the overall MAPE of the proposed model is reduced by 3.11% upon the dataset from ISO-NE.

Experiment 4: Robustness Estimation
Actual temperature data are used in the previous experiments. We utilize the North American Utility dataset in this experiment to validate the robustness of the proposed model. The division of the training set and test set is the same as mentioned in [8]. The time range of the data in this experiment is from 1 January 1988 to 12 October 1992. For the sake of evaluating the proposed model, the hourly loads for the two-year period, which is before 12 October 1992, have been predicted. The data prior to the current forecasting point is used for training. The validation dataset occupying 10% of the training set is used to adjust the hyperparameters of the proposed model. The Gaussian noise with 1 °F standard deviation is added to the measured temperature, as is done in [8,19,39,40] in order to obtain the predicted temperature. The comparison of the results is listed in Table  6. The results from Table 6 show that the MAPE of the proposed model is smaller than the other three available methods on the given test. The proposed strategy greatly reduces the MAPE by at least 4.91% and 6.34% in the two scenarios. Meanwhile, the MAPE of the proposed model increases by 0.315% when the actual temperature is replaced by the modified temperature.
The actual temperature can also be replaced in another version. We use predicted temperature of the predicting time instead of actual temperature. One-hot codes for seasons, months, and hours are utilized as inputs to predict temperature. The proposed load forecasting model is simplified and modified slightly to predict temperature. The results of the temperature forecasting model are RMSE = 2.213 and MAPE = 3.723%, which are illustrated in Figure 8. The predicted temperature then replaces the actual load and MAPE for the load forecast is 1.589%, which is comparable to the results (1.587%) of the predictions using actual temperature. It is rational to come to the conclusion that the proposed method has strong robustness against temperature uncertainty.

Experiment 5: Training Time Comparison
In this experiment, we train the model with ISO-NE data under the same circumstances where time span is six years, namely from 2004 to 2009. Various models listed in [41], including modified ELM, modified SVR, modified error correction (ErrCor), modified improved second order (ISO), and ResNetPlus [18] are compared with the proposed model. The training time results of the various state-of-the-art models are compared in Table 7. Meanwhile, we also take the MAPEs of the year 2010 and 2011 into consideration for these methods. The training time of the proposed method is reduced by 44.54% compared with the Modified ELM. The MAPEs of the proposed model are reduced by 3.07% and 7.38% for the year 2010 and 2011 respectively. The results showed that the proposed model not only needs less time to train, but also achieved better performance than the other five published models.

Experiment 6: Probabilistic Load Forecasting
We used the ISO-NE dataset to illustrate the probabilistic load forecasting with MC dropout technology. The dropout rate (p = 0.1) is put into the previously proposed model, except for the inception module, input layer, and output layer. The parameters in (9) are estimated by the proposed hybrid model, which is trained via 30 epochs (with M = 30 for Formula (8)), and the estimated value of is 0.78 and is 0.47. An instance of the 95% coverage rate for two different weeks in 2014 is shown in Figure 9. The tested coverage rate results, generated by the proposed method with regards to various z-scores which stand for the number of standard variances, are listed in Table 8. The results show that the proposed model gives satisfactory tested coverages for different intervals.
In addition, as indicated in Formula (10), the Winkler scores in this experiment of the ISO-NE dataset are 16.7 ( = 0.5) and 28.7 ( = 0.1). From Formula (11), the pinball loss that we can obtain is 1.828. The results of both indices can be regarded as the benchmark performance of this dataset.

Conclusions
A novel STLF model based on ensemble GoogLeNet and modified deep residual neural networks was proposed in this paper. The ensemble GoogLeNet with dense block structure and modified deep residual neural networks facilitated high accuracy of the proposed model and enabled suitable generalization ability. Various experiments conducted on two commonly accepted public datasets proved the validity of the methodology. The selection of the parameter "filters" were studied for the ISO-NE and North American Utility datasets. Comparisons with models in existence revealed that the proposed methodology outweighs other models in both prediction accuracy and robustness. Furthermore, the time spent in the training process was largely reduced compared with other models. It is also worth noting that the proposed methodology can be applied directly to probabilistic forecasting and achieve satisfactory results.
Much further work can be done. As we have only briefly considered deep neural networks, we may combine many other structures of deep neural networks (e.g., LSTM or GRU) with the model to strengthen its performance. Additionally, the implementation of probabilistic load forecasting can be further investigated. Moreover, more meteorological variables can be taken into consideration to improve the prediction accuracy.

Conflicts of Interest:
The authors declare no conflicts of interest.

Appendix A: Computing of i  and i 
To address the issue of internal covariate shift, we adopt GN layer which is independent of batch size. To explain the difference between the two methods, the feature map tensor is briefly illustrated in Figure A1 where (H,W) denote the spatial axes, namely height and width, C represents channel axis and N implies batch axis. where  denotes a small constant, M i represents the set of feature map whose size is m.
The main difference between the two methods lies in the definition of set M i . In GN, the set i M is defined as: where the hyper-parameter Z denotes the number of groups (Z = 32 or 64), / CZ represents the number of groups in each channel.
  is floor operation. The second equation in (A.3) means that the indices k and i share the same group of channels, presuming each group of channels are stored in a sequential order along the C axis.  Figure A1, a simplified case for GN has 3 groups (Z = 3) each of which owns 2 channels.
In both normalization methods, a single channel linear transformation is learned to make up for potential loss in describing capability: where  and  are trainable parameters of dilation and translation.
Once the i M is determined by Equation (A3), we can define the GN layer by Equations (4), (A1) and (A4). Similarly, the features in the same group are normalized together by the same  and  .