Stacking Ensemble Methodology Using Deep Learning and ARIMA Models for Short-Term Load Forecasting

: Short-Term Load Forecasting is critical for reliable power system operation, and the search for enhanced methodologies has been a constant ﬁeld of investigation, particularly in an increasingly competitive environment where the market operator and its participants need to better inform their decisions. Hence, it is important to continue advancing in terms of forecasting accuracy and consistency. This paper presents a new deep learning-based ensemble methodology for 24 h ahead load forecasting, where an automatic framework is proposed to select the best Box-Jenkins models (ARIMA Forecasters), from a wide-range of combinations. The method is distinct in its parameters but more importantly in considering different batches of historical (training) data, thus beneﬁting from prediction models focused on recent and longer load trends. Afterwards, these accurate predictions, mainly the linear components of the load time-series, are fed to the ensemble Deep Forward Neural Network. This ﬂexible type of network architecture not only functions as a combiner but also receives additional historical and auxiliary data to further its generalization capabilities. Numerical testing using New England market data validated the proposed ensemble approach with diverse base forecasters, achieving promising results in comparison with other state-of-the-art methods.


Introduction
The large-scale and complex nature of Power Systems Operation in a constantly changing environment poses a series of challenges for policy makers, regulators, market operators and participants (both the generation and demand sides) and transmission and distribution operators, among others [1]. One of these challenges is the task of electrical power load forecasting. This is crucial for the smooth operation of the power grid-which runs on a set of tight tolerance requirements-and the decision-making policies of utility companies, i.e., influencing the purchase and generation of electric power in the power market and consequently affecting operation costs, energy efficiency, load switching and infrastructural development [2]. In this regard, the increased level of variable renewable energy sources directly rises the required spinning reserve volume to offset the fluctuation of renewable generation [3]. Factors such as an expanded network of distributed energy resources and a higher share of demand-side management further add to the complexity of this prediction task [4], which is already influenced by several uncertain factors including climate and seasonal factors [5,6].
Given that a large part of market decisions and utilities operations have an hourly, daily or weekly basis, these types of forecasting lead times (from 1 h to 168 h) are of most interest. An interval that is well categorized in the electric power system literature is the "Short-Term Load Forecast (STLF)" [7], which has long been a very active field of research, are compact in terms of user-dependence and computationally cost-efficient. To fulfill this objective, we chose an ensemble method based on a deep neural network, which due to its scalable and non-linearity traits is a great tool for enhancing feature extraction and better mapping the input space [33]. Thus, an ensemble approach has the potential to improve generalization and increase model stability (less output variance). In turn, different ARIMA models contribute to increasing input diversity through a stacked generalization framework. The straightforward formulation and the relatively small number of hyperparameters favors its robustness to model time-series based on the autocorrelation between past samples (only requires prior data), making it one of the preferred methods to benchmark proposed STLF models.
A brief description of the several methodological stages and its testing is listed: (i) a data selection process is used to assemble batches of training data with different sizes, ranging from recent to longer-term trends; (ii) a unit-root test is used to determine the need to include integration (number of time-series differences) in the ARIMA model parameters; (iii) after using the Box-Jenkins method, i.e., correlation analysis, a number of different seasonal and non-seasonal ARIMA models are fitted to the training data; (iv) the most promising models are hand-picked based on the Bayesian Information Criteria; (v) this process occurs automatically, a windowing technique is applied and the ARIMA model is rolled (updated) with the recently tested data; (vi) these models, which are able to capture most of the linear components in the time-series, work as diversified forecasters fed to the ensemble method in a stacking manner; (vii) the deep neural network is trained offline with not only the forecasters but also with additional load and exogenous variables with significant levels of Pearson correlation; (viii) the proposed methodology was tested using the well known aggregated load data from the New England region; (ix) the numerical results showed the validity of the proposed ensemble methodology, outperforming the standalone shallow neural network without and with the ARIMA forecasters and the exogenous inputs, and the standalone (different base learners) ARIMA models.
The content of this paper is summarized below. The most relevant theoretical background and formulation regarding the ARIMA models (forecasters) and the deep neural network architecture (assembler), including its non-linear computations, are given in Section 2; a detailed explanation of the different stages of the proposed methodology, as well as the considered parameter selection, is provided in Section 3; Section 4 introduces the case study and the underlying key statistics, which supported not only the design of the proposed methodology but also the analysis of results; Section 5 presents the most relevant numerical results and their interpretation, which includes a graphical analysis. Lastly, a summary of the main conclusions of this work is presented in Section 6.

Background
This section introduces the theoretical fundamentals behind each stage and building block considered in the proposed methodology.

Box-Jenkins Method
To study and model a generic time-series data X with a total of n length sequentially spaced time observations, such as X = (χ 1 , χ 2 , . . . , χ n t −1 , χ n t ), a simple and common choice is to use the Box-Jenkins method to identify and fit the most suitable Autoregressive Integrated Moving Average (ARIMA) time-series models. These statistical models constitute a good approach to tackle the forecasting task, as mentioned in the Introduction.
These models translate a linear relationship with the observed output data and rely on the assumption that future observations can be approximated based on past observations. Models that can be written using a single equation as expressed in Equation (1), with an autoregressive term, AR(p), where p is the autoregressive polynomial order and a moving average term, MA(q), where q is the moving average polynomial order, hence translating a linear combination of the preceding error terms up to lag q. Thus, the polynomial orders or degrees, p, q ∈ Z + , represent the maximum value for the AR and MA terms, respectively.
In turn, the integrated part is commonly represented by the letter d, which stands for the differencing order. In the case of d = 0, the ARIMA model is simply an ARMA model [34]: where µ denotes the constant term (vertical translation), χ t−i denotes the lagged (past) time-series (independent variable) and ε t−j represents the unknown lagged error terms, i.e., the random error term that accounts for the difference between the actual observation and the theoretical value given by the regression-based model, often referred as white Gaussian noise. Therefore, the actual observation, χ t , is influenced by factors beyond the independent variable. In turn, the autoregressive and moving average coefficients are denoted by ϕ i and θ i , respectively. Typically, when using the Box-Jenkins notation, a backshift operator, B, is used to represent the lagged terms such that B χ t = χ t−1 . Thus, the model in Equation (1) is re-written as demonstrated in Equation (2).
In addition, most of the time-series encountered in engineering problems, including traditional energy load forecasting, reveal nonstationary characteristics, meaning that the unconditional joint probability distribution changes with time (a non-constant mean and variance across observations). These traits in the time-series reveal the presence of trends and/or seasonal variations. Consequently, since the Box-Jenkins models assume stationarity, one approach is to obtain the first differences of the time-series, i.e., χ t − χ t−1 = (1 − B)χ t , which results in a general version of the ARIMA(p, d, q) model with an added generic difference term, as shown in Equation (3).
To determine the order at which the original time-series must be integrated to acquire stationarity, authors often resort to unit root tests to assess the stationarity hypothesis. For the vast range of time-series, when the unit root test determines that the data needs to be integrated, due to variance in the time-series, the first and second differences are usually ruled sufficient for removing the trends [10]. In this scenario, where the time-series reveals clear features of seasonality, e.g., daily, monthly and semesterly cycles, among others, it is advantageous to use a seasonal ARIMA(p, d, q) × (P, D, Q) (SARIMA) model [35]. The same letters but capitalized, P, D and Q, describe the seasonal orders of the autoregressive, moving average and differentiation terms; conversely, the lower-case letters represent the non-seasonal term orders. As such, the generic abbreviated form of the multiplicative model can be written as follows Equation (4): where the superscript s distinguishes the seasonal terms from the non-seasonal, whereas the seasonal autoregressive and moving average coefficients are denoted by Φ P and Θ Q , respectively. All the remaining parameters were already defined in Equations (1) and (2). These statistical models are especially tailored to handle linear problems, but the same cannot be said about hard non-linear problems, i.e., they do not follow the normal distribution assumption [36].
With the groundwork notation introduced, the next step concerns the question of how to identify ARIMA(p, d, q) order levels. To this end, the typical choice is to apply the well known Box and Jenkins method, which relies on the autocorrelation function (ACF), a measure of the mutual dependence between distinct lags, and the partial autocorrelation function (PACF) to draw inferences about an adequate model [25]. This last function is a measure of the relationship between the current observation and its lags excluding the effects of intermediate observations, which is useful to specify AR terms.

Deep Neural Networks
A Deep Neural Network (DNN) is a type of machine learning approach, with the particularity of having multiple (hidden) layers between the input and output layers (at least two). The term "deep" is used to symbolize the more complex structure of layers and nodes, which significantly increases the number of weights and bias terms, deriving a more abstract and high-dimensional feature mapping from the provided input and output data [37,38]. In the same fashion as its classic shallow ANN counterpart, it draws inspiration from human brain functioning, particularly the information flow in biological neurons [39]. Moreover, its ability to cope with non-linearities renders it a suitable tool to handle a vast range of engineering problems from pattern recognition and data classification to time-series modeling, etc.
A common architecture for the DNN is the Deep Feedforward Neural Network (DFNN), forming a directed acyclic graph as represented in Figure 1 meaning that the flow of information only proceeds forward, simplifying the backpropagation training process. In each DFNN layer, every neuron output is evaluated by the activation function f j , with a cost given by a biased weighted sum a j , which works as a threshold. Hence, one can state that for any two consecutive layers l − 1 and l, the neuron (node) response can be formulated as follows: where j ∈ ]0, m] is an index representing an arbitrary neuron response in layer l; and the i ∈ ]0, n] index denotes an arbitrary input node from the previous layer l − 1 (notice the flexibility of this formulation where the previous layer can either be an input layer or an adjacent hidden layer). Accordingly, n is the total number of neurons/input nodes in layer l − 1, and m is the number of neurons in the connecting layer l, with n and m ∈ Z + . y j is the output for neuron j; x i is the input signal from neuron i; b j is the respective neuron bias; and lastly ω ij is the weight of the synaptic connection between neurons i and j. This output response (5) is then one of the inputs for all the nodes of the proceeding layer in sequential form. Thus, one can compute the output response of all of the hidden layers in a DFNN, with an arbitrary number of L hidden layers, as a composite function g of the original input x (input layer), as denoted in Equation (6) [40]: where h (l) (x) denotes the output of an arbitrary hidden layer l (superscript), and a (l) (x) is an abbreviated form for the biased weighted sum for an arbitrary hidden layer l (preactivation cost).

Output layer Input layer
Hidden layers In the considered training process, the DFNN weights are updated by using an offline supervised strategy, where the weights are adjusted to minimize the error in the output layer with known target samples. The scaled conjugate gradient backpropagation algorithm was chosen to perform the training task given its balance between performance and speed, which is particularly important for deep networks.

Proposed Methodology
This section presents a comprehensive description of the proposed load forecasting methodology and discusses its implementation details. First, the proposed methodology is guided by the ensemble methods (learning) concept, where an improved predictive performance is expected when employing/combining multiple methods versus any of the individual methods. More specifically, this work is centered on a stacking ensemble type approach, which in a broad sense involves training a machine learning method to merge the predictions of some other methods [41]. To reinforce this point, the literature in the forecasting field documents report that when dealing with an unstable and varying data pattern, it is convenient to use dissimilar models to improve time-series forecasting accuracy [42]. Much of the reasoning behind the wide-adoption (as shown the Introduction section) of ensemble methods resides in the increased diversity among the base classifiers/forecasters. When using a stacking strategy, the meta-learning model (assembler) is used to integrate the output of base models [43].
Following this line of thought, a stacking ensemble approach is proposed in this work where a DFNN combines of the Box-Jenkins predictors. For better context and for learning purposes, additional input data from the load time-series and relevant related exogenous variables are also fed to the DFNN. An overview of the most relevant stages and main blocks that constitute the proposed methodology is given in Figure 2. A more thorough account of these stages and blocks is provided in the subsections below. Additionally, the pseudo code of the main implementation stages is given in Appendix A.

Prediction Models: ARIMA Forecasters
Given this ensemble approach to perform STLF, ARIMA models were chosen to serve as reference inputs for the combiner DFNN. This approach of coupling soft computing methods with ARIMA model has proven to increase accuracy, as reported in [36,42]. While ARMA models are unable to capture the non-linearity of the series, they can capture the linear traits of the time-series, which is worthy information to be fed as inputs to the assembler (non-linear) DFNN [44].
Before identifying and fitting the ARIMA models, a significant question concerns the suitable number of observations to estimate the models. In this regard, a commonsense rule is to have more observations than model parameters. This implies that, for example, in a SARIMA model as defined in Equation (4), a number of p + d + q + P + D + Q + 1 observations is required to estimate the model. However, when dealing with data exhibiting substantial random variation, this minimum requirement is insufficient, and observations must exceed the parameters several fold [45]. Hence, a hard rule can often be misleading given the underlying variability of the data. As such, instead of following a narrow approach, in this work, a conscious decision was made to assemble a pool of ARIMA models.
These differed from each other not only in parameters but also in the training sample window (number of observations used to fit the model). This procedure results in very different ARIMA models, where some will better capture the most recent trends (when the training window is shorter), while others with a longer training window (more samples) will provide a fuller picture of the past time-series behavior. Hence, in this work, five different training windows (in months) were selected, i.e., the number of past observations (training samples) used to fit the model was given by a specific number of prior months defined in the variable δ = [1,2,3,6,12].
For each selected training window, stationarity was evaluated through the Unit Root hypothesis using the Augmented Dickey Fuller (ADF) test. Based on the test's p-value, the methodology decides the order of integration. In our case study, the first differences proved to be sufficient, but with a 12-month training window, in some instances, the unit-root null hypothesis was rejected in favor of the alternative hypothesis (ARMA model is suitable). Having sorted the integration order d, the ACF and PACF functions were built, proceeding far as 15 days prior, and a list of the most promising AR and MA terms was established. With this task completed, all the parameter combinations were tested, and the respective ARIMA models were assembled together with their mirroring SARIMA models, where the seasonal P and Q orders depict daily and weekly seasonalities. This process of automatic model selection, based on the information provided by Unit Root Tests and correlation analysis, is frequently referred as "Automatic ARIMA" [46].
Having identified the several models identified, a Maximum likelihood estimation (MLE) algorithm was used to estimate the model coefficients, i.e., the coefficients that maximize the log-likelihood function. Subsequently, the Bayesian Information Criterion (BIC) was used to select the three best models for each training window (for a total of 15 ARIMA forecasters), by determining the set of model parameters ϕ p , Φ P , µ, θ q , Θ q that minimize the BIC value, as expressed in Equation (7): where k denotes the number of model parameters, n length is the number of observations and RSS is the residual sum of square, i.e., a measure of the variance in the error term of the model [47]. Notice that while all combinations were tested, this does not imply that if, for example, p = 24, all AR terms between 1 and p have non-null coefficients. On the contrary, since the model complexity is penalized, as observed in Equation (7), this reinforces the need to select models with strong correlations. This offline process is performed iteratively in an automatic manner, and its different stages are illustrated in Figure 3. The selected training window is only valid for testing in the subsequent week, and the forecasting performance is stored upon test completion. Thereafter, the most recent data (testing data) are added to the training window (batch), while the oldest data are removed, similarly to a Last In First Out procedure. Therefore, it ensures a training window (batch) with a fixed size. This windowing process is carried out until all testing data are evaluated and the respective ARIMA Forecasters are stored. The process concludes after analyzing all the training windows and their correspondent prediction models, producing the key prediction inputs for the ensemble method.
This offline process was performed iteratively in an automatic manner, and its different stages are illustrated in Figure 3. The selected training window is only valid for testing in the subsequent week, and the forecasting performance is stored upon test completion. Thereafter, the most recent data (testing data) were added to the training window (batch), while the oldest data were removed, as in a Last In First Out procedure. This ensures a training window (batch) with a fixed size. This windowing process was carried out until all the testing data were evaluated and the respective ARIMA Forecasters were stored. The process concluded after analyzing all the training windows and their correspondent prediction models, producing the key prediction inputs for the ensemble method.

Ensembler DFNN
As previously discussed in the Introduction and Section 2.2, DNNs are a great solution for solving a wide range of problems given their ability to explore and model deep nonlinear relationships.
With respect to the DFNN architecture, its input layer was built with 26 nodes that account for the following: 15 ARIMA forecasters (three best forecasts for each of the five training windows), ŷ 1 t δ 1 , . . . ,ŷ1 t δ 5 ,ŷ2 t δ 1 , . . . ,ŷ2 t δ 5 ,ŷ3 t δ 1 , . . . ,ŷ3 t δ 5 , which are stacked and fed to the ensemble method providing information about the expected linear trends; the six past lagged values of the actual load data (x t−24 , x t−25 , x t−26 , x t−32 , x t−48 , x t−168 ); four date/time related exogenous variables, namely hour, weekday, binary holiday indicator and season of the year; information about the time-sample of the 24 h ahead target load exog t hour , exog t weekday , exog t holiday , exog t season ; and finally, the lagged relative humidity information was also fed to the DFNN exog t−24 relHum . The addition of exogenous variables to forecast the load in the ensemble method is of crucial importance since electricity demand does not depend exclusively on the autocorrelation of its own data (endogenous). In fact, a comprehensive assortment of exogenous factors has an impact on the load [44].
Concerning the hidden layer(s) configuration and after performing a few trial simulations, it was found that a three hidden layer DFNN (meaning a total of five layers, including the output and input layers), with 20, 10 and 5 neurons in each hidden layer, respectively, was found to be a good fit. These hidden sizes ensured a good level of complexity and abstraction without using too many neurons and, consequently, weights, which can often result in overfitting. In this regard, note that an optimal paradigm for the definition of the best model does not exist and likewise for the number of neurons in the hidden layer(s). As such, the conventional path is to rely on past experiences and trial-and-error procedures.
The hyperbolic tangent sigmoid transfer function, σ(x) = 2 1+exp(−2x) − 1, was chosen as the activation function for the hidden layers. The DFNN flow ends in the output layer, with a single neuron, producing the 24 h ahead forecasting,ŷ t . In addition, a linear transfer function, out(x) = αx, was used in this layer, which is a very common choice for approximation and regression tasks.
With the architecture defined, the next stage was preprocessing (8) the input and target data, ensuring that the data always fell within the range [−1, 1] by using the equation below: where x norm is the preprocessed input sample; x is the original input sample; and x min and x max are the minimum and maximum values of all input samples. All the training samples are employed to form the learning batch; as such, a Batch Gradient Descent paradigm is considered in this work. This implies that all samples contribute to training and validation errors before updating DFNN weights and biases. The input and target data spanned over the period of 1 year prior to the testing of the trained ensemble methodology in the subsequent month. These datasets were divided randomly into training set and validation set by using a typical 70% to 30% ratio in order to improve the generalization capabilities.
This validation set provides a separate (unseen) set of representative data-samples that are used to perform an unbiased evaluation of the model fit (validation error), different from the training error, during the training phase (weights and biases update). When the validation error starts to decouple from the training error, i.e., does not decrease for several iterations, further continuation of the training process may invariably result in overfitting. This decoupling, therefore, constituted the main stopping criterion for DNN training even though a maximum number of learning iterations was also defined. Next, the essential training process of the ensemble DFNN was carried out. The scaled conjugate gradient (SCG) algorithm [48] was chosen to perform this supervised learning task given that no critical user-dependent parameters are needed, making it a fully automated algorithm. In addition, its step size scaling mechanism improves the speed of convergence by avoiding a time-consuming line search per iteration, which is a common calculation of other second order algorithms [10,49].
Furthermore, in order to improve generalization during the training process, a modified version was given by Equation (9), modPerfF, from the standard performance function using the mean squared error (MSE), and it was chosen by adding the mean of the sum of squares of the network weights and biases (MSW). This forces the network to have smaller weights: where η is the generalization ratio (a good compromise was achieved with its value set to 0.3), N S is the number of training samples, N w is the number of weights in the network,ŷ i is the DFNN output and y i is the desired (target) output response.

Case Study
In order to evaluate the proposed methodology's forecasting accuracy, the well known real case study from the New England region system is considered. This region in Northeastern US spreads across six states, namely Connecticut, Maine, Massachusetts, New Hampshire, Rhode Island and Vermont. The hourly system load samples are provided by the New England Independent System Operator (ISO-NE) and are available in [50]. In this work, data from 2014 to 2015 were selected for training different forecasters. The data from 2016 were applied to test the models, and Figure 4 illustrates its hourly and monthly behavior. A brief data analysis revealed a mean load value of 13.504 GW and significant amount of standard deviation, 2.426 GW (∼18% of the mean value), with a moderately positively skewed and Leptokurtic distribution (skewness = 0.779 and kurtosis = 3.870). This implies a distribution with longer tail and more observations located to the right (higher) of the mean value. In addition, as illustrated in Figure 4, the summer months registered not only higher load values but also greater variabilities largely due to the cooling related loads. By contrast, the spring months had a smaller mean and standard deviation. Accordingly, the month of June had the highest mean and standard values, 16.700 GW and 2.9269 GW, respectively, as well as the maximum recorded load value of 23.973 GW, which took place on 3 June. The month of August recorded the lowest mean value of 11.480 GW and also the minimum observed load value of 9.149 GW (7 August), while the minimum standard deviation occurred in the spring month of April with a value of 1.3383 GW.

Results
In order to demonstrate the proposed methodology's behavior in a representative manner for the entire year, four different scenarios (out-of-sample scenarios) were considered, comprising four months corresponding to four different meteorological seasons, including holidays (typically similar to weekends in terms of load profile).The following months were selected to serve as testing data from the New England region: (i) winter month (February 2016); (ii) spring month (April 2016); (iii) summer month (August 2016); and (iv) fall month (October 2016). In order to ensure the consistency, the results of the ensemble DFNN were the mean results of chosen error metrics over 50 simulations, thereby avoiding skewed inferences based on specific random data divisions and random DFNN weight initialization.
In order to gauge the forecasting accuracy of the 24 ahead predictions, the common scale-invariant error metrics MAPE(%) (mean absolute percentage error) and RMSE (MW) (root mean square error) were used, respectively, Equations (10) and (11), which are defined as follows: where N test is the number of testing samples (length of the target and forecast vector); and load i andload i are the actual and predicted load value at each time-step i, respectively. The obtained load forecasting error metrics are displayed in Tables 1 and 2. Notably, in terms of MAPE (%) (Table 1), the ensemble method was able to outperform the ARIMA forecasters (input Predictions Models) for all the testing periods. This translates the sizeable monthly accuracy gains of MAPE (%) in comparison with the mean accuracy of all the ARIMA forecasters of ∼23%, ∼14%, ∼12% and ∼26% for the months of February, April, August and October, respectively, and accuracy gains of ∼16%, ∼2%, ∼8% and ∼14% when comparing the ensemble methodology versus the best ARIMA Forecaster in the same testing months. Table 2 indicates a similar pattern for the RMSE, where one can see monthly error improvements of ∼22%, ∼11%, ∼13% and ∼28% in comparison with the mean accuracy of all the ARIMA forecasters and RMSE accuracy gains of 12%, ∼1%, ∼10% and ∼18%, respectively, for the four different testing months. In addition, we can oberve that the methodology's accuracy is not specifically dependent on the time-series scale and variance characteristics, i.e., the largest MAPE and RMSE values took place in the month of April, which as observed previously is not the month with the highest mean or standard deviation.  Figure 5 illustrates the forecasting performance of the proposed ensemble methodology in comparison with the selected ARIMA forecasters (inputs) for two indiscriminately selected fortnights, i.e., the period from 1 to 14 February (winter) and the period from 14 to 27 August (summer). This figure also underlines the ensemble methodology's good performance in resembling the real load profile. More importantly, these examples reveal the ability of DFNN to generalize beyond the provided ARIMA forecasters and extrapolate further, thereby enabling the ensemble forecast tomatch the real load more closely. This behavior is illustrated in the zoomed secondary axis in both plots, where one can clearly observe the ensemble forecast outside of the region mapped by the 15 selected ARIMA forecasters (input prediction models). In addition, it is also clear that, as expected, the ARIMA forecasters are a suitable option to capture the linear components of the time-series but struggle to reproduce the faster changing variations of the morning and evening peak loads. This confirms the need for a stacking ensemble methodology where a deep network with nonlinear mapping abilities, fed with additional data, compensates this shortcoming and provides relevant error accuracy gains. To complement these results, a comparative analysis was made against a common shallow NN architecture with a single hidden layer with 35 nodes, also trained using the SCG algorithm, for benchmarking purposes and also for evaluating the effect of the added exogenous input variables in the input dataset (the same exogenous variables were fed to the ensemble methodology). Ergo, two different input datasets are considered. Likewise, a support vector machine regression with a Gaussian kernel function and input standardization is used to map the input domain, building the decision boundaries to achieve the hyperplane that allow us to predict continuous load outputs. The average error metrics over the course of 50 runs are shown in Tables 3 and 4, highlighting not only the superior performance of the the ensemble method for all the testing periods but also the clear benefit of including related exogenous information in the input datasets when we compare the errors between the two NNs. A brief statistical analysis reveals gains in terms of RMSE between ∼1.9% to ∼17.5% versus the NN with exogenous inputs; between ∼10.5% to ∼32.9% versus the NN without exogenous inputs; and between ∼3.4% to ∼33.7% versus the Regression-SVM. An identical account is also depicted by the MAPE (%) values, revealing considerable monthly accuracy improvements (between ∼11.5% to ∼33.8%) for all the months versus the other three forecasting approaches, with the exception being the mean monthly MAPE (%) of August where the gain was more moderate (∼1.5%).  Another important feature of the proposed methodology and its comparative NN peers was the relatively low levels of standard variation across the 50 runs, as one can notice in Figure 6, by the narrow range of the interquartile distribution in the box-plot and the small number of outliers, both for the MAPE and RMSE error metrics. More specifically, the ensemble methodology even with a large input dataset is in line with the NN with exogenous inputs variance, and its median line "sits" lower in the box-plots for all four testing months. In the months of February, April, August and October, the recorded values of standard deviation were, respectively, 0.132%, 0.252%, 0.217% and 0.113% in terms of MAPE (%) and 26  Last but not least, Figures 7 and 8 are presented to illustrate two different weeks (first week of April and third week of October, respectively) of forecasting loads versus the real load, as well as the individual hourly deviation (signed error) of each comparative method and the proposed approach below (in the form of bar plots). These bars allow a better understanding of the difference between predicted and real hourly loads, and the smaller (better) magnitudes seen in the Ensemble DNN reinforce the inferences made by not only analyzing the error tables but also the error distribution in Figure 6, with a slightly positive bias in terms of error signal for all the considered methods.

Conclusions
The task of short-term load forecasting plays a crucial part for a better functioning electric power system, enabling better scheduling, lower generation costs, better planning and better count of load flows. This task has received new attention, given the increasing adoption of self-consumption and demand response mechanisms. A review of the latest developments in the STLF problem revealed a broad field of study, with researchers trying to explore the significant breakthroughs in ML and DL in order to reduce forecasting error. An increasingly consolidated trend is the use of ensemble or fusion methods (particularly black-box type models) in order to explore the different generalization capabilities of dissimilar methods and to increase the diversity of solutions (among the base classifiers/inputs).
By considering all these aspects, this study proposes a new stacking ensemble methodology that uses a pool of different ARIMA forecasters that not only differ in the orders of autoregressive and moving average terms but also in terms of the batch (training) sizes in which they are adjusted. These prediction models are fitted according to their stationarity hypotheses in an automatic manner, and they are mainly tailored for modelling the linear components of the load time-series; their information is fed as forecaster inputs to a DFNN given its well-known ability to map hard non-linear relationshipsby using a gradient based training algorithm. This deep black-box architecture was designed to include another set of endogenous (lagged load data) and exogenous information in order to allow the network to generalize beyond ARIMA forecasters. To improve the ANN learning process, an early stop mechanism is not only employed but a modified performance function is also employed, which improves generalization.
This implementation philosophy will ensure a reduction in the high variance of single NN based models and in generalization errors. The other focal point for researchers is improvements in data selection, pre-processing and feature extraction techniques, which are essential to "clean" the time-series from noisy data, including outliers or seasonal events, or to precisely uncover some "hidden" features (e.g., high frequency behaviors). As such, CA and different Box-Jenkins models were used to facilitate the modeling process, i.e., the extraction of hard (non-linear) relationships from the set of explanatory input variables.
The effectiveness of the proposed method was evaluated in the New England case study by assessing the error in four different months (from different seasons of the year) of 2016. The obtained error metrics revealed the ensemble methodology's ability to improve forecasting accuracy in comparison with the same error metrics for the input prediction models, achieving error improvements on the order of 10% to 25% (mean terms) in comparison with the base learners' prediction accuracy. The RMSE underscores well the effectiveness of the proposed methodology, since all the monthly RMSEs were lower than the lowest monthly standard deviation in the time-series (which occurred in April 2016).
The comparative analysis validated the ensemble methodology by confirming the improvements when using a DNN versus shallow NN, and it also allowed backing up the decision to consider relevant exogenous variables, i.e., there are individual contributions made to the ensemble approach by the different individual ARIMA forecasters that coupled with a correlated input dataset for the DFNN, which has produced meaningful error improvements. Moreover, the relatively small levels of variance between simulations of the DFNN show the consistency of the architecture independently of some random parameters, such as weight initialization and data division. These results attest the suitability of the designed approach to create the intended input diversity in the base learners with the different Box-Jenkins models. These models were assembled and evaluated from a pool of options based on data and information-criteria, thus successfully tackling the common model identification (hyperparameter decision) problem when using these statistical models, thus corroborating the design approach of the proposed stacking ensemble methodology.
In terms of future works, the authors will be looking to extend the diversity of the input forecasters (with different types of base learners), address the error bias and study different combinations of dissimilar methods, as well as increasing the number of training windows and its range, in order to gauge longer term trends with the forecasting models and its effect in the overall combined load forecast. Funding: This work is funded by FCT/MCTES through national funds and, where applicable, co-funded EU funds under the project UIDB/EEA/50008/2020.

Data Availability Statement:
Publicly available datasets were analyzed in this study. This data can be found here: https://www.iso-ne.com/isoexpress/web/reports/load-and-demand/accessionnumber (accessed on 4 November 2021).

Conflicts of Interest:
The authors declare no conflict of interest.

Nomenclature
The following abbreviations, acronyms and variables are used in this manuscript: Output of an arbitrary hidden layer l receiving an input x; ISO-NE New England indepedent system operator (regional transmission); k Number of (ARIMA) model parameters; (i) Architecture: 3 hidden layers with sizes (number of neurons) [20,10,5], respec-