Error Compensation Enhanced Day-Ahead Electricity Price Forecasting

: The evolution of electricity markets has led to increasingly complex energy trading dynamics and the integration of renewable energy sources as well as the inﬂuence of several external market factors contributed towards price volatility. Therefore, day-ahead electricity price forecasting models, typically using some kind of neural network, play a crucial role in the optimal behavior of market agents. The most prominent models and benchmarks rely on improving the accuracy of predictions and the time for convergence by some sort of a priori processing of the dataset that is used for the training of the neural network, such as hyperparameter tuning and feature selection techniques. What has been overlooked so far is the possible beneﬁt of a posteriori processing, which would consider the effects of parameters that could reﬁne the predictions once they have been made. Such a parameter is the estimation of the residual training error. In this study, we investigate the effect of residual training error estimation for the day-ahead price forecasting task and propose an error compensation deep neural network model (ERC–DNN) that focuses on the minimization of prediction error, while reinforcing error stability through the integration of an autoregression module. The experiments on the Nord Pool power market indicated that this approach yields improved error metrics when compared to the baseline deep learning structure in different training scenarios, and the reﬁned predictions for each hourly sequence shared a more stable error proﬁle. The proposed method contributes towards the development of more ﬂexible hybrid neural network models and the potential integration of the error estimation module in future benchmarks, given a small and interpretable set of hyperparameters.


Introduction
Modern energy markets follow increasingly complex processes in order to perform efficient electricity trading that balances supply and demand while reacting to the dynamics derived from the unique characteristics and challenges of each energy system. One of the main challenges that urge the development of more sophisticated techniques for the coordinated production and supply of electricity is price volatility [1]. The price of electricity can fluctuate due to several factors and the sudden peaks and valleys in the price curves could lead to suboptimal energy market agent behavior, hindering the ability of those entities to execute economic transactions in the electricity market to the best of their envisaged capacity. Some of the most notable factors that could cause price fluctuations to include seasonal trends [2], weather conditions [3], penetration of renewable energy sources [4], challenges involving economic growth and changes in fuel cost [5], supply availability [6] and neighboring market dynamics [7]. It can be easily observed that load and generation process of model selection through those benchmarks rely primarily on hyperparameter optimization, feature selection and regularization techniques [27].
Recent research projects and reviews highlighted interesting short-term electricity price forecasting approaches that utilize elements from statistical and machine learning methods. Alamaniotis et al. [28] proposed a multiple regression model based on relevance vector machines for day-ahead electricity price forecasting, contributing towards the development of optimal bidding strategies in electricity markets. Moreover, Alamaniotis et al. [29] developed a hybrid forecasting model featuring relevance vector machines in a linear regression ensemble method for efficient short-term price forecasting. Zhang et al. [30] presented a forecasting method that aggregates the combined predictions from CNN and RNN structures in a gradient boosting regressor yielding improved performance. Additionally, this study highlighted the importance of elastic net regularization for the stability and reliability of this combinatorial method. Alamaniotis et al. [31] developed a combinatorial approach that couples load and price forecasting and modifies forecasted load demand through the implementation of smart scheduling algorithms. Chinnathambi et al. [32] developed a multi-stage day-ahead forecasting model based on the autoregressive integrated moving average (ARIMA) statistical approach and the consequent residual error forecast that improves the performance of the initial predictions for different time periods. This research project provides some useful insights on the utilization of post-processing factors, such as the error for the improvement of statistical methods. Chang et al. [33] proposed a forecasting model that utilizes wavelet transform and an LSTM network featuring the stochastic gradient optimizer Adam, demonstrating that a well-optimized recurrent neural network could capture and process the nonlinear patterns in this task efficiently. Su et al. [34] utilized the least squares regression boosting algorithm to predict natural gas spot prices, outperforming existing approaches, such as linear regression. Atef and Eltawil [35] conducted a comparison between support vector regression (SVR) and LSTM electricity price forecasting models, concluding that while both methods could be suitable for this predictive task, the deep learning approach outperforms the regression model in terms of error metrics. Bissing et al. [36] investigated the different combinations of regression, namely the ARIMA and Holt-Winters models, for day-ahead forecasting and provided some interesting results regarding the performance benefits of hybrid implementations. Xu and Baldick [37] compared different neural network architectures and some state-of-the-art statistical methods, concluding that neural network models could perform better for price forecasting while yielding lower mean absolute error. Zhang et al. [38] studied the performance of deep recurrent neural networks for electricity price forecasts in a deregulated market, providing useful insights on the suitability of this neural network type as a multivariate time series model. Lago et al. [39] presented a review of state-of-theart price forecasting models covering statistical, machine learning and hybrid approaches. Furthermore, this research work provided a useful open-access benchmark including a regression and a deep neural network model that utilize hyperparameter optimization for future model comparisons. Tao et al. [40] proposed a bias compensation LSTM network utilizing the LightBGM algorithm for feature selection. This work contributed significantly towards the development of hybrid short-term forecasting models since the introduction of residual error analysis for recurrent neural networks is a novel approach that could refine time series predictions. Vega-Márquez et al. [41] approached the electricity price forecasting task from a univariate time series perspective and tested well-known deep learning and statistical methods through hyperparameter optimization, distinguishing LSTM, CNN and regression tree methods as the most performant. Jiang et al. [42] utilized a decomposition-selection-ensemble forecasting system that adapts to different data characteristics and focuses on accurate and stable price predictions. Li et al. [43] presented a price forecasting model based on variational mode decomposition and sparse Bayesian learning of time series, showing that aggregate predictions derived from components featuring simple characteristics could outperform state-of-the-art models. Pourdaryaei et al. [44] investigated the impact of different optimization methods for day-ahead price forecasting. This research work focuses mostly on the pre-processing and learning steps, while the impact of post-processing optimization techniques remains unexplored.
After a thorough overview of the literature, it is important to note that while a plethora of forecasting models exist and deep neural networks have been some of the most frequently used models, the effect of error compensation for the state-of-the-art feed-forward DNN is not sufficiently covered. We can observe that benchmarks and relevant studies utilize hyperparameter optimization as well as feature selection to tune the models and achieve lower error metrics, but fewer studies have applied post-processing techniques in order to refine and improve the predictions. Therefore, while there are recent studies that utilize error residuals for this short-term forecasting task, the application of this technique on the simple yet highly performant DNN is not thoroughly explored. As a result, the potential utilization of an error estimation module for benchmarks utilizing the DNN model as an additional tuning tool remains an open question. In this study, we identified these research gaps and developed a hybrid error compensation deep neural network model, the ERC-DNN, which utilizes a feed-forward deep neural network for day-ahead electricity price predictions, as well as an autoregression module, which operates on the hourly residual error sequences and performs a step-by-step error estimation to refine the predicted prices. The main goals of this research project are: (i) to showcase the improvement of price predictions in terms of error metrics; (ii) to investigate the stability of hourly predicted sequences after the error refinement; and (iii) to provide insights into the suitability of error estimation modules in modern benchmarks for future integration, when the appropriate parameters are defined. This hybrid approach was evaluated on the dataset of the Nord Pool market following the guidelines of the benchmark presented in [39], and through different training scenarios that highlight the positive impact of error refinement. Moreover, the resulting error metrics of this approach are compared to a baseline DNN structure developed using well-known configuration and training practices in order to achieve a similar score to the DNN benchmark with a static set of hyperparameters that does not alter the tests and produces consistent results during recalibration. Additionally, the error metrics of ERC-DNN are compared to the benchmark scores despite the differences in training epochs and hyperparameter optimization in order to highlight the overall effect of the error estimation module. Section 2 presents the main methods utilized in the implementation of the proposed forecasting approach with references to the core components of the network, as well as information regarding the dataset and the configuration of the experiments. Furthermore, this section defines the error metrics used to evaluate the performance of ERC-DNN. Section 3 discusses the results of the experiments and compares performance metrics to the baseline and benchmark models. Finally, in Section 4, the advantages, as well as the challenges of this hybrid model, are outlined. Additionally, comments regarding the impact of this model as a standalone project, the potential expansion of the proposed architecture, and the integration of this model to more complex forecasting structures and open-access benchmarks in the future are included, in the hope that they contribute to the intelligence gathered in this area of research.

Feedforward Deep Neural Network
The feedforward deep neural network is an acyclic artificial neural network [45] that follows a simple layer structure and extends the MLP architecture for the purposes of function approximation. The base unit of the feedforward DNN is the neuron which is a node designed to receive a specified number of inputs, perform computations and pass the output to connected nodes found deeper in the network. The value of the output at each node is determined by activation functions, such as the rectified linear unit and hyperbolic tangent [46]. The neurons of the DNN are organized into layers and the connections of those layers denote the computation path from the input to the output. The simplest and most frequently used DNN structure contains the input layer, where input features are passed to the first set of neurons, several hidden layers that perform additional computations and tune the learnable parameters of the network, and the output layer where one or more output values are generated at each node. For the purposes of this study, we consider the role of the feedforward DNN for the supervised learning task of regression [47] since we focus on the prediction of the electricity price for the next day. Based on this task, the goal of the DNN is to learn the mapping function that describes the complex relationship between the input variables and the output variables. As a general example, we consider the fully connected DNN presented in Figure 1. The DNN features an input layer i containing k inputs, a variable number f of hidden layers h, where each one contains a variable number of neurons z and, finally, an output layer o containing j neurons for the predictions of j outputs.
Energies 2022, 14, x FOR PEER REVIEW 5 of 23 each node is determined by activation functions, such as the rectified linear unit and hyperbolic tangent [46]. The neurons of the DNN are organized into layers and the connections of those layers denote the computation path from the input to the output. The simplest and most frequently used DNN structure contains the input layer, where input features are passed to the first set of neurons, several hidden layers that perform additional computations and tune the learnable parameters of the network, and the output layer where one or more output values are generated at each node. For the purposes of this study, we consider the role of the feedforward DNN for the supervised learning task of regression [47] since we focus on the prediction of the electricity price for the next day. Based on this task, the goal of the DNN is to learn the mapping function that describes the complex relationship between the input variables and the output variables. As a general example, we consider the fully connected DNN presented in Figure 1. The DNN features an input layer containing inputs, a variable number of hidden layers ℎ, where each one contains a variable number of neurons and, finally, an output layer containing neurons for the predictions of outputs. The main learnable parameters of the DNN are the weights and biases [48]. Those parameters are initially randomized and iteratively refined through the training process since the network will be able to predict the output after several passes of the training dataset, called epochs. Weights quantify the influential strength that a change in the input could have on the output and biases denote the difference between the generated output and the desired output, essentially quantifying the extent to which the network assumes that the output should have specific values. The training process of the DNN mainly follows the back-propagation algorithm [49] where the generated output values are compared to the desired output and the value of error, which is calculated by a plethora of pre-specified loss functions [50], is fed back to the network, in order to adjust the weights. Since the goal of this training process is to minimize the error function and consequently discover the best weights, optimization methods, such as gradient descent need to be specified for the training process.
The DNN architecture shows an impressive performance in time series forecasting tasks and it is widely used in the energy sector as a standalone network or as a member of hybrid and ensemble learning methods. However, the default configuration of this The main learnable parameters of the DNN are the weights and biases [48]. Those parameters are initially randomized and iteratively refined through the training process since the network will be able to predict the output after several passes of the training dataset, called epochs. Weights quantify the influential strength that a change in the input could have on the output and biases denote the difference between the generated output and the desired output, essentially quantifying the extent to which the network assumes that the output should have specific values. The training process of the DNN mainly follows the back-propagation algorithm [49] where the generated output values are compared to the desired output and the value of error, which is calculated by a plethora of pre-specified loss functions [50], is fed back to the network, in order to adjust the weights. Since the goal of this training process is to minimize the error function and consequently discover the best weights, optimization methods, such as gradient descent need to be specified for the training process.
The DNN architecture shows an impressive performance in time series forecasting tasks and it is widely used in the energy sector as a standalone network or as a member of hybrid and ensemble learning methods. However, the default configuration of this structure may not always be sufficient for the generation of accurate predictions due to several training scenarios that need to be avoided, such as the existence of local minima [51] of the error function that could hinder the convergence of the network and the occurrence of overfitting or underfitting that are connected to the relative complexity of the model and the dataset structure. Most deep learning models achieve optimal performance either by following a set of best practices or by exhaustively searching for the best training configuration through hyperparameter optimization [52]. Some of the most important hyperparameters include the number of neurons and layers, the choice of activation function, the choice of optimizer and the associated learning rate [53], the number of training epochs, regularization [54] and the application of early stopping [55]. The search space of those hyperparameters could be large and the total training time needed for the derivation of the best set of hyperparameters could be restrictive for models aimed at short-term and real time forecasts. Therefore, while we often see meticulous and time consuming hyperparameter optimization approaches being suitable for benchmarks, many deep learning approaches rely on the results of experiments with different combinations of best practices complemented by feature selection techniques, in order to derive their baseline models and conduct comparisons. The interpretation of those results, given a specified set of parameters, requires considerable effort towards the practical evaluation of a network and the overall demystification of the black-box structure that provides added value to research work.

Autoregressive Forecasting Model and Model Selection
Autoregressive models constitute a class of simple time series models used to forecast future values of the target variable based on previous observations of the same variable, called lags [56]. The target variable is linearly dependent on the lags and this relationship occurs due to some degree of correlation between lags of adjacent time steps. The number of lags utilized in the construction of an autoregressive model determines the order of the model and it is usually derived from the inspection of partial autocorrelations. The maximum lag at time step t − n beyond which all other partial autocorrelations are close to zero is often used as an indicator of the order, and the model is expected to perform adequately when including lags up to that time step. The definition of the autoregressive model is made complete by the estimation of the coefficients ϕ i that are multiplied by each lag, the constant term c as well as the error term ε t . The estimation of those parameters is usually achieved with the use of the ordinary least squares method [57]. In order to present a general example, we consider the autoregressive model of order p for the prediction of the value y t on the next time step of the sequence formed by the variable y with time lags ranging from y t−1 to y t−p . The formula that defines this autoregressive model given the previously mentioned parameters is the following: Since autoregressive models are widely used forecasting tools with several applications in the energy sector, a few core elements need to be explored for optimal performance and the fairness of the model selection process. First, the stationarity of the data needs to be investigated since statistical models often perform better when no trend or seasonality is present. Different implementations of the autoregressive model take into consideration constant and time-dependent trends but the potential inaccurate detection of the trends and their effects on the time series forecast could sometimes lead to larger error terms. In this situation, the augmented Dickey-Fuller test [58] is utilized to determine the stationarity of a time series. According to this method, the null hypothesis assumes that a unit root exists in a time series sample and the alternate hypothesis rejects the previous assumption and considers that the time series is stationary. The p-value of the statistic results in the rejection of the null hypothesis when it is lower than 0.05. Alternatively, the comparison is between the values of the statistic and the critical values of the Dickey-Fuller t-distribution, where the value of the statistic must be more negative than the critical values to confirm stationarity. The stationarity criterion imposes restrictions to the autoregressive model that could often be seen as necessary countermeasures towards the overall reduction of uncertainty.
Second, the selection of the best autoregressive model plays a crucial role towards the minimization of forecasting error and several information criteria could be considered for the statistical evaluation of fitness to the data, such as the Akaike Information Criterion (AIC) [59], the Bayesian Information Criterion (BIC) [60] and the Hannan-Quinn Information Criterion (HQIC) [61]. The Akaike information criterion provides an estimation of information loss given the number of estimated model parameters k and the maximum valueL of the likelihood function for the model with the following formula: Furthermore, the Bayesian Information Criterion follows a similar formula with a slightly altered first term that features the sample size n of the observed data: Lastly, the Hannan-Quinn Information Criterion utilizes the previously mentioned parameters in order to derive a more consistent fitness evaluation metric when compared to the AIC and follows the formula: The selection of models with the lowest values of information criteria and the search for lags that have high autocorrelation values could result in a more accurate estimation of the target variable.

Proposed Model Structure
This research project focused on the design and implementation of a hybrid day-ahead electricity price forecasting model based on the well-known feedforward deep neural network architecture, with an additional error compensation module that estimates the prediction error and contributes towards the refinement of the final prediction. At the first step, the dataset of the model is constructed, and market data is processed in order to derive the input features, consisting of electricity price lags and exogenous variables relevant to the price time series, as well as the output features of the targeted electricity price sequences for the next day. The dataset is split into training and validation sets, undergoes normalization and is fed to the input layer of the feedforward deep neural network. At the second step, the deep neural network is trained for m epochs featuring an early-stopping mechanism that monitors the decrease of the loss function for the avoidance of overfitting with a specified patience interval, proportional to the number of epochs. Consequently, after m epochs or after the loss function stops decreasing in that patience interval, 24 sequences are generated at the output layer, each one denoting the electricity price prediction for the i th hour of the next day.
At the third step, the sequences are inverted back to their original values and the residual forecasting error for each hourly sequence is calculated from the training set. The definition of the residual training error at every hour h for the price p of the day of interest d given the known values of the training dataset and the predicted output is defined by the formula: Following this step, the residual error sequences are fed to an autoregressive model for their step-by-step estimation, resulting in the derivation of coefficients that are used to predict the error value of the next hour based on historical error data. The final price prediction is derived from the addition of the estimated error and the price forecast of the feedforward DNN. The structure of this model is presented in Figure 2 and this forecasting approach is used in our case study featuring several experiments on different training scenarios for the interpretation and analysis of the error compensation process. We refer to this model as ERC-DNN in the remainder of this paper.
Energies 2022, 14, x FOR PEER REVIEW predict the error value of the next hour based on historical error data. The final price prediction is derived from the addition of the estimated error and the price forecast of the feedforward DNN. The structure of this model is presented in Figure 2 and this forecasting approach is used in our case study featuring several experiments on different training scenarios for the interpretation and analysis of the error compensation process. We refer to this model as ERC-DNN in the remainder of this paper.

Case Study and Experiments
In this section, we present a case study consisting of several experiments used to test the forecasting performance of the proposed ERC-DNN model and investigate the impact of error compensation in the stability of error profiles for each hour in the day-ahead electricity price prediction task. The dataset used for our experiments contains hourly observations of day-ahead electricity prices, as well as the exogenous sequences that represent the day-ahead forecast of load and the day-ahead forecast of wind generation for the Nord Pool energy market during the time period between 1 January 2013 and 24 December 2018. The dataset is freely available in [62] and was used by the open access benchmark of [39] to evaluate the performance of the standard feedforward deep neural network. The data is organized according to the feature formation proposed by the benchmark. The input features include historical day-ahead prices from the previous three days as well as the prices from one week ago labeled as , , , , , and , , respectively, where denotes the day of interest and ℎ denotes the hour ranging from 1 to 24. Additionally, the day-ahead forecasts of the two exogenous variables are included for the day of prediction, made available on the previous day and labeled as , and , , essentially defining a set of 48 features. Furthermore, historical values of each exogenous variable for the previous day and one week ago, labeled as , , , , , and , . Lastly, a feature representing the day of the week as a binary vector with 7 elements is included, resulting in a total of 241 input features. The output features consist of the 24 h of dayahead electricity prices. The dataset is split into a training set of the first 3 years, including the hourly observations from 2013 through 2016 and a validation set of the last 2 years, including years 2017 and 2018 similar to the benchmark model. According to the review and benchmark of [39], the recommended minimum testing period for the evaluation of electricity price forecasting models includes one year of observations since the common practice of including a total of four weeks, one for each season, could be unsuitable due to inadequate representation of the average model performance, the potential exclusion of extreme events that could have an impact on dataset values and the possibility of selecting only the weeks where the model shows improved performance. Therefore, following these recommendations and acknowledging the two-year period used in the benchmark, we believe that the selection of testing period in this review is a suitable evaluation

Case Study and Experiments
In this section, we present a case study consisting of several experiments used to test the forecasting performance of the proposed ERC-DNN model and investigate the impact of error compensation in the stability of error profiles for each hour in the day-ahead electricity price prediction task. The dataset used for our experiments contains hourly observations of day-ahead electricity prices, as well as the exogenous sequences that represent the dayahead forecast of load and the day-ahead forecast of wind generation for the Nord Pool energy market during the time period between 1 January 2013 and 24 December 2018. The dataset is freely available in [62] and was used by the open access benchmark of [39] to evaluate the performance of the standard feedforward deep neural network. The data is organized according to the feature formation proposed by the benchmark. The input features include historical day-ahead prices from the previous three days as well as the prices from one week ago labeled as p d−1,h , p d−2,h , p d−3,h and p d−7,h , respectively, where d denotes the day of interest and h denotes the hour ranging from 1 to 24. Additionally, the day-ahead forecasts of the two exogenous variables are included for the day of prediction, made available on the previous day and labeled as  [39], the recommended minimum testing period for the evaluation of electricity price forecasting models includes one year of observations since the common practice of including a total of four weeks, one for each season, could be unsuitable due to inadequate representation of the average model performance, the potential exclusion of extreme events that could have an impact on dataset values and the possibility of selecting only the weeks where the model shows improved performance. Therefore, following these recommendations and acknowledging the two-year period used in the benchmark, we believe that the selection of testing period in this review is a suitable evaluation practice and utilize it for the evaluation of our model. Moreover, we acknowledge that the training period varies between price forecasting models and select the maximum available historical data in the remainder of this dataset for our case study in order to have a sufficient number of observations for the convergence of the deep neural network. Following the guidelines of the open access benchmark, we first constructed a baseline feedforward deep neural network of 4 layers for this multivariate time series forecasting task. The base DNN implements a set of best practices and consists of a fixed set of hyperparameters in order to exclude the performance benefits of hyperparameter optimization and isolate the effects of error compensation in our comparison. The exclusion of hyperparameter tuning at the preprocessing and training steps highlights the role of error estimation as an additional computational layer that reinforces the interpretability of the performance improvement through a smaller and simpler set of parameters. It is evident that the best set of hyperparameters for a forecasting model designed to perform well on a specific machine learning task is dependent on several factors including the dataset, the forecasting horizon, system or application constraints and the intended architecture. The search space for those optimal parameters is large and the resulting optimal set is often chosen based on the improvement of error metrics without having a direct and easily interpretable association to the architecture of the model. On the other hand, error estimation presents the simple concept of error refinement through the discovery of the coefficients that define the polynomial which best fits to the residual error sequences, providing a prediction of the error value that could correct the final prediction of the network by bringing the initial forecast to a value closer to the target. Therefore, error estimation operates independently from the computational structure of the deep neural network and the search goal shifts towards the selection of parameters that could prevent the values of error from exhibiting large variations and irregular patterns instead of proposing a set of parameters that attempt to configure a black-box approach.
The baseline model achieves comparable performance to the open-access benchmark in terms of error metrics as we will analyze in the following sections. The DNN structure contains an input layer of 241 neurons, two fully connected hidden layers with 100 and 52 neurons, respectively, and an output layer with 24 neurons for the prediction of the 24 hourly sequences of the day-ahead prices. The activation function is the rectified linear unit (ReLU) [63] and the optimizer is based on stochastic gradient descent [64] with a learning rate of 0.0005 for the avoidance of local minima. The dataset is normalized using min-max normalization and the neural network features an early-stopping mechanism with a patience interval that is equal to 10% of the total number of epochs in order to ensure the stability of predictions and the avoidance of overfitting. Figure 3 presents the structure of the baseline DNN, which is used to derive the day-ahead price predictions as a core component of the ERC-DNN model.
The DNN is trained and the sequences for price prediction are generated at the output. The experiments presented in this work consider three training scenarios, with 10, 100 and 1000 epochs, respectively, for the investigation of error compensation in a scenario where the values of error are large and the network is not near convergence, a moderate scenario where the error has improved but there is still room for further training and a training scenario where the error of the network could marginally improve after a large number of epochs. In all three experiments, the residual error sequences for each hour are calculated and their stationarity is verified by the augmented Dickey-Fuller test. Additionally, the inspection of the partial autocorrelation function for each error sequence reveals that after the first 24 lags the partial autocorrelations decay to values near zero. The results of the stationarity test as well as the observation of the partial autocorrelation function encourage the integration of an autoregressive model for the estimation of each error sequence. Therefore, the residual error sequences are passed to an AR model utilizing a window of 24 lags for the prediction of the next value of error in each sequence. After the fitting of the model to the data, the autoregression coefficients are computed and the estimated hourly error sequences are added to the electricity price forecasts for the refinement of the final prediction. Furthermore, the information criteria of AIC, BIC and HQIC were examined for the suitability of the 24-lag autoregressive model and the potential refinement of the model selection process when a threshold for feature autocorrelation is set at 0.2, 0.3 and 0.4. This additional experiment could contribute towards the appropriate selection of hyperparameters that could be included in future benchmarks adopting this technique for post-prediction processing of the model. Since hyperparameter optimization for this type of forecasting task already considers a sizable set of hyperparameters, the choice between the window length and the more complex threshold inspection based on information criteria could often be an important decision that could determine the size of the search space and the overall computational burden for the recalibration of a model or benchmark, given that short-term and real-time forecasting models need to recalibrate relatively fast. Figure 4 presents the diagram for the autoregressive model of the ERC-DNN used in the experiments. The DNN is trained and the sequences for price prediction are generated at the output. The experiments presented in this work consider three training scenarios, with 10, 100 and 1000 epochs, respectively, for the investigation of error compensation in a scenario where the values of error are large and the network is not near convergence, a moderate scenario where the error has improved but there is still room for further training and a training scenario where the error of the network could marginally improve after a large number of epochs. In all three experiments, the residual error sequences for each hour are calculated and their stationarity is verified by the augmented Dickey-Fuller test. Additionally, the inspection of the partial autocorrelation function for each error sequence reveals that after the first 24 lags the partial autocorrelations decay to values near zero. The results of the stationarity test as well as the observation of the partial autocorrelation function encourage the integration of an autoregressive model for the estimation of each error sequence. Therefore, the residual error sequences are passed to an AR model utilizing a window of 24 lags for the prediction of the next value of error in each sequence. After the fitting of the model to the data, the autoregression coefficients are computed and the estimated hourly error sequences are added to the electricity price forecasts for the refinement of the final prediction. Furthermore, the information criteria of AIC, BIC and HQIC were examined for the suitability of the 24-lag autoregressive model and the potential refinement of the model selection process when a threshold for feature autocorrelation is set at 0.2, 0.3 and 0.4. This additional experiment could contribute towards the appropriate selection of hyperparameters that could be included in future benchmarks adopting this technique for post-prediction processing of the model. Since hyperparameter optimization for this type of forecasting task already considers a sizable set of hyperparameters, the choice between the window length and the more complex threshold inspection based on information criteria could often be an important decision that could determine the size of the search space and the overall computational burden for the recalibration of a model or benchmark, given that short-term and real-time forecasting models need to recalibrate

Performance Metrics
In this section, we define the performance metrics utilized in our experiments for the comparison of forecasting error and the examination of the error refinement on the stability of error metrics for each hourly sequence of this day-ahead forecasting task. For the purposes of this study, four error metrics were used to cover different characteristics of the performance evaluation process. Mean absolute error (MAE) [66] provides an easily interpretable and natural error metric that is indifferent to the direction of errors. This performance metric was used as a loss function for the training of the deep neural network, the configuration of the early-stopping mechanism, as well as the evaluation of the ERC-DNN approach. Given the predicted values y i and real values x i in a set of n samples, the mean absolute error is computed by the formula: Furthermore, mean absolute percentage error (MAPE) [67] was used as a scale independent performance metric since it is a widely used measure for time series regression tasks and could provide a generalized percentage score for forecasting models. Given the same parameters for the calculation of MAE, MAPE is computed by the formula: Moreover, the error metrics of mean squared error (MSE) [68] and root mean squared error (RMSE) [69] is included in the performance evaluation of the experiments since they provide quadratic loss functions that measure the forecasting uncertainty while focusing on the impact of large errors. The values of MSE could express the sum of the variance and square value of bias, further contributing to the performance analysis of a model. Additionally, the values of RMSE increase with the variance of the frequency distribution of error magnitudes, resulting in larger values when large error values are present [70]. Given the same parameters used for the computation of the previously described error functions, the formulae for MSE and RMSE are the following:

Results
In this section, we present the results of the experiments with the inclusion of figures featuring a comparison of error metrics between the ERC-DNN and the baseline DNN for each training scenario. This comparison provides an overview of the stability and performance refinement that occurred in each hourly price sequence after the autoregressive error compensation module is added to the DNN architecture. Additionally, the overall performance of the model for each scenario is presented based on aggregated error metrics, in order to examine the generalized improvement in prediction accuracy stemming from the error estimation process. Furthermore, the exploration of information criteria for the selection of a refined autoregressive model is investigated and the value of implementing a threshold method instead of the window of lagged error observations for error estimation is discussed. Since the performance metrics did not fluctuate greatly after consecutive executions, the results presented in this section constitute averages from 10 executions for each experiment. It is worth noting that the baseline DNN structure presented in this work performs similarly to the DNN model of the open access benchmark [39] since it achieves a MAE of 1.987, a MAPE of 6.895 and an RMSE score of 3.877 after 4000 training epochs, while the DNN benchmark configuration with the lowest error metrics achieved a MAE of 1.797, a MAPE of 5.738 and an RMSE of 3.474 after hyperparameter optimization. Therefore, the resulting ERC-DNN model is utilizing a highly performant neural network component for the experiments.
First, we consider the training scenario of 10 epochs. The main purpose of this experiment is to present the effect of error compensation on the DNN forecast when the error has larger values that fluctuate greatly from sequence to sequence. In the simple univariate case, we could assume that this scenario refers to a network that has not reached convergence and could be unstable or not properly trained, while in the multivariate case we could observe that each output sequence differs greatly from the desired values and error magnitudes vary for each hour. Error compensation has the greatest impact on this scenario, as the accurate error estimation leads to a larger prediction refinement. In the subplots of Figure 5, we can observe that after the implementation of error compensation, large errors are no longer present, and this greatly improves the MSE and RMSE scores of the model. Moreover, the error profile for each hourly sequence is stabilized, resulting in an average model performance that is close to the model performance for each hourly predicted sequence.
The second experiment considers the training scenario of 100 epochs. In this task, the neural network reaches a more acceptable forecasting performance with each hourly sequence having similar error metrics. As can be observed from the subplots of Figure 6, there are slight error variations between the hourly sequences showing that the network is still unable to predict every hour of the day-ahead prediction equally well. The effect of error compensation in the ERC-DNN improves the forecasting performance and the error metrics are lower than those presented in the open-access benchmark. Since neural network models on sufficiently large datasets do not typically converge after 100 epochs and the values of error are not distinctly high, the slight error variations observed in the baseline evaluation are passed down to the ERC-DNN. Therefore, when compared to the 10-epoch scenario, the performance of the model improved in a similar way but the stability improvement of error among hourly sequences was not as drastic.
The third scenario considers 1000 training epochs and refers to models that are near finalization, where the model converges to predicted values close to the target output and the error metrics remain relatively low. Through this experiment, we can observe that the error metrics could follow more consistent patterns, in this case denoting that the first hourly sequences of the day-ahead forecasting task are predicted more accurately when compared to the last few hours. This phenomenon could be a cause of concern when the model is deployed for real-world applications since the model could generate substantially divergent values for the last few hours of each day. The error compensation improves the performance of this model and flattens the previously described effect, resulting in more consistently accurate predictions. However, it is worth noting that as the neural network is close to reaching convergence, the error values are considerably lower, and the overall refinement of predictions is smaller for larger numbers of epochs. The subplots of Figure 7 visualize this scenario. The second experiment considers the training scenario of 100 epochs. In this task, the neural network reaches a more acceptable forecasting performance with each hourly sequence having similar error metrics. As can be observed from the subplots of Figure 6, there are slight error variations between the hourly sequences showing that the network is still unable to predict every hour of the day-ahead prediction equally well. The effect of error compensation in the ERC-DNN improves the forecasting performance and the error metrics are lower than those presented in the open-access benchmark. Since neural network models on sufficiently large datasets do not typically converge after 100 epochs and the values of error are not distinctly high, the slight error variations observed in the baseline evaluation are passed down to the ERC-DNN. Therefore, when compared to the 10epoch scenario, the performance of the model improved in a similar way but the stability improvement of error among hourly sequences was not as drastic. The third scenario considers 1000 training epochs and refers to models that are near finalization, where the model converges to predicted values close to the target output and the error metrics remain relatively low. Through this experiment, we can observe that the error metrics could follow more consistent patterns, in this case denoting that the first hourly sequences of the day-ahead forecasting task are predicted more accurately when compared to the last few hours. This phenomenon could be a cause of concern when the model is deployed for real-world applications since the model could generate substantially divergent values for the last few hours of each day. The error compensation improves the performance of this model and flattens the previously described effect, result- Overall, we can observe that across all four performance metrics, the integration of the error compensation module refined the predictions and resulted in improved performance in every training scenario, denoting that better and substantially more stable error metrics can be derived even in situations where the neural network is not close to convergence. Table 1 presents the overall error metric comparison that cohesively depicts the impact of this post-processing error estimation model. Hyperparameter optimization considers a large space of training parameters in search of a combination that produces optimal error metrics after training. These parameters are specified before the training process starts and affect the error of the model during the training iterations. After the inspection of the results presented in this work, the argument for the inclusion of parameters that regulate error estimation and affect the error after the initial training is complete, such as the window of lagged observations for the Overall, we can observe that across all four performance metrics, the integration of the error compensation module refined the predictions and resulted in improved performance in every training scenario, denoting that better and substantially more stable error metrics can be derived even in situations where the neural network is not close to convergence. Table 1 presents the overall error metric comparison that cohesively depicts the impact of this post-processing error estimation model. Hyperparameter optimization considers a large space of training parameters in search of a combination that produces optimal error metrics after training. These parameters are specified before the training process starts and affect the error of the model during the training iterations. After the inspection of the results presented in this work, the argument for the inclusion of parameters that regulate error estimation and affect the error after the initial training is complete, such as the window of lagged observations for the definition of an autoregressive model, or the choice of error estimation method could be valid as future benchmarks could consider the full spectrum of error optimization, in an attempt at setting the new standard for model comparisons, where prediction refinement becomes one of the core final steps. However, expanding the search space and introducing additional hyperparameters is not always a viable option, especially when we consider the poten-tial lack of computing power or the time restrictions imposed by the short recalibration period of real-time models. In this study, the consideration of an autoregressive model utilizing a window of 24 lagged observations for error estimation was a reasonable and computationally inexpensive choice, since the total execution time of the experiments was not dramatically increased. Additionally, the execution of the experiments considered the parameters that could encourage the usage of an autoregressive model for this task, such as the augmented Dickey-Fuller test of stationarity, the computation of partial autocorrelations, and the computation of information criteria for the error estimator.
While the search for the optimal window size based on partial autocorrelations could be regarded as an important step in model selection, and as a potential hyperparameter in more complex optimization problems, the investigation of different model selection criteria could introduce additional hyperparameters is equally necessary. This work explored the information criteria threshold selection method as an alternative to the simpler window selection. The information criteria threshold selection method iteratively fits the autoregressive model using lagged observations that surpass a specified autocorrelation function threshold (ACF). The three information criteria scores of AIC, BIC and HQIC are computed and the model that achieves the lowest score for each hourly error sequence is selected. After examining the scores extracted from this alternative model selection approach in Tables 2-4, we observed that in the scenario of 10 epochs, where the error compensation model achieves the greatest prediction refinement, not all error sequences led to improved information criteria when lagged observations over a certain autocorrelation threshold were selected since the values depend on the error sequences generated by the DNN. This also holds true for the 100 and 1000 epoch scenarios. Furthermore, the improvement of the information criteria is negligible when compared to the 24-lagged window method. Consequently, in the scenario where all hourly error sequences were able to benefit from the threshold method, the increase in forecasting performance would not be impactful enough to justify the computational burden of iteratively searching for the model that satisfies that criteria. Hence, the simplicity of the window method for autoregressive error estimation would be the preferred method for ERC-DNN and the window size would be an appropriate hyperparameter to tune that model. Table 2. Comparison between the 24-lag window method and the threshold method based on AIC scores for the 10-epoch scenario of ERC-DNN. Cells colored in green denote an improvement in information criteria score while cells colored in blue denote worse overall scores when compared to the window method.

Discussion
This paper presented an error compensation deep neural network for the task of dayahead electricity price forecasting. The proposed model used an autoregressive module to estimate hourly residual error sequences and refine and improve the predictions of the neural network model. This approach was tested in three different training scenarios, where the values of the error were high, moderate, and low in order to cover several potential network behaviors, ranging from fairly unstable to nearly convergent. The ERC-DNN yielded impressive results, with improved error metrics in every training scenario when compared to the baseline model. In detail, the error compensation method stabilized the performance of the poorly trained network in the first scenario, decreasing the value of MAE from 8.581 to 2.137. Additionally, significant performance improvements were observed in the moderate and the longer training scenarios with the values of MAE decreasing from 3.068 to 1.507 in the 100-epoch experiment and from 2.156 to 1.105 in the 1000-epoch experiment. This forecasting approach resulted in improved error metrics when compared to the benchmark results presented in [39].
The improvement of forecasting performance is not the only benefit provided by this approach, since the error compensation method manages to create more consistent predictions, resulting in multivariate models that can predict each hourly sequence at a similar level of accuracy. The inclusion of an autoregressive module resulted in a clear and interpretable approach to error improvement since it operates on the output of neural networks. Therefore, error estimation and refinement through this approach could be easily associated with the analysis of hourly residual error sequences instead of searching for the optimal combination of structural parameters that configure complex deep neural networks in a black-box approach. The design, implementation and testing of this method provides some useful insights towards the development of more robust and stable hybrid models, as well as the integration of error compensation as an additional optimization option for benchmarks during post-processing. However, one potential disadvantage of this method is the dependence on the error sequences and their characteristics. In this project, we implemented several methods, such as stationarity and autocorrelation analysis to ensure that the autoregressive module would behave appropriately. In scenarios where those methods would yield inconsistent results, this approach may not result in substantial error improvements. As a result, we believe that the analysis of error sequences is a crucial part that precedes the integration of that data in post-processing techniques and should not be omitted. Since most hybrid models and benchmarks utilize hyperparameter optimization to search for the optimal combination of parameters that minimize the error metrics, the integration of error compensation could introduce a wide set of additional parameters that would increase the overall complexity of the models and potentially render that refinement more computationally expensive. While the simple choice of the window size in an autoregressive error estimation model seems to be an appropriate hyperparameter for the configuration of this method, the consideration of more complex estimation methods could result in refinement techniques that greatly hinder the execution time of those models.
The contribution of this work is not limited to the research and development of electricity price forecasting models since there are several ways this approach could benefit market participants and the grid. Firstly, this approach could reduce the price uncertainty of generators while assisting them indirectly in the maximization of profit. Since generators often need to select the highest price after inspecting offers from different markets in order to sell the production [71], this method could lead to more informed decisions due to the increased stability and forecasting performance. Secondly, trading companies could develop more robust short-term contracts due to the availability of more accurate price estimates. Lastly, the grid could benefit from more stable and accurate price predictions since the effect of price volatility could lead to more blackouts and the urgent usage of reserves.
This project attempted to cover several research gaps through the investigation of the error compensation effect on the well-known DNN structure used in open-access benchmarks and several forecasting applications. While recent studies shared a similar direction in the implementation of error compensation on the LSTM structure [40] as well as more traditional statistical methods for different forecasting tasks, this study considered the feedforward deep neural network as the building block for the development of performant forecasting models that include error estimation. The examination of the results in conjunction with recent research findings derived from statistical and machine learning models reinforces the concept that error estimation is a beneficial post-processing technique for deep learning models in the energy sector. There are several additional aspects regarding this method that could be explored in future work. First, a wide comparison of error estimation models ranging from simple statistical approaches to the increasingly complex neural network models could contribute towards the optimal model selection of the error refinement module post-training. Second, the ERC-DNN model could be tested on many electricity markets that display different price characteristics, such as different levels of price fluctuations in an attempt to study the effects of the unique price curve behavior on the training error. Additionally, the inspection of distinctly different error sequences could result in useful insights into the behavior of the model and the adaptability to different market dynamics. Lastly, the benefits of hyperparameter optimization could be studied in combination with error compensation, in an attempt to quantify the overall performance improvement and the computational tradeoff for short-term and real-time applications.  Data Availability Statement: Data is available in a publicly accessible, general-purpose repository. The data used in this study are openly available in [Zenodo] at [10.5072/zenodo.715409], reference number [62]. The dataset was processed as the input for the design and performance assessment of the day-ahead electricity price forecasting approach described in this article. This dataset was chosen for this work due to the insightful comments presented in the open-access benchmark referenced