Selection of Calibration Windows for Day-Ahead Electricity Price Forecasting

We conduct an extensive empirical study on the selection of calibration windows for day-ahead electricity price forecasting, which involves six year-long datasets from three major power markets and four autoregressive expert models fitted either to raw or transformed prices. Since the variability of prediction errors across windows of different lengths and across datasets can be substantial, selecting ex-ante one window is risky. Instead, we argue that averaging forecasts across different calibration windows is a robust alternative and introduce a new, well-performing weighting scheme for averaging these forecasts.

Fezzi and Mosetti [29] propose a simple, two-step approach that uses the first step to determine the optimal window length (ranging from only a few to 350 days) for each model and the second step to compare forecasting capabilities across models.They argue that improvements over selecting ex-ante one window of 'typical' size are significant for the considered datasets from the Nordic and Italian markets.Hubicka et al. [30] go one step further and propose a novel concept in energy forecasting that combines day-ahead predictions across different calibration windows (ranging from 28 to 728 days).Using data from the Global Energy Forecasting Competition 2014 [31], they show that this kind of averaging yields better results than selecting ex-ante only one 'optimal' window length.In this paper, we extend their analysis to other datasets, predictive models with more explanatory variables, transformed price and consumption/load series and-most importantly-introduce a new, well-performing weighting scheme for averaging forecasts.
The remainder of the paper is structured as follows.In Section 2 we introduce the methodology.First, the preliminaries including variance stabilization via the area hyperbolic sine transformation (Section 2.1).Next, the expert autoregressive models with (i.e., ARX) and without (i.e., AR) exogenous variables that are used to compute the day-ahead price forecasts (Section 2.2).Then, the three distinct datasets from three major power markets with 2.5-3 year long out-of-sample test periods (Section 2.3) and the rolling window framework we apply (Section 2.4).Finally, a novel weighting scheme for averaging forecasts across calibration windows (Section 2.5).In Section 3 we evaluate the obtained results in terms of the classical error measure for point forecasts (i.e., the mean absolute error , MAE) and the Giacomini and White [32] test for conditional predictive ability (CPA) to determine significant differences in forecasting accuracy.Finally, in Section 4 we wrap up the results and conclude.

Preliminaries
As in many EPF studies, the modeling is implemented here within a 'multivariate' framework, which mimics price setting in day-ahead auction markets [27].We explicitly use a 'day × hour', matrix-like structure with P d,h representing the electricity price for day d and hour h.Given the recommendations of Uniejewski et al. [26], we calibrate our models not only to raw prices but also to transformed data, i.e., X d,h = f (P d,h ), where f (•) is an appropriately chosen variance stabilizing transformation (VST).Lower variation and/or less spiky behavior of the input data usually allows the forecasting model to yield more accurate predictions [33].
For electricity markets with only positive prices, the logarithm is the most popular choice for a VST.However, since two out of three datasets analyzed here exhibit negative values, the log-transform is not feasible in our case.Instead, we use the area hyperbolic sine transformation: where p d,h = 1 b (P d,h − a) are 'normalized' prices, a is the median of P d,h in the calibration window and b is the sample median absolute deviation (MAD) around the median.The asinh can be used for negative data and its implementation is straightforward.Moreover, it has been found to perform well in several EPF studies [25,27,34].The inverse transformation is the hyperbolic sine, i.e., p d,h = sinh(X d,h ).After computing the forecasts, we apply it to obtain the price predictions: (2)

Expert Models
We consider four autoregressive models, each consisting of 24 submodels-one for each hour of the day.Following [17,22,27], we refer to them as expert models, because they are built on some prior knowledge of experts.All four models come in two variants: • Benchmark versions that work on raw data.Since the identity f (P d,h ) ≡ P d,h is a special case of a VST, to simplify the notation we refer to it as ID and to the resulting data as 'ID-transformed'.• Modified versions that work on asinh-transformed data.
The first model, denoted by ARX1, is a simple autoregressive structure with an exogenous variable (hence X in the name) originally proposed by Misiorek et al. [9] and later used in several EPF studies [13,15,17,18,22,25,[35][36][37].Within this model, the VST-transformed price on day d and hour h is given by: where The third autoregressive structure is the well-performing expert DoW,nl model of Ziel and Weron [27], only expanded to include one exogenous variable (consumption or load forecast; as in [25]).Within this model, denoted by ARX2, the VST-transformed price on day d and hour h is given by: The VST-transformed price for the last load period of the previous day, i.e., X d−1,24 , is included to take advantage of the fact that prices for early morning hours depend more on the previous day's price at midnight than on the price for the same hour [22,27,38].Compared to ARX1 also the maximum of previous day's VST-transformed prices, i.e., X d−1,max , and dummies for the remaining days of the week (D 4 ≡ D Tue , D 5 ≡ D Wed , D 6 ≡ D Thu , D 7 ≡ D Fri ) are included.As there are seven dummies in Equation ( 4), one of them plays the role of the intercept, hence the missing β h,0 term.The fourth model, denoted by AR2, is obtained from Equation (4) by setting β h,7 ≡ 0, i.e., by discarding the exogenous variable.

Datasets
For the test ground we have chosen three major power markets that differ in geographic location and generation mix: • Nord Pool (NP; Northern Europe)-a hydro-dominated (over 50% of generation) market exhibiting strong seasonal variations, • PJM Interconnection (Northeastern United States)-the world's largest competitive wholesale electricity market with a balanced generation mix (ca. a third of coal, gas and nuclear), • EPEX Germany and Austria (Central Europe)-a developed market with a rapidly growing share of renewables (wind, solar and biomass; currently over 33% of generation) and pronounced negative prices; the latter are natural in electricity trading-since plant flexibility is limited and costly, incurring a negative price for a few hours can actually be economically optimal [34,39].
The first dataset comprises two time series at hourly resolution: Nord Pool system prices and day-ahead consumption prognosis for four Nordic countries (Denmark, Finland, Norway and Sweden) from the period 1 January 2013 to 31 July 2018.The time series plotted in Figure 1 were constructed using publicly available data (source: https://www.nordpoolgroup.com/historical-market-data/)and preprocessed to account for missing values and changes to/from the daylight saving time, analogously as in Weron [37].The missing data values (corresponding to the changes to the daylight saving/summer time and eight hourly consumption figures for Norway) were substituted by the arithmetic average of the neighboring values.The 'doubled' values (corresponding to the changes from the daylight saving/summer time) were substituted by the arithmetic average of the two values for the 'doubled' hour.The ARX1 and ARX2 models (for both VSTs) were calibrated to the Nord Pool dataset.The second dataset also comprises two time series at hourly resolution: prices and day-ahead load forecasts for the Commonwealth Edison (COMED) zone in the PJM market from the period 10 April 2012 to 2 April 2018, see Figure 2. The time series were constructed using publicly available data (source: https://dataminer.pjm.com) and preprocessed to account for missing values and changes to/from the daylight saving time, analogously as the Nord Pool dataset.Note, that although the electricity price units are similar in Figures 1 and 2 (EUR/MWh vs. USD/MWh), the scales on the y-axis are different because of the much more volatile behavior of the PJM market, particularly in early 2014.All four models introduced in Section 2.2 were calibrated to the PJM dataset, resulting in eight different price forecasts for each calibration window: four models × two VSTs.
The third dataset comprises hourly prices from the EPEX market for Germany and Austria in the period 6 August 2010-28 July 2016, see Figure 3, that has been used in two recent EPF studies [26,27].Note, that unlike the Nord Pool and PJM datasets, it does not include an exogenous variable.Consequently, only the 'pure-price' AR1 and AR2 models (for both VSTs) were calibrated to the EPEX dataset.

The Rolling Window Scheme
All three datasets are long enough to comprehensively evaluate day-ahead price predictions: the PJM and EPEX time series span 6 years (2184 days), while the more recent Nord Pool series are 5 months shorter (2038 days).Since the longest calibration window we consider is 3 years long, the out-of-sample test periods span 2.5-3 years, see Figures 1-3.
Like the majority of EPF studies, we consider a rolling window scheme.Initially, the first 1092 (= 3 × 364) days are used for calibration of the expert models.Since this study is concerned with the optimal choice of the calibration window length, we consider 28-to 1092-day windows.For windows shorter than 1092 days, the calibration sample is left-truncated so that it ends on the same day as the 1092-day window.Then, the WAW weights (see Section 2.5) are selected based on price predictions for the following day, i.e., 29 December 2015 for Nord Pool, 7 April 2015 for PJM and 2 August 2013 for EPEX.The remaining observations constitute the 1091-(for PJM and EPEX) or 945-day (for Nord Pool) out-of-sample test periods.Once the price forecasts are obtained for 30 December 2015 (Nord Pool), 8 April 2015 (PJM) and 3 August 2013 (EPEX), the calibration windows are rolled forward by one day and price forecasts for the 24 h of the next day are computed.This procedure is repeated until the predictions for the last day in the test period are obtained.

Averaging Forecasts across Calibration Windows
Hubicka et al. [30] have recently proposed a novel concept in energy forecasting and shown that averaging day-ahead EPFs across different calibration windows yields better results than selecting ex-ante only one 'optimal' window length.Their idea is inspired by the econometric literature, where some researchers argue that forecasting performance is sensitive to the choice of the calibration window and in the presence of structural breaks (i.e., abrupt and unexpected changes in the underlying process) it may be advisable to combine forecasts based on windows of different lengths [40,41].The rationale behind this approach is that longer windows allow for more precise estimation of model parameters, but shorter better adapt to changes.Consequently, forecasts obtained from models calibrated over windows of different lengths will address distinct features of the underlying process.A similar concept, but limited to two window lengths (2-and 3-year), has been recently considered in load forecasting [42,43].
Here, we extend the analysis of Hubicka et al. [30] to other datasets, predictive models with more explanatory variables, VST-transformed price and consumption/load series and-most importantly-a new weighting scheme for averaging forecasts.As discussed in Section 2.4, we first compute EPFs for 1065 different calibration window lengths, ranging from 28 to 1092 days; we use Win(T) to denote the forecast for a calibration window of length T days.Obviously, we do not know ex-ante which T leads to the most accurate predictions and-as we will see in Section 3-the variability of forecast errors across the T's and across the datasets can be substantial.Hence, selecting ex-ante one window length is risky.However, as Hubicka et al. [30] argue, a simple arithmetic average of the Win(T)'s across many T's is a robust alternative that outperforms many Win(T)'s.We denote such a forecast by AW(T ), where T is the set of window lengths used.We use Matlab's notation for the latter, e.g., T = (28,728) refers to 28-and 728-day windows, T = (28:28:728) to 26 windows-28-, 56-(= 2 × 28), ..., 728-day, and T = (28:1092) to all 1065 windows ranging from 28 to 1092 days.
While AW(28:1092) is robust, it is not computationally efficient, especially if more CPU-demanding predictive models than regression are used; it involves repeating the estimation exercise over 1000 times and for committee machines of neural networks this can be a cumbersome task [18].The more 'sophisticated' approach is to cherry-pick only several calibration windows and not necessarily the best ones, as the suboptimal forecasts may actually average better.Hubicka et al. [30] recommend AW(28:28:84, 714:7:728) as it leverages accurate predictions for a mix of short-and long-term windows with computational efficiency and is not significantly outperformed by any other window set in their study.Since we consider calibration windows of up to three years, in Section 3 we will also evaluate AW(T )'s for T 's including longer than 2-year windows.
Let us now extend the AW(T ) concept and introduce a new, non-uniform weighting scheme for averaging forecasts; we denote the resulting price predictions by WAW(T ).Inspired by the results of Stock and Watson [44] in a macroeconomic and of Nowotarski et al. [13] in an EPF context, we propose to determine weights based on the inverse of the mean absolute error (MAE) for a certain period in the past: where w T is the weight corresponding to a window of length T and MAE τ,T is the MAE for a period of τ days directly preceding the target day and for a calibration window of length T.
Using this approach we assign larger weights to windows that have performed well in the past.Note, that as opposed to earlier studies [13,44] which used large τ's, we suggest to use very short w T -selection periods, so that the weighting scheme catches only the most recent dependencies.In fact, based on a limited simulation study, τ = 24 h seems to be close to optimal and is used in this paper.So that, when assigning weights for day d, we look at the performance of each calibration window on the previous day, i.e., d − 1.However, the concept is more general and can be extended to different τ's.
Finally, the WAW(T ) weighted forecast for day d and hour h is given by: where P d,h,T is the prediction for day d and hour h obtained for a calibration window of length T.

Evaluation in Terms of MAE
As the main evaluation criterion we consider the mean absolute error (MAE) for the full out-of-sample test period of D = 1091 (for PJM and EPEX) or 945 days (for Nord Pool): where E d,h = P d,h − P d,h is the prediction error for day d and hour h.Results for selected windows (T) and window sets (T ) are summarized in Tables 1-4 and plotted in Figures 4-7.Note, that when calibrated to asinh-transformed PJM data MAEs of all expert models tend to explode for short windows.This is particularly visible in the rightmost columns in Figures 5 and 6, i.e., respectively for ARX2 and AR2, where MAEs in excess of 10 10 are denoted by ∞.As a robustness check, we have also evaluated the results in terms of the root mean squared error (RMSE) and observed only slight differences, e.g., for the ARX2(PJM, ID) model and calibration window sets containing the shortest windows (i.e., shorter than 2 months) WAW was slightly outperformed by AW averaging.Overall, however, the results were qualitatively very similar and are not reported here.
Although there are three datasets in our study, we essentially have four test cases: (i) ARX-type models for Nord Pool in Table 1 and Figure 4; (ii) ARX-type models for PJM in Table 2 and Figure 5; (iii) AR-type models for PJM in Table 3 and Figure 6; and (iv) AR-type models for EPEX in Table 4 and Figure 7.The following conclusions can be drawn regarding performance across the models and VSTs (for a given window or window set):    .., 1092 (gray circles) and obtained by combining forecasts (the color/line/symbol scheme and the location of panels is the same as in Figure 4).Note the different scales.Out-of-range lines or symbols are not visible..., 1092 (gray circles) and obtained by combining forecasts (the color/line/symbol scheme and the location of panels is the same as in Figure 6).Note the different scales.Out-of-range lines or symbols are not visible.
• Comparing the two expert model structures, i.e., ARX1 vs. ARX2 and AR1 vs. AR2, we can clearly see that the more parsimonious one is outperformed by the larger one.In two test cases (ARX for Nord Pool and AR for EPEX) this is true irrespective of the VST, i.e., ARX2(NP, ID) and AR2(EPEX, ID) yield more accurate predictions than ARX1(NP, asinh) and AR1(EPEX, asinh), respectively.This result supports the observations made by Ziel and Weron [27] that AR(X)2 is a very competitive expert model and can outperform much richer structures.• Regarding the two VSTs-ID and asinh-clearly the latter one leads to better forecasts than identity (i.e., fitting models to raw data).This can be seen, for instance, by comparing the levels of gray dotted curves representing Win(T) forecasts in the left and corresponding right panels of Figures 4-7.
This result provides strong support for using variance stabilizing transformations.Although asinh was not the best VST in the study of Uniejewski et al. [26], it is straightforward to implement, its computational cost is negligible and on average it performs very well.We recommend it for EPF.• Autoregressive models with and without the exogenous variable can be compared on the PJM dataset (Table 2 and Figure 5 vs. Table 3 and Figure 6).As is often reported in the EPF literature [1], models with the load forecast as an explanatory variable perform better; in our study by 2-8%, except for a few isolated cases when MAEs of the AR1(PJM, asinh) model exploded.
Now, with respect to the performance of a given model/VST across windows (T) and window sets (T ) we can observe that: • The behavior of Win(T) as a function of T is very unpredictable.For ARX models calibrated to raw Nord Pool data the MAEs decrease with T and 3-year windows are preferred, see the gray dotted curves in the left panels of Figure 4.However, if the data is asinh-transformed beforehand then the choice of a single window length in not that clear-cut, see the right panels in Figure 4.The picture is completely different for the PJM dataset-now the gray curves have a minimum around 350-450 days and the very long windows are as bad (or even worse) than the very short ones, see Figures 5-6.For the EPEX dataset the gray curves also have a minimum around 350-450 days, but now the very long windows are better than the very short ones, see Figure 7.The origins of these minima around 350-450 days are not clear.However, a possible explanation is that time series characteristics changed over time in the PJM and EPEX markets, see the decrease in volatility in Figures 2 and 3, and too long windows include past spikes which are not so pronounced recently.• Comparing Win(T) with AW and WAW averaging we note in some cases, i.e., for ARX2(NP, asinh), ARX2(PJM, asinh) and AR1(EPEX, * ), the latter outperform all Win(T)'s.In many cases averaging outperforms most Win(T)'s, while for AR1(PJM, ID) and AR2(EPEX, asinh) there are a few T's for which Win(T) is slightly better than any considered average.However, selecting those T's ex-ante is unlikely.Hence, the presented results support the concept of averaging forecasts across calibration windows.• Regarding AW and WAW averaging we can observe that the new approach proposed in this paper almost always outperforms the equally weighted scheme of Hubicka et al. [30], i.e., the dashed lines in Figures 4-7 are almost always lower than the solid ones of the same color.Moreover, when compared to AW averaging, the WAW scheme decreases the negative impact of a poorly performing calibration window (or a window subset) on the resulting forecast, hence yielding an even more robust outcome.• An ex-ante choice of T is not trivial.However, a mix of short-and long-term windows typically outperforms an average across all windows.Recall, that Hubicka et al. [30] recommend AW(28:28:84, 714:7:728) as it leverages accurate predictions with computational efficiency and is not significantly outperformed by any other window set in their study.Our results show that including ca. 3-year windows, e.g., (1078:7:1092), instead of ca.2-year, e.g., (714:7:728), does not bring visible benefits.However, we suggest to use slightly longer windows at the shorter end, i.e., (56:28:112) instead of (28:28:84), as MAEs for the latter may explode.
Summing up, we recommend using the WAW(56:28:112, 714:7:728) averaging scheme.It is computationally efficient (requires generating only six forecasts for the six calibration windows) and exhibits very good performance across all four autoregressive models, both transformations and all three datasets.

The CPA Test and Statistical Significance
The obtained MAE values can be used to provide a ranking of models, but do not allow to draw statistically significant conclusions on the outperformance of the forecasts of one model by those of another.Therefore, we use the Giacomini and White [32] test for conditional predictive ability (CPA), which can be regarded as a generalization of the commonly used Diebold and Mariano [45] test for unconditional predictive ability.While both tests can be used for nested and non-nested models-as long as the calibration window does not grow with the sample size [46]-only the CPA test accounts for parameter estimation uncertainty and hence is the preferred option.Here, one statistic for each pair of models is computed based on the 24-dimensional vector of errors for each day: where E Z,d = ∑ 24 h=1 |E d,h | for model Z; by 'model Z' we mean here one of the four autoregressive structures defined in Section 2.2 calibrated either on a window of a certain length, i.e., Win(T), or a combined forecast one of the four autoregressive models for a set of windows, i.e., AW(T ) or WAW(T ).For each model pair and each dataset we compute the p-value of the CPA test [32] with null H 0 : φ = 0 in the regression: where X d−1 contains elements from the information set on day d − 1, i.e., a constant and lags of ∆ X,Y,d ; note, that also the parameters of each of the two models are estimated using data up to day d − 1.
In Figures 8-15 we visualize the obtained p-values using 'chessboards', analogously as in [18,19,[25][26][27] for the Diebold-Mariano test, i.e., we use a heat map to indicate the range of the p-values-the closer they are to zero (→ dark green) the more significant is the difference between the forecasts of a set on the X-axis (better) and the forecasts of a set on the Y-axis (worse).For instance, the first row in the right panels of Figures 8-15 is green, indicating that the forecasts of all four expert models for every window set significantly outperform those for Win (28).Actually, in the left panels the first row is also green except for the black squares corresponding to the very poorly performing Win(728) and Win(1092) predictions for the ARX1 model fitted to raw PJM prices, see Figure 10.On the other hand, the columns which correspond to WAW(28:28:84, 1078:7:1092) averaging are green in both panels of Figure 8, meaning that this window set leads to significantly better forecasts than all other for the ARX1 model fitted to ID-or asinh-transformed Nord Pool prices.Note, that due to the explosive nature of models calibrated to asinh-transformed PJM data over very short windows (see Section 3.1), in the right panels of Figures 10-13 all window sets were left-truncated at 40 days, i.e., all windows of less than 40 days were discarded from the average for models AW(28:1092) through WAW(28:728), while window T = 28 was replaced by T = 40 for the remaining ones.
Overall, the CPA test results confirm and emphasize the observations made in Section 3.1.In particular, the majority of AW and WAW averaged forecasts significantly outperform those of the five selected calibration window lengths, see the mostly green rows corresponding to Win(T)'s for T = 28, 56, 365, 728, 1092.Moreover, in cases such as ARX2(PJM, ID) and ARX2(NP, ID), where some of the Win(T) forecasts are performing well, the T varies.An ex-ante choice of the correct calibration window (respectively 364 and 1092 days in these two cases) is problematic.On the other hand, the averaged forecasts (especially based on the new WAW scheme) are less prone to this instability.The mixed short-and long-term window sets are strong performers in these two cases, and in general across all datasets and models (note the mostly green columns corresponding to these T 's).
Comparing the averaging schemes, it is worth noting that in no case does AW significantly outperform the corresponding WAW scheme.On the other hand, the opposite can be observed for several cases-across all datasets, both transformations and different numbers of averaged forecasts.This result reinforces our recommendation of using the WAW scheme for combining multiple forecasts for EPF, regardless of the number of predictions combined.they are to zero (→ dark green) the more significant is the difference between the forecasts of a set on the X-axis (better) and the forecasts of a set on the Y-axis (worse).
Finally, the recommended above WAW(56:28:112, 714:7:728) averaging scheme performs strongly (see the largely green rightmost columns) and-in most cases-is not significantly outperformed by any other model (see the mostly black bottom rows).This is especially worth emphasizing, because the best model is different for almost each of the test cases, see Tables 1-4 and Figures 4-7.

Conclusions
In this paper, we report on a comprehensive empirical study on the selection of calibration windows for day-ahead EPF.Our starting point was the paper of Hubicka et al. [30], who proposed a novel concept in energy forecasting that combined day-ahead predictions across different calibration windows.We have extended their analysis to much longer datasets, predictive models with more explanatory variables, VST-transformed price and consumption/load series and-most importantly-introduced a new, well-performing WAW weighting scheme for averaging forecasts.
Firstly, we have confirmed the observations of Ziel and Weron [27], that AR(X)2 is a very competitive expert model, and of Uniejewski et al. [26], that models calibrated to asinh-transformed prices (and consumption/load forecasts) outperform by a large margin structures fitted to raw prices.Since the area hyperbolic sine transformation is straightforward to implement and its computational cost is negligible, we recommend it for EPF.
Moreover, we have shown that the majority of AW and WAW averaged forecasts significantly outperform those obtained from fitting a model to one ex-ante selected window length.Interestingly, in no case did AW significantly outperform the corresponding WAW scheme.On the other hand, the opposite can be observed for several cases-across all datasets, both transformations and different numbers of averaged forecasts.
As noted by Hubicka et al. [30], the mixed short-and long-term window sets are strong performers-in our case, this is true across all datasets and models.However, we suggest to use slightly longer windows at the shorter end, because MAEs may explode for the latter, especially if models with more variables are considered.On the other hand, including 3-instead of 2-year windows does not bring significant benefits.Overall, we recommend the WAW(56:28:112, 714:7:728) averaging scheme.It performs very well and-in most cases-is not significantly outperformed by any other forecast.

Figure 3 .
Figure 3. EPEX hourly system prices for Germany and Austria from 6 August 2010 to 28 July 2016.The vertical dashed lines mark the end of the initial 1092-day calibration window (i.e., 6 August 2010-1 August 2013).Weights for the WAW approach are selected based on price predictions for the following day (i.e., 2 August 2013 for the initial calibration windows) and the remaining observations constitute the 1091-day out-of-sample test period.

Figure 4 .
Figure 4. Mean absolute errors (MAE) for the Nord Pool (NP) dataset as a function of the window length T = 28, ..., 1092 (gray circles) and obtained by combining forecasts: solid lines for AW(T ) and dashed lines for WAW(T ) averages across all windows in a given range, solid lines with symbols for AW(T ) and dashed lines with symbols for WAW(T ) averages with cherry-picked windows.MAEs are plotted for ARX1 (top) and ARX2 models (bottom), calibrated to raw (left) and asinh-transformed prices (right).Note the different scales.Out-of-range lines or symbols are not visible.

Figure 5 .
Figure5.Mean absolute errors (MAE) for the PJM dataset as a function of the window length T = 28, ..., 1092 (gray circles) and obtained by combining forecasts (the color/line/symbol scheme and the location of panels is the same as in Figure4).Note the different scales.Out-of-range lines or symbols are not visible.

Figure 6 .
Figure 6.Mean absolute errors (MAE) for the PJM dataset as a function of the window length T = 28, ..., 1092 (gray circles) and obtained by combining forecasts (the color/line/symbol scheme is the same as in Figure4).MAEs are plotted for AR1 (top) and AR2 models (bottom), calibrated to raw (left) and asinh-transformed prices (right).Note the different scales.Out-of-range lines or symbols are not visible.

Figure 7 .
Figure 7. Mean absolute errors (MAE) for the EPEX dataset as a function of the window length T = 28, ..., 1092 (gray circles) and obtained by combining forecasts (the color/line/symbol scheme and the location of panels is the same as in Figure6).Note the different scales.Out-of-range lines or symbols are not visible.

Figure 8 .
Figure 8. Results of the conditional predictive ability (CPA) test [32] for selected window sets and the ARX1 model fitted to raw (left) or asinh-transformed (right) Nord Pool data.We use a heat map to indicate the range of the p-values -the closer they are to zero (→ dark green) the more significant is the difference between the forecasts of a set on the X-axis (better) and the forecasts of a set on the Y-axis (worse).

Figure 9 .
Figure 9. Results of the CPA test for selected window sets and the ARX2 model fitted to raw (left) or asinh-transformed (right) Nord Pool data.We use a heat map to indicate the range of the p-values-the closer they are to zero (→ dark green) the more significant is the difference between the forecasts of a set on the X-axis (better) and the forecasts of a set on the Y-axis (worse).

Figure 10 .
Figure 10.Results of the CPA test for selected window sets and the ARX1 model fitted to raw (left) or asinh-transformed (right) PJM data.We use a heat map to indicate the range of the p-values-the closer they are to zero (→ dark green) the more significant is the difference between the forecasts of a set on the X-axis (better) and the forecasts of a set on the Y-axis (worse).

Figure 11 .
Figure 11.Results of the CPA test for selected window sets and the ARX2 model fitted to raw (left)or asinh-transformed (right) PJM data.We use a heat map to indicate the range of the p-values-the closer they are to zero (→ dark green) the more significant is the difference between the forecasts of a set on the X-axis (better) and the forecasts of a set on the Y-axis (worse).

Figure 12 .
Figure 12. Results of the CPA test for selected window sets and the model fitted to raw (left) or asinh-transformed (right) PJM prices.We use a heat map to indicate the range of the p-valuesthe closer they are to zero (→ dark green) the more significant is the difference between the forecasts of a set on the X-axis (better) and the forecasts of a set on the Y-axis (worse).

Figure 13 .
Figure13.Results of the CPA test for selected window sets and the AR2 model fitted to raw (left) or asinh-transformed (right) PJM prices.We use a heat map to indicate the range of the p-values-the closer they are to zero (→ dark green) the more significant is the difference between the forecasts of a set on the X-axis (better) and the forecasts of a set on the Y-axis (worse).

Figure 14 .
Figure 14.Results of the CPA test selected window sets and the model fitted to raw (left) or asinh-transformed (right) EPEX prices.We use a heat map indicate the range of p-values-the closer they are to zero dark green) the more significant is the difference the forecasts of a set on the X-axis (better) and the forecasts of a set on the (worse).

Figure 15 .
Figure 15.Results of the CPA test for selected window sets and the AR2 model fitted to raw (left) or asinh-transformed (right) EPEX prices.We use a heat map to indicate the range of the p-values-thethey are to zero (→ dark green) the more significant is the difference between the forecasts of a set on the X-axis (better) and the forecasts of a set on the Y-axis (worse).
h and X d−7,h account for the autoregressive effects of the previous days (i.e., the same hour yesterday, two days ago and one week ago), X d−1,min is the minimum of the previous day's 24 h VST-transformed prices and the exogenous variable C d,h refers to the consumption (or load) forecast for day d and hour h (known on day d − 1 and VST-transformed).The three dummy variables D Sat , D Sun and D Mon model the weekly seasonality, and are defined as D i = 1 for d = i and zero otherwise.Finally, the ε d,h 's are assumed to be independent and identically distributed normal variables.The second model, denoted by AR1, is obtained from Equation (3) by setting β h,5 ≡ 0, i.e., by discarding the exogenous variable.

Table 1 .
Mean absolute errors (MAE) of the ARX1 and ARX2 models fitted to ID-or asinh-transformed Nord Pool (NP) data for selected windows (T) and window sets (T ).Relative improvements (%chng) of the Win(T) and WAW(T ) forecasts with respect to Win(728) are also reported.

Table 2 .
Mean absolute errors (MAE) of the ARX1 and ARX2 models fitted to ID-or asinh-transformed PJM data for selected windows (T) and window sets (T ).Relative improvements (%chng) of the Win(T) and WAW(T ) forecasts with respect to Win(728) are also reported.MAEs in excess of 10 10 are denoted by ∞; for these values we do not provide %chng.

Table 3 .
Mean absolute errors (MAE) of the AR1 and AR2 models fitted to ID-or asinh-transformed PJM data for selected windows (T) and window sets (T ).Relative improvements (%chng) of the Win(T) and WAW(T ) forecasts with respect to Win(728) are also reported.MAEs in excess of 10 10 are denoted by ∞; for these values we do not provide %chng.

Table 4 .
Mean absolute errors (MAE) of the AR1 and AR2 models fitted to ID-or asinh-transformed EPEX data for selected windows (T) and window sets (T ).Relative improvements (%chng) of the Win(T) and WAW(T ) forecasts with respect to Win(728) are also reported.