Efficient Forecasting of Electricity Spot Prices with Expert and LASSO Models

Recent electricity price forecasting (EPF) studies suggest that the least absolute shrinkage and selection operator (LASSO) leads to well performing models that are generally better than those obtained from other variable selection schemes. By conducting an empirical study involving datasets from two major power markets (Nord Pool and PJM Interconnection), three expert models, two multi-parameter regression (called baseline) models and four variance stabilizing transformations combined with the seasonal component approach, we discuss the optimal way of implementing the LASSO. We show that using a complex baseline model with nearly 400 explanatory variables, a well chosen variance stabilizing transformation (asinh or N-PIT), and a procedure that recalibrates the LASSO regularization parameter once or twice a day indeed leads to significant accuracy gains compared to the typically considered EPF models. Moreover, by analyzing the structures of the best LASSO-estimated models, we identify the most important explanatory variables and thus provide guidelines to structuring better performing models.


Introduction
One of the challenges of short-term electricity price forecasting (EPF) is variable (or feature) selection.
However, there is no standard approach when the number of potential explanatory variables is large.So far, the typical approach has been to select predictors in an ad-hoc fashion or using expert knowledge [1].Rarely has a formal validation procedure been carried out, and hardly ever has an automated selection or shrinkage algorithm been utilized.

The earliest known examples of statistically sound variable selection in EPF include Karakatsani
and Bunn [2] and Misiorek [3] who used stepwise regression to eliminate statistically insignificant variables in parsimonious autoregression (AR) and regime-switching models in a multivariate setting, i.e., separately for the individual load periods; see [4] for a discussion of the uni-and multivariate frameworks in EPF.In the machine learning literature, Amjady and Keynia [5] proposed a feature selection algorithm based on mutual information, that was later utilized in [6,7], among others.On the other hand, Gianfreda and Grossi [8] used a very simple technique -single-step elimination of insignificant predictors in a regression setting.
To our best knowledge, Barnes and Balda [9] were the first to apply regularization in EPF.Namely, in a study concerning the profitability of battery storage, they utilized ridge regression to compute forecasts of NYISO electricity prices for a model with more than 50 regressors.'Putting big data analytics to work' -as they referred to it -Ludwig et al. [10] used random forests and the least absolute shrinkage and selection operator (LASSO or Lasso) as a feature selection algorithm to choose the relevant out of the 77 available weather stations.In parallel, in the machine learning literature, González et al. [11] utilized random forests, while Keles et al. [7] combined the k-Nearest-Neighbor algorithm with Submitted to Energies, pages 1 -16 www.mdpi.com/journal/energiesbackward elimination to select the most appropriate inputs out of more than 50 fundamental parameters or lagged versions of these parameters.
The next qualitative change came with the papers of Ziel [12,13], who used LASSO to sparsify very large (100+) sets of model parameters, either utilizing B-splines in a univariate setting or, more efficiently, within a multivariate framework.In the first thorough comparative study, Uniejewski et al.
[14] evaluated six automated selection and shrinkage procedures (single-step elimination, forward and backward stepwise regression, ridge regression, LASSO and elastic nets) applied to a baseline model with 100+ regressors.They concluded that using LASSO and elastic nets could bring significant accuracy gains compared to commonly used EPF models.Finally, in a study on the optimal model structure for short-term EPF, i.e., uni-vs.multivariate, Ziel and Weron [4] considered autoregressive models with 200+ potential explanatory variables (but not exogenous) and concluded that both uniand multivariate LASSO-implied structures significantly outperform autoregressive benchmarks and that combining their forecasts can bring further improvements in predictive accuracy.
Yet, despite these very recent advances in the use of regularization techniques -and particularly the LASSO -for EPF, a few issues remain open: • What is the optimal structure of the baseline model, i.e., which of the available regressors should we consider?
• How should we choose the tuning (regularization) parameter λ?
• Should we use a variance stabilizing transformation (VST) prior to estimating the model?If yes, which of the known ones [15]?
To this end, we conduct a comprehensive empirical study that involves a nearly 5-year dataset from one of the major European power markets (Nord Pool), six model structures (with 700+ variants), and evaluation in terms of the classical error measure for point forecasts (i.e., the Mean Absolute Error, MAE) and the Diebold-Mariano (DM) test [16] for significant differences in forecasting accuracy.
For the benchmarks we consider a naive similar-day scheme and three parsimonious autoregressive models with exogenous variables (ARX).Two of them are built on popular structures, originally proposed in [4,17], and the third one is a new specification that is motivated by the results presented in [14, Table 2].After Uniejewski et al. [14] and Ziel [13], we refer to all three ARX benchmarks as expert models, since they are built on some prior knowledge of experts.As the baseline structures that are 'shrunk' via the LASSO we use two multi-parameter ARX models -one has been already considered in the EPF context [14], the other is a newly proposed specification.Furthermore, we propose a two-step procedure that merges automated variable selection via LASSO with ordinary least squares (OLS) estimation of its weights and compare the obtained point forecasts across four variance stabilizing transformations (VSTs) and 39 schemes of selecting the tuning parameter λ.
The remainder of the paper is structured as follows.In Section 2 we first introduce the dataset and the rolling window scheme, then discuss seasonal decomposition and the four VSTs we use in this study.In Section 3 we introduce the considered model structures.Then in Section 4 we compare the predictive performance in terms of MAE and the DM-test, across models, VSTs and λ-selection schemes.Finally, in Section 5 wrap up the results and conclude.

The test ground
For the test ground we have chosen a major European power market -Nord Pool.The dataset comprises system prices, day-ahead consumption prognosis for four Nordic countries (Denmark, Finland, Norway and Sweden) and day-ahead wind power generation prognosis for the period 01.01.2013-19.09.2017, all at hourly resolution.The time series plotted in Fig. 1 were constructed using data published by the Nordic power exchange Nord Pool (www.nordpoolspot.com) and preprocessed to account for missing values and changes to/from the daylight saving time, analogously as in [18].The missing data values (corresponding to the changes to the daylight saving/summer time and eight hourly consumption figures for Norway) were substituted by the arithmetic average of the neighboring values.The 'doubled' values (corresponding to the changes from the daylight saving/summer time) were substituted by the arithmetic average of the two values for the 'doubled' hour.
Like the majority of EPF studies we consider a rolling window scheme.Each day the 728-day long calibration window is rolled forward by 24 hours and price forecasts for the 24 hours of the next day are computed.Initially all considered models (their short-term and long-term components) are calibrated to data from 1.01.2013 to 29.12.2014 and forecasts for all 24 hours of 30.12.2014 are determined.Then the window is rolled forward by one day and forecasts for all 24 hours of the next day are computed.
This procedure is repeated until predictions for the last day in the sample (i.e., 19.09.2017) are made.

Seasonal decomposition
Recently published research leaves no doubt that seasonal decomposition of the price series leads to significant improvement in forecasting accuracy [19][20][21].Here we follow the Seasonal Component AutoRegressive (SCAR) modeling framework of Nowotarski and Weron [19], i.e., the series of electricity spot prices is decomposed in an additive manner into a long-term seasonal component (LTSC) and the remaining stochastic component with short-term periodicities.Then the two are modeled separately and their day-ahead forecasts are added up to yield the final price forecast.Note, that following [21], we add an additional step to the original SCAR algorithm, in which we decompose the exogenous series (consumption prognosis and wind power generation prognosis; see Section 2.1).Moreover, in contrast to other papers built around the SCAR concept, (i) we do not limit the analysis to log-transformed data but consider more general variance stabilizing transformations (see Section 2.3) and (ii) work only with the S 9 wavelet-based LTSC [18,22], which has been found to be robust and well performing across a range of datasets and forecasting models [20,21].

Variance stabilizing transformations
The basic idea behind a variance stabilizing transformation (VST) is to reduce the variation of price data.Lower variation and/or less spiky behavior of the input data usually allows the forecasting model to yield more accurate predictions [22].Let Y d,h denote the deseasonalized electricity price in the day-ahead market for day d and hour h and X d,h the transformed data, i.e., After computing the forecasts, we apply the inverse transformation to obtain the deseasonalized electricity spot price forecasts: Ŷd,h = f −1 ( Xd,h ).We use four VST classes that have been found to perform well in an EPF context [15].Three of them work on 'normalized' values: , where a is the median of Y d,h in the 728-day calibration window and b is the sample median absolute deviation (MAD) around the median.The last class builds on the probability integral transform (PIT) and does not require normalization.In Figure 2 we illustrate all four VSTs on a sample dataset.
The area hyperbolic sine transformation is nothing else but the inverse of the hyperbolic sine: with inverse y d,h = sinh(X d,h ).In the context of electricity prices it was used, e.g., in [4,23].The mirror-log and polynomial transformations were introduced by Uniejewski et al. [15] as generalizations of the robust Box-Cox transformation for λ = 0 and λ = 0, respectively.The former is a straightforward generalization of the log-transformation with a mirror image of the logarithm for negative values: with inverse . It is a one parameter family; here we use c = 1 3 , as suggested in [15].Similarly as the standard log-transformation and asinh, the mlog exhibits a log-damping effect.The polynomial transformation is defined as: It is a two parameter family; here we use λ = 0.125 and c = 0.05, as suggested in [15].Both, the mlog and poly have a slope of c at the origin.
The last class we consider is based on the so-called probability integral transform: where F Z is the cumulative distribution function (cdf) of Z.The N-PIT is defined as: with inverse: , where Φ is the standard normal cdf.The N-PIT, misleadingly called the Nataf transformation, was used in [24] to normalize Spanish electricity prices.It was also used together with the t-PIT (a Student-t transformed PIT) in a recent extensive EPF study [15].In Figure 2 we can clearly see that the histogram of N-PIT-transformed prices looks like the standard normal density.

The forecasting framework
Let us now summarize the forecasting framework.For each autoregressive model discussed in Section 3, i.e., all models except the naive benchmark, it consists of five steps: 1. (a) Decompose the series of prices in the calibration window of 728 × 24 = 17472 hours using S 9 wavelet smoothing, i.e., utilizing nine decomposition levels [18,22], into a trend-seasonal and a stochastic component (with short-term periodicities): 3. Calibrate one of the autoregressive models discussed in Section 3 to X d,h and compute forecasts for each of the 24 hours of the next day: X d * +1,h , h = 1, ..., 24.
4. Back-transform the obtained forecasts: 5. Add the persistent LTSC forecasts to yield the final price forecasts: Note, that compared to recent EPF papers utilizing the SCAR approach [19][20][21], we introduce important changes.Firstly, instead of the logarithm we use one of four VSTs defined in Section 2.3, which can be applied also to price series with negative values, like those from the German EEX market [25].
Secondly, the order of data transformation and decomposition is swapped.Now, we first seasonally decompose the data, then transform the stochastic part; the same applies for the exogenous series (consumption and wind power generation prognosis).

Benchmarks
The first benchmark, the so-called naive method, belongs to the class of similar-day techniques.
It is defined by P d,h = P d−7,h for Monday, Saturday and Sunday, and P d,h = P d−1,h otherwise.As Contreras et al. [26] argue, forecasting procedures that are not calibrated carefully fail to outperform this extremely simple method surprisingly often.
The second benchmark, denoted by ARX1, is a simple autoregressive structure originally proposed by Misiorek et al. [17] and later used in a number of EPF studies [13,14,18,20,[27][28][29][30]: where The third benchmark is the expert DoW,nl model of Ziel and Weron [4], only expanded to include one exogenous variables (consumption prognosis).The model, denoted by ARX2, is given by the following formula: The price for the last load period of the previous day, i.e., X d−1,24 , is included to take advantage of the fact that prices for early morning hours depend more on the previous day's price at midnight than on the price for the same hour, see [13,31]  The last benchmark, denoted by ARX3, is based on the results of Uniejewski et al. [14].The authors used shrinkage methods (particularly, LASSO and elastic nets) to find the most commonly selected regressors of a multi-variable ARX model.We have analyzed these results and used them to construct a benchmark model that combines the knowledge of experts and the outcome of automated variable selection.The model is defined by: Compared to ARX2, this benchmark includes more autoregressive terms and two additional exogenous variables: consumption prognosis for the previous week, i.e., C d−7,h , and wind power generation prognosis for the forecasted day and hour, i.e., W d,h .Actually, the latter two refer to transformed and deseasonalized values.

Baseline autoregressive models
An indisputable advantage of using automatic variable selection is an almost unlimited number of initially considered explanatory variables.As a result, the knowledge of experts becomes less important.However, the human factor is not completely eliminated.The forecaster still has to choose an appropriate baseline model that includes all variables of (potential) interest.To see how much does this choice impact the forecasting accuracy, we will consider two different baseline models.The first one, denoted by bARX1, is essentially the fARX model of Uniejewski et al. [14] with consumption prognosis C d,h as the first exogenous variable and wind power generation prognosis W d,h as the second; both after decomposition and transformation: Note, that in contrast to the benchmark autoregressive models discussed in Section 3.1, here we treat holidays as the eighth day of the week (i.e., D 1 = ... = D 7 = 0 for national holidays in Norway; they were identified using the Time and Date AS web page: www.timeanddate.com/holidays).
Consequently, like the ARX1 model, but unlike ARX2 and ARX3, the regression in Eqn. ( 8) has no intercept.
The second baseline model, denoted by bARX2, is much richer (394 vs. 104 variables).It uses information about prices from all hours of the whole previous week, the exogenous variables play a much more important role (192 vs. 4 variables) and the part accounting for weekly seasonality is more complex (the dummies are additionally multiplied by the average price from the previous day and the last known spot price; transformed and deseasonalized).The model is given by: Note, that for h = 24, β h,i,1,8 ≡ 0 for all i.Moreover, compared to bARX1, the model includes an explicit holiday dummy variable (D 8 ), hence the regression in Eqn. ( 9) includes an intercept.

LASSO-estimated models
Let us write the structure of both baseline models in a more compact form: where V i d,h 's are the regressors and β h,i 's are the corresponding coefficients.Shrinkage (also known as regularization) fits the full model with all predictors using an algorithm that shrinks coefficients of the less important explanatory variables towards zero [32].The least absolute shrinkage and selection operator (LASSO or Lasso) of Tibshirani [33] may actually shrink some of the coefficients to zero and hence perform variable selection.
The LASSO can be treated as a generalization of linear regression, where instead of minimizing only the Residual Sum of Squares (RSS), the sum of RSS and a linear penalty function of the β's is minimized: where λ ≥ 0 is a tuning (or regularization) parameter.Note, that for λ = 0 we get the standard least squares estimator, for λ → ∞ all β h,i 's tend to zero, while for intermediate values of λ we are balancing between minimizing the RSS and shrinking the coefficients towards zero (and each other).Selecting a 'good' value for λ is critical.However, to our best knowledge, the optimal choice of λ has not been discussed in the EPF context so far.Here, we compare alternative λ-selection schemes for a grid of 25 exponentially decreasing λ's ranging from 10 0 to 10 −6 , and select the tuning parameter for which the mean absolute (prediction) error in the D-day validation period is the smallest.
Models for which the LASSO operator is used to estimate parameters are denoted by * Lasso D i , where i = 1, 2 refers to one of two baseline models (respectively bARX1 and bARX2), D is the length of the validation window and the asterisk represents the λ-selection scheme.We consider eight window lengths: D = 1 (one day), 7 (one week), 28 (one 'month'), 60 (two months), 91 (one quarter), 182 (half a year), 273 (three quarters) and 364 (one year), and six λ-selection schemes: • 24xNLasso D i , where for each day and each hour in the test period one λ is selected, • 2xNLasso D i , where for each day in the test period two λ's are selected -one for the on-peak and one for the off-peak hours, • 1xNLasso D i , where for each day in the test period one λ is selected, • 24Lasso D i , where for each hour one λ is selected for the whole test period, • 2Lasso D i , where two λ's are selected (on-peak and off-peak) for the whole test period, • 1Lasso D i , where one λ is selected for the whole test period.

LassOLS-type models
To answer the question how important is the shrinkage property for reducing the prediction errors, we introduce here a modification of the * Lasso D i class of models, where the LASSO operator is used to select the explanatory variables but not to estimate the β h 's; instead we use OLS to obtain parameter estimates for the LASSO-selected variables.In this way, we automatically create 24 different 'expert', LASSO-implied models for each day.Since these model structures are dependent on the choice of λ, like in Section 3.3, we consider six model schemes: nxNLassOLS D i and nLassOLS D i , with n = 1, 2, 24, which correspond to nxNLasso D i and nLasso D i models, respectively.

Evaluation in terms of MAE and the DM-test
As the main evaluation criterion we consider the Mean Absolute Error (MAE) for the full out-of-sample test period of D = 631 days, i.e., 29.12.2015 to 19.09.2017: where E d,h = P d,h − P d,h is the prediction error for day d and hour h.Additionally, we use the Diebold-Mariano (DM) test [16] to draw statistically significant conclusions on the outperformance of the forecasts of one model by those of another.In the EPF literature, the DM test is usually performed separately for each of the 24 hours of the day [1].However, Ziel and Weron [4] have recently introduced a different approach, where only one statistic for each pair of models is computed based on the 24-dimensional vector of errors for each day: where For each model pair and each dataset we compute the p-value of two one-sided DM tests: (i) a test with the null hypothesis H 0 : E(∆ X,Y,d ) ≤ 0, i.e., the outperformance of the forecasts of Y by those of X, and (ii) the complementary test with the reverse null H R 0 : E(∆ X,Y,d ) ≥ 0, i.e., the outperformance of the forecasts of X by those of Y.As in the standard DM test, we assume that the loss differential series is covariance stationary.

Performance across model classes and VSTs
In Table 1 we report the MAEs in the 631-day test period for 30 considered models and five VSTs (Id denotes the identity 'transformation').The models include four benchmarks, two baseline models, 12 * LassOLS 91 i models and 12 * Lasso 91 i models, i.e., all LASSO-type models use a 91-day validation window for selecting λ.In Figure 3 we summarize the results of the 'multivariate' DM-test for the same 30 models and the N-PIT transformation (i.e., corresponding to the last column in Table 1), while in Figure 4 for the two best performers in Table 1 across the VSTs (i.e., corresponding to rows 26-27 in Table 1).The 'chessboards' present the p-values of the conducted pairwise comparisons in a graphical form.Green and yellow squares indicate statistical significance at the 5% level, with the darkest green corresponding to close to zero p-values.Red squares indicate weak significance with a p-value between 5% and 10%, while black denote no significance, i.e., a p-value of 10% or more.
Several interesting conclusions can be drawn: • All considered models (except bARX2 with the identity transform, i.e., without applying a VST) beat the naive benchmark by a large margin, see Tab. 1 and Fig. 3.This indicates that  they all are highly efficient forecasting tools.The poor performance of bARX2 may be due to identification issues of a model with nearly 400 explanatory variables calibrated to as few as 728 daily observations.
• On the other hand, the smaller baseline model (bARX1) performs reasonably well.This is in contrast to the results reported in [14].The reason may be that for a regression model with 104 variables, the increased length of the calibration window (from one to two years) leads to a much better performance.
• LASSO-type models based on bARX2 significantly outperform all other (the lower left 'quadrant' of the chessboard in Fig. 3 is black and the upper right 'quadrant' is green).In particular they improve accuracy by ca.7% in comparison to the same models based on the smaller bARX1 baseline model.
• The LASSO operator treated as a shrinkage technique (i.e., a forecasting tool on its own; * Lasso D i models) significantly outperforms the two-step procedure where LASSO is only used for variable selection and OLS is used for estimation ( * LassOLS D i models).
• The ARX3 model strongly outperforms other benchmarks, obtaining almost as good results as LASSO-based models built on bARX1, see Tab. 1 and the relatively black row in Fig. 3 corresponding to the ARX3 model.
• The choice of the best performing transformation depends on the complexity of the model, e.g., for LASSO models based on bARX2 the asinh and N-PIT transformations lead to the most  1 (i.e., 1xNLasso 91 2 and 2xNLasso 91  2 ) across the VSTs.Like in Figure 3, we use a heat map to indicate the range of obtained p-values for the pairwise comparisons.accurate forecasts (see Tab. 1), and the benefit from using a VST increases with the complexity of the model.Moreover, there are no significant differences between using the asinh and N-PIT for the two best performing models but both lead to significantly better predictions than the remaining VSTs (see Fig. 4).
• For almost all cases, models that reestimate (reselect) λ on a daily basis, i.e., nxNLassOLS D i and nxNLasso D i , outperform models with a fixed tuning parameter, i.e., nLassOLS D i and nLasso D i , see Tab. 1 and Fig. 3.
To better understand the impact of applying a VST it is necessary to refer to the article of Uniejewski et al. [15], where 16 transformations were tested across 12 data sets.Importantly, only models without exogenous variables were used in the cited study -an AR counterpart of our ARX2 and its neural network equivalent -and the price series were not decomposed into the LTSC and the stochastic component.The N-PIT turned out to be the best performing VST overall, at least in terms of MAE.However, for Nord Pool system prices covering a similar time period as in our study (30.07.2010-28.07.2016), the N-PIT was worse than asinh, mlog and poly.Note, that the same conclusion holds here for the autoregressive benchmarks (including ARX2).This suggests that the conclusions of [15] are likely to remain unchanged when using a model with exogenous variables and/or a model that is fitted to deseasonalized series.From this perspective, the N-PIT gains the most from model complexity and even for more complex models the N-PIT would probably outperform other VSTs.

Performance across λ-selection schemes
Up to this point, all LASSO-type models have used a 91-day validation window for selecting λ.
A pertinent question arises if this is the optimal choice for D. To address it, we now consider eight validation window lengths: D = 1 (one day), 7 (one week), 28 (one 'month'), 60 (two months), 91 (one quarter), 182 (half a year), 273 (three quarters) and 364 (one year; this window is illustrated in Fig. 1).
On the other hand, the results discussed in Section 4.2 clearly show that (i) the bARX2 baseline model leads to significantly better performing LASSO-type models than the smaller bARX1 model, and that (ii) * Lasso 91 i models significantly outperform * LassOLS 91 i .Hence, to simplify the comparison, in this Section we limit the analysis to * Lasso D 2 models.
In Table 2 we report the MAEs in the 631-day test period for 39 different λ-selection schemes and five VSTs.We can see that:  • As observed in Section 4.2, models that reestimate (reselect) λ on a daily basis, i.e., nxNLasso D i , outperform models with a fixed tuning parameter, i.e., nLasso D i .Yet, considering the significantly lower computational requirements, it is still reasonable to take the latter models as benchmarks.
• As the length of the λ-selection window increases, the extreme values of the tuning parameter are less often selected and finally -for D = 364 days -only a few intermediate λ's are used, for a sample illustration see Fig. 5.
• Selecting validation windows of less than 60 days in not advisable.In the case of reestimating (reselecting) λ's on a daily basis, i.e., in the nxNLasso D i models, it is recommended to use much longer validation windows.
• We cannot definitely reject or accept the hypothesis that choosing λ for each hour separately leads to an improvement in the accuracy of the forecasts.Overall, the best results in terms of MAE are obtained for the 2xNLasso 2 and 1xNLasso 2 models and D ≥ 273 days, but the differences between them and other well performing models are insignificant (DM-test values are not reported here).

Conclusions
This paper reports on a comprehensive empirical study on the optimal way of implementing the LASSO in electricity price forecasting.We find that if certain conditions are met, the choice of only one tuning parameter for a given model and dataset leads to a similar but inferior forecast compared to a more complex but also much more time consuming procedure of selecting λ's on a daily basis.On the other hand, selecting one instead of 24 values for a day is recommended due to a lower computational burden and rather moderate gains in predictive accuracy for the latter approach.
Secondly, we confirm the findings of [4,14] that the use of the LASSO operator leads to a significant improvement in forecasting accuracy.Interestingly, the ARX3 model that combines the knowledge of experts and the outcome of automated variable selection in [14] is the best performer out of all expert models considered in this study.This shows that the LASSO is also a very useful tool for the construction of expert models.Moreover, our study indicates that the size of the baseline model has a surprisingly large impact on the quality of the forecasts.Although the LASSO method always uses only a small part of the available regressors, providing additional information in the underlying model significantly improves the accuracy of the obtained forecasts.
Finally, we confirm and reinforce the conclusions of [15].Namely, we show that the results remain qualitatively the same when considering exogenous variables and using the SCAR approach that decomposes the price series into the LTSC and the remaining variability.The results presented in Table 1 indicate that the gains from using an appropriate variance stabilizing transformation increase with the complexity of the model.Lastly, using the Diebold-Mariano test, we find that for more complex models it is advisable to use the asinh or the N-PIT transformation.
h .(b) Compute persistent forecasts of the LTSC independently for each of the 24 hours of the next day, i.e., T d * +1,h ≡ T d * ,h for h = 1, ..., 24, where d * is the last day in the calibration window.(c) Decompose the exogenous series in the calibration window of (729 + 1) × 24 = 17496 hours using the same type of wavelet smoothing as in Step 1(a).Note, that we can start one day earlier since the consumption and wind power forecasts are known one day in advance.2. Normalize y d,h = 1 b (Y d,h − a) and transform X d,h = f (y d,h ) the stochastic component, where f is one of the VSTs defined in Section 2.3 and (a, b) = (median, MAD) of Y d,h in the calibration window.Analogously normalize and transform the stochastic component(s) of the exogenous series.
h and X d−7,h refer to the transformed and deseasonalized price series.They account for the autoregressive effects of the previous days (the same hour yesterday, two days ago and one week ago), while X min d−1 is the minimum of the previous day's 24 hourly preprocessed prices.The exogenous variable C d,h refers to the prognosis of consumption for day d and hour h (actually, to a transformed and deseasonalized series).The three dummy variables D Sat , D Sun and D Mon account for the weekly seasonality, and are defined as D i = 1 for d = i and zero otherwise.Finally, the ε d,h 's are assumed to be independent and identically distributed normal variables.
for a discussion.Compared to ARX1 also the maximum of previous day's prices, i.e., X max d−1 , and dummies for the remaining days of the week (D 4 ≡ D Tue , D 5 ≡ D Wed , D 6 ≡ D Thu , D 7 ≡ D Fri ) are included.As there are seven dummies in Eqn.(6), one of them plays the role of the intercept.Note, that there is no intercept in Eqn.(5); this is not a major shortcoming since the prices are deseasonalized (i.e., centered) beforehand.

Figure 3 .
Figure 3. Results of the 'multivariate' DM-test for the same 30 models as in Table1and the N-PIT transformation.We use a heat map to indicate the range of the p-values -the closer they are to zero (→ dark green) the more significant is the difference between the forecasts of a model on the X-axis (better) and the forecasts of a model on the Y-axis (worse).

Figure 4 .
Figure 4. Results of the 'multivariate' DM-test for the best performers in Table1(i.e., 1xNLasso91  2 and 2xNLasso91  2 ) across the VSTs.Like in Figure3, we use a heat map to indicate the range of obtained p-values for the pairwise comparisons.

Figure 5 .
Figure 5. Histograms displaying the number of hours a given value of λ was selected in the whole test period (24 × 631 = 15144 hours), across different D's and for the N-PIT-transformed 2xNLasso D 2 model.Recall, that we consider 25 exponentially decreasing λ's ranging from 10 0 to 10 −6 ; they are plotted on a logarithmic scale in the X-axis.

Table 1 .
Mean Absolute Errors (MAE) in the 631-day test period for 30 considered models (in rows) and five VSTs (in columns; the first 'transformation' is identity, Id).The models include four benchmarks, two baseline models, 12 * LassOLS 91 i models and 12 * Lasso 91 i models; note, that all LASSO-type models use a 91-day validation window for selecting λ.A colormap is used to indicate better (→ green) and worse (→ red) results.

Table 2 .
Mean Absolute Errors (MAE) in the 631-day test period for 39 different λ-selection schemes (in rows) and five VSTs (in columns; the first 'transformation' is identity, Id).A colormap is used to indicate better (→ green) and worse (→ red) results; note, that this time the coloring is independent for each column to better emphasize the differences between the D's.