Energy Markets Forecasting. From Inferential Statistics to Machine Learning: The German Case

In this work, we investigate a probabilistic method for electricity price forecasting, which overcomes traditional ones. We start considering statistical methods for point forecast, comparing their performance in terms of efficiency, accuracy, and reliability, and we then exploit Neural Networks approaches to derive a hybrid model for probabilistic type forecasting. We show that our solution reaches the highest standard both in terms of efficiency and precision by testing its output on German electricity prices data.


Introduction
The electricity market has always aroused the interest of many people due to the importance of its product. One can find the beginnings of electricity price prediction in the early 90s. It should be remembered that electricity is a specific commodity. It cannot be stored economically and, therefore, necessitates a continuous balancing of production and consumption; it depends on weather conditions, the season and the nature of human activity. These characteristics are unique and they lead to a price dynamic that is not found on any other market. It is often possible to recognize a day, weekly, or yearly seasonality and possible sudden, short-term, and generally unforeseen price peaks. Moreover, the presence of possible cycles characterizing this product can be characterized by irregularities, which is then difficult to properly model. Electricity Price Forecasting (EPF) is a field of energy prediction that is focused on forecasting spot and forward prices on the wholesale power markets. It has become a significant element for power companies in their decision-making and investment process. Therefore, as usually happens within financially related areas, a number of EPF-related questions have to consider specific time horizon constraints. Within standard literature, it is customary to identify three types of forecasting horizons, namely: short, medium, and long-term, which are then better specified accordingly to the particular financial problem. As a general rule, one can consider: Short-term: a period of time ranging from few minutes to few days in advance; medium-term: a period of time that ranges from a few days to a few months; and, long-term: a period of time that ranges from some months to years. In most cases, the forecast horizons are of the first two types, with particular emphasis on day ahead previsions. Nevertheless, it is worth underlying that each of the above-mentioned horizons has its own importance, indeed a company is typically obliged to program its activities at different time scales. According to the particular time scale of interest, different modeling approaches can be exploited. In particular, most of the used techniques belong to the following five groups: statistical models, intelligent computing models, reduced form, fundamental, and multi-agent models. In this paper, we will start focusing on the first two 2.1. 1

. Point Forecast
The series for the day-ahead price is usually obtained from an auction held once a day, where each hourly price for the next day is announced at once. Load periods can last less than one hour in the intraday markets, e.g., at the European Power Exchange (EPEX), and half-hourly or quarter-hourly products are also traded. The day can be partitioned in both cases into a finite number of load periods h = 1, . . . , H. Therefore, it is common to indicate, with P d,h , the price referring to day d, with load period h. A point forecast of P d,h , expressed byP d,h , is usually understood in the EPF literature as the expected value of random variable indicating the price, i.e., E(P d,h ). One could extend this term to quantile forecasts that are a reasonable starting point for probabilistic predictions.

Probabilistic Forecast
There exist two fundamental approaches for probabilistic predictions. The most popular one is based on point prediction and the associated error distribution. A different choice is to consider the distribution of the spot price as, e.g., in Quantile Regression Average (QRA), see Nowotarski and Weron [10]. The focus can be on selected quantiles, prediction intervals, or the overall predictive distribution. Let us consider the average price at a future point in time, i.e.,P d,h = E(P d,h ), as a "point prediction". We can write P d,h =P d,h + d,h , which means: where, with F , we refer to the distribution of errors that are associated withP d,h . Hence, the distribution of errors has the same form as the distribution of prices. Indeed, it is only shifted to the left byP d,h on the horizontal axis. The corresponding quantile functionq α, is likewise shifted concerningq α,P . Equivalently, in the sense of the inverse empirical cumulative distribution function (CDF), we have:F −1 P (α) =P d,h +F −1 (α), therefore obtaining the basic framework for the generation of probabilistic predictions from distributions of prediction errors. Let us underline that probabilistic predictions can be useful, e.g., when one needs information regarding huge future investments.

Ensemble Forecast
The probabilistic prediction concept is very general. Hence, it can happen that it is not enough to solve many energy prediction problems and it might be difficult to verify its validity. Indeed,P d,h is considered alone, independent of the predictions for the adjacent hours. Nevertheless, rather than considering the H univariate price distributions, we can focus on the H-dimensional distribution F P of the H-dimensional price vector P d = (P d,1 , . . . , P d,H ). Therefore, we need a prediction for the multivariate distribution, but many models cannot supply such a direct distribution prediction. The latter problem can be overcome while computing an ensemble forecast. We can define an ensemble as a set of M paths E M (P d,h ) = (P 1 d , . . . ,P M d ) simulated from a forecast model, which is typically obtained by Monte Carlo methods.

Modelling Approaches
Usually the techniques for electricity price forecasting (EPF) are divided into five groups of models: statistically, computationally intelligent, reduced form, fundamental and multi-agent. We will focus on statistical and computational intelligent (CI) models.

Statistical Approaches
Statistical approaches are used in order to predict the actual price through a weighted combination of past prices and any current values of exogenous variables that could be associated with electricity, such as weather forecasts or demand, usually via a linear regression. Autoregressive terms take into account the dependencies between prices of today and those of the days before, e.g., a possible structure could be: where P d,h denotes the price for day d and hour h, P d−1,min represents the minimum of the 24-h prices of the previous day (example of a non-linear component), L d,h is the load prediction for day d and hour h (known on day d−1), and (D sat , D sun , D mon ) are three dummies for modelling the weekly seasonality. Some authors call expert models previous, parsimonious structures, because they exploit a certain amount of experts' prior knowledge. The standard approach for estimating the model (2) is Ordinary Least Squares (OLS), implemented by exploiting the electricity prices of the past D days in order to predict prices for the next day(s). The optimal value D is not given a priori, instead it should be chosen as to have a long enough estimation sample from which to extract samples, but not too long as to avoid to give an exaggerate of weight to events far in the past. Overall, there are no standard parameters. Many studies assume one or two years with hourly prices, but some use just 10-13 days as short time windows, while others are up to four years long, see, e.g., Marcjasz et al. [11].
Autoregression models represent the largest subset of statistical models, including shrinkage techniques, such as the least absolute shrinkage and selection operator (LASSO), but also Ridge regression and elastic nets.
Statistical models are attractive, because one can add physical interpretations to the regressors, being then useful to identify significant variables. Moreover, they allow for operators to better understand models' behaviors as to finally obtain models with a meaningful structure. A well-known disadvantage of statistical models is the representation of nonlinear factors, even if they can be approximated while using linear terms under certain circumstances. However, non-linear dependencies can be explicitly included by nonlinear variables, like P d−1,min in (2). Otherwise, spot electricity prices, as well as exogenous variables, can be transformed by nonlinear functions before fitting a statistical model.

Computational Intelligence Methods (CI)
CI methods have been developed to solve problems that cannot be treated efficiently with statistical methods, such as, e.g., the presence of non-linear terms, or when a distribution-free approach is required. They combine different elements to create approaches that are able to analyze complex dynamic systems. In particular, they do not need exact knowledge, but they can also be confronted with incompleteness. CI models are flexible, and they are used for short-term predictions, being able to deal with different types of non-linearities, but at the cost of high computational complexity. It is worth mentioning that Statistics is more concerned with the analysis of finite samples, misspecification of models, and computational considerations, while probabilistic modeling is inherent in computational intelligence. Regarding Artificial Neural Networks (ANN) alternatives, different options are effectively used. In general, one can mainly focus on three aspects to identify a specific ANN algorithm for a prediction problem: data characteristics, solution complexity, and desired prediction accuracy, see [12].
In order to summarize the conceptual difference between these two approaches, we can observe Figure 1. Computational intelligence works more like a "black box", in which we can only decide a few parameters; each time that we train a network, we can get similar but slightly different results; the statistical approach models a certain relationship between the data.

The German Electricity Market
We will use price data from the German electricity market. The EPEX is the most relevant trading platform for electricity prices in Europe, also being the most important in terms of volume. It is also relevant from a policy point of view, since, for example, several regulatory calculations are based on the day-ahead EPEX price, such as the feed-in tariffs for renewable energies. It offers trading, clearing, and settlement in both the day-ahead and intraday market. Day-ahead hourly prices in Germany are traded on EPEX and they are referred to as "Phelix", Physical electricity index. This index is the daily mean value of the system price for electricity traded on the spot market, which is calculated as the arithmetic mean of the 24-h market clearing price.
The day-ahead market is the primary market for electricity trading. Here, buyers and sellers set up hourly contracts for the delivery of electricity the following day. This is done through a daily auction, usually at 12:00, where the market clearing price, which is commonly known as the spot price, is determined by matching supply and demand. Most market designs choose a uniform-price auction market (see [13]): buyers with bids that are above the clearing price pay that price, and suppliers with bids below that price receive the same price.
The demand for electricity is a function of temperature, seasonality, and consumption patterns, from which the periodic character of electricity prices is derived. Demand is very price inelastic in the short term because consumers have few options available to them in response to price changes. Positive price peaks are often caused by high (unexpected) demand, cf. Figure 2. On the other hand, the merit-order curve plays a decisive role in the electricity price formation process. It is derived by arranging the suppliers' offers according to rising marginal costs. The point of intersection of the demand curve with the merit order curve, also called supply curve, determines the market clearing price, i.e., the electricity spot market price. In times of low demand, base load power plants, such as nuclear power plants and coal-fired power plants, generally serve as price-determining technologies. These plants are inflexible, due to their high costs. In contrast, in times of high demand, the prices are set by expensive peak load power plants, such as gas and oil-fired power plants. These plants have high flexibility, high marginal costs, and they lead to a convex shape of the merit order curve. With the lowest marginal costs, renewable energy sources are at the lower end of the merit order curve. Wind and solar energy have attracted the most attention in Germany, and regulatory changes are also the main driver for the growth of renewable energies, as several subsidies and political measures have been recently introduced. There is a large share of intermittent renewable energy sources, which makes the difficult task of forecasting and describing prices of changing energy markets. The hours of increased renewable energy supply are causing difficulties for inflexible plants that should be running continuously. This is because inflexible base-load plants have shutdown and start-up costs that force them to accept negative marginal returns in order to continuously generate electricity. This has a lowering effect on electricity prices. Negative prices are mainly caused by high wind production in times of low demand. Therefore negative price peaks mainly occur at night. For further details, we refer the reader to [15].

The Probabilistic Approach
The probabilistic approach is based on a prediction that takes the form of a probability distribution of future quantities or events. It allows for treating different types of problems, also extending the spectrum of the point estimation approach, e.g., long-term forecast becomes feasible.

The Problem
We can use a point forecast of the electricity spot price for the definition of the probabilistic forecasting problem, i.e., the "best estimate" or expected value of the spot price, see [16]. Notice that the actual price at t, P t can be expressed as: whereP t refers to the point forecast of the spot price at time t, made at a previous time, and t is the corresponding error. In most of the EPF papers, the analysis stops here, because the authors are only primarily interested in point forecasts, see [8].
The construction of prediction intervals (PI) is a very natural extension from point to probabilistic forecasts. Several methods can be used for this purpose and a popular one considerd both the point prediction and the corresponding error: at the (1 − α) confidence level, the center of the PI is set equal toP t and its bounds are determined by the α 2 th and 1 − α 2 th quantile of the CDF of t . As to provide an example, for the 90% PIs, the 5% and 95% quantiles of the error term are requested.
A forecaster can further expand such an approach by considering multiple PIs. As a finale result, we obtain a series of quantiles at many levels.
An appropriate discretization of the price distribution is also a set of 99 quantiles (q = 1%, 2%, . . . , 99%). Generally, a density prediction that corresponds to (3) can be referred to as a set of PIs for all α ∈ (0, 1). The calculation of a probabilistic prediction requires an estimate ofP t and the distribution of t . Analogously, we reformulate it inverting the CDF of P t and t , namely: The probability forecast can be then defined as a probability distribution w.r.t. future quantities. Such an identification can be then carried on while using random variables. For our purposes, the latter implies the distribution ofF P t itself, which is the way that is implied by the QRA method. It is worth mentioning that we are not considering the Probability Density Function (PDF) of t ; moreover, since 24-h prediction distributions have to be created in a single step, their interconnected probability properties have to be taken into account all at once. Following [16], we can either try to model the correlation between boundary distributions or we can simulate the 24-h paths characterizing the day-ahead dynamics; hence, using a multivariate model.

Construction of Probabilistic Forecasts
There are several ways to construct a probabilistic interval. We report four of them, mainly following [16].

Historical Simulation
We can use a historical simulation approach to compute the PIs. The latter is a pathdependent approach that consists of evaluating sample quantiles out of the empirical distribution for the t errors' quantities.

Distribution-Based Probabilistic Predictions
When the models used for time series are driven by noises of the Gaussian type, which is the case, e.g., for the AR, ARMA, ARIMA, etc., methods, we directly use the Normal distribution to model the density forecasts, while computing the PIs accordingly via analytical tools. The latter differs from historical simulation because of the standard deviation of the error that we are considering in modelling the density, let us indicate it bŷ σ, which is first calculated, then one considers the lower, resp. upper, bound of the PIs to be set according to a N (0,σ 2 ).

Bootstrapped PIs
An alternative method is the bootstrap one, which is often used within the NNs scenario. When considered to produce step-ahead forecast, the bootstrap approach works, as follows:

1.
Provide an estimate for the parametersθ characterizing the model and obtaining a fit with corresponding set of residuals t .

2.
Provide a set of (simulated) data, which are then not real world based, with distribution that is steered by the parameters' setθ and normalized residuals * t . The latter means that, for a general autoregressive model of order r, w.r.t. variables (that constitute the external sources of noise) P 2 = P 1 , . . . , P r = P r , we recursively define: 3. Provide a new estimation for the model to then compute the bootstrap one step-ahead forecast for the new time step t = T + 1.

4.
Starting from the the bottom, repeat both step 2 and 3, N times to obtain the (bootstrapped) price predictions: Calculate the previously (step 4) quantities in order to provide the requested PIs This type of construction is more accurate, because it takes both historical forecast errors and parameter uncertainty into account. However, it is computationally heavier.

Quantile Regression Averaging
The method that was proposed by Nowotarski and Weron in [10] is called QRA and it involves quantile regression for a group of point forecasts from individual forecasting models. It is not necessary to split the probabilistic forecast into point forecast and error term distribution, because it works directly with the electricity spot price distribution. The quantile regression problem can be expressed as where Q P t (q|X t ) represents the conditional q-th quantile of the electricity price distribution, X t are the explanatory variables, also called regressors, and β q is a vector of parameters for the q quantile. Subsequently, minimizing the loss function for a given q-th quantile, the parameters can be estimated. There is no limitation of the components of X t . Until it contains predictions from individual models, it is considered to be QRA.

Validity
In the case of a probabilistic forecast, the most important problem is that we do not know the actual distribution of the underlying process. It is not possible to compare the predicted distribution and the true distribution of the spot price of electricity with just the values of prices that were observed in the past. Probabilistic forecasting can be evaluated in several ways and the chosen method also depends on final goal that we aim to reach. Of course, we can rely on tests and parameters to check the validity of the model and to have criteria for choosing the optimal model. An evaluation is usually based on reliability, sharpness, and resolution.
Statistical consistency between distributional forecasts and observations is called reliability (or also calibration or unbiasedness). For example, if a 90% PI covers 90% of observed prices, then this PI is considered to be reliable, well-calibrated, or unbiased. The sharpness means how closely the predicted distribution covers the true distribution, i.e., the concentration of predicted distributions. This is a property only of forecasts, in contrast to reliability, which is a joint property of predictions and observations. Finally, resolution represents how strongly the predicted density changes over time, in other words, the ability to provide probabilistic predictions (e.g., wind power) depending on the forecast conditions, e.g., wind direction.
To formally check whether there is an "unconditional coverage" (UC), i.e., whether is in the interval, we can use the approach of Kupiec (1995), which allows for knowing whether I d,h , also known as an indicator for "hits and misses", is determined as a i.i.d. Bernoulli series, with an average of (1 − α), i.e., violations are assumed to be independent. Because the Kupiec test is not based on the order of the PI violations, but only on the total number of violations, Christoffersen (1998) introduced the independence test and the conditional coverage test (CC).
It is generally more difficult to test the goodness-of-fit of a predictive distribution than it is to assess the reliability of a PI. A very common method is to use the probability integral transform (PIT) If the distribution forecast matches the actual distribution of the spot price process, then PIT d,h is uniformly distributed and independent, which can be verified by a statistical test, see [17].
In contrast to reliability, which is a joint property of observations and predictions, sharpness is only a property of predictions. Sharpness is strongly related to the notion of correct scoring rules. Indeed, scoring rules simultaneously evaluate the reliability and sharpness [17]. The pinball loss (PL) and the continuous ranked probability score (CRPS) are the two most popular correct valuation rules for energy forecasting: the former for quantile predictions and the latter for distribution predictions. The pinball loss (PL) is a particular case of an asymmetric piece-wise linear loss function: so PL depends on the quantile function and the actually observed price. The PL is a strictly correct score for the α-th quantile. In order to obtain an aggregated score, the PL can be averaged over different quantiles. It is also necessary to have statistically significant conclusions regarding the outperformance of the forecasts of one model by those of another model. For this purpose, we use the Diebold Mariano (DM) test, which is an asymptotic z-test of the hypothesis that the mean value of the loss differential series: is zero, where S i ( * , * ) is the score of the forecasts of the model (i = 1, 2). In the context of probabilistic or ensemble forecasts, any strictly correct scoring rule can be used, for example, the pinball loss. Given the loss difference series, we calculate the statistics: whereμ(δ d,h ) andσ(δ d,h ) are the sample mean or standard deviation of δ d,h and T represents the length of the test period outside the sample. The fundamental hypothesis of equal predictive accuracy, i.e., equal expected loss, corresponds to E(δ d,h ) = 0, in which case, while assuming a steady-state covariance of δ d,h , the DM statistic is asymptotically standard normal and one-or two-sided asymptotic tail probabilities are easily computed. The DM test makes a comparison between the forecasts of two models and not the models themselves. In day-ahead power markets, the forecasts for all hours of the next day are simultaneously made with the same set of information, implying that forecast errors for a given day usually have a high serial correlation. Therefore, it is recommended to run the DM tests separately for every load period considered, for example, at each hour of the day. [18]. It is also worth mentioning that probabilistic forecast performance can be improved, exploiting approaches that are related to regime-switching models, see, e.g., [19,20].

Models
We now present the theory behind some possible models for forecasting the energy price. We will see two statistical models and then one from the group of computational intelligence.

(S)ARIMA Models
Auto Regressive Moving Average (ARMA) models represent an important class of statistical models for analyzing and forecasting time series data. The autoregressive component uses the dependencies between observations and a given number of delayed observations; the moving average component uses the dependency between an observation and the corresponding residual error from a moving average model that is applied to delayed observations. This type of model relates the signal to its own past and does not explicitly use information from other possible related time series. An ARMA(p, q) is defined as where B denotes the backward shift operator, i.e., B h = x t−h , φ(B) is the notation for the polynomial of the autoregressive component for the polynomial of the moving average component and t denotes a white noise W N(0, σ 2 ). Thus, p and q are the orders of the two polynomials. The observed time series is considered as a realization of a stochastic process, whose parameters are the coefficients of the polynomials of the operator B, which determine the properties of persistence (and also the variance of the white noise). This kind of model assumes that the time series is stationary, i.e., mean and covariance of the time series shall be time independent. Usually, the series has to be transformed into the stationary form by differentiation. A model that explicitly includes differentiation is a generalization of the ARMA model for non-stationary time series: the "Auto Regressive Integrated Moving Average" (ARIMA) models. The equation of a ARIMA(p, d, q) is given by x t is the lag-1 differencing operator, a particular case of the general lag-h differencing operator that is given by If d = 0, ARIMA(p, 0, q) ≡ ARMA(p, q), i.e., ARIMA processes reduce to ARMA processes when differenced finitely many times.
Seasonality fluctuations are indeed caused by changing climatic conditions that influence demand and supply. For this reason, we introduce a model that can handle this behavior: the Seasonal Autoregressive Integrated Moving Average (SARIMA) process. The notation for a SARIMA model with both seasonal and non-seasonal parameters is SARIMA(p, d, q) × (P, D, Q) s . Here, (p, d, q) indicates the order of the non-seasonal part, while (P, D, Q) s is the seasonal part. The parameter s represents the number of observations in the seasonal pattern, e.g., for monthly data, we would have s = 12, for quarterly observations s = 4, etc. The SARIMA model is defined by the following formula: Any SARIMA model can be converted to an ordinary ARMA model in the variablẽ Consequently, the estimation of the parameters of ARIMA and SARIMA models is analogous to that for the ARMA processes.

Time Series Analysis in the Simple Case of an ARMA Model
In what follows, we recall basic steps of time series analysis via the ARMA model, we refer to [21] for more details. This approach is also known as Box and Jenkins method, by the name of its authors, namely George Box and Gwilym Jenkins. The main methodological assumption here is to consider data as drivers of the model and not vice versa, so that the time series can "speak for itself". The ARMA scheme can be summarized, as follows: 1.
Preliminary analysis: it is necessary to know the main features of the series such, e.g., its stationarity. To this aim, so-called "unit roots tests" are often used as well as the Dickey-Fuller test, see [22].

2.
Order selection or identification: It is important to select appropriate values for the orders p and q. One can start by analyzing the total Autocorrelation Function (ACF) and Partial Autocorrelation Function (PACF), and then make some initial guesses based on their characteristics, see [21]. From these, one can obtain more reliable results while using various "Information Criteria" (IC), such as Akaike Information Criteria (AIC) and Baesyan Information Criteria (BIC). ICs are indexes trying to find a balance between the goodness of fit and the number of parameters. A "penalty factor" is included in the ICs to discourage the fitting of models with too many parameters. Hence, the preferred model is the one that minimizes the specific IC considered. 3.
Estimation of the coefficients: once p and q are known, then the coefficients of the polynomials can be estimated, e.g., by a least squares regression or with a maximum likelihood method. In most cases, this problem can be solved with numerical methods. 4.
Diagnostic check: to check whether the residuals, follow a random process. If the fitted model is suitable, the rescaled residuals should have similar properties to a "white noise" W N(0, 1) sequence. One can observe the sample "autocorrelation function" (ACF) of the residuals and perform tests for randomness, e.g., the Ljung-Box test, see, e.g., ([21](Chapter 1)).
If the results are not satisfactory, the above described procedure can be restarted with a different order selection. Once the model has successfully passed the verification phase, then it can be used for forecasting.

Seasonality and Its Decomposition
There are different ways to deal with seasonality. An option consists in trying to find out what kind of seasonality we have and see whether it is possible to include it in the model, e.g., by using the SARIMA model. Otherwise, one can remove the seasonality at all.
In order to deseasonalize our series, we will use the MATLAB function deseasonalize written by Rafał Weron, see [13], and whose mathematical basis are described in what follows. In particular, such a function returns the deseasonalized data, the Short-Term Seasonal Component (STSC) and the Long-Term Seasonal Component (LTSC) obtained from the original data series. It also creates the periodograms of the original and deseasonalized data. With more detail, spectral analysis involves identifying possible cyclical patterns of data, so as to break down a seasonal time series into a few underlying periodic (sine and cosine) functions of certain wavelengths. It is known that the variability of many physical phenomena can depend on frequency. Therefore, the information regarding frequency dependence could lead to a better knowledge of the true mechanisms. Spectral analysis and its basic tools, such as the periodogram, can be useful in this direction. Let us recall that, for an observation vector x 1 , . . . , x n , the associated periodogram (or the pattern analog of spectral density) is defined as: where ω k = 2π(k/n) represents the Fourier frequencies that are expressed in radians per unit time, k = 1, . . . , [n/2], while [x] means the largest integer that is less than or equal to x, see, e.g., [23] for further details. This function implements a simple but quite efficient method for eliminating the short-term seasonal component. The concept is to divide data into a matrix with rows of length T and then to take the average or median of the data in every column. In order to provide an example, we can think about seven element rows for a week period, calculated in average daily data. The obtained row vector, again of length T, constitutes the estimate of the seasonal component and it is the part to subtract from the original series.

The Wavelet Decomposition for the Long-Term Seasonal Component
After removing the weekly or daily seasonality from the data, one is often faced with the annual cycle. Even if a sine function is often a natural and good first approximation of the annual cycle, there are markets where it is revealed to not be the right choice, as in the case of the German market. Indeed, the latter has no recognizable annual seasonality and one can notice that spot prices behave similarly across the year, with peaks occurring in both winter and summer, as can be seen in Figure 3. One possible option is to use a wavelet decomposition, as suggested by Weron in [13].
A wavelet family consists of pairs of "father" (ϕ) and "mother" (ψ) wavelet. The former is the "low-frequency" smooth components: those that require wavelets with the greatest support, while the latter intercepts the "higher-frequency" detail components. In other words, father wavelets are used for trends or cycle components, and parent wavelets are used for any deviation from the trend, see [13,24]. More specifically, a signal wavelet decomposition uses one father wavelet and a sequence of mother wavelets: where S J = ∑ k s J,k ϕ J,k (t) and D j = ∑ k d j,k ψ j,k (t). The coefficients s J,k , d J,k , and d J−1,k , . . . , d 1,k represent the wavelet transform coefficients that quantify the contribution of the corresponding wavelet function to the approximating sum, while ϕ J,k and ψ j,k (t) are the approximating father and mother wavelet functions, respectively. As soon as the signal is decomposed with (16), the procedure can be inverted to obtain an approximation to the original signal. If you want to use an unrefined scale, f (t) can be estimated by S J . The signal can be estimated by S J−1 = S J + D J for a higher degree of refinement. At each step, we obtain a better estimate of the original signal by adding a mother wavelet D j to a lower scale j = J − 1, J − 2, . . . . The reconstruction process can always be interrupted, especially when we reach the desired accuracy. The resulting signal can be seen as a de-noised (or filtered or smoothed) time series. Figure 3 shows both the deseasonalization results and the periodogram.

Expert Models
The basic structure of an expert model is an autoregressive model, but other exogenous or input variables can be, and, in fact, are, also introduced. The latter is due to the fact that electricity prices are not only connected to their own past, but they are also influenced by the current and past values of different exogenous variables, such as energy loads and weather changes.

The ARX MODEL
Within this expert model, the centred log-price on day d and hour h is determined by: where we assume the d,h to be independent and identically distributed normal random variables. We define this autoregressive benchmark with ARX in order to underline that, in (17), the (zonal) load forecast is used as an exogenous variable. In contrast to naive models, which did not require parameter estimation, we have to estimate the model characterizing parameters.

The mARX Model
It can be advantageous to use not only different parameter sets, but also other model structures for each day of the week, see [18]. For example, the multi-day ARX model or mARX approach, defined, as follows where I ≡ {0, Sat, Sun, Mon}, D 0 ≡ 1 and the term D mon P d−3,h explain the autoregressive effect of Friday's prices on Monday's prices for the same hour. Notice that this structure is similar to periodic autoregressive models to a certain extent (i.e., PAR, PARMA). We could estimate both autoregressive models, hence ARX and mARX, exploiting least squares (LS) methods.

Artificial Neural Networks (ANNs)
The basic idea behind ANNs methods is to build algorithms that gradually learn from input data. They are mainly inspired by the structure and functioning of the human brain, which justifies the name: Artificial Neural Networks, or simply NNs. A remarkable property of NNs, which is also the main point distinguishing them from statistical methods, relies on the fact that they are trained from data through a "learning process".
Indeed, a NN is based on a structure of interconnected nodes, called artificial neurons and organized in layers. As it happens in a human brain via synapses, such components can transmit signals to other neurons. To be more precise, there is an input layer of source nodes that projects onto an output layer of neurons, possible intermediate steps being possible according with the presence of one or more hidden layers whose function is to mediate between the external input and the network output Figure 4. The more hidden layers we consider, the more higher order statistics are accessible, see, e.g., [25] for details.
Many different types of NNs have been considered over the years, each of which with own properties and possible applications. A key distinction is made between NNs whose compounds form "cycles" and those whose compounds are "acyclic". NNs with cycles are called recursive, feedback, or recurrent neural networks, while NNs without cycles are called feedforward neural network (FNN). However, both of them have the three basic components in common: a set of synapses, each of which is characterized by a weight representing the relevance of the connection it realizes between neurons, an adder allowing to sum input signals and an activation function in order to both limit the amplitude of neurons' spikes and introduce possible nonlinearities into the neurons' output. In mathematical terms, we can describe a neuron k by writing the following: where x j are the signals, w j are the synaptic weights of a neuron k, ϕ is the activation function, which maps a node's inputs to its corresponding output, and y k the output signal of the neuron. There exist different activation functions, such as, e.g., the logistic sigmoid function, or the tanh function: Previous functions are linked by the following linear transformation: implying that any function that is computed by a neural network with a hidden layer of tanh units can also be computed by another network with logistic sigmoid units and vice-versa. Weights are calibrated while using an optimization algorithm minimizing a given loss function. The standard choice in this context is to consider the Stochastic Gradient Descent (SGD) optimizer, along with the Root Mean Squared Error (RMSE) as a loss function. The basic idea of gradient descent consists in finding the derivative of the loss function w.r.t. each of the network weights, then adjusting them along the negative slope direction. The "back-propagation" technique provides an efficient method for computing such a gradient, see, e.g., [26,27].
Once all of the data points have passed through the network, we say that an epoch is complete, i.e., an epoch refers to a single pass of the entire data set through the network during training. This procedure is repeated several times: this is what "training the network" means. At the end of each epoch, the loss of a single output is evaluated and it is possible to calculate the gradient of this loss in relation to the selected weight. The path to this minimized loss occurs in several stages, with its magnitude depending on the learning rate, which is typically between 0.01 and 0.0001. Therefore, the new weight is the old weight minus the learning rate times the gradient.
In summary, the workflow for the general NN design process comprises six primary steps: creating the network, configuring the network, initializing the structure, training the network, validating the network (post-training analysis), and using the network, see [25]. For our data analysis, we will use a special category of Recurrent Neural Network (RNN), called Long Short Term Memory (LSTM).

Recurrent and Long-Short Term Memory Networks
RNNs allow for cyclic connections, i.e., when time series are involved and we do not want to lose information regarding past relationships between data. In fact, an RNN with enough hidden units can approximate any measurable sequence-to-sequence mapping with any degree of accuracy, see, e.g., [28] for details. This allows for the network to remember earlier inputs. Indeed, the output at time t is not only connected to the input at time t, but additionally to recursive signals before that time. A problem arises when the gradient becomes too small, hence preventing the weight from changing its value, potentially inhibiting the neural network from further training, see [29]. For such a reason, RNNs are not suitable to provide time series predictions when, e.g., learning of dependencies over long distances or long-term storage of contexts are required. This problem can be solved by a variant of the RNN: the long-term short-term memory (LSTM) architecture [29].
We will now briefly review how LSTM-based architectures work, see, e.g., [29,30], Figure 5 for more details. In what follows, σ and tanh from (20) are used as activation functions, with other choices being possible according to specific purposes.

•
We define an input at time t as (X t ) and the hidden state from the previous time step as (S t−1 ), which is inserted into the LSTM block. Subsequently, we have to calculate the hidden state (S t ). • It is important to decide which information from the cell state should be discarded. This is decided by the following "forget gate": • Then you must determine which new information should be saved in the cell state. This part consists of two steps: firstly, the "Input-Gate" layer (i t ) determines which values are updated. Secondly, a tanh layer creates a vector of new candidate valuesC t . These two can be written as • We can obtain the new cell state C t updating the old cell state, C t−1 , as: • Finally, the chosen output will be based on the cell state, but it will be a sort of filtered version. At this point, the output gate (o t ) determines which parts of the cell state should be computed as output. Subsequently, the cell state passes through the tanh layer and it is multiplied by the output gate

Data Analysis
Our analysis is based on three years of hourly energy prices data for the German electricity market, i.e.: from 01.01.2016 to 31.12.2018. We will consider the daily average for each day, for a total of 1096 observations. We use historical observations, our endogenous variable, as input, aiming at forecasting the related future values as output. Subsequently, we are considering a regression type problem. Indeed, we want to create a multilevel forecast of a numerical quantity. Our time series is coherent, because the observations are collected and structured uniformly over time, e.g., we have a seasonal pattern.
First of all, it is important to have a general idea of the behavior of the data. As an example, we record the data for the year 2016 and, additionally, their monthly average as well as two weeks in May. We can see the pattern of weekly seasonality, see Figure 6b: we can clearly see the weak seasonality and the difference between Mon-Fri and the weekend. This example proves the dependence of the price on external factors, like business activities, etc. There are some outliers and, on average, the same days show a negative price. For more details on how peaks are handled in the energy market, see, e.g., [31].

From the Probability Density Function to a Possible Realization of the Stochastic Process
There are some interesting questions that we would like to answer, for example: whether it is possible to estimate a stochastic process that obeys a certain PDF; whether we can create a kind of forecast from it; and, whether it could be a suitable starting point for more structured forecasts. This approach could help a user to have a better knowledge of his future probable data for a mid-term period. We start by describing the general idea from a mathematical point of view, and we see how we can try to apply it. We introduce the rejection method and one of its improvements, for more details, see, e.g., [32]. They generate random variables with the non-uniform distribution.

The Rejection Method
The rejection method creates a random variable with a given PDF f by using two independent generators U(0, 1). The idea of John von Neumann is to create a box [a, b] × [0, c] containing the graph of f . Subsequently, a random point (U, Y) in the box and U is accepted if the point is below the graph. This is a simple rejection algorithm that can be generalized for a d-dimensional PDF: If Y ≤ f (U), then accept X = U, or else reject U and return to step 1. Subsequently, we know from the Fundamental Theorem of Simulation that the random variable X that is generated by the general rejection algorithm has the PDF f .

Marsaglia's Ziggurat Method
The Marsaglia's Ziggurat method Figure 7 is a highly efficient rejection method [33], which can be implemented, e.g., in MATLAB while using the function randn. It is based on an underlying source of uniformly distributed random numbers, but it can also be applied to symmetric unimodal distributions, such as the normal one. An additional random sign ± is generated for the value that is determined by the Ziggurat method applied to the distribution in [0, ∞). Instead of covering the graph of a given PDF f with a box, the idea now is to use a "ziggurat of rectangles", a cap and a tail, which all have the same area. These sections are selected, so that it is easy to choose uniform points and determine whether to accept or reject them. For the sake of simplicity, we assume that f has support in [0, ∞) and decreases monotonously. For a given number M, we ask for the nodes 0 = x 0 < · · · < x M < ∞, so that the ziggurat rectangles Z i and the tail Z 0 have the same area. Subsequently, the Ziggurat method reads:

2.
If i ≥ 1, then or else generate a random value X ∈ [x M , ∞) from the tail. It is important to realize that the resulting distribution is exact, even if the ziggurat step function is only an approximation of the probability density function.

The Function Randn and Its Possible Application
Let us consider the summer season (01/06-31/08) of the average daily price for the two years 2016 and 2017 in order to obtain similar data. A natural question arises: whether it is possible to create a probable realization for the summer of 2018 (or only for one period), while taking the probability density functions of the previous data into account. Before continuing, we first standardize the data so that they have a normal distribution in the mean. In the following figure, we can observe the behavior of our data for this particular period. In Figure 8a, we can observe that 2016 and 2017 are very similar, while the 2018 data are much higher on average.
Afterwards, we use the MATLAB function randn, implemented with the Ziggurat's algorithm with the options: X = randn(_, like , p) returns an array of random numbers like p. X will be our possible realization for summer 2018. For example, if we generate 1000 paths and evaluate the RMSE, we get, in Figure 9, the following histogram, where we can check that the values are mainly in the range [1.35,1.46]. As an example, we could get with RMSE = 1.33:  It is important to see what could happen if we perform the inverse standardization procedure; Figures 9 and 10 show the results. We conclude that such a big difference could be due to the fact that summer 2018 was quite different from the last two summer times. Additionally, the data densities are not completely unimodal and symmetric, so the algorithm may not work properly. Therefore, one can say that this first approach gives better results if the situation is regular over time and always follows the same "trend", or if one has access to additional information and variables that can improve performance. Indeed, applying the same procedure, but only considering 2016 to provide a forecast for 2017, we obtain Figure 11. We can also note that any possible realization never reaches an "outlier" or a particularly high spike.
Finally, our opinion is that this approach cannot be very useful from a practical point of view. In fact, even if it can be used to get a general idea of the medium and long term, it does not provide an adequate level of accuracy for the required forecasts.

Example with SARIMA Model
In this section, we present the results of a forecast that was obtained by implementing a SARIMA model. We use the Matlab model sarima with the parameters while using the estimator and the function forecast to predict the average prices for the next few days in order to calculate the forecast. We try to predict two weeks of the year 2018, and we will use three different training periods: one year (2017), the spring/summer season from April to September 2017, and the autumn/winter period from October 2016 to March 2017. Table 1 shows the parameters that are used for each analyzed period. With s = 7, we refer to the fact that we implement a weekly seasonality, as our kind of data suggests. In addition, a test can also be performed to check the fit of the model, e.g., the well-known "Ljung-Box Q-Test", for more details, see [34]. We test it for 20 lags with a significance level of 0.05. For all of the tested models, we do not have to discard the null hypothesis, so that we can consider them to be good and reasonable models for our data. The correlations in the population from which the sample is drawn are 0, so that all of the correlations observed in the data result from the randomness of the sampling procedure. Figure 12 shows the results of the forecasts and their confidence intervals at 95%. In Table 2, we see the values of the RMSE: a shorter and more specific time period for the train has a better outcome.  The results may not be very accurate, but let us note that we are considering a daily average of the price. Therefore, it is more important that the general trend shows similar behavior. In addition, we can observe that fluctuations occur for certain periods of time, but these latter are not as regular as for other types of data, which implies a higher level of difficulties in analyzing these data.

ARIMA Model
Because our series are quite long, it may not be sufficient to model weekly seasonality, but it may be necessary to include other components. As in the SARIMA example, we will start by modeling 2017 so as to provide a forecast for the first two weeks of 2018. Subsequently, we will see the results referring at only one period of the year (spring/summer; fall/winter).

•
We decide to deseasonalize the series while using the Matlab function deseasonalize, see Section 4.2. It considers both a short-term and a long-term seasonal component. • We have used an Augmented Dickey-Fuller test to verify that our deseasonalized series is stationary. It tests the null hypothesis that a unit root is in a time series sample.
In Matlab, we can use the function adftest; it returns a logical value with the rejection decision for a unit root in a univariate time series. For example, the result ad f = 0 indicates that this test does not reject the null hypothesis of a unit root against the trend-stationary alternative. In our cases, we have that the series is not stationary, so we can differentiate it, then obtaining the series to become stationary, which is equivalent to having d = 1. • Our goal now is to find a suitable ARIMA(p, d, q) model for estimating the series.
To guess a plausible order of the parameters, we consider the autocorrelation and partial autocorrelation functions, as proposed in the procedure of Box and Jenkins. • We try to estimate different types of models, then choosing the one minimizing the information criterion AIC and BIC. In particular, we end up with choosing ARIMA(1,1,2): • Subsequently, we calibrate our parameters in order to optimize the error in the L 2 norm of the difference between the Probability Density Function (PDF) of the data and the forecast we calculated based on the estimated model. The PDF is estimated by the Matlab function ksdensity. It uses a non-parametric method, called kernel density estimation. We are interested in estimating the form of the density function f from a sample of data (x 1 , . . . , x n ); its kernel density estimator iŝ where K is the kernel, h > 0 denotes a smoothing parameter, called bandwidth, and N denotes the number of observations. By default, K is set as the normal density function. • The optimal prediction results from the solution of this problem: min φ 1 ,...,φ p ,θ 1 ,...,θ q PDF(forecast) {φ 1 ,...,φ p ,θ 1 ,...,θ q } − PDF(data) 2 , where φ 1 , . . . , φ p , θ 1 , . . . , θ q are the parameters of the chosen model, and we choose a compact interval of R, where they can vary before solving the problem. With PDF(forecast), we denote the estimated density function from the obtained forecasts.
In order to solve this minimization problem we use the Matlab function fminsearch, which determines the minimum of a multi-variable function using a derivative-free method. In this case, we use the Matlab function arma_forecast for the point forecasts, see [35]. In particular, fminsearch uses the Nelder-Mead simplex algorithm, which is a direct search method that does not use numerical or analytical gradients. Indeed, it depends on the given initial values. In particular, we use the parameters that come from the first estimate of the ARIMA model. Let us note that such an approach differs from the one that was used by Ziel and Steinert, see [9]. In fact, the optimal parameters of the problem are given by the solution of minimizing the BIC criteria while using the lasso technique. The general idea is that, for a linear model Y = β X + , the LASSO estimator is given by: where λ ≥ 0 is the penalty term. • We plot the density functions and see the results in Figure 13. • In Table 3, one can compare between the error of the difference before and after optimization: • Finally, we try to implement the idea that is described in Section 5.1. In view of the Probability Density Function (PDF) of the point estimates, we can guess a possible realization of the future period, for example, for the first three weeks of January 2018. A plausible result can be observed in Figure 14. Now, we follow the same steps-test for stationary, identification of the model, calibration of the parameters-to see whether we can achieve an improvement of the RMSE by reducing the number of observations and choosing a certain time period for training and testing (see Table 4). Therefore, we first use the autumn/winter period and then the spring/summer period (See Figures 15 and 16).  Let us note that the values of RMSE are now generally not as high as in the case that is considered in Section 5.1. This is due to the fact that we use a more structured procedure in the preparation of our forecast. We are aware that this type of approach may not be appropriate if we want to achieve a high degree of accuracy; there are already a lot of other methods for this purpose. Here, we try to imagine a plausible trend for future data within a medium-term period. Such an approach in combination with ARIMA leads to better results when the behavior of the series is quite regular, as already mentioned and as it is the case when dealing with statistical models.

Neural Networks
For the same data, we try to use an ANN to calculate the forecast. In particular, we implement in MATLAB the LSTM architecture from Section 4.4. The selection of a suitable representation and preprocessing of the input data plays a central role in any machine learning task. However, neural networks have the tendency to be relatively robust to the choice of input representation, see [26]. The only requirements for input representations of NNs are that they are complete, i.e., that they contain all of the information needed to successfully predict the output. Although irrelevant inputs are not a major problem for NNs, a large input space could lead to too many input weights and poor generalization.
A common procedure is the standardization of the components of the input vectors. This means that, first, the mean value and standard deviation are calculated with the classical estimators: then compute the standardised input vectorx withx = x−µ σ .
This method does not change the training set information, but it does improve the performance by moving the input values to a range that is more appropriate for the standard activation functions. Standardizing the input values can have a great impact on the performances [26].
Once we have standardized, we train the network exploiting specific Matlab functions. In particular, we use the function trainNetwork, specifying the train set, the lstmLayers, and appropriate options for the training. With the function trainingOptions, which is part of the Matlab Deep Learning Toolbox, we can set all of the parameters that we need, see Table 5. It is worth mention that there is not a specific procedure for selecting these values. Contrary to what we saw for the (S)ARIMA example, we do not have any criteria or index that we can exploit to justify a choice. Here, the decision is made after several attempts. There are two main problems to pay attention to. The first one is the so-called over-fitting. This implies that it is fundamental to prescribe an appropriate number of epochs for the training phase. For example, we can stop once we see that the loss function is stable, or we can try to optimize the parameters: a very high number of hidden layers could be complex to handle from a computational point of view.
In Table 5, we also see the duration of the training. The training time is another difference to the statistical methods, which are more immediate. Here, the elapsed time is about 1 min., but, for larger data sets, this value can become hours. We know that the purpose of the SGD during training is to minimize the loss between the actual performance and performance expected by our training masters. The loss minimization is progressive. The training process begins with randomly set weights, and then these weights are updated incrementally as we get closer to the minimum. The steps size depends on the learning speed. During training, once the loss has been computed for our inputs, then the gradient of that loss is computed in relation to every weight in the model. We can observe this type of process in Figure 17a: We see that the loss function, the RMSE, is minimized epoch by epoch.
Having obtained gradients values, we multiply them for the learning rate, whcih results in small numbers, usually between 0.01 and 0.0001. Nevertheless, the actual value can change, and any value that we obtain for the gradient will be smaller after we multiply it by the learning rate. If we set the learning rate on the higher part of this range, then there is a risk of overshoot. This happens when we take a "too big" step toward the minimized loss function and exceed this minimum and miss it. We can avoid this by setting the learning rate to a value on the lower side of the range. Our steps will be very small with this option, so it will require more time to reach the point of loss minimization. Overall, the act of choosing between a higher and lower learning rate leaves us with this kind of compromise idea. All of these starting parameters are something to test, and then you can choose the more suitable one for the problem, it is impossible to determine them in advance. When considering the example with statistical methods, where we have information criteria that can lead us to optimal parameters, in this case we can only observe the final RMSE and the course of the training process. For example, if we see high peaks or irregularities until the end of the training, it might be necessary to add some epochs; conversely, if the loss function suddenly becomes very low and remains constant, then we can decide to reduce the number of epochs or decrease the learning rate to avoid over-fitting phenomena.
In the following figures, we can see the results: the training procedure, a comparison between forecasts and original data and the comparison between densities. As done in the previous sections, we can repeat the same procedure while using the spring/summer period, Figure 18; and then using the autumn/winter period, Figure 19 and Table 6. We can observe that we obtain good results and that it is possible to simplify calculations shortening the period as well as to obtain more precision for a given season, as done for previous analysis.   Figure 19. Summary of results of the LSTM network on autumn/winter period for two weeks forecast.

Hybrid Approach
In the literature, there are other important approaches used to forecast the price of electricity, namely: hybrid methods realized by combining other models to capture different patterns characterizing time series of interest. A common example of these hybrid models is a combination between a statistical model, such as the ARIMA and an LSTM-based model. For more details, we refer the interested reader to [36]. As to proceed further, the simplest approach is to average between them, finding suitable weights w i : This procedure is quite common, allowing to easily decide which component needs to be highlighted. In Table 7, we specify the weights for each component, while, in Table 8, we can see the final comparison between RMSE, noticing that the hybrid approach can be a valid tool for obtaining a lower RMSE. In Figure 20, we summary the results of hybrid approach for each time period of training.    In the following, we test our idea to capture the dynamic of the underlying possible stochastic process, and Figure 21 shows the results. We try to obtain good approximations to the investigated processes on the basis of the examined data, while assuming that their development can be well shaped by a certain (previously specified) type of stochastic process. The estimated densities are not precise, so our approach does not yield better results, as we can see from the previous figure. From this point of view, a hybrid approach does not yield any improvement.

Conclusions
In this paper, we gave a general overview of energy price forecasting, highlighting its highest relevant quantitative aspects, smartly using two of the most frequently exploited approaches in order to analyze the corresponding behavior over time. We also included applications that are based on real-time series. In particular, we first focused our attention on (S)ARIMA models, which are widely implemented as typical starting point for most time series analyses. Even if the obtained results have been shown to be quite adequate, better ones can be obtained by using Neural Networks (NNs) based alternatives. Specifically, we considered LSTM-architectures. The latter being between the most popular within the NNs-oriented community because of the highly accurate results provided, see, e.g., [37]. These improvements are characterized by both a more complex computational structure and a higher data processing time.
In general, one could say that NNs-based results are often better than those that are obtained with an ARIMA model, cf. Table 8. This is because these networks allow for us to obtain and store more information, even for more complex seasonal influences. It is also worth remembering that ARIMA works better in very regular situations. Both are valuable models, but, in this case with this kind of data and context, LSTM can be a suitable tool, not only for our probabilistic approach, but also for point estimation for short and medium term period forecasts, for more details, see With the hybrid approach, as in Section 5.5, we have shown that we can balance our results, also achieving significant improvements w.r.t. the accuracy of the two weeks-point forecast.
Let us finally emphasize that our results regarding the prediction of the (random) electricity price behavior (expressed in the form of a stochastic process) are based on an innovative probabilistic proposal, which gives us remarkable accuracy results. However, one must not forget that these models are developed to answer specific questions. Therefore, they have to be carefully redesigned according to the constraints of the particular problem that a company intends to solve. This fact reflects the need to include more information, such as temperature or another type of exogenous variable, in order to better describe some spikes or other phenomena. We could also consider working with hourly data in order to achieve a higher degree of prediction accuracy.
Author Contributions: All authors contributed equally to this work. All authors have read and agreed to the published version of the manuscript.
Funding: This research received no external funding.

Data Availability Statement: Not applicable
Acknowledgments: Emma Viviani would like to underline that the material developed within the present paper has been the result of her Erasmus+ experience she spent exploiting the Erasmus+ agreement between the University of Wuppertal and the Dept. of Computer Science of the University of Verona, collaborating with Matthias Ehrhardt, under the supervision of Luca Di Persio.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: