Analysis and Modeling for Short-to Medium-Term Load Forecasting Using a Hybrid Manifold Learning Principal Component Model and Comparison with Classical Statistical Models (SARIMAX, Exponential Smoothing) and Artiﬁcial Intelligence Models (ANN, SVM): The Case of Greek Electricity Market

: In this work we propose a new hybrid model, a combination of the manifold learning Principal Components (PC) technique and the traditional multiple regression (PC-regression), for short and medium-term forecasting of daily, aggregated, day-ahead, electricity system-wide load in the Greek Electricity Market for the period 2004–2014. PC-regression is shown to effectively capture the intraday, intraweek and annual patterns of load


Introduction
The unique characteristic of electricity is that it cannot be stored.In order to achieve frequency stability energy production must be equal to the instantaneously consumed production.Considering Energies 2016, 9, 635 2 of 40 also the physical constraints that impose limitations (congestion) on the power transferred in transmission lines we can realize why competitive pricing is not easy to implement in real-time markets.
Today energy markets' products are defined in terms of delivering a predetermined amount of power over a specified period of time.These markets are usually called spot markets where the prices (spots) are determined within one hour or half an hour time periods (e.g., Australia).Spot prices emerge either from auctions which take place in the so-called market pool, where retailers and generators' representatives make offers and bids or from trading on an exchange platform either in the day-ahead or in the real-time market [1][2][3].The market clearing price is therefore determined by the most expensive unit dispatched in the abovementioned mechanisms over the respective trading period.The key factors that influence spot prices are mainly the demand or load as well as the ability to respond to this demand by the available generating units.Therefore, possible errors in load forecasting could have significant cost implications for the market participants.More specifically an underestimated predicted load could lead to unavailability the required reserve margin which in turn could lead to high costs from peak units.On the other hand, load overestimations would cause the problem of excess supply management pushing spot prices downwards.
Load prediction is a complex procedure because of the nature of the influencing factors-weather factors, seasonal factors and social-economic factors [4].Weather factors include temperature, relative humidity, wind speed, dew point, etc. Seasonal factors include climate variation during a year while social-economic factors are depicted through periodicities inside the time-series of the load as well as trends through years.
An electricity utility also can use forecasts in making important decisions related to purchasing and generating electric power, load switching and investing in infrastructure development.Also, energy suppliers, financial institutions and other "players" in the electric energy generation and distribution markets can benefit from reliable load forecasting.
Load is a variable that is affected by a large number of factors whose influences are "imprinted" in its dynamic evolution.Its historical past values, weather data, the clustering of customers according to their consumption profiles, the number of types of electric appliances in a given region as well as consumer age, economic and demographic data and their evolution in the future, are some of the crucial factors taken into account in medium and long-term load forecasting.
Also, the time of the year, the day of the week and the hour of the day are time factors that must be included in load forecasting.The consumption of electricity for example in Mondays, Fridays, or holidays etc. is different.Load is also strongly influenced by weather conditions.In a survey performed in 2001 bu Hippert et al. [5], in out of 22 research papers, 13 only considered temperature, indicating the significance of this meteorological parameter in load forecasting.In this work, we have included the average temperature in the country, as a predictor, in the seasonal Auto-Regressive Integrated Moving Average (ARIMA) model for load forecasting.
Load forecasting is also a necessary tool for Transmission System Operators (TSOs) since it is used for different purposes and on different time scales.Short-term forecasts (one hour-one week) are useful for dispatchers to schedule short-term maintenance, unit commitment, fuel allocation, and cross-border trade, but also to operation engineers for network feature analysis such as optimal power flow, etc. Medium term forecasts (one week-one month) are used by TSOs for planning and operation of the power system while long-term forecasts (one month-years) are required for capacity planning and maintenance scheduling.
In this sense, load forecasting is of crucial importance for the operation and management of power systems and thus has been a major field of research in energy markets.There exist many statistical methods which are implemented to predict the behavior of electricity loads, with varying success under different market conditions [6].In this approach, the load pattern is treated as a time series signal, where various time series techniques are applied.The most common approach is the Box-Jenkins' Auto-Regressive Integrated Moving Average (ARIMA) [7] model and its generalized form Seasonal ARIMA with eXogenous parameters (SARIMAX) [8,9].Models utilized in electricity Energies 2016, 9, 635 3 of 40 load forecasting also include reg-ARIMA, a regressive ARIMA model, Principal Component Analysis and Holt-Winters exponential smoothing [10][11][12].All of the abovementioned methods are structured in such a way in order to deal with the double seasonality (intraday and intraweek cycles) inherent in load data.
Neural Networks (NNs) and Artificial Neural Networks (ANNs) [13,14] are considered to be other more advanced forecasting methods.ANNs are useful for multivariate modeling but have not been reported to be so effective in univariate short-term prediction.However, the complexity of the latter and their questionable performance has relegated NNs and ANNs to being perhaps considered the last resort solution to any forecasting problem [15].State space and Kalman filtering technologies have thus far proved to be one of the most complex, yet reliable methods in time series forecasting [16,17].A number of different load forecasting methods have been developed over the last few decades, as described below.
Figure 1 shows the spectrum of all methods categorized according to their short-term and medium to long-term usage, based on a literature review by the authors of this work.Other authors categorize the methods in two groups: classical mathematical statistical and artificial intelligence.
Energies 2016, 9, 635 3 of 40 structured in such a way in order to deal with the double seasonality (intraday and intraweek cycles) inherent in load data.
Neural Networks (NNs) and Artificial Neural Networks (ANNs) [13,14] are considered to be other more advanced forecasting methods.ANNs are useful for multivariate modeling but have not been reported to be so effective in univariate short-term prediction.However, the complexity of the latter and their questionable performance has relegated NNs and ANNs to being perhaps considered the last resort solution to any forecasting problem [15].State space and Kalman filtering technologies have thus far proved to be one of the most complex, yet reliable methods in time series forecasting [16,17].A number of different load forecasting methods have been developed over the last few decades, as described below.
Figure 1 shows the spectrum of all methods categorized according to their short-term and medium to long-term usage, based on a literature review by the authors of this work.Other authors categorize the methods in two groups: classical mathematical statistical and artificial intelligence.In order for this work to be as self-contained as possible with respect to the references we have decided to provide a short description of the most widely used methods in Figure 1.Further survey and review papers are provided for example in Heiko et al. [18], Kyriakides and Polycarpou) [19], Feinberg and Genethliou [20], Tzafestas and Tzafesta [21] and Hippert et al. [5] and a very recent one by Martinez-Alvarez et al. [22].The group of models on the right part of Figure 1 contains new forecasting approaches that cannot be classified into any of the other two groups [22].A description of manifold learning Principal Component Analysis (PCA) technique is given in Sections 4.3 and 5.2 below.Basic definitions of statistical tests and a short description of ETS models are given in Appendices A and B respectively while in Appendix C methods for Support Vector Machine (SVM) and ANN are also described.
Trend analysis uses historical values of electricity load and projects them into the future.The advantage of this method is that it generates only one result, future load, but provides no information on why the load evolves the way it does.The end-use method estimates electricity load, directly, by using records on end use and end users, like appliances, the customer usage profile, their age, type and sizes of buildings, etc.Therefore, end-use models explain electricity load as a function of how many appliances are in the market (Gellings [23]).In order for this work to be as self-contained as possible with respect to the references we have decided to provide a short description of the most widely used methods in Figure 1.Further survey and review papers are provided for example in Heiko et al. [18], Kyriakides and Polycarpou) [19], Feinberg and Genethliou [20], Tzafestas and Tzafesta [21] and Hippert et al. [5] and a very recent one by Martinez-Alvarez et al. [22].The group of models on the right part of Figure 1 contains new forecasting approaches that cannot be classified into any of the other two groups [22].A description of manifold learning Principal Component Analysis (PCA) technique is given in Sections 4.3 and 5.2 below.Basic definitions of statistical tests and a short description of ETS models are given in Appendices A and B respectively while in Appendix C methods for Support Vector Machine (SVM) and ANN are also described.
Trend analysis uses historical values of electricity load and projects them into the future.The advantage of this method is that it generates only one result, future load, but provides no information on why the load evolves the way it does.The end-use method estimates electricity load, directly, by using records on end use and end users, like appliances, the customer usage profile, their age, type and sizes of buildings, etc.Therefore, end-use models explain electricity load as a function of how many appliances are in the market (Gellings [23]).
Combining economic theory and statistical analysis, the econometric models have become very important tools in forecasting load, by regressing various factors that influence consumption on load (the response variable).They provide detailed information on future levels of load, explain why future load increases and how electricity load is influenced by all the factors.The works of Genethliou and Feinberg [20], Fu Nguyen [24], and Li Yingying.;Niu Dongxiao [25], describe in depth both their theoretical structure and how they are applied in various markets.Feinberg et al., [26,27] developed statistical models that learn the load model parameters by using past data values.The models simplify the work in medium-term forecasting, enhance the accuracy of predictions and use a number of variables like the actual load, day of the week, hour of the day, weather data (temperature, humidity), etc.
The similar-day approach is based on searching historical data for days within one, two or three years with features that are similar to the day we want to forecast.The load of a similar day is taken as a forecast and we can also take into consideration similar characteristics of weather, day of the week and the date.
The technique that is the most widely used is regression analysis modeling and forecasting.We regress load on a number of factors like weather, day type, category of customers, etc. Holidays, stochastic effects as average loads as well as exogenous variables as weather, are incorporated in this type of models.The works of Hyde et al. [28], Haida et al. [29], Ruzic et al. [30] and Charytoniuk et al. [31], provide various applications of this kind of models to load prediction.Engle et al. [32,33] applied a number of regression models for forecasting the next day peak value of load.
Stochastic time series load forecasting methods detect and explore internal structure in load series like autocorrelation, trend or seasonal variation.If someone assumes that the load is a linear combination of previous loads, then the autoregressive (AR) model can be fitted to load data.If the current value of the time series is expressed linearly in terms of its values at previous periods and in terms of previous values of white noise, then the ARMA model results.This model as well as the multiplicative or seasonal Autoregressive Integrated Moving Average (ARIMA, for non-stationary series), are the models that we will use in this work and are described in Section 4. Chen et al. [34] have applied an adaptive ARMA model for load prediction, where the resulting available prediction errors are used to update the model.This adaptive model outperformed typical ARMA models.Now, in case that the series is non-stationary, then a transformation is needed to make time series stationary.Differentiating the series is an appropriate form of transformation.If we just need to differentiate the data once, i.e., the series is integrated of order 1, the general form of ARIMA (p,1,q) where p and q are the number or AR and MA parameters in the model (see Section 4 for more details).Juberias et al. [35] have created a real time load forecasting ARIMA model, incorporating meteorological variables as predictors or explanatory variables.
References [36][37][38][39][40] refer to application of hybrid wavelet and ANN [36], or wavelet and Kalman filter [37] models to short-term load forecasting STLF, and a Kalman filter with a moving window weather and load model, for load forecasting is presented by Al-Hamadi et al. [38].In the Greek electricity market Pappas et al. [39] applied an ARMA model to forecast electricity demand, while an ARIMA combined with a lifting scheme for STLF was used in [40].
Artificial Neural Networks (ANN) have been used widely in electricity load forecasting since 1990 (Peng et al. [41]).They are, in essence, non-linear models that have excellent non-linear curve fitting capabilities.For a survey of ANN models and their application in electricity load forecasting, see the works of Martinez-Alvarez et al. [22], Metaxiotis [42] and Czernichow et al. [43].Bakirtzis et al. [44] developed an ANN based on short-term load forecasting model for the Public Power Corporation of Greece.They used a fully connected three-layer feed forward ANN and back propagation algorithm for training.Historical hourly load data, temperature and the day of the week, were the input variables.
The model can forecast load profiles from one to seven days.Papalexopoulos et al. [45] developed and applied a multi-layered feed forward ANN for short-term system load forecasting.Season related inputs, weather related inputs and historical loads are the inputs to ANN.Advances in the field of artificial intelligence have resulted in the new technique of expert systems, a computer algorithm having the ability to "think", "explain" and update its knowledge as new information comes in (e.g., new information extracted from an expert in load forecasting).Expert systems are frequently combined with other methods to form hybrid models (Dash et al. [46,47], Kim et al. [48], Mohamad et al. [49]).
References  cover the period 2000-2015 in which the work of Maier et al. [50] is one of the first ones related to ANN application in forecasting water resources, while during the same period we note also the creation of hybrid methods combining the strengths of various techniques that still remain popular until today.A combination of ANN and fuzzy logic approaches to predict electricity prices is presented in [51,52].Similarly, Amjady [53] applied a feed-forward structure and three layers in the NN and a fuzzy logic model to the Spanish electricity price, outperforming an ARIMA model.Taylor [54] tested six models of various combinations (ARIMA, Exponential Smoothing, ANN, PCA and typical regression) for forecasting the demand in England and Wales.Typical applications of ANN on STLF are also presented in [55][56][57][58].The book of Zurada [59] also serves as a good introduction to ANN.A wrapper method for feature selection in ANN was introduced by Xiao et al. [60] and Neupane et al. [61], a method also adopted by Kang et al. [62].A Gray model ANN and a correlation based feature selection with ANN are given in [63,64].The Extreme Learning Machine (ELM), a feed-forward NN was used in the works [65,66] and a genetic algorithm, GA, and improved BP-ANN was used in [67].Cecati et al. [68] developed a "super-hybrid" model, consisting of Support Vector Regression (SVR), ELM, decay Radial Basis function RBF-NN, and second order and error correction, to forecast the load in New England (USA).A wavelet transform combined with and artificial bee colony algorithm and ELM approach was also used by Li et al. [69] for load prediction in New England.Jin et al. [70] applied Self Organizing Maps (SOM), a method used initially in discovering patterns in data to group the data in an initial stage, and then used ANN with an improved forecasting performance.One of the most recent and powerful methods is SVM, originated in Vapnik's (Vane [71]) statistical learning theory (see also Cortes and Vapnik [72], and Vapnik [73,74]).SVM perform a nonlinear (kernel functions) mapping of the time series into a high dimensional (feature) space (a process which is the opposite of the ANN process).Chen et al. [75] provides an updated list of SVM and its extensions applied to load forecasting.SVM use some linear functions to create linear decision boundaries in the new space.SVM have been applied in load forecasting in different ways as in Mohandas [76], and Li and Fang [77], who blend wavelet and SVM methods.
References [76-94] on SVM and its various variations-extensions for performance improvement cover the period 2003-2016.Traditional SVM has some shortcomings, for example SVM cannot determine the input variables effectively and reasonably and it is characterized by slow convergence speed and poor forecasting results.Suykens and Vandewalle [78] proposed the least square support vector machine (LSSVM), as an improved SVM model.Hong [79] analyzed the suitability of SVM to forecast the electric load for the Taiwanese market, as Guo et al. [80] did for the Chinese market.To capture better the spikes in prices, Zhao et al. [81] adopted a data mining framework based on both SVM and probability classifiers.In order to improve the forecasting quality and reduce the convergence time, Wang et al. [82] combined rough sets techniques (RS) on the data and then used a hybrid model formed by SVM and simulated annealing algorithms (SAA).A combination of Particle Swarm Optimization (PSO) and data mining methods in a SVM was used by Qiu [83] with improved forecasting results.In general, various optimization algorithms are extensively used in LSSVM to improve its searching performance, such as Fruit Fly Optimization (FFO) [85,88], Particle Swarm Optimization (PSO) [93], and Global Harmony Search Algorithm [75].
The most recent developments on load forecasting are reviewed by Hong and Fan [95].An interesting big data approach on load forecasting is examined by Wang [96], while Weron's recent work is mostly focused on improving load forecast accuracy combining sister forecasts [97].New methods on machine learning are covered in [98,99].A regional case study is performed by Saraereh [100] using spatial techniques.Kim and Kim examine the error measurements for intermittent forecasts [101].
The structure of this paper is as follows: Section 2 includes a short description of the Greek electricity market with emphasis on the dynamic evolution of load during the 2004-2014 period.In Section 3 we quote the load series data performing all tests which are required for further processing by our models (stationarity tests, unit-root tests, etc.).Section 4 is the suggested methodology; it encompasses the mathematical formulation of the proposed models (SARIMAX, Exponential Smoothing and PC Regression analysis).Section 5 presents the comparison between the proposed hybrid model and the classical forecasting methods as well as the more recent ANN and SVM approaches.A number of forecasting quality measures Mean Absolute Percentage Error (MAPE), Root Mean Squared Error (RMSE), etc. are evaluated and strengths and weaknesses of each model are noted.Section 6 (Conclusions) summarizes and proposes the next steps.A short description of the ANN and SVM methods is given in Appendix C.

A Short Description of the Greek Electricity Market
Greece's liberalized electricity market was established according to the European Directive 96/92/EC and consists of two separate markets/mechanisms: (1) the Wholesale Energy and Ancillary Services Market; (2) The Capacity Assurance Mechanism.
The wholesale electricity market is a day ahead mandatory pool which is subject to inter-zonal transmission constraints, unit technical constraints and reserve requirements.More specifically, based on forecasted demand, generators' offers, suppliers' bids, power stations' availabilities, unpriced or must-run production (e.g., hydro power mandatory generation, cogeneration and RES outputs), schedules for interconnection as well as a number of transmission system's and power station's technical constraints, an optimization process is followed in order to dispatch the power plant with the lower cost, both for energy and ancillary services.In this pool, market "agents" participating in the Energy component of the day-ahead (DA) market submit offers (bids) on a daily basis.The bids are in the form of a 10-step stepwise increasing (decreasing) function of pairs of prices (€/MWh) and quantities (MWh) for each of the 24 h period of the next day.A single price and quantity pair for each category of reserve energy (primary, secondary and tertiary) is also submitted by generators.Deadline for offer submission is at 12.00 pm ("gate" closure time).
The Operator of Electricity Market in Greece (LAGIE) [102] is responsible for the solution of the so-called day ahead (optimization) problem.This problem is formulated as a security constrained unit commitment problem, and its solution is considered to be the optimum state of the system at which the social welfare is maximized for all 24 h of the next day simultaneously.This is possible through matching the energy to be absorbed with the energy injected into the system, i.e., matching supply and demand (according to each unit's separate offers).In the abovementioned optimization problem besides the objective function there are also a number of constraints.These are the transmission system constraints the technical constraints of the generating units and the energy reserves requirements.The DA solution, therefore, determines the way of operation of each unit for each hour (dispatch period) of the dispatch day as well as the clearing price of the DA market's components (energy and reserves).
The ultimate result of the DA solution is the determination of the System Marginal Price (SMP, which is actually the hourly clearing price).At this price load representatives buy the absorbed energy for their customers while generators are paid for their energy injected into the system.The Real-Time Dispatch (RTD) mechanism refers to adjusting the day-ahead schedule taking into consideration data regarding availability and demand as well as security constraints.The dispatch scheduling (DS) is used in time period between day-ahead scheduling (DAS) and RTD where the producers have the right to change their declarations whenever a problem has been occurred regarding the availability of their units.Any deviations from the day-ahead schedule are managed via the imbalances settlement (IS) operation of the market.During the IS stage an Ex-Post Imbalance Price (EXPIP) is produced after the dispatch day which is based on the actual demand, unit availability and RES production.The capacity assurance mechanism is a procedure where each load representative is assigned a capacity adequacy obligation and each producer issues capacity availability tickets for its capacity.Actually this mechanism is facing any adequacies in capacity and is in place for the partial recovery of capital costs.
The most expensive unit dispatched determines the uniform pricing in the day-ahead market.In case of congestion problems and as a motive for driving new capacity investment, zonal pricing is a solution, but at the moment this approach has not been activated.Physical delivery transactions are bounded within the pool although market agents may be entering into bilateral financial contracts that are not currently in existence.The offers of the generators are capped by an upper price level of 150 €/MWh.

Data Description and Preparation
This section is devoted to the description and preprocessing of the data available.The main time series we focus on is the system load demand in the Greek electricity market.This is defined as the system's net load; more specifically the system's net load is calculated as the sum demanded by the system (total generation including all types of generation units, plus system losses, plus the net difference of imports & exports, minus the internal consumption of the generating units).In this article we use the terms system load and load interchangeably.
The raw data are provided by Independent Power Transmission Operator (IPTO or ADMIE) (http://www.admie.gr)and include the hourly and daily average load, for the 10 year period from 1 January 2004 to 31 December 2014, as well as average external temperature in Greece, for the same time period.
In this work, we aim to capture the effect of seasonal events on the evolution of the load demand [4,103].For this reason and as bibliography suggests, we also include as exogenous variables three indicator vectors that use Boolean logic notation to point states like working days, weekend and holidays.The study is conducted on the Greek electricity market, therefore we consider the Greek holidays (Christmas, Orthodox Easter, Greek national holidays).The abovementioned vectors and the temperature are imported as exogenous variables to the SARIMA model for building our final SARIMAX model (see Section 4).
Figure 2 shows the evolution of load demand in Greece for the time period 2004-2014.We also plot the Autocorrelation Function (ACF) and Partial Autocorrelation Function (PACF) of the load for the period 2004-2014.From this plot we observe strong, persistent 7-day dependence and that both the ACF and PACF decay slowly, indicating a strong autocorrelated or serially correlated time series.This is a strong indication of seasonality in our load series and also an implication of non-stationarity.The PACF also shows the seasonality with spikes at lags 8, 15, 22 and so on.Table 1 provides the summary or descriptive statistics of the load and temperature time series.The variation of load with average daily temperature is depicted in Figure 5.The variation of load with average daily temperature is depicted in Figure 5.The variation of load with average daily temperature is depicted in Figure 5.The variation of load with average daily temperature is depicted in Figure 5.The graph shows a parabolic shape indicating increased consumption at high and low temperatures [33].It suggests also that forecasting future load (demand), requires the knowledge of load growth profile, relative to a certain reference i.e., the current year.The polynomial function of load versus temperature shown on the graph seems a reasonable approximation for load forecasting.Due to quadratic and strong correlation between temperature and load, we have included temperature as an exogenous-explanatory variable in the modeling in order to enhance model's predictive power.More specifically, we have considered the temperature deviation from comfortable living conditions temperature in Athens, for the current and previous day T 2 Ath (t), T 2 Ath (t − 1) respectively (Tsekouras [104]): where This expression for T deviation capture the nonlinear (parabolic) relation of temperature and system load shown in Figure 5.
The same plot is expected if instead of the daily average temperature we use daily maximum or minimum temperature.Figure 6 gives the histogram of the load as well as its summary statistics.
Energies 2016, 9, 635 10 of 40 The graph shows a parabolic shape indicating increased consumption at high and low temperatures [33].It suggests also that forecasting future load (demand), requires the knowledge of load growth profile, relative to a certain reference i.e., the current year.The polynomial function of load versus temperature shown on the graph seems a reasonable approximation for load forecasting.Due to quadratic and strong correlation between temperature and load, we have included temperature as an exogenous-explanatory variable in the modeling in order to enhance model's predictive power.More specifically, we have considered the temperature deviation from comfortable living conditions temperature in Athens, for the current and previous day , 1 respectively (Tsekouras [104]): where 18 C , 25 C .This expression for capture the nonlinear (parabolic) relation of temperature and system load shown in Figure 5.
The same plot is expected if instead of the daily average temperature we use daily maximum or minimum temperature.Figure 6 gives the histogram of the load as well as its summary statistics.Before proceeding to formulate a model, we apply various statistical tests in order to examine some core properties of our data, namely stationarity, normality, serial correlation, etc.The Jarque and Bera (JB) algorithm [105] assumes the null hypothesis that the sample load series has a normal distribution.The test as we see rejected the null hypothesis at the 1% significance level, as it is shown in the analysis (p = 0.000, JB Stat significant).From the descriptive statistics of the above figure the positive skewness and kurtosis indicate that load distribution has an asymmetric distribution with a longer tail on the right side, fat right tail.Thus load distribution is clearly non-normally distributed as it is indicated by its skewness (different but still close to zero), and excess kurtosis (>3) since normal distribution has kurtosis almost equal to three (3).That means that our data need to be transformed in such way in order to have a distribution as close to normal as possible.
The deviation of load distribution from normality is also apparent in the Q-Q plot, Figure 7.We see that the empirical quantiles (blue curve) versus the quantiles of the normal distribution (red line) do not coincide, even slightly and exhibit extreme right tails: Before proceeding to formulate a model, we apply various statistical tests in order to examine some core properties of our data, namely stationarity, normality, serial correlation, etc.The Jarque and Bera (JB) algorithm [105] assumes the null hypothesis that the sample load series has a normal distribution.The test as we see rejected the null hypothesis at the 1% significance level, as it is shown in the analysis (p = 0.000, JB Stat significant).From the descriptive statistics of the above figure the positive skewness and kurtosis indicate that load distribution has an asymmetric distribution with a longer tail on the right side, fat right tail.Thus load distribution is clearly non-normally distributed as it is indicated by its skewness (different but still close to zero), and excess kurtosis (>3) since normal distribution has kurtosis almost equal to three (3).That means that our data need to be transformed in such way in order to have a distribution as close to normal as possible.
The deviation of load distribution from normality is also apparent in the Q-Q plot, Figure 7.We see that the empirical quantiles (blue curve) versus the quantiles of the normal distribution (red line) do not coincide, even slightly and exhibit extreme right tails: In this section we also identify the cyclical patterns of load for the period 2004 to 2014.This can be done by decomposing our signals which contain cyclical components in its sinusoidal functions, expressed in terms of frequency i.e., cycles per unit time, symbolized by ω.The period, as we know, is the reciprocal of ω i.e., T = 1/ω.Since our signal is expressed in daily terms, a weekly cycle is obviously 1/0.4128 = 7 days or 1/52.14 = 0.0192 years expressed annually.A typical tool for revealing the harmonics or spectral components is the periodogram.Let our set of signal measurements or the load time series is {xi, i=1,…n}, then the periodogram is given as follows: where ω 2 ⁄ are the Fourier frequencies, given in radians per unit time.In Figure 8 the periodogram of daily average system load is shown, for the same period as before.Obviously the signal exhibits the same periods and its harmonics as described above.The periodic harmonics identified in this graph correspond to 7-day period (peak), 3.5-day and 2.33-day (bottom part) which in turn correspond to frequencies ωk = 0.1428, 0.285 and 0.428 respectively (upper part of the figure).The existence of harmonics i.e., multiples of the 7-day periodic frequency reveal that the data is not purely sinusoidal.We treat load as a stochastic process and before proceeding further in our analysis, it is necessary to ensure that conditions of stability and stationarity are satisfied.For this purpose, we will utilize the necessary tests proposed by bibliography [106], Augmented-Dickey-Fuller (ADF) In this section we also identify the cyclical patterns of load for the period 2004 to 2014.This can be done by decomposing our signals which contain cyclical components in its sinusoidal functions, expressed in terms of frequency i.e., cycles per unit time, symbolized by ω.The period, as we know, is the reciprocal of ω i.e., T = 1/ω.Since our signal is expressed in daily terms, a weekly cycle is obviously 1/0.4128 = 7 days or 1/52.14 = 0.0192 years expressed annually.A typical tool for revealing the harmonics or spectral components is the periodogram.Let our set of signal measurements or the load time series is {x i , i = 1, . . .n}, then the periodogram is given as follows: where ω k = 2π(k/n) are the Fourier frequencies, given in radians per unit time.In Figure 8 the periodogram of daily average system load is shown, for the same period as before.Obviously the signal exhibits the same periods and its harmonics as described above.The periodic harmonics identified in this graph correspond to 7-day period (peak), 3.5-day and 2.33-day (bottom part) which in turn correspond to frequencies ω k = 0.1428, 0.285 and 0.428 respectively (upper part of the figure).
The existence of harmonics i.e., multiples of the 7-day periodic frequency reveal that the data is not purely sinusoidal.In this section we also identify the cyclical patterns of load for the period 2004 to 2014.This can be done by decomposing our signals which contain cyclical components in its sinusoidal functions, expressed in terms of frequency i.e., cycles per unit time, symbolized by ω.The period, as we know, is the reciprocal of ω i.e., T = 1/ω.Since our signal is expressed in daily terms, a weekly cycle is obviously 1/0.4128 = 7 days or 1/52.14 = 0.0192 years expressed annually.A typical tool for revealing the harmonics or spectral components is the periodogram.Let our set of signal measurements or the load time series is {xi, i=1,…n}, then the periodogram is given as follows: where ω 2 ⁄ are the Fourier frequencies, given in radians per unit time.In Figure 8 the periodogram of daily average system load is shown, for the same period as before.Obviously the signal exhibits the same periods and its harmonics as described above.The periodic harmonics identified in this graph correspond to 7-day period (peak), 3.5-day and 2.33-day (bottom part) which in turn correspond to frequencies ωk = 0.1428, 0.285 and 0.428 respectively (upper part of the figure).
The existence of harmonics i.e., multiples of the 7-day periodic frequency reveal that the data is not purely sinusoidal.We treat load as a stochastic process and before proceeding further in our analysis, it is necessary to ensure that conditions of stability and stationarity are satisfied.For this purpose, we will utilize the necessary tests proposed by bibliography [106], Augmented-Dickey-Fuller (ADF) We treat load as a stochastic process and before proceeding further in our analysis, it is necessary to ensure that conditions of stability and stationarity are satisfied.For this purpose, we will utilize the necessary tests proposed by bibliography [106], Augmented-Dickey-Fuller (ADF) [107] test for examination of the unit-root hypothesis and Kwiatkowski-Phillips-Schmidt-Shin (KPSS) [108] test for stationarity.Applying the ADF test we acquire the information that the series is stable; therefore null hypothesis is rejected.Additionally performing the KPSS test to the load series, we also obtain that the null hypothesis is rejected hence our time series is non-stationary.Even after applying 1st difference, although this forces the time series to become stationary, as the KPSS test applied on the 1st difference also confirms, the ACF and PACF still suggest that a seasonality pattern is present, at every 7 lags.Therefore, we will proceed and filter our data through with a 1st and 7th order differencing filter.The above tests ensure that the stability and stationarity requirements are met [8].The aforementioned preliminary tests, as shown in Table 2, indicate that our time series has to be differenced before we fit it in a regressive model.The slowly decaying positive ACF in Figure 2, after 1st differencing, becomes a slowly decaying but still an oscillatory behavior with 7 lags period, a fact suggesting the need to fit a seasonal (multiplicative) ARMA model to our data.The damping of this oscillation is faster as the ACF value is away for the value of 1.Therefore, in order to capture the 7 lags-period seasonality in load time series, we must also apply, in addition to 1st difference operator (1−B), the operator (1−B 7 ), on x t (load).The needed transformed series is given by: i.e., the nonseasonal first difference times the seasonal (of period 7 days) difference times Load, where B s x t = x t−s , the lag operator.In Figure 9 we show the ACF and PACF of the transformed series.
Energies 2016, 9, 635 12 of 40 [107] test for examination of the unit-root hypothesis and Kwiatkowski-Phillips-Schmidt-Shin (KPSS) [108] test for stationarity.Applying the ADF test we acquire the information that the series is stable; therefore null hypothesis is rejected.Additionally performing the KPSS test to the load series, we also obtain that the null hypothesis is rejected hence our time series is non-stationary.Even after applying 1st difference, although this forces the time series to become stationary, as the KPSS test applied on the 1st difference also confirms, the ACF and PACF still suggest that a seasonality pattern is present, at every 7 lags.Therefore, we will proceed and filter our data through with a 1st and 7th order differencing filter.The above tests ensure that the stability and stationarity requirements are met [8].The aforementioned preliminary tests indicate that our time series has to be differenced before we fit it in a regressive model.The slowly decaying positive ACF in Figure 2, after 1st differencing, becomes a slowly decaying but still an oscillatory behavior with 7 lags period, a fact suggesting the need to fit a seasonal (multiplicative) ARMA model to our data.The damping of this oscillation is faster as the ACF value is away for the value of 1.Therefore, in order to capture the 7 lags-period seasonality in load time series, we must also apply, in addition to 1st difference operator (1−B), the operator (1−B 7 ), on (load).The needed transformed series is given by: i.e., the nonseasonal first difference times the seasonal (of period 7 days) difference times Load, where , the lag operator.In Figure 9 we show the ACF and PACF of the transformed series.

The Seasonal ARIMA and ARIMAX Models
The seasonal (or multiplicative) ARIMA model (SARIMA) is an extension of the Autoregressive Integrated Moving Average (ARIMA) [8,9] model, when the series contains both seasonal and non-seasonal behavior, as our time series load does.This behavior of load makes the typical ARIMA model inefficient to be used.This is because it may not be able to capture the behavior along the seasonal part of the load series and therefore mislead to a wrong selection for non-seasonal component.
Let d and D are nonnegative integers, the series {x t } is a seasonal ARIMA (p, d, q) (P, D, Q) S process with seasonality or period s if the differenced series: is a casual ARMA process (Brockwell and Davis, [9]), defined by: where B is the difference or lag operator, defined in Section 3, and φ(x), Φ(x), θ(x) and Θ(x) polynomials defines as follows: In the ARIMA form (p, d, q) (P, D, Q) S the term (p, d, q) is the non-seasonal and (P, D, Q) S is the seasonal part.We note here that the process y t is casual if and only if φ(x) = 0 and Φ(x) = 0 for |x| ≤ 1 while in applications d is rarely more than 1, D = 1 and P and Q are typically less than 3 (Brockwell & Davis, 1996) [9].Also, if the fitted model is appropriate, the rescaled residuals should have properties similar to those of a white noise ε t driving the ARMA process.All the diagnostic tests that follow, on the residuals, after fitting the model, are based on the expected properties of residuals under the assumption that the fitted model is correct and that {ε t } ∼ W N(0, σ 2 ).
In order to incorporate eXogenous variables in the SARIMA model, to obtain a SARIMAX model the previous model is modified as follows: where y t as previously is the univariate response series, in our case load X t is the row t of X, which is the matrix of predictors or explanatory variables, β is a vector with the regression coefficients corresponding to predictors, c is the regression model intercept, and ε t is a white noise innovation process.The ϕ p , Φ P , θ q and Θ Q are as previously defined i.e., the lag operator polynomials, seasonal and nonseasonal: Energies 2016, 9, 635 14 of 40

A Short Description of Exponential Smoothing (ES) and Error-Trend-Seasonal (ETS) Forecasting Methods
The exponential smoothing approaches date back to the 1950s.However, procedures for model selection were not developed until relatively during our days.All exponential smoothing methods (including non-linear methods) have been shown to be optimal forecasts from innovations state space models (Hyndman et al. [109,110]).Pegels [111] was the first to make a taxonomy of the exponential smoothing methods.Hyndman et al. [109] modified the work of Gardner [112] who had previously extended the work of Pegels' [111] classification table.Finally, Taylor [113] provided the following Table 3 with all fifteen methods of exponential smoothing.
However, a subset of these methods is better known with other names.For instance, the well-known simple exponential smoothing (or SES) method corresponds to cell (N, N).Similarly, cell (A, N) is Holt's linear method, and the damped trend method is given by cell (A d , N).Also, the additive Holt-Winters' method is given by cell (A, A) while and the multiplicative Holt-Winters' method is given by cell (A, M).As it is described in Weron [114] and Taylor [115], exponential smoothing is the traditional model used in forecasting load data series.Due to its satisfactory results it has been adopted by lots of utilities.However, it is pointed out in the above paper that seasonal multiplicative smoothing, SMS, or a variant of it Holt-Winter's SMS are very useful in capturing the seasonal (periodicity) behavior.Centra [116], has applied a Holt-Winter's SMS model to forecast hourly electricity load.
The given time series Y may be decomposed into a trend (T), seasonal (S) and error (E) components.The trend reflects the long-term movement of Y, the seasonal corresponds to a pattern with known periodicity and the error is the uncertain, unpredictable constituent of the series.
In Let now the forecasted trend T h periods out, by these models, be decomposed into a level term (l) and a growth term (b) in a number of ways.Let also 0 < ϕ < 1 be the damping parameter.The five different trend types, according to different assumptions about the growth term.
where ϕ h ≡ ∑ h s=1 ϕ s .Let also s represent the included seasonal terms, we can define the general p-dimensional state vector: A nonlinear dynamic model representation of the ES equations using state space model with a common error part (Ord [117]): where h, k continuous scalar functions, f and g continuous functions with continuous derivatives, R p → R P and ε t ∼ iid W N(0, σ 2 ), independent past realizations of y and x. θ is a parameter set.Intuitively, y t reflects the way the various state variables (l t−1 , b t−1 , s t−m ) are combined to express ŷt = h(x t−1 , θ) and ε t .With additive errors we have k ≡ 1, so: with multiplicative errors, k ≡ h, so: For the ETS models we applied in this work, we consider the updating smoothing equations as being weighted average of a term depending on the current prediction error (and prior states), and one depending on the prior states.The general form of the resulting state equations is: where P t ≡ P(x t−1 , ε t ) R t and T t are functions of the forecasting error and lagged states and ) is a function of the lagged states.ϕ 1 , ϕ 2 are the damping parameters for linear trend and multiplicative trend models respectively (they are = 1, in the absence of damping).The exact form of (10-12) depends on the specific ETS specifications.Hydman et al. [118], list all the 30 possible specifications (see Tables 2.2, 2.3, pages 21-22 in this reference).

Simple Exponential Smoothing (A,N,N)
By inserting x t = l t and h(x t−1 , θ) = l t−1 , P t = l t−1 + ε t and Q t = l t−1 in Equations ( 8) and ( 9) the ETS representation of this (A,N,N) model, in the form of (10-12) is: Similarly, the ETS expressions for the models Holt's Method with multiplicative Errors (M,A,N) and Holt-Winters method with multiplicative errors and seasonal (M,A,M), and fully multiplicative method with damping (M,M d ,M), are given in the Appendix B.

Manifold Learning and Principal Component Analysis in Electricity Markets
The Need for a Low Dimensional Presentation of Electricity Load Series The work needed to analytically model the dynamic evolution of a load curve is daunting since the load series is a relatively "high-dimensional" quality.Each load value at a point in time (say an each hour of the day) on the load curve essentially contains one dimension of uncertainty, due to the concurrent influence of both fundamental and random disturbance factors.Therefore, it seems natural to reduce the dimensionality of modeling a load curve and identify the major random factors that affect its dynamics.
Principal Component Analysis (PCA) is a linear manifold learning technique, mainly suited for extracting the linear factors of a data or time series.In the case of hourly day-ahead electricity load series, the assumption that a low-dimensional structure capturing the longest amount of randomness in the load dynamics seems to be reasonable.Furthermore, while the electricity delivered in the next 24 h corresponds to different commodities, the different 24 prices all result from equilibrating the fundamental supply and demand (load).The load demand and supply conditions in all 24 h suggests for a possible linear or non-linear representation of both the 24-dimensional load and price series in a manifold or space of lower dimension.
In this work we apply the PCA method of manifold learning, in order to analyze the "hidden" structures and features of our high-dimensional load data.Other more advanced method in manifold learning, like Locally Linear Embedding (LLE) or Isometric Feature Mapping (ISOMAP), to capture intrinsic non-linear characteristics in load data, could be used, but this is left for our next work.
As it is observed, the load demand series is highly correlated, hinting that the high dimension of the original data can be reduced to a minimum realization state space.To achieve the necessary dimension reduction we seek data structures of the fundamental dimension hidden in the original observations.For this reason we apply PCA (Jolliffe, [119]) and seek data sets that maintain the variance of the initial series, allowing only an insignificant portion of the given information to be lost.The PCA provides a new data set of drastically reduced dimension, to which a regression model is applied.This strategy provides a reliable forecasting method, with variables reduced to the minimum necessary dimension, allowing for faster calculations and reduced computational power.
In a multiple regression, one of main tasks is to determine the model input variables that affect the output variables significantly.The choice of input variables is generally based on a priori knowledge of casual variables, inspections of time series plots, and statistical analysis of potential inputs and outputs.PCA is a technique widely used for reducing the number of input variables when we have huge volume of information and we want to have a better interpretation of variables (Çamdevýren et al. [120], Manera et al. [121]).The PCA approach introduces a few combinations for model input in comparison with the trial and error process.Given a set of centered input vectors h 1 , h 2 , . . ., h m (i.e., hour 1 (h 1 ), hour 2 (h 2 ), . . ., hour 24 (h 24 )) and ∑ m t=1 h t = 0. Then the covariance matrix of vector is given by: The principal components (PCs) are computed by solving the eigenvalue problem of covariance matrix C: where λ i is one of the eigenvalues of C and u i is the corresponding eigenvector.Based on the estimated u i , the components of z t (i) are then calculated as the orthogonal transforms of h t : The new components z t (i) are called principal components.By using only the first several eigenvectors sorted in descending order of the eigenvalues, the number of principal components in z t can be reduced, so PCA has the dimensional reduction characteristic.The principal components of PCA have the following properties: z t (i) are linear combinations of the original variables, uncorrelated and have sequentially maximum variances (Jolliffe [119]).The calculation variance contribution rate is: The cumulative variance contribution rate is: The number of the selected principal components is based on the cumulative variance contribution rate, which as a rule is over 85%-90%.
Table 4 gives the four (4) largest eigenvalues λ i or variance of the Covariance matrix C, the corresponding variance contribution rate (or % of variance explained) V i and the cumulative variance contribution.Figure 10 provides the patterns of the first four eigenvectors.components (8.79% and 3.65%, respectively) capture some specific aspects of system load, basically daily effects and environmental factors.
Looking at the middle part of Figure 4, we observe that the mean average system load curve exhibits a significant difference in its pattern between night and day.In Figure 10 we show the four eigenvectors associated with the first four eigenvalues 1, … ,4 .
The 24 coefficients associated with , mimic very closely the pattern of the average system load curve, confirming the explanatory power of this component.By observing the values of the first PC 1 in Figure 11 (a zoom), they are positive and almost constant during the first part of the week, while reduced during Saturdays and finally negative during Sundays.This is actually the general shape of the aggregate system load daily pattern, i.e., more "active" during working days and "less" active on week-ends and holidays.These features are captured well by PC1.The differences in load values between working days and week-ends within each week, are captured by the coefficient of the second component .This component helps define the shape of the system load curve during working days.We see that the first principal component explains the largest part of the total variation (83.82% on average), which represents the system load pattern within a typical week.The second and third components (8.79% and 3.65%, respectively) capture some specific aspects of system load, basically daily effects and environmental factors.
Looking at the middle part of Figure 4, we observe that the mean average system load curve exhibits a significant difference in its pattern between night and day.In Figure 10 we show the four eigenvectors u i associated with the first four eigenvalues λ i (i = 1, . . ., 4).
The 24 coefficients associated with u 1 , mimic very closely the pattern of the average system load curve, confirming the explanatory power of this component.By observing the values of the first PC (Z t (1)) in Figure 11 (a zoom), they are positive and almost constant during the first part of the week, while reduced during Saturdays and finally negative during Sundays.This is actually the general shape of the aggregate system load daily pattern, i.e., more "active" during working days and "less" active on week-ends and holidays.These features are captured well by PC1.The differences in load values between working days and week-ends within each week, are captured by the coefficient of the second component u 2 .This component helps define the shape of the system load curve during working days.
The third component PC3 explains on average, as we already said, only a small proportion of the variance in system load and it probably accounts for the effects of weather conditions and environmental factors.The length of the day during winter is shorter so the electricity consumption for heating and lighting is very high during the late afternoon.On the contrary, consumption is lower in late afternoon in summer, due to the fact that daylight is intense and the mean temperature is not as high as in the middle of the day to justify a massive use of air conditioning.We then perform a typical regression where the response variable is the daily average system load and predictors the first three PCs, following the same rationale as in [68]: As new information flows by (the set of new values of the system load for each hour of the new day), a new set of PCs is extracted from the updated covariance matrix C, and a new regression is performed.
Figure 12 shows the factor loading structure, the "projection" of the original matrix of data on the low dimensional manifold spanned by the three or four eigenvectors.It shows the loading of each variable h 1 , . . ., h 24 (each hour in the day) on the 1st, 2nd, 3rd and 4th PC.PC1 explains the 83.83% of the total variability (variance) of the variables, while PC2 explains only the 8.79%, etc.We observe the coefficients of the loadings in Figure 12, while a 3D picture of the first four principal components or 'scores' is given in Figure 13.The distribution of points (dots) along the three individual axes (PC's) of this 3D 'manifold' is very clear.The variances of the distribution of points on the axes correspond to the magnitude of the three largest eigenvalues of the covariance matrix.A piece of very interesting information results from comparing the profiles of structure of the first four eigenvectors in Figure 12 with the profiles of variance, mean and kurtosis of h 1 to h 24 , of Figure 4.The similarity of the profiles suggests that the four loadings provide information about the variance (corresponding to PC1), skewness (PC2), mean (PC3) and kurtosis (PC4) contained in our data set.The capacity of PCA to reproduce the "same" qualitative features of the original 24D manifold with a low dimensional (3D or 4D) manifold is clearly shown.
piece of very interesting information results from comparing the profiles of structure of the first four eigenvectors in Figure 12 with the profiles of variance, mean and kurtosis of to , of Figure 4.The similarity of the profiles suggests that the four loadings provide information about the variance (corresponding to PC1), skewness (PC2), mean (PC3) and kurtosis (PC4) contained in our data set.The capacity of PCA to reproduce the "same" qualitative features of the original 24D manifold with a low dimensional (3D or 4D) manifold is clearly shown.

Application of SARIMAX Model and Results
For the discussed load series (first difference of load), we extracted the best performing model following the Box-Jenkins methodology, from a huge set of 1521 possible models.The selected estimation sample includes observations starting from 1 January 2004 for each model, since we observed that shorter samples produce less efficient models.
Next we examine carefully the ACF and PACF of the filtered series in Figure 9.More specifically we observe the lags that are multiples of seven (s = 7) in order to determine the Q and P parameters from ACF and PACF, respectively.From the PACF we observe that lags at 7, 14, 21, 28, 35 and 42 are outside the bounds suggesting that P should be ranged between 1 and 6.Moreover, from the ACF we observe that the lag at 7 and possibly at 14 and 21 are outside the bounds suggesting Q = 1 to 3. The orders of p and q are then selected by observing the non-seasonal lags observed before lag 7, where we conclude at p = 1 to 4 (lags at 1 to 4) and q = 1 to 2. In other words a preliminary SARIMA model to start with, could have the following form: We estimated all of the SARIMAX models developed in this work using the maximum likelihood standard procedure.The Schwartz Bayesian criterion or BIC and Akaike information

Application of SARIMAX Model and Results
For the discussed load series (first difference of load), we extracted the best performing model following the Box-Jenkins methodology, from a huge set of 1521 possible models.The selected estimation sample includes observations starting from 1 January 2004 for each model, since we observed that shorter samples produce less efficient models.
Next we examine carefully the ACF and PACF of the filtered series in Figure 9.More specifically we observe the lags that are multiples of seven (s = 7) in order to determine the Q and P parameters from ACF and PACF, respectively.From the PACF we observe that lags at 7, 14, 21, 28, 35 and 42 are outside the bounds suggesting that P should be ranged between 1 and 6.Moreover, from the ACF we observe that the lag at 7 and possibly at 14 and 21 are outside the bounds suggesting Q = 1 to 3. The orders of p and q are then selected by observing the non-seasonal lags observed before lag 7, where we conclude at p = 1 to 4 (lags at 1 to 4) and q = 1 to 2. In other words a preliminary SARIMA model to start with, could have the following form: SARIMAX (1 to 4, 1, 1 to 2)(1 to 6, 1, 1 to 3) 7 We estimated all of the SARIMAX models developed in this work using the maximum likelihood standard procedure.The Schwartz Bayesian criterion or BIC and Akaike information criterion, AIC, were used to select the models with the requirements that all regression coefficients (see Table 5) were significant (at the 1% level).Table 6 provides the values of the above two criteria, which must be the minimum possible ones for a model to be a good candidate for selection.The SARIMAX method involves the selection of the exogenous variables.Many research findings indicate that temperature is a crucial factor for load forecasting and more importantly, the deviation of temperature from the heating and cooling indices, 25 • C and 18 • C, respectively (Hor et al. [122], Tsekouras et al. [104]).In our case, as observed in Figure 5, there is a profound quadratic relation between temperature and the load curve.Thus, we conclude in a set of exogenous variables comprised of the deviation of temperature from the mentioned indices for the given day, and the deviation for the past two days, to account for the 'social inertia', a force that compels masses to react with a certain delay to events.Additionally, we take into consideration a couple of other social parameters that drive the demand of the load, working days and holidays.
The estimation output of the above SARIMAX model is given in the following Table 5.The estimation is based on a period of 10 years, 2004-2013.Table 5 shows the coefficient estimates all having the expected sign, the standard error, the t-statistics and the coefficient of determination R 2 .All coefficients are statistically significant at 1% level (p-values = 0.000).
Observing the fitted model we enhance our intuition on the relation between temperature and load.For example, the deviation of temperature to the heating index is weighted 7.569.Therefore, we expect for a working day with average temperature 30 • C that the load should increase roughly by 7.569(25 − 30) 2 = 189.2MW from its average value 5943.5 MW and it should equal 6132.7 MW.As it appears, given our data set, this expectation is reliable since in the case of the 8th of April 2014, the temperature is 29.9 • C and the load equals 6133.9MW.
The evaluation of the model is provided in the following table.We observe almost excellent descriptive statistics' results, with a coefficient of determination R 2 = 96.1% indicating that our model explains 96.1% of the data variability as well as an almost 'ideal' Durbin-Watson stat DW = 1.992 (see Appendix A), indicating how close are the residuals of the fitting to the white noise distribution.
To measure the fit of each model to the given data, we use the Theil's Inequality Coefficient (TIC) (see Appendix A).A small TIC indicates a good fit (zero indicates a perfect fit).The average TIC of our models is 0.020, showing a very good fit to the load series.Figure 14 shows the fitted in-sample SARIMAX model with the load series for the period 2004-2014.TIC of our models is 0.020, showing a very good fit to the load series.Figure 14 shows the fitted in-sample SARIMAX model with the load series for the period 2004-2014.The in-sample forecasting quality measures indicate a reliable predictive model, with an average fitting MAPE 2.91% (see Appendix A for the definition of MAPE).Performing the Ljung-Box Q-test (LBQ test) [123] and ARCH test on the residuals, we find that indeed they are not auto-correlated nor serially correlated, therefore they are independent.However, the residual series does not pass the Jarque-Bera test which indicates that they are not normally distributed.These results are shown in Table 7.The in-sample forecasting quality measures indicate a reliable predictive model, with an average fitting MAPE 2.91% (see Appendix A for the definition of MAPE).Performing the Ljung-Box Q-test (LBQ test) [123] and ARCH test on the residuals, we find that indeed they are not auto-correlated nor serially correlated, therefore they are independent.However, the residual series does not pass the Jarque-Bera test which indicates that they are not normally distributed.These results are shown in Table 7.The above models were selected after optimizing the model selection with the Akaike Information and BIC Schwarz criteria.As suggested by [8,9], we seek the most parsimonious model, in other words we seek to limit the number of parameters to the absolute necessary while at the same time the error remains bounded.The following Table 8 includes some of the best models under this optimization method.The above models were selected after optimizing the model selection with the Akaike Information and BIC Schwarz criteria.As suggested by [8,9], we seek the most parsimonious model, in other words we seek to limit the number of parameters to the absolute necessary while at the same time the error remains bounded.The following Table 8 includes some of the best models under this optimization method.The above models were selected after optimizing the model selection with the Akaike Information and BIC Schwarz criteria.As suggested by [8,9], we seek the most parsimonious model, in other words we seek to limit the number of parameters to the absolute necessary while at the same time the error remains bounded.The following Table 8 includes some of the best models under this optimization method.

Principal Components Regression Forecasting
As we have already mentioned in Section 4.3, the principal components method performs dimension reduction, transforming a set of correlated data to a set of orthogonal and independent variables.In our case, we will use the hourly electricity system load series and classify our data to 24 components, each for the respective hour of the day.Performing PCA, we find that out of the 24 classes only three are enough to account for the 96% of the variance of the original data (Table 4).It is evident that with this method we manage to decompose the data set with not significant data information loss.
A regressive model is built with response variable the system load and predictors the first three PCs, which we use in order to perform forecasts.The regression estimation process is significantly accelerated by using only three (transformed) predictor variables as well as by selecting a suitable training set.We have observed that a training set of a rolling window of 1 year (365 observations) is enough to produce satisfactory results.Evaluating the out-of-sample forecasting performance, we find that the principal components regression model produces an average MAPE 2.93 %, for all individual forecasting periods examined, using the three models.For example, Figures 17 and 18 show the forecasting results for May and December 2014, respectively.

Principal Components Regression Forecasting
As we have already mentioned in Section 3.4, the principal components method performs dimension reduction, transforming a set of correlated data to a set of orthogonal and independent variables.In our case, we will use the hourly electricity system load series and classify our data to 24 components, each for the respective hour of the day.Performing PCA, we find that out of the 24 classes only three are enough to account for the 96% of the variance of the original data (Table 4).It is evident that with this method we manage to decompose the data set with not significant data information loss.
A regressive model is built with response variable the system load and predictors the first three PCs, which we use in order to perform forecasts.The regression estimation process is significantly accelerated by using only three (transformed) predictor variables as well as by selecting a suitable training set.We have observed that a training set of a rolling window of 1 year (365 observations) is enough to produce satisfactory results.Evaluating the out-of-sample forecasting performance, we find that the principal components regression model produces an average MAPE 2.93 %, for all individual forecasting periods examined, using the three models.For example, Figures 17 and 18 show the forecasting results for May and December 2014, respectively.The forecasting results of PC regression are similar to these in works of Taylor [11,124], although the data used in there are hourly observations.It was found that the MARE of SARIMAX for 24 h forecasting horizon was slightly better than PC regression's forecasting.It is also worth mentioning that the number of PCs that explains the largest part of the total variability of hourly data in a number of European Countries is 5 for Italy, 6 for Norway, 7 for Spain and 6 for Sweden [11], compared to 4 for the Greek market.This information enhances the view that PCA is a powerful tool in revealing the degree of complexity of a high dimensional system, as an energy market is.

Holt-Winters' Triple Exponential Smoothing Forecasting
Triple exponential smoothing is an adaptive and robust method mainly utilized mainly for short-term forecasting.It allows adjustment of the models' parameters to incorporate seasonality and trending features.The Holt-Winters' method major advantage is that only a few observations are sufficient to produce a good training set.In our case, we have used a training set of equal length to that of the forecast, 31 observations.Using the log-likelihood optimization approach, we produced Multiplicative Error, No Trend, Multiplicative Seasonality (M,N,M) models for each forecast period.The average MAPE of the Holt-Winters' method is 5.12%, indicating a reliable forecast given the significantly small estimation set.For each estimation period the model is realized as follows (please refer to Table 9):

Model Comparison
Each of the three forecasting models discussed represents a statistical prediction approach for medium-term forecasting of data that exhibit profound seasonality.Each model has advantages and disadvantages, compelling us to accept a trade-off, whether it is reduced accuracy for a small estimation set or vice versa.After multiple applications of each model to different lengths of our original data set, we constructed the following table (Table 12) containing the average errors for each method.Figures 21 and 22 show a comparison of our models with the actual load, for both May 2014 and December 2014 respectively.

Model Comparison
Each of the three forecasting models discussed represents a statistical prediction approach for medium-term forecasting of data that exhibit profound seasonality.Each model has advantages and disadvantages, compelling us to accept a trade-off, whether it is reduced accuracy for a small estimation set or vice versa.After multiple applications of each model to different lengths of our original data set, we constructed the following table (Table 10) containing the average errors for each method.Figures 21 and 22 show a comparison of our models with the actual load, for both May 2014 and December 2014 respectively.It is evident that the model with the optimum combination of prediction accuracy and simplicity is the principal components regression, while SARIMAX is the second best performing and ETS is the worst performing.Surprisingly, the Holt-Winters' method provides a good direction prediction movement, despite its small estimation set.This happens because the method manages to track the movement of the seasonality in the load despite its poor accuracy in predicting the next value of the load.The SARIMAX model produces a very reliable in-sample-fit and shows larger record of the most successful forecasting of "extreme" values or spikes load, while at the same time contains the average error to satisfactory levels.Overall, we find the principal component regression to be the most reliable method, due to its simplicity, limited need of computational power and very good prediction accuracy.The SARIMAX model is also very good and robust with a forecasting ability comparable to PC regression, while the ETS have been found to perform poorly in forecasting daily data.This behavior is completely different from its impressive prediction performance when it is fitted to hourly data [11].

Comparison with Machine Learning Techniques
In this section we apply the two most widely used machine learning approaches to short term load forecasting, namely artificial neural networks (ANN) and support vector machines (SVM) that will serve as benchmarks toward a comparison with our proposed PCA-regression model.The designed models are compared with these approaches.Other methods include fuzzy systems and knowledge based expert systems (see the literature review in the Introduction section).Such approaches are not recommended for precision forecasting and generally are based on the experts' view, which can be influenced by various random parameters.Hybrid methods of the above techniques are also implemented for forecasting purposes, however we will not discuss them in this work.
For our ANN we use a 2-layer network as shown in Figure 23, with a Levenberg-Marquardt optimization algorithm and one training epoch between updates.The training set is the full set of previous observations beginning from 1 January 2004.The algorithm does not converge within 1000 iterations, bringing up the problem of the seemingly endless training.
For the SVM (support vector regression-SVR) approach we use linear kernel functions and chose to encode the following labels (equivalent to exogenous parameters in our regression models): It is evident that the model with the optimum combination of prediction accuracy and simplicity is the principal components regression, while SARIMAX is the second best performing and ETS is the worst performing.Surprisingly, the Holt-Winters' method provides a good direction prediction movement, despite its small estimation set.This happens because the method manages to track the movement of the seasonality in the load despite its poor accuracy in predicting the next value of the load.The SARIMAX model produces a very reliable in-sample-fit and shows larger record of the most successful forecasting of "extreme" values or spikes load, while at the same time contains the average error to satisfactory levels.Overall, we find the principal component regression to be the most reliable method, due to its simplicity, limited need of computational power and very good prediction accuracy.The SARIMAX model is also very good and robust with a forecasting ability comparable to PC regression, while the ETS have been found to perform poorly in forecasting daily data.This behavior is completely different from its impressive prediction performance when it is fitted to hourly data [11].

Comparison with Machine Learning Techniques
In this section we apply the two most widely used machine learning approaches to short term load forecasting, namely artificial neural networks (ANN) and support vector machines (SVM) that will serve as benchmarks toward a comparison with our proposed PCA-regression model.The designed models are compared with these approaches.Other methods include fuzzy systems and knowledge based expert systems (see the literature review in the Introduction section).Such approaches are not recommended for precision forecasting and generally are based on the experts' view, which can be influenced by various random parameters.Hybrid methods of the above techniques are also implemented for forecasting purposes, however we will not discuss them in this work.
For our ANN we use a 2-layer network as shown in Figure 23, with a Levenberg-Marquardt optimization algorithm and one training epoch between updates.The training set is the full set of previous observations beginning from 1 January 2004.The algorithm does not converge within 1000 iterations, bringing up the problem of the seemingly endless training.For the SVM (support vector regression-SVR) approach we use linear kernel functions and chose to encode the following labels (equivalent to exogenous parameters in our regression models): (i) day of the week; (ii) temperature status (Boolean vector signaling high/low); (iii) holiday.For the training set, we chose the observations of the last year.
The proposed SVR structure is actually the model proposed by Chen et al. [75] and is commonly applied for the purposes of short term load forecast.To implement it we used the LibSVM library for MATLAB (see Chang C.C; Lin C. J, [125]).Convergence is achieved astonishingly quickly (within 2-3 iterations), however the accuracy of the forecast, given the 30 step window is reduced compared to the proposed regression models.The aforementioned techniques were applied to our data and comparative results are presented in Table 11 and Figures 23-27.Machine learning techniques are the basis of the most recent developments in short term load forecasting.Those methods provide reliable results for a very short forecasting period and are mostly implemented in online algorithms.However, in cases where the forecasting window is more than a couple of iterations, machine learning algorithms can provide dubious forecasts.This is shown more clearly in Figures 26 and 27, where for the December month the error is well above the acceptable levels (7% and 10%), while the same methods for May produce more reliable results.Such methods are not robust in terms of the optimization algorithm that is applied, meaning that for different sets different algorithms may be more appropriate.SARIMAX and PC-regression do not experience these drawbacks since in their implementation the parameters remain unchanged, thus they are more easily adopted and applicable on a dataset.
From Table 13 and the figures above it is evident that our proposed model has shown better forecasting performance for the load data of the GEM.We conclude that we can produce better forecasts either by selecting the most appropriate parameters that shape the dynamics of the load  Machine learning techniques are the basis of the most recent developments in short term load forecasting.Those methods provide reliable results for a very short forecasting period and are mostly implemented in online algorithms.However, in cases where the forecasting window is more than a couple of iterations, machine learning algorithms can provide dubious forecasts.This is shown more clearly in Figures 26 and 27, where for the December month the error is well above the acceptable levels (7% and 10%), while the same methods for May produce more reliable results.Such methods are not robust in terms of the optimization algorithm that is applied, meaning that for different sets different algorithms may be more appropriate.SARIMAX and PC-regression do not experience these drawbacks since in their implementation the parameters remain unchanged, thus they are more easily adopted and applicable on a dataset.
From Table 13 and the figures above it is evident that our proposed model has shown better forecasting performance for the load data of the GEM.We conclude that we can produce better forecasts either by selecting the most appropriate parameters that shape the dynamics of the load Machine learning techniques are the basis of the most recent developments in short term load forecasting.Those methods provide reliable results for a very short forecasting period and are mostly implemented in online algorithms.However, in cases where the forecasting window is more than a couple of iterations, machine learning algorithms can provide dubious forecasts.This is shown more clearly in Figures 26 and 27, where for the December month the error is well above the acceptable levels (7% and 10%), while the same methods for May produce more reliable results.Such methods are not robust in terms of the optimization algorithm that is applied, meaning that for different sets different algorithms may be more appropriate.SARIMAX and PC-regression do not experience these drawbacks since in their implementation the parameters remain unchanged, thus they are more easily adopted and applicable on a dataset.
From Table 11 and the figures above it is evident that our proposed model has shown better forecasting performance for the load data of the GEM.We conclude that we can produce better forecasts either by selecting the most appropriate parameters that shape the dynamics of the load series and then specify a SARIMAX model, or by applying the powerful non-parametric PC regression method that allows dimension reduction or data compression with very little loss of useful information.One crucial inherent property and advantage of PCA-regression is that this method is at the same time a very effective noise reduction technique, therefore "facilitating" and improving a lot the regression process.

Conclusions
There is an increasing emphasis on accurate load forecasting, as electricity markets are transformed into complex structures, facilitating energy commodities and financial-like instruments.In this work, we examined the Greek electricity market and attempted to build statistical models that limit the prediction error and provide valuable information to all market participants, from power generators to energy traders.We have built a hybrid model, in a similar way as in the work of Hong et al. [12] in which their hybrid PCA model outperformed an ANN model.The principal component analysis (PCA) approach is a linear tool of the manifold learning field applied on the load data.We considered the given series as a projection-measurement (or realization) taken from a high-dimensional dynamical object, the "load manifold", since the underlying dynamics has a large number of degrees of freedom.This is equivalent to say that load if influenced by a very large number of factors in the electricity market.After applying PCA on load series we obtained a new low dimensional object, a 3D manifold on which the underlying dynamics of load is now influenced by only three factors or principal components (PCs-linear combinations of all the original variables, in our case the 24 hourly time series), that in total explain or capture 96% of the total variance of load.These PCs are independent, uncorrelated (since they are orthogonal to each other as they are extracted from PCA "filter") and "ideal" predictors in a multiple regression model, where the response variable is the load series.We have compared our model with a seasonal ARIMA model with eXogenous parameters, (SARIMAX), the Holt-Winters' exponential smoothing method and their improved extractions (ETS) as well as with ANN and SVM models.To the best of our knowledge, there has not been a study in the Greek electricity market involving PCA or the augmented ARMA model, SARIMAX.We have also chosen the ANN models that have been used extensively in the GEM (e.g., [14,44,45]), with known pros and cons.SARIMAX models however have not been used (although ARMA and ARIMA ones have been applied) [126].The results indicate that the PC-regression method produces forecasts superior to the ones by the conventional, Holt-Winters', ETS, ANN and SVM models (when applied to daily, day-ahead, aggregated load data).Tables 10 and 11 compare the forecasts of the considered models.Focusing on MAPE which is the "benchmark measure" (a crucial measure of quality and reliability of the predictions and in estimating the financial consequences that result from the supply and demand interaction) in the forecasting "business", we conclude that the hybrid PC-regression model outperforms all other models, with SARIMAX the next best performing.We however point out here that the abovementioned superiority of the PC-regression model in forecasting found in this work, has to be considered in conjunction with the type of the data on which it has been applied, i.e., daily, day-ahead, aggregated system-wide load data.
It is our intention, in the near future, to examine the application of our model also on hourly data.Moreover, as a future work, we plan to examine other novel techniques of manifold learning, in place of PC, as the Locally Linear Embedding (LLE) and Isometric Feature Mapping (Isomap) on load as well as price daily data, methods that have not been used so far in electricity load and price forecasting.
where ξ i is the upper limit of the training error and ξ i * the lower one.C parameter is a constant which determines the tradeoff between the model complexity and the approximation error.Equation ( 4) is subjected to the following constraints: where ai and ai * are the Lagrange multipliers and K(xi, xj) represents the Kernel function.Kernel functions are used in order to reduce the computational burden since φ(x) has not to be explicitly computed.By maximizing the Lagrange multipliers the Karush-Kuhn-Tucker (KKT) conditions for regression are applied [127]; Maximization of: The former optimization problem as described by equations 4 and 5 can be solved with the help of Lagrangian multipliers: Thus: where a i and a i * are the Lagrange multipliers and K(x i , x j ) represents the Kernel function.Kernel functions are used in order to reduce the computational burden since φ(x) has not to be explicitly computed.By maximizing the Lagrange multipliers the Karush-Kuhn-Tucker (KKT) conditions for regression are applied [127]; Maximization of: The performance of the SVM regression model is strongly dependent on the selection of the parameter C, ε as well as the chosen kernel function.For our SVM models we have used LibSVM program.

C2. Artificial Neural Networks
Neural networks aim to mimic brain activity and approach solution to a problem by applying a laws that the network has ''learned" as a result of a training process [74].Much like a biological brain, the ANN is a set of individual operators, called neurons that co-operate to accomplish a task.Neurons are structured in layers and typically an ANN has an input layer, a number of hidden layers and an output layer as shown in Figure A2.The number of neurons in each layer can vary and usually is selected by an optimization algorithm.Each neuron has one input and its output is connected with all neurons of the next layer, creating the connections of the network.There are various neuron types, most common being the nonlinear sigmoid operator:

Figure 1 .
Figure 1.Available methods applied in each case of load forecasting, based on forecasting horizon (short-term, medium-term, or long-term).

Figure 1 .
Figure 1.Available methods applied in each case of load forecasting, based on forecasting horizon (short-term, medium-term, or long-term).

Figure 3
depicts the evolution of load for each hour of the day for the year 2013, while in Figure4we show the fluctuation of the mean, variance, kyrtosis and skewness of load for each hour of the day and for the period 2004-2014.

Figure 2 .
Figure 2. Daily average (of 24 h) load in Greek electricity market, 2004-2014, with its autocorrelation and partial autocorrelation functions.

Figure 2 .
Figure 2. Daily average (of 24 h) load in Greek electricity market, 2004-2014, with its autocorrelation and partial autocorrelation functions.

Figure 3 .
Figure 3. Evolution of load series for each hour of the day, for the year 2013.

Figure 4 .
Figure 4. Profile of variance, meankurtosis and Skewness of load, for hours 1 to 24, for the period 2004-2014.

Figure 5 .
Figure 5.The parabolic variation of average temperature in Greece with load.

Figure 3 . 40 Figure 3 .
Figure 3. Evolution of load series for each hour of the day, for the year 2013.

Figure 4 .
Figure 4. Profile of variance, meankurtosis and Skewness of load, for hours 1 to 24, for the period 2004-2014.

Figure 5 .
Figure 5.The parabolic variation of average temperature in Greece with load.

Figure 4 .
Figure 4. Profile of variance, meankurtosis and Skewness of load, for hours 1 to 24, for the period 2004-2014.

Energies 2016, 9 , 635 9 of 40 Figure 3 .
Figure 3. Evolution of load series for each hour of the day, for the year 2013.

Figure 4 .
Figure 4. Profile of variance, meankurtosis and Skewness of load, for hours 1 to 24, for the period 2004-2014.

Figure 5 .
Figure 5.The parabolic variation of average temperature in Greece with load.

Figure 5 .
Figure 5.The parabolic variation of average temperature in Greece with load.

Figure 6 .
Figure 6.Histogram of load time series, summary statistics and Jarque-Bera test results.

Figure 7 .
Figure 7.The Q-Q plot of load series, indicating a deviation from normality at right tail.

Figure 8 .
Figure 8. Periodogram of daily average load in the frequency domain.

Figure 7 .
Figure 7.The Q-Q plot of load series, indicating a deviation from normality at right tail.

Energies 2016, 9 , 635 11 of 40 Figure 7 .
Figure 7.The Q-Q plot of load series, indicating a deviation from normality at right tail.

Figure 8 .
Figure 8. Periodogram of daily average load in the frequency domain.

Figure 8 .
Figure 8. Periodogram of daily average load in the frequency domain.
a purely additive model we have: Y = T + S + E or Y = S + T, while in a pure multiplicative model: Y = T•S•E or Y = T•E.Hybrid models are: Y = (T•S) + E or Y = (T + S)(1 + E).

Figure 11 .
Figure 11.Principal component analysis of the load series, for the time period 2004-2014.

Figure 11 .
Figure 11.Principal component analysis of the load series, for the time period 2004-2014.

Figure 11 .
Figure 11.Principal component analysis of the load series, for the time period 2004-2014.

Figure 13 . 3 -
Figure 13.3-Dimensional manifold after the PCA after dimension reduction of the original 24-dimensional space.

Figure 13 . 3 -
Figure 13.3-Dimensional manifold after the PCA after dimension reduction of the original 24-dimensional space.

Figure 17 .
Principal component regression forecast for May 2014.

Figure 21 .
Figure 21.Forecast accuracy comparison for May 2014 for the Greek electricity market.

Figure 22 .
Figure 22.Forecast accuracy comparison for December 2014 for the Greek electricity market.

Figure 23 .
Figure 23.General structure of the ANN we apply.

Figure 23 .
Figure 23.General structure of the ANN we apply.

Figure 23 .
Figure 23.General structure of the ANN we apply.

1 e
−ax+b + c (C10) An ANN is functional only after the training process is complete, thus the parameters of each neuron are estimated.There exist numerous training algorithms, some benchmark methods being Hebbian training and error back-propagation.Energies 2016, 9, 635 35 of 40Neurons are structured in layers and typically an ANN has an input layer, a number of hidden layers and an output layer as shown in FigureA2.The number of neurons in each layer can vary and usually is selected by an optimization algorithm.Each neuron has one input and its output is connected with all neurons of the next layer, creating the connections of the network.There are various neuron types, most common being the nonlinear sigmoid operator:1 (C10)An ANN is functional only after the training process is complete, thus the parameters of each neuron are estimated.There exist numerous training algorithms, some benchmark methods being Hebbian training and error back-propagation.

Figure A2 .
Figure A2.The general structure of a neural network, showing the inputs, input layer, hidden layers and output layer.

Figure A2 .
Figure A2.The general structure of a neural network, showing the inputs, input layer, hidden layers and output layer.

Table 1 .
Descriptive statistics of Load and predictor mean temperature in Greece.

Table 1 .
Descriptive statistics of Load and predictor mean temperature in Greece.

Table 2 .
Statistical tests applied to the original load series.

Table 2 .
Statistical tests applied to the original load series.

Table 5 .
SARIMAX model parameters for estimation set in the time period 2004-2013.

Table 6 .
Evaluation of the SARIMAX model for estimation set in the period 2004-2013.

Table 7 .
Results from the tests on residuals generated from SARIMAX models.
Figures 15 and 16 show the SARIMAX's model forecasts for May 2014 and December 2014 respectively.

Table 8 .
A comparative list of the prediction efficiency and parameters selection of the SARIMAX models fitted on load series.

Table 9 .
Holt-Winter's model and evaluation parameters for May and December 2014.The Exponential smoothing forecasts for May 2014 and December 2014 are shown in Figures 19 and 20 respectively.

Table 12 .
Average errors per month for the time period 2004-2014 & length of estimation set for each method.

Table 10 .
Average errors per month for the time period 2004-2014 & length of estimation set for each method.

Table 12 .
Average errors per month for the time period 2004-2014 & length of estimation set for each method.
The aforementioned techniques were applied to our data and comparative results are presented in Table13and Figures23-27.

Table 13 .
Model comparison for May and December 2014.

Table 11 .
Model comparison for May and December 2014.
the proposed regression models.The aforementioned techniques were applied to our data and comparative results are presented in Table13and Figures23-27.

Table 13 .
Model comparison for May and December 2014.