Benchmark Comparison of Analytical, Data-Based and Hybrid Models for Multi-Step Short-Term Photovoltaic Power Generation Forecasting

: Accurately forecasting power generation in photovoltaic (PV) installations is a challenging task, due to the volatile and highly intermittent nature of solar-based renewable energy sources. In recent years, several PV power generation forecasting models have been proposed in the relevant literature. However, there is no consensus regarding which models perform better in which cases. Moreover, literature lacks of works presenting detailed experimental evaluations of different types of models on the same data and forecasting conditions. This paper attempts to ﬁll in this gap by presenting a comprehensive benchmarking framework for several analytical, data-based and hybrid models for multi-step short-term PV power generation forecasting. All models were evaluated on the same real PV power generation data, gathered from the realisation of a small scale pilot site in Thessaloniki, Greece. The models predicted PV power generation on multiple horizons, namely for 15 min, 30 min, 60 min, 120 min and 180 min ahead of time. Based on the analysis of the experimental results we identify the cases, in which speciﬁc models (or types of models) perform better compared to others, and explain the rationale behind those model performances.


Introduction
Photovoltaic (PV) power generation is constantly gaining ground as a renewable energy source (RES) within the energy market. In 2018, a capacity over 500 GW providing around 600 TWh (roughly 2.5% of the global electricity production) has been documented [1]. By 2019, the current estimation is that the PV capacity will reach 650 GW providing for the 4% of the global production [2]. Additionally, future scenarios for RES systems penetration in the market are even more optimistic, with some countries aiming to reach 100% [3] in the next decades, towards complete decarbonization. Therefore, it is evident that PV systems, are expected to be a key player in this rapidly evolving energy landscape.
Nevertheless, PV production is volatile and intermittent, due to its direct dependency on weather conditions. This introduces considerable uncertainty to the system operation, which is translated into significant risks to the stability and reliability of both the transmission and distribution networks [4]. The challenge is further exacerbated by the distribution of PV penetration. Several small and medium installations are appearing around the world, making such PV plants the most commonly accessed RES-based distributed energy resource (DER) [5]. This popularity creates the challenge of efficiently managing PV power generation. Unexpected shortage or excess can lead to severe imbalance between supply and demand, requiring mitigation actions from the system operator towards avoiding penalties or more severe consequences to the network operation. From a financial perspective, other market entities like aggregators and flexibility traders, have also invested in the optimal management of such resources for maximising their profits through a more efficient market participation.
An important factor in addressing these challenges is the ability to forecast the power generated by the PV systems as accurately as possible. A lot of effort has been invested in this direction as indicated by the relevant literature. Depending on the application, the time horizon for forecasting PV power generation varies from a few minutes (short-term) to days (long-term). The former is usually employed for improving control schemes as well as the participation to intra-day and ancillary markets, while the latter is applied mainly for maintenance and planning [6]. In both cases, the several PV power generation forecasting models can be classified into three categories: • Analytical: These methods do not require any prior knowledge regarding the power generation measurements. They deliver the required results using well-known analytical equations that incorporate the technical characteristics of the PV installation along with weather forecasts derived by typical numerical weather prediction (NWP) models. • Data-based: These models are data-driven, meaning that they are solely dependent on the historical PV power generation data, without any knowledge regarding the PV system itself. The basis of these models is the discovery of patterns and relations within the provided data. This category includes statistical time series models (e.g., autoregressive integrated moving average model-ARIMA), traditional machine learning (ML) models (e.g., artificial neural networks-ANNs) and deep learning (DL) models. • Hybrid: These models attempt to combine the best characteristics of the other two categories in order to achieve higher forecasting accuracy. Different data-based models merged as one, or data-based models on top of analytical models, or even data-based models using NWP techniques, are some of the combinations identified in the literature. Interestingly enough, hybrid models seem to hold quite the potential in delivering the most accurate forecasting results.
In each of the above categories, quite interesting results have been presented over the last few years, with forecasting errors reaching below 1% [7,8]. Nevertheless, in most cases, those results are limited and fragmented, due to lack of a common evaluation framework. Some of the factors limiting their scalability and replicability include: (a) forecasting over clear sky scenarios only, (b) limited amount of data, (c) presentation of results over very specific time frames, (c) inclusion of non-productive time slots (i.e., night hours) to the error metrics calculations, (d) elucidation of results from different locations, datasets and PV plants sizes [6,9]. Due to such factors, it has become quite difficult to thoroughly evaluate the predictive ability of a specific model. Therefore, a comprehensive benchmarking framework that will take into account a variety of factors during the evaluation of a multitude of PV power generation forecasting models from different categories is of great significance. On top of that, it is important to critically compare different types of methods in order to identify the objectively strong points of each type. To the best of our knowledge, very few efforts have been invested towards this direction (e.g., [10]) and even in those, the outcomes were not fully aligned with the rest of the literature [9]. This paper presents a comprehensive benchmarking framework for several analytical, data-based and hybrid models for multi-step short-term PV power generation forecasting. All models were evaluated on the same real-world power data and forecasting conditions (i.e., forecasting objectives, horizons, evaluation metrics, etc.). The main contributions of the work presented in this paper include:

1.
A comprehensive benchmark comparison between analytical, data-driven and hybrid direct PV power generation forecasting models.
acceptable accuracies in cases where the geographical areas of interest are of the order of tens of square kilometers [29]. Finally, the analytical direct and indirect models are proven superior in long-term forecasting scenarios (i.e., from one day to few months) and for large PV installations (order of magnitude MW), but they present inferior performance in short-term forecasting scenarios and for small PV arrays (order of magnitude kW) [30,31].

Data-Based Models
The poor performance of analytical models in short-term forecasting scenarios, led the researchers to investigate the potential of data-based models. These models depend solely on the available PV power data when they predict PV power generation directly, while they process both PV power and weather data when they predict PV power generation indirectly [6]. In this category fall the typical statistical time series models (e.g., autoregressive integrated moving average model-ARIMA), the traditional machine learning models (e.g., support vector machines -SVM), and the deep learning models (e.g., deep neural networks-DNN). A comparative analysis of these data-based approaches is difficult since each published work presenting such a model uses a completely different evaluation frameworks (hour-ahead versus day-ahead forecasts, small versus large PV plants, etc.).
In the case of day-ahead PV power generation forecasting for small-scale PV installations, the majority of published works uses the day-ahead prediction of a weather variable (generated by a typical NWP model) to feed a data-based model. This explains the non-linearities of the generated PV power under different weather conditions [32]. In these works, the reported accuracy of the models varies significantly since there are multiple variables that may impact accuracy. Therefore, it can be stated that no particular model, published in PV power generation forecasting literature, is proven to be consistently superior over others [33]. Some data-based models proven to yield acceptable accuracy for day-ahead PV power generation forecasting are the Extreme Learning Machines (ELM) [34] and the Self-Organizing Maps (SOM) [35].
As already mentioned, the data-based models contain both direct and indirect approaches. The indirect approaches combine historical measurements of PV output power and weather variables (e.g., irradiance, temperature and humidity) in order to build a model that produces highly accurate predictions. For example, Das et al. [36] proposed a support vector regression (SVR) model for hourly and day-ahead PV power generation forecasting. Though results were promising, only sunny days were used for demonstration, thus omitting covering the problem of forecasting in cloudy and rainy days. In cases where limited historical PV power data exist, iterative multi-task learning can be utilized by sharing the PV information from multiple similar solar panels [37]. Moreover, the importance of integrating weather information into data-driven models is highlighted by De Giorgi et al. [38], who proposed an Elman neural network model for direct day-ahead PV power generation forecasting. The outcome of this work is the significantly improved prediction accuracy when both temperature and irradiance historical measurements are added in the input vectors of the network. Finally, weather data can also be used for data pre-processing tasks instead of being directly integrated into the data-based prediction model [39]. For example, Yang et al. [40] divided the historical PV power data into weather-based subsets for sunny, cloudy and rainy days.

Statistical Time Series Models
Statistical time series models are the first data-based models employed for direct PV power generation forecasting. Some of the first models used were based on the linear regression model [41][42][43], the ARIMA model [44,45] and its variants [44,46]. In many studies, these models (along with the naive persistence model) are used for benchmarking purposes [41,44,[47][48][49][50]. Additionally, such statistical models with several input variables are used to estimate the correlation between the PV power generation and weather variables [48,51]. However, these models are linear with respect to both their regressors and parameters, which results in poor performance due to the fact that the PV power generation process is, in general, a nonlinear phenomenon [42].

Traditional Machine Learning Models
The second type of data-based PV power generation forecasting models is the traditional machine learning models [52], namely k-nearest neighbors (kNN) [33], support vector machines (SVM) [14,49,53] and artificial neural networks (ANN). kNN models appear to yield acceptable performance [53]. For example, Fernandez-Jimenez et al. [47] proposed kNN and weighted kNN models for direct PV power generation forecasting with quite accurate results. On the other hand, SVM models present mediocre results in terms of forecasting accuracy, even in case of very short-term direct forecasting (i.e., up to 30 min ahead). Shi et al. [14] presented an SVM-based PV power generation forecasting model that approximately estimated PV power generation using day-ahead weather predictions.
ANNs have grown in popularity due to their ability to accurately represent the highly nonlinear mapping between PV power generation and its related variables [54]. Fernandez-Jimenez et al. [47] proposed five different ANN architectures, which achieved superior performance compared to ARIMA, kNN and adaptive neuro-fuzzy inference systems (ANFIS). Chen et al. [35] used radial basis function networks (RBFN) to forecast the day ahead PV power generation, having initially clustered the predictions of the weather variables. This model presented mediocre forecasting accuracy in cloudy and rainy days. Similarly, Sideratos and Hatziargyriou [55] proposed an RBFN-based PV power generation forecasting model demonstrating high accuracy in long-term forecasting scenarios (e.g., 24 h forecasting horizons) and sunny periods. However, a critical limitation of the ANNs is that they require large amount of data (and, subsequently, long training times) in order to achieve high forecasting accuracy [56].

Deep Learning Models
Deep learning (DL) is a sub-field of machine learning, which includes complex ANN architectures that automatically identify and extract useful features from raw data. Deep learning models have been extensively used for time series forecasting tasks ( [48,[56][57][58]), due to their ability to learn complex relationships from the data and use them to provide accurate forecasting results. There are (roughly) three main categories of deep learning models used for time series forecasting: deep neural networks (DNN), convolutional neural networks (CNN) and recurrent neural networks (RNN). Among RNN architectures, long short-term memory (LSTM) network is the most widely used architecture for time series forecasting. Recently, several DL architectures have been proposed in the PV power generation forecasting literature. For example, Qing and Niu [43] proposed an LSTM architecture to predict the hour-ahead solar irradiance, which is then used for estimating the PV power generation. This model was claimed to yield 18% higher forecasting accuracy compared to traditional ANNs.
For example, Ouyang et al. [59] proposed an RNN-based PV power generation forecasting model, which was combined with clustering algorithms. The model exhibited good forecasting results in sunny days. Additionally, Ghimire et al. [60] introduced an indirect PV power generation forecasting approach in which a CNN extracts features of solar irradiation, which in turn are used by an LSTM to predict the hour-ahead irradiance. Kim and Lee [61] proposed an LSTM model with multiple inputs that include meteorological factors, seasonal factors and preceding power output information in order to predict PV power generation in the peak zone. Vidisha De et al. [62] proposed an LSTM-based model that yielded small forecasting error (i.e., approximately 5% even though it was trained with limited data. Several other LSTM-based power generation forecasting models have been proposed in the relevant literature [42,44,48,49,57]. These models present superior forecasting performance compared to conventional models like ARIMA and DNN, especially in the case of short-term PV power generation forecasting. However, these models have limitations like the requirement for large amounts of available data in order to produce highly accurate predictions and the long training times [49].

Hybrid Models
Apart from the analytical and data-based models, there are also other PV power generation forecasting models that attempt to combine the best characteristics of these categories in order to achieve higher forecasting accuracy. These are the hybrid models. The hybrid models either combine characteristics from models of the same category (i.e., multiple analytical or multiple data-based models) or from models from different categories (i.e., analytical and data-based models). The hybrid approaches make up only 6% of the published PV power generation forecasting models [9]. In this context, Bracale et al. [63] proposed a probabilistic direct forecasting model based on Bayesian inference and Monte Carlo simulation. The model used an analytical function in order to connect the hourly sky clearness index to the maximum power point production of a PV plant. Despite its ability to identify the probability distribution function of power generation, the model underperformed in the one step-ahead prediction case. In general, the unstable meteorological conditions usually result in inferior performance of the hybrid analytical-data-based models [38,56]. Another hybrid approach for day-ahead direct PV power generation forecasting was proposed by Mosaico and Saviozzi [54]. The authors proposed a decision system that selects between an analytical and an ANN model based on the current cloud coverage percentage. The model presents acceptable performance in clear sky days and poor in cloudy days. Additionally, Luyao et al. [64] proposed a hybrid PV power generation forecasting model that combines three ANN architectures with genetic algorithms (GA). Finally, Wang et al. [8] presented a hybrid model that fused a CNN with an LSTM architecture.

Materials & Methods
In this section, the real-world power generation data used for the evaluation of the several PV power generation forecasting models are presented. Additionally, a small mathematical description of each of the nine PV power generation forecasting models is provided. Finally, the configuration parameters of the overall experimental framework are presented.

Field Data
In this section, the real-world power generation data used for the evaluation of the several PV power generation forecasting models are presented and the several preprocessing methods applied on them are described.

Data Description
The dataset used in this study is collected from a real-world small-scale PV installation, which is located on the roof of a two-floor family house emulating building. This building is one of the official European Commission Digital Innovation Hubs (DIH) located within the campus of the Centre for Research and Technology Hellas (CERTH), 6 km away from the metropolitan area of Thessaloniki, Greece. This "smart house" is part of the research and experimental infrastructure of CERTH. Its current installation consists of 58 CIS (copper, indium, and selenium) solar panels with 165 Wp nominal power each. The solar panels are divided into 9 strings that form in total 9.57 kWp, and they are facing 255 • south-west with a tilt angle of 18 • (Figure 1). The PV installation has a very brief shading due to hill located on the northeast of the building during early morning hours. Finally, the climatic conditions according to the Köppen Climate Classification Map (https://www.plantmaps.com/koppen-climateclassification-map-europe.php) is considered Cold Semi-Arid Climate (BSk). The exact longitude and latitude of the installation are 40.566501 and 22.998864, respectively.
The dataset contains the power generation values of the above PV installation for each 15-min interval of a total period of 11 months, namely from 24 March 2019 to 29 February 2020. This is a total period of 343 days. However, 38 days in this period had no available data. Hence, the total number of days with available PV power generation data is 305. Based on the data granularity (i.e., one value per 15-min interval) and the total time period covered by the data (i.e., 305 days), the dataset contains  Apart from the dataset of the PV power generation values, a dataset of weather data has also been assembled. In particular, measurements for three weather variables, namely temperature, wind speed and cloud coverage, have been collected for the location of the aforementioned PV installation. This data was collected from the online weather data aggregation service Weatherbit (https://www.weatherbit.io/), which provides weather information in 15-min intervals (same as the granularity of the PV power generation dataset) for several locations anywhere on the globe. The total period covered by this data is the same as the period covered by the PV power generation dataset. Again, the data is organised into time series. A complementary source of weather information, namely the weather data aggregation service Darksky (https://darksky.net/dev), was particularly used for cloud coverage data. Predictions are also given in time series format, in 15-min intervals. Finally, it should be mentioned that the above weather services have been used in order to collect both actual and forecasted values of the weather variables. The forecasted values are generated using typical NWP models.

Data Segmentation
As identified in similar works found in literature (e.g., [36,40]), it is considered a good practice to divide the available PV power generation data into periods with stable weather conditions (e.g., sunny days period and cloudy days period), and build different forecasting models for each period. This approach was followed in the present study. In particular, the PV power generation dataset was initially split into spring, summer, autumn and winter periods containing PV power generation values from the following time periods: Within each period, the data were re-divided based on the corresponding cloud coverage values. Specifically, the days were separated into high and low cloud coverage days based on whether the corresponding average cloud coverage of the day exceeded a specific cloud coverage threshold. This threshold was set to 10% after experimentation. Finally, it should be mentioned that most of the PV power generation measurements from the time intervals before 5:00 A.M. UTC and 17:00 P.M. UTC were zero and therefore they were discarded.

Data Transformation for Supervised Learning
The data-based and hybrid models presented in this work are trained in a supervised-learning way. This means that in order to train these models, first a set of training samples of the form {(z 1 , y 1 ) , . . . , (z N , y N )} is required, where z j ∈ R d and y j ∈ R. The z j vectors and the y j values should then be applied to the input and output of the models, respectively. However, as mentioned above, the data is organised as a set of time series x i of size n each. In order to transform a time series of data into a set of training samples appropriate for training a model in a supervised way, a window of fixed size p and a forecasting horizon h should be selected. Then, the window passes over the time series one step at a time and matches the time series values it covers to a training sample. This transformation technique is called sliding window. Having a fixed window length assists in the creation of input-output pairs. In particular, the first step is to select the values x i 0 , . . . , x i p−1 as the first training vector z 1 and the value x i p−1+h as the first training output y 1 . Next, the values x i 1 , . . . , x i p formulate the second training vector z 2 and the value x i p+h the second training output y 2 , and so on. In this way, a set of n − p − h + 1 training samples is generated from a time series of size n. For a set of m time series of size n the number of generated training examples is m × (n − p − h + 1).

PV Power Generation Forecasting Models
The objective of this work is to present a comprehensive evaluation framework of several analytical, data-based and hybrid models for multi-step short-term PV power generation forecasting. Extending previous research findings [8,10], this paper aims to evaluate a wider range of models over the same dataset and forecasting parameters, towards presenting a more holistic overview over their performance on multi-step short-term PV power generation forecasting, as presented in Figure 2. In this subsection, a short mathematical description of each model evaluated in this study is provided.

Analytical Model
The first PV power generation forecasting model used in this study is a physical model. As explained, the physical models emulate both the electrical and thermal properties of the PV cell and demonstrate highly accurate forecasting performance. The physical model was implemented using the open source software PVLIB [65]. PVLIB is widely used in the PV power generation forecasting literature as it implements a simple electrical model that exhibits forecasting performance equivalent to a higher-order model, and also it incorporates the Sandia thermal model [66]. The model requires the accurate definition of the following variables: • PV configuration: PV construction details such as type/number of modules, type of inverter, the installation's tilt and azimuth angle should be defined.

•
Weather data: Cloud coverage and temperature forecasts should also be provided as input to the physical model.

Statistical Models
As already mentioned, the statistical models are the first data-based models used for direct PV power generation forecasting. In this section, the details of the statistical models used in this study, namely the persistence (or random walk) model and the ARIMA model, are provided.

Persistence model
In every time series forecasting task, it is useful to have as a basic benchmarking model, a simplistic model like the persistence model (also referred to in the forecasting literature as random walk or naive model). In the persistence model, the forecasted value for the dependent variable is equal to the current value of the variable, regardless of the forecasting horizon. The prediction equation of the persistence model is as follows:x where h is the forecasting horizon. If the forecasting accuracy of a new model is not higher than the accuracy of the persistence model, then the new model cannot be considered as useful.

Autoregressive integrated moving average
The ARIMA model is one of the most widely used statistical models for time series forecasting tasks in general, and for power PV generation forecasting, in particular. The method was popularised by the work of Box and Jenkins [67] in the 1970s. In short, an ARI MA(p, d, q) model is described by the following equation: where p is the autoregressive order, q is the moving average order, d is the order of differencing (i.e., how many times to apply the first differences method in order to make a time series stationary), ϕ j are the autoregressive parameters of the model, θ j are the moving average parameters of the model, L j is the lag operator (i.e., L j x t = x t−j ) and ε t is white noise with zero mean and constant variance. The parameters of the ARIMA model are generally estimated using either the non-linear least squares method or the maximum likelihood estimation method. When the ARIMA model does not include the moving average component, its autoregressive parameters can be estimated using the ordinary least squares (OLS) method.

Traditional Machine Learning Models
This subsection provides the details of the traditional machine learning models used in this study, namely the support vector regression (SVR) and the gradient boosted trees (GBT).

Support vector regression
SVR is the version of the support vector machine (SVM) model for regression problems. Considering z j ∈ R d is the input vector of the SVR model, its prediction is given by the following equation: where y j is the prediction for the input vector z j , w is the parameter vector of the SVR model, b is the bias of the SVR model and · denotes dot product. Given a set of training samples {(z 1 , y 1 ) , . . . , (z n , y n )}, the training process of the SVR model (i.e., the process of estimating its parameter vector w) can be expressed by the following optimisation problem: where ε is a hyperparameter of the SVR model that serves as a threshold. In particular, all predictions have to be within an ε range of the true predictions. In addition, slack variables may be introduced to the problem in order to allow prediction errors to flow out of the ε range boundaries. The above optimisation problem is usually solved using quadratic programming methods like the method of Lagrange multipliers.

Gradient boosted trees
GBT is a model based on gradient boosting, a technique used for both regression and classification problems. A gradient-boosting-based model produces predictions as ensembles of multiple predictions generated by weak prediction models called weak learners. The weak learners are trained sequentially, each one correcting the errors made by its predecessor. In the case of GBT, the weak learners are decision trees. GBT aims to minimise an objective function that combines a convex loss function and a penalty term for model complexity. The training process proceeds iteratively, adding new trees that predict the residuals of errors of prior trees that are then combined with previous trees to make the final prediction. The simplified form of the objective function for the new tree f t is [68]: where g i and h i are the first and second order gradient statistics of the loss function, which are defined as follows: The second term of the objective function Ω( f t ) represents a regularization term in charge of seeking the appropriate final weights to avoid overfitting.

Deep Learning Models
This section provides the details of the deep learning models used in this study, namely a DNN architecture and an LSTM network architecture.

Deep neural networks
A DNN is a typical feed-forward ANN, which consists of at least four layers of nodes, namely the input layer, the output layer and at least two hidden layers. Except for the input layer, all other layers contain neurons with arbitrary activation functions. These activation functions may be either linear or nonlinear, but in the majority of the cases, they are nonlinear (e.g., hyperbolic tangent function, logistic function, rectifier linear unit -ReLU, etc.). The input layer of the DNN just receives the input vectors. The DNNs are trained using the backpropagation technique in a supervised learning way. Finally, DNNs are considered as universal function approximators [69], and therefore they can be used for regression tasks. Moreover, as classification can be considered as a special case of regression in which the target variable is categorical, the DNNs can also be used for classification tasks.

Long short-term memory networks
An LSTM network [70] is an RNN architecture which copes with the vanishing gradient problem (Error gradients tend to become very small or even vanished in very deep neural network models preventing the weights from changing values and thus the models to be trained) by allowing gradients to back-propagate unchanged through the network (however, an LSTM can still suffer from the exploding gradient problem). A common LSTM architecture consists of a cell, which is the memory of the LSTM unit, and three gates that control the information flow inside the LSTM. In particular, the input gate controls the extent to which new values flow into the cell, the forget gate controls the extent to which a value remains to the cell, and the output gate controls the extent to which the values in the cell are used to compute the output of the LSTM. An overview of a typical LSTM unit is presented in Figure 3. A forward pass of information (i.e., of a vector of values x t ∈ R d ) through an LSTM network is described by the following equations: where x t ∈ R d is the input vector of the LSTM network, f t , i t , o t ∈ R h are the activation vectors of the forget, input and output gates of the LSTM units, respectively, c t ∈ R h is the state vector of the cells of the LSTM units, and h t ∈ R h is the hidden state vector (or activation vector) of the LSTM units. Additionally, W q ∈ R h×d is the weight matrix of the input connections between the input vector x t and an LSTM element q, where q can be either the input gate i, the forget gate f , the output gate o or the cell c. Moreover, U q ∈ R h×h is the weight matrix of the recurrent connections between the hidden state vector h t (or more accurately h t−1 ) and an LSTM element q. Finally, b q ∈ R h is the bias vector of an LSTM element q. The σ functions are activation functions. In particular, σ g is the sigmoid activation function of the input, forget and output gates, σ c is the hyperbolic tangent activation function of the cell and σ h is the hyperbolic tangent activation function of the LSTM unit. The symbol represents the Hadamard product (or element-wise product).
The initial state vectors c 0 and h 0 are usually set equal to the zero vector 0 = [0, . . . , 0] T ∈ R h . The training process of an LSTM network lies in the estimation of the values of the W q and U q matrices and the b q vectors for all h units of the network, and it is usually performed using the backpropagation through time (BPTT) algorithm [71]. Unlike typical RNNs, the training process of an LSTM network does not suffer from the vanishing gradient problem, because while the error values are back-propagated to the input they remain unchanged inside the cells of the LSTM units and they do not exponentially degrade. In addition, as understood from the previous description, the cell of an LSTM unit decides what to store and what to leave using element-wise operations of sigmoids, which are differentiable and therefore suitable for backpropagation.

Hybrid Models
As presented in Section 2.3, there are a lot of different approaches for combining methods in order to build a hybrid model with increased forecasting performance. This section provides the details of the two hybrid models used in this study, namely a data-based model that utilises NWP and a combination of the presented analytical with a data-based model.

Hybrid GBT mode-NWP-enriched GBT model
This model extends the GBT model described in Section 3.2.3 by fusing into it NWP historical data. As already mentioned, cloud coverage is the weather variable that predominantly affects PV power generation, and therefore cloud coverage data is utilized by this model. In particular, the historical cloud coverage data is initially organized as a set of time series, and then they are transformed into a set of training samples as described in Section 3.1.3. The main difference here is the fact that the existing

AI-Corrected NWP for Enriched Analytical PV Forecast
The AI-Corrected NWP for Enriched Analytical PV Forecast (AI-PVF) model is a combination of the analytical model described in Section 3.2.1 and an error correction method based on data obtained from the PV plant. The error is divided into clear sky error and cloud coverage error, a separation routinely found in the literature [14,40]. In the context of this study, clear sky error is associated with inaccuracies of the PV parameters, such as solar angles, installation angles and PV module/inverter types. This error exists at all times and it can be isolated on clear sky days.
On the other hand, the cloud coverage error exists only on cloudy days and it essentially represents the error introduced by inaccurate weather forecasts. Weather stochasticity makes it impossible to achieve cloud coverage forecast with satisfyingly high accuracy. By investigating the power generation data of the PV plant, it was found that the weather forecast errors follow specific patterns each time of the day. When those patterns are taken into consideration, the accuracy of the initial weather forecast is improved locally, resulting into a more accurate prediction of the PV power output.
In both cases, the error is corrected using the extremely randomised tree regression (EXTRA trees) [72] model. EXTRA trees is a computationally efficient ensemble method that builds unpruned trees with the classical top-down process. Its main distinctive characteristics from other tree-based ensemble methods are: (a) node splitting is done by choosing cut-points completely at random and (b) the algorithm uses the whole learning sample to grow the trees and not just a bootstrap replica. Concerning the feature extraction process, the PV power output derived from the analytical model along with NWP data and the solar azimuth/elevation angles are fed as features into the regression model. The actual PV power output is the model's target variable. Thus, the models essentially learn the error patterns that occur under specific weather conditions (NWP forecasts) on each time of the day (sun angles). The AI-PVF method is thoroughly analysed in [73].

General Experimental Settings
This paper presents a comprehensive evaluation framework of different types of PV power generation forecasting models. In this section, the configuration details of this framework are provided. In this experimental framework, the main objective is to forecast the value of the PV power generation in multiple forecasting horizons ahead in time using all the aforementioned models, and then compare them in terms of forecasting accuracy. Five different forecasting horizons were evaluated, namely 1, 2, 4, 8 and 12 steps ahead. Given the 15-minute data granularity, these steps correspond to 15 min, 30 min, 1 h, 2 h and 3 h ahead of time. After the data has been transformed into a supervised-learning-compatible form (see Section 3.1.3) with p = 3 and h = {1, 2, 4, 8, 12} according to the chosen forecasting horizons, they are split into training, validation and test sets. In particular, 80% of the data samples were used for training, 10% for validation and 10% for testing. Finally, since the problem investigated is essentially a multi-step time series forecasting problem, an appropriate forecasting strategy was selected, namely the direct strategy [74,75]. In this, each step is forecasted independently from each other. This means that if forecasts should be computed for k steps ahead in total, then k different models should be built. Hence, this strategy can lead to higher training times. It should be noted that, in all data-based model only the PV power generation values were used as inputs. Cloud coverage was only utilised for splitting the original dataset into different sub-datasets according to the forecasting period (e.g., spring).

Model Configuration and Hyperparameter Tuning
As mentioned above, 10% of the available data was used for validation of the models, namely hyperparameter tuning. This process is required by the data-based and the hybrid models. Regarding the ARIMA, SVR and GBT models, the same hyperparameters have been selected for all data partitions. In particular, for all data partitions and forecasting horizons ARIMA(3, 1, 0) models were implemented (the term 'models' here refers to different instances of the ARIMA model based on the different data used for its training). Additionally, the hyperparameters of the SVR models were C = 1, degree = 3, = 0.1, γ = 1/number o f f eatures and kernel = radial basis f unction. Moreover, the optimal hyperparameters of the GBT model were max depth = 8 for the maximum depth of the decision trees and n estimators = 40 for the number of trees used by the GBT. The hyperparameter tuning process for these models was performed using the grid search method. Regarding the AI-PVF approach, which utilises EXTRA trees for the error correction process, grid search was also conducted in order to find the optimal model hyperparameters. Through this process, the best parameters found were: max depth = 12, min samples−split = 9 and n estimators = 150.
In contrast with the above models, for each data partition and forecasting horizon different DNN and LSTM architectures, with different structure (e.g., number of neurons per hidden layer) and hyperparameters, were implemented. These configurations are presented in Table 1 (N u stands for number of units, where units refer to either input units or computational neurons) for the DNNs and in Table 2 for the LSTMs. These configurations were estimated using the grid search method. The PV power generation forecasting models presented in this work were implemented using well-known Python statistical, machine learning and deep learning libraries. In particular, as already mentioned, the analytical model was implemented using PVLIB. The statistical models were implemented using the statsmodels library [76], SVR and AI-PVF using the scikit-learn library [77], the GBT models (i.e., both GBT and HGBT) using the XGBoost library [68], and the neural network architectures (i.e., DNN and LSTM) using the TensorFlow [78] and Keras [79] libraries.

Forecasting Evaluation Metrics
In order to evaluate the accuracy of the presented PV power generation forecasting models, three main error metrics were used, namely the mean absolute error (MAE), the mean absolute percentage error (MAPE) and the root mean squared error (RMSE). These metrics are defined by the following equations: where y i the actual PV power generation value,ŷ i is the predicted value and n is the total number of predictions. These metrics are widely used for evaluating the accuracy of PV power generation forecasting models. MAE and RMSE are expressed in the units of the predicted variable, in this case kilowatts (kW). On the other hand, MAPE is expressed as a percentage and so it is more easily interpretable.
In addition to these well-known metrics, a new metric, namely the weighted relative squared error (WRSE), is introduced. This metric is defined by the following equation: where y i is the actual PV power generation value,ŷ i is the predicted value and h is the forecasting horizon. WRSE expresses the relative forecasting error in terms of the magnitude of the evaluated PV power generation. It also takes into account the direction of the error and provides a uniform weighting for all errors. Finally, the metric disregards zero PV power generation values.

Results
The forecasting accuracy results of all PV power generation forecasting models studied in this work, for all forecasting horizons and data partitions, are presented in Tables 3-6 for the spring, summer, autumn and winter period, respectively. In these tables, the models demonstrating the highest performance (in terms of MAPE) in each case are highlighted with their respective performance metrics in bold letters.

Results According to Season
With regard to the spring season, in the sunny days subperiod, the AI-PVF model presents consistently the best forecasting accuracy across all forecasting horizons. On the contrary, in the cloudy days subperiod, there is no single model yielding the best performance across all forecasting horizons. In particular, the LSTM model has the best performance for 1 step ahead, the DNN model for 2, the HGBT model for 4 and the AI-PVF model for 8 and 12. It is important to highlight that, as expected [57], the forecasting errors in the cloudy days subperiod are much higher than the corresponding errors of the sunny days subperiod. For example, the MAPE value of the best performing model (i.e., AI-PVF) forecasting for 1 step ahead in the sunny days subperiod is 2.898%, while the corresponding value of the best performing model (i.e., LSTM) in the cloudy days subperiod is 22.582%. This is also demonstrated by the order of magnitude of the errors, where in the sunny days subperiod it is at the level of 10 1 at most while in the cloudy days is at the level of 10 2 .
Regarding the summer season, in the sunny days subperiod, the LSTM model presents the best forecasting accuracy for 1, 2 and 4 steps ahead, while the analytical model has the best accuracy for 8 and 12 steps ahead. However, this is not an outcome that can or should be generalised, because according to both available literature (e.g., [9,38]) and also, the hands-on experience of the authors, the plain analytical model almost never outperforms data-based or hybrid models when tested over an extended period of time. The reason for this result here may be the small size of the testing set, which occurred by the limitation of 10% maximum cloud coverage and the size of the forecasting horizons, i.e., the more the forecasting horizon increases, the smaller the testing set becomes. Thus, the authors cannot support the statement that the analytical model outperforms any of the other forecasting models (and more importantly, the LSTM and the AI-PVF models). Therefore, this particular finding requires additional experimentation in order to reach a definite conclusion. In the cloudy days subperiod, the AI-PVF model presents the best forecasting accuracy across all forecasting horizons. Notably, in some cases, the forecasting error of the AI-PVF model is one or two orders of magnitude smaller compared to the other models (e.g., for 8 steps ahead). Additionally, as in the case of the spring period, the forecasting errors in the cloudy days subperiod are much higher than the corresponding errors of the sunny days subperiod.
With respect to the autumn season, in the sunny days subperiod, the analytical model yields the best forecasting accuracy for 1 and 2 steps ahead, the GBT model for 4 steps ahead and the AI-PVF model for 8 and 12 steps ahead. On the other hand, in the cloudy days period, the persistence model yields the best forecasting accuracy for 1 and 2 steps ahead and the analytical model for 4, 8 and 12 steps ahead. The fact that the naive persistence model outperforms all the other advanced models for two forecasting horizons is indicative of the inability of the advanced models to cope with the variations in the PV power generation variable introduced by the harsh autumn weather. Additionally, the analytical model that incorporates weather information (in the form NWP predictions) presents the best forecasting accuracy for the forecasting scenarios of 4, 8 and 12 steps ahead. In this season, the days in the sunny days subperiod when the cloud coverage is below 10% are very few, which leads to small available datasets for training the data-based models and the HGBT model. This fact can explain the very low forecasting performance of both the data-based models and the HGBT model, and especially that of the traditional statistical model ARIMA and the traditional machine learning model SVR. The AI-PVF model is more robust in this case due to its error-correction capabilities. On the other hand, the high cloud coverage values during the autumn's cloudy days lead to an intermittent pattern for the PV power generation variable, which cannot easily be captured even by the complex nonlinear DL models. The integration of the PV installation's characteristics along with weather information in the DL models may possibly help them to better capture the complex distribution of the PV power generation variable during the autumn's cloudy days.
Finally, in the sunny days subperiod of the winter season, the GBT model presents the best forecasting accuracy for 1 step ahead, the HGBT model for 2 steps ahead and the analytical model for 4, 8 and 12 steps ahead. On the other hand, in the cloudy days subperiod, the GBT model is still the best performing model for 1-step ahead forecasting, while the analytical model in the best for the other forecasting horizons. In this season, the problems reported for the autumn season have been greatly amplified. In particular, for the sunny day subperiod, the forecasting error (in terms of MAPE) of most of the models becomes greater than 50% after the forecasting horizon of 1 step ahead. Only the analytical and the AI-PVF model with the error-correction capabilities maintain their performance at relatively acceptable levels. The problem escalates in the cloudy days subperiod when most of the models yield forecasting error greater than 100% after the 1-step ahead forecasting scenario. This finding is consistent across all data-based models (statistical, ML or DL) which can support the argument that in cases with highly intermittent pattern of the PV power generation variable the data-based models that utilize only past values of the variable in order to predict its future values, cannot be considered as accurate and cannot be used for forecasting tasks in real RES systems. One way to mitigate this behaviour is to integrate to the models the PV installation's characteristics along with weather information, as in the case of the analytical model.
In order to examine if the forecasting accuracy results of the several models differ from each other in a statistically significant way, we performed statistical tests on the residuals of the best performing models from each model category (i.e., analytical, data-based and hybrid) for all data partitions and forecasting horizons using the Kruskal-Wallis statistical test. This non-parametric test is used for estimating if two or more independent samples of equal or different sample sizes are drawn from the same distribution with similar mean and variance (null hypothesis). The results indicate that, in the majority of cases, the null hypothesis can be rejected. Hence, the forecasting accuracy of the best performing model in each case is different from the accuracy of the best performing models of the remaining categories in a statistically significant way. Such a result is illustrated in Table 7, which contains the results of the Kruskal-Wallis statistical test for the best performing models for the sunny days subperiod of the summer season and the cloudy days subperiod of the winter season for all forecasting horizons. In all cases apart from one, the null hypothesis is rejected (highlighted by green color). The best performing model in each case is highlighted by bold letters.

Generalized Results
Moving forward to generalize some of the aforementioned findings, it is observed that the AI-PVF model consistently outperforms all others in spring sunny days and summer cloudy days subperiods. Although it is not very clear as to why this occurs in these two particular scenarios, there are similarities that could potentially lead to a reasonable justification. Both subperiods have similar ambient conditions in terms of external temperature and humidity. In that regard, and taking account the high dependency of PV power generation on temperature, it would be safe to assume that the hybrid model that takes both physical characteristics and historical performance into account, presents the best results in close to optimal temperature conditions (not too hot nor too cold).
Another interesting finding is that the data-based models outperform the analytical and the hybrid models in very short-term forecasting scenarios, namely in 1 step ahead. From the second step ahead, the integration of weather information to the models (i.e., hybrid approaches) seems to improve the overall performance. As reported in the relevant literature [44,49], the data-based models are very precise in very short-term PV power generation forecasting scenarios.
Regarding the long-term forecasting scenarios (i.e., 12 steps or 3 hours ahead), no generalized finding can be fully justified given the incidental nature of best performance of the analytical model in the summer sunny days subperiod for 8 and 12 steps ahead forecasts. Nonetheless, given the consistent best performance of the AI-PVF model in similar spring sunny days subperiod and the relative similar performance during the summer sunny days, with reservation, the authors support that the AI-PVF model exhibits overall good behaviour. However this does not apply to PV installations and systems that have existed for long periods of time, as it does not take into account the material degradation and the physical corruption of the various physical components. As such, for old PV installations this finding might not be applicable. However, the AI-PVF model takes into account this factor and also seems to have good performance in long-term scenarios.
An almost expected outcome is that out of all data-based models, the LSTM architectures present the best and most consistent forecasting results for forecasting horizons of 2 steps-ahead onwards in all examined scenarios. This result can be attributed to the ability of the LSTM models to capture the complex nonlinear long-term dependencies in the PV power generation time series, and exploit them in order to produce accurate predictions [48,62]. What is not so expected concerning the behaviour of the data-based models is that the integration of weather data does not improve their accuracy [42]. For example, the GBT and the HGBT models present quite close performance across all forecasting horizons and for most of the data partitions.
Another interesting finding is the the abruptly decrease of the forecasting accuracy in the autumn season in sunny days subperiod. On the opposite side, in the cloudy days subperiod there is a stable high forecasting error. This result is frequently discussed in literature as a barrier towards accurate PV power generation forecasting [39,44,56,57]. Additionally, another interesting outcome refers to the winter season. In particular, for all forecasting horizons and models, the forecasting errors in the cloudy days subperiod are approximately twice the forecasting errors of the sunny days subperiod.
From the perspective of the several weather-based data partitions, a first finding is that the models in cloudy days in summer present smaller forecasting errors than in cloudy days in spring across all forecasting horizons. This behaviour can be justified by the fact that cloudy days in spring are more often and with higher cloud coverage than the respective ones in summer for the examined test data. Average cloud coverage and ratio of days with more than 10% cloud coverage in spring is 48.3% and 1.73%, whereas in summer 43.2% and 1.23%, respectively.
Another interesting result is that in spring cloudy days and in summer clear sky days a different model has the best performance for each forecasting horizon. This is a rather difficult finding to explain. However, it seems that there are some similar patterns. Up to 2 steps ahead, it is evident that the data-based models present the best results. However, from 4 step onwards, it appears that hybrid models (i.e., models that take into account weather forecast) start outperforming the data-based models. Finally, from 8 steps ahead and beyond, the analytical model and hybrid ones present better results compared to the data-based models.
Out of all the scenarios examined, the best forecasting accuracy achieved per metric is observed on sunny days (expected), but not always on the same season (not expected), as shown below: Based on the above findings, it would appear that the examined models provide more accurate results when predicting long-term rather than short-term time horizons. When examined the cloudy days however, the results are slightly different: It is evident that for cloudy days the best performance in all metrics is for long-term predictions and through the AI-PVF model. Interestingly enough, there is an opposite pattern between clear sky days and cloudy days. In the former, absolute error metrics (i.e., MAE and RMSE) have good results in less than 12 steps ahead, with relative ones (i.e., MAPE and WRSE) having their good ones on 12 steps ahead. The exact opposite is observed for the cloudy days. Optimal results are extracted for the relative metrics for 8 steps ahead, whereas for the absolute ones for 12 steps ahead.
Throughout the present study, the need for further investigation of how weather data affect the performance of PV power generation forecasting models, is evident. Although errors for clear sky days remain quite low (i.e., for all examined horizons and models, the maximum average errors are below 0.17 kWh, 10.5%, 0.2 kWh and 1% for MAE, MAPE, RMSE and WRSE, respectively) this is not the case for cloudy days (i.e., for all examined horizons and models, the maximum average errors are below 0.35 kWh, 199%, 0.4 kWh and 52% for MAE, MAPE, RMSE and WRSE, respectively). This effect of the weather conditions on the accuracy of the PV power generation forecasting models is highlighted in the boxplots presented in Figures 4-7, in which the distributions of the residuals of the several forecasting models for the two most extreme cases in terms of weather conditions are presented. In particular, Figure 4 presents the the residuals' distributions of all forecasting models in the sunny days subperiod of the summer season for 1 step ahead, while Figure 5 presents the corresponding residuals' distributions of the models for the same forecasting horizon in the cloudy days subperiod of the winter season. It is evident, by both the height of the boxes and the range of the outliers, that most of the models face difficulties when trying to predict the PV power generation under severe weather conditions. The same result applies for larger forecasting horizons, as shown in Figures 6 and 7, which present the residuals' distribution for 12 steps ahead in the sunny days subperiod of the summer season and the cloudy days subperiod of the winter season, respectively. Hence, it is apparent that there is a great need for more accurate models on diverse weather conditions. Hybrid or DL models that integrate both PV installation's characteristics and NWP, may hold the key for accurate and generic PV power generation forecasting.
Finally, the results provided by different evaluation metrics are not consistent. There are cases where MAE and RMSE are reduced, while MAPE and WRSE are increased, and vise versa. This highlights the need to identify the exact evaluation metric under which such studies need to be performed towards presenting meaningful and comparable results. The most troubling part is that even though it would be expected to have more consistent results along all metrics in clear sky conditions, the most ambiguous ones have instead been observed. This could be due to the smaller errors observed compared to those of cloudy days. In particular, RMSE seems to be the metric that deviates the most from all other three, highlighting that this metric may not be suitable for such evaluation frameworks.

Conclusions
This paper presents a comprehensive evaluation framework for the comparison of different PV power generation forecasting models on the same forecasting conditions. In particular, a dataset has been assembled, containing PV power generation values from a real-world PV installation, along with weather data gathered from a well-known online weather data aggregation services. The experiments have specific characteristics in terms of the objectives, forecasting horizons, model configuration processes and evaluation metrics. More importantly, the authors designed and implemented a set of 9 PV power generation forecasting models from the three different categories identified in the relevant literature, namely analytical, data-based and hybrid. Specifically, one analytical, six data-based and two hybrid models were designed, implemented and evaluated. The extracted findings are considered useful for both researchers who design new PV power generation forecasting models and managers of PV installations who want to employ the best forecasting models for each situation in order to optimize the overall power generation and delivery pipeline. Future directions of our research include the evaluation of the models on bigger datasets (e.g., from larger PV installations), the design and implementation of new forecasting models, and the integration of the implemented models into more generic PV power management pipelines.

Conflicts of Interest:
The authors declare no conflict of interest.