A Comparative Study of Machine Learning-Based Methods for Global Horizontal Irradiance Forecasting

: The proliferation of photovoltaic (PV) power generation in power distribution grids induces increasing safety and service quality concerns for grid operators. The inherent variability, essentially due to meteorological conditions, of PV power generation affects the power grid reliability. In order to develop efﬁcient monitoring and control schemes for distribution grids, reliable forecasting of the solar resource at several time horizons that are related to regulation, scheduling, dispatching, and unit commitment, is necessary. PV power generation forecasting can result from forecasting global horizontal irradiance (GHI), which is the total amount of shortwave radiation received from above by a surface horizontal to the ground. A comparative study of machine learning methods is given in this paper, with a focus on the most widely used: Gaussian process regression (GPR), support vector regression (SVR), and artiﬁcial neural networks (ANN). Two years of GHI data with a time step of 10min are used to train the models and forecast GHI at varying time horizons, ranging from 10min to 4h. Persistence on the clear-sky index, also known as scaled persistence model, is included in this paper as a reference model. Three criteria are used for in-depth performance estimation: normalized root mean square error (nRMSE), dynamic mean absolute error (DMAE) and coverage width-based criterion (CWC). Results conﬁrm that machine learning-based methods outperform the scaled persistence model. The best-performing machine learning-based methods included in this comparative study are the long short-term memory (LSTM) neural network and the GPR model using a rational quadratic kernel with automatic relevance determination.


Introduction
In the past few years, the higher penetration of renewable energy sources, in particular solar photovoltaics, into power grids, has brought new challenges for grid operators [1][2][3][4][5]. As electricity is not easy to store, supply and demand have to be balanced at all times by grid operators. Nonetheless, due to the intermittent nature of the solar resource, the deployment of photovoltaic (PV) power generation makes the power grid balance more complex to ensure using standard tools [6]. Indeed, the proliferation of PV power generation in power distribution grids brings out constraints, in particular voltage constraints, mainly observed on the medium-voltage power distribution grid, on the low-voltage one. Ergo, evolution of the power distribution grid and smart management tools are necessary to alleviate these constraints. It seems necessary to develop new tools that must help to improve the grid observability and management to go along with the grid's evolution. As a result, the development of predictive management strategies is a stepping stone to more efficient real-time monitoring and optimization of grid operation [6][7][8]. Therefore, tools that allow argued that artificial intelligence-based techniques without any specific preprocessing of data outperform classical approaches with preprocessed data. Finally, the development of models free from using clear-sky models or other preprocessing steps implies that all errors come solely from the forecasting method. As a result, in this paper, GHI data without any specific preprocessing step will be used.
Machine learning methods have been increasingly used in recent years. In [31], the authors applied deep recurrent neural networks for solar irradiance forecasting and showed that these networks outperform SVR models and feedforward neural networks. In [23], long short-term memory (LSTM) neural networks were used for multi-step ahead forecasting of GHI. In [32], the authors developed a hybrid model using an autoregressive model and an ANN model for forecasting hourly solar radiation in the Mediterranean area. One-hour ahead solar irradiance were predicted using support vector machines (SVMs) in [33]. In [34], a deep convolutional neural network (CNN) model has been developed for hourly GHI forecasting based only on sky images without numerical measurements and extra feature engineering. A hybrid CNN with a LSTM neural network for forecasting halfhourly solar radiation has been proposed in [35]. This hybrid model has been compared to other deep learning models and results show that the hybrid model outperformed its counterparts. The potential of GPR for GHI forecasting has been investigated in [24,25]. In [24], the authors have made a comparative study of online GPR and online sparse GPR models based on simple kernels or combined kernels defined as sums or products of simple kernels and the results have shown the superiority of quasiperiodic kernels-based GPR models over the classic persistence model as well as simple kernels-based GPR models.
Based on their proven good performance, popularity and potential in providing accurate forecasts, it yields that machine learning methods such as ANN, SVR and GPR are well-suited for GHI forecasting. The present paper focuses on the development and comparison of intrahour and intraday machine learning-based GHI forecasting models. Even though several such comparative analyses exist in the literature (Table 1 offers a comparison between the work presented in this paper and recent comparative studies), several questions still remain to be answered. • Most studies use databases having at least a 1-hour time step, which leads to the impossibility of intrahour forecasting and to significant simplification of GHI dynamics, as illustrated in Figure 1.

•
The methods themselves are not always optimized and used to their fullest extent. For example, when using GPR, the default kernel (i.e., the squared exponential) is usually used [13,36]; however, a kernel tailored to the application at hand has a significant influence on results [24]. • Regarding input data: some studies use only endogenous data, while others use additional data, which prevents from making fair comparisons between methods. • Some authors forecast GHI directly, others the clear-sky index (using clear-sky models as pre-and postprocessing steps) or even PV power generation.
As a result, despite the extent of research on GHI forecasting in the scientific literature, a thorough comparative study of machine learning-based methods using endogenous data only, without any preprocessing step, for intrahour GHI forecasting is, to the best of the authors' knowledge, inexistent.  The main contributions of the present paper are threefold.
• The models are developed using a two-year GHI database with a 10 min time step. As can be seen in Table 1, in previous machine learning studies the time step is usually 1 h. However, such a time step leads to significant simplification of GHI dynamics: as can be noticed in Figure 1, GHI data sampled with 10 min time step exhibit more fluctuations and are thus more difficult to forecast. • Contrary to developing a specific model for each forecast horizon, as shown in some studies in the literature [13,28], we made the choice of multi-horizon forecasting models. Developing a specific model for each forecast horizon can be computationally demanding when many horizons are considered and it would be more practical to use a multi-horizon forecasting model when trying to run the algorithms in situ to produce real-time forecasts at various horizons, especially when intrahour forecasts are needed. Therefore, in this paper, the models are developed for multi-step ahead GHI forecasting and once the training phase is over, the models are used to forecast GHI for all horizons. • Besides, many authors generally choose classical performance criteria (nRMSE, MAE, MBE, MAPE) for their models' evaluation. In the present paper, two criteria are used in addition to the nRMSE: DMAE, that accounts for temporal distortion error and absolute magnitude error simultaneously; and CWC, that assesses the quality of prediction intervals. These criteria provide more detailed and comprehensive information about the models' performance, and allow an in-depth analysis of their forecasts.
In this paper, the same data are used for each model training and tests are performed on the same dataset to make a fair comparison and a thorough analysis of the results. When using GPR, kernels with automatic relevance determination (ARD), such as the squared exponential kernel with ARD (k SE-ARD ) and the rational quadratic kernel with ARD (k RQ-ARD ), are chosen to account for the relevance of each input dimension in the underlying function modelling. It is a good option, when using GPR for GHI forecasting, as the underlying function has a multi-dimensional input variable, to use ARD kernels that implicitly determine the relevance of each input dimension [37]. That is why we have decided for such kernels. Nonetheless, the flexibility of ARD kernels means that they can be relatively slow to learn and, as a result, authors generally choose simple kernels with isotropic correlation length parameter [13]. The ANN models developed in this paper are based on MLP (multilayer perceptron) and LSTM neural networks. These artificial neural networks are widely used for time series forecasting.
The rest of the paper is organized as follows: in Section 2, the data used to develop and validate the models included in the comparative study are described. Section 3 provides a description of the scaled persistence model and the machine learning methods (GPR, SVR and ANN) used to forecast GHI and presents the models' structure. The forecasting results as well as the criteria used to assess the models' performance (i.e., forecasting accuracy) are presented and discussed in Section 4. The paper ends with a conclusion.

Data Description
The dataset is derived from measurements taken at PROMES-CNRS laboratory, located in Font-Romeu-Odeillo-Via (southern France) in the Occitania region. Located at high altitude, the site is characterized by a relatively dry mountain climate, with hot summers and icy winters. Besides, there is also occasional severe thunderstorms that fall on the region. GHI data are collected using a pyranometer KIPP and ZONEN CM5.
The dataset used in this paper covered the years 2018 and 2019. The data were sampled with a time step ∆t = 10 min, resulting in around 26,600 points per year (after nights were removed). Data of year 2018 were used for training the models and 2019's data were used for the test phase. The aim behind such a partition was to train the models with data providing exhaustive information (this enabled the models to learn all possible patterns in one-year GHI data) and to assess their ability to provide GHI forecasts in all atmospheric conditions, whatever the season. The min-max normalization was used for data preprocessing to facilitate the learning process of the models. Figure 2 depicts a map of data covering the two years. Additionally, it can be noted that there were some missing values in the database, amounting to 0.64% of the whole database. As the missing values represented an insignificant part of the data, there was no need for a sophisticated way to handle them. They were removed from the dataset. Moreover, the models trained with data after the removal of missing values were often robust.

Forecasting Methods Included in the Comparative Study
This section presents the scaled persistence model and the three different machine learning-based methods included in the comparative study: Gaussian process regression (GPR), support vector regression (SVR) and artificial neural networks (ANNs). For the machine learning-based methods' description, we considered a training set D = {(x i , y i ), 1 i n}, where x i ∈ R D is a vector of past GHI values, y i ∈ R is the corresponding onestep ahead GHI forecasting value and n is the number of training samples. Let us aggregate all input vectors x i in a matrix X ∈ R n×D and note y ∈ R n the vector of corresponding outputs y i . The models were implemented on a machine having Intel ® Xeon ® CPU E7-4890 v2@2.80 GHz as the CPU. The GPR models were implemented using the Gaussian Processes for Machine Learning Matlab Toolbox [38], the LS-SVR model was implemented using the LS-SVM Matlab Toolbox [39] and the ANN models were implemented using the Deep Leaning Matlab Toolbox.
This paper focuses on the multi-step ahead forecasting of GHI, which was achieved by iterating one-step ahead forecasting: this strategy involved using the estimate of the output of the current forecasting as well as previous outputs (up to the lag D) as the input to forecast GHI at the next time step and repeating this procedure until the forecast at the desired forecast horizon was obtained. During the training phase, each sample was used. Throughout the testing phase, the models' parameters were updated each time step to forecast the one-step ahead GHI and successive one-step ahead forecasts were performed to reach the desired forecast horizon.

Scaled Persistence Model
The scaled persistence model, used in this study as a reference model, is given by: where GHI is the measured global horizontal irradiance, GHI clsk is the clear-sky model's output, ∆t refers to the time step (10 min) and k∆t is the forecast horizon. Clear-sky models estimate ground level irradiance under a clear sky as a function of the solar elevation angle, site altitude, aerosol concentration, water vapour and various atmospheric parameters. The scientific literature exhibits a wide variety of clear-sky models that are reviewed in [40]. The approach proposed in [41] for clear-sky DNI models, which combines Ineichen and Perez's clear-sky model [42] with a persistence of atmospheric turbidity, is modified to obtain a clear-sky GHI model. The clear-sky global horizontal irradiance is given by: where I 0 is the extraterrestrial irradiance, z is the solar zenith angle, AM is the optical air mass, T L is the Linke turbidity coefficient and f h1 , f h2 , c g1 , and c g2 are parameters.
The Linke turbidity coefficient is defined as the number of dry atmospheres necessary to produce the attenuation of the extraterrestrial irradiance produced by the atmosphere.
The parameters f h1 , f h2 , c g1 and c g2 are respectively given by: (3) where h is the elevation. The optical air mass (AM) is calculated by Kasten and Young's formula [43]:

Gaussian Processes
A Gaussian process (GP) is a collection of random variables, any finite number of which have a joint Gaussian distribution [38]. As a Gaussian distribution is determined by a mean vector and a covariance matrix, in a similar manner a GP is also fully defined by a mean function and a covariance function. A function f distributed as a Gaussian process is denoted as f (x) ∼ GP (µ(x), k(x, x )), where x and x are arbitrary input variables, ) T is the covariance function, also called kernel.

Kernels Used in This Study
Kernels contain the assumptions about the underlying function to be learned and define the similarity between the function values at different input variables x and x [38,44]. It yields that the choice of kernel has a crucial impact on the performance of a GPR model [24]. Any function can be a covariance function as long as the covariance matrix obtained is positive semi-definite.
The kernel is a crucial part of GPR models and should be carefully chosen. For GHI forecasting by GPR, many authors have chosen kernels with isotropic correlation length parameter [13,36]. However, as the underlying function has a multi-dimensional input variable and doesn't have the same variation level along each input dimension, it would be better to use kernels that define a specific and suitable length-scale for each input dimension. That is why Automatic Relevance Determination (ARD) kernels, which implicitly determine the relevance of each input dimension, are used to model functions that have multi-dimensional input [37].
Furthermore, additive structures can be developed for ARD kernels to capture the correlations between the input dimensions in order to improve the generalization ability of models for long-term horizons [45]. ARD kernels used in this work are described below. It is worth noting that several ARD kernels have been tested to identify the best-performing ones. • The SE-ARD kernel k SE-ARD , expressed as a product of squared exponential (SE) kernels over input dimensions each having a different length-scale, is given by: where σ > 0 is the amplitude and 1 , 2 , . . . , D > 0 are the length-scales which control the function's variation along each input dimension. This kernel is a covariance function commonly chosen as default in Gaussian processes applications, because it has relatively few and easy-interpretable parameters to estimate. Moreover, it can be seen as an universal kernel, capable of learning any continuous function given enough data, under some conditions that are investigated in [46]. • The RQ-ARD (rational quadratic) kernel k RQ-ARD is given by: where α > 0 characterizes the relative weighting of large-scale and small-scale variations. The RQ kernel can be seen as an infinite sum of SE kernels with different characteristic length-scales [38] and models functions that vary smoothly across many length-scales. Analogously, the RQ-ARD kernel, which allows the modelling of functions that exhibit multi-scale variations along each input dimension, can be seen as an infinite sum of SE-ARD kernels.

Gaussian Process Regression
Consider the standard regression model: where f is the regression function and ε ∼ N (0, σ 2 ε ) is an independent identically distributed Gaussian noise.
GPR is a Bayesian non-parametric approach [38] towards regression problems, which consists in approximating f (x) ∼ GP (µ(x), k(x, x )) using a training set of n observations D. Consider X * a matrix of test input vectors. Assuming f (x) ∼ GP (µ(x), k(x, x )) and ε ∼ N (0, σ 2 ε ), the training observations y and the function values f * for the test inputs X * follow a joint Gaussian distribution. Thus, the posterior predictive density is also Gaussian [38]: where the mean prediction µ * and the predictive variance σ 2 * are respectively given by: and: In the event of a single test input vector x * , the mean prediction (Equation (12)) becomes: where Thus the mean prediction µ * (Equation (14)) can be written as a linear combination of kernel functions, each one centred on a training point: where α = (K + σ 2 ε I) −1 (y − µ(X)). The coefficients α i , also called parameters, are updated each time a new observation is added to the observation set contrary to the parameters of the kernel, referred to as hyperparameters, which do not change once training is over. Graphical views of GHI forecasting by the k SE-ARD -based and k RQ-ARD -based GPR models are depicted by Figures 3 and 4.

Training a GPR Model
The hyperparameters θ of a GPR model, which group the parameters involved in the kernel and the noise variance, have to be inferred from the training data. In practice, they are commonly estimated by maximizing the log marginal likelihood given by: Unfortunately, the log marginal likelihood function L(θ) is usually non-convex with respect to the hyperparameters. The problem space can have many local optima and the optimized hyperparameters, which depend on their initialization, may not be the global optimum [37,47]. A classical approach to overcome this drawback is to use various starting points randomly selected from a suitable prior distribution. A study on the impact of prior distributions of the initial hyperparameters in GPR models has been conducted in [48] and the authors have arrived at the conclusion that simple priors such as the uniform distribution in an appropriate range may be sufficient in GPR modelling in terms of predictability.
For the initialization of the hyperparameters θ, the following choices have been made: the hyperparameter σ 2 was chosen to be equal to the variance of the training data and the remaining hyperparameters are randomly drawn from an uniform distribution θ i ∼ Uniform(0, 1).

Support Vector Machines
Support vector machines (SVM), which have been developed in the past few decades [49][50][51][52][53], are learning systems that map inputs into a high dimensional feature space where linear separation or regression becomes much easier. The SVM algorithm, which is a nonlinear generalization of the method developed in [54], is based on a learning algorithm from optimization theory grounded in the statistical learning theory framework [49,55]. Firstly, SVM were developed for pattern recognition [52], but they also exhibit good performance in time series modelling and have been successfully applied to regression problems and time series forecasting [51,53].

Support Vector Regression
Support vector regression (SVR) maps the input X into a high dimensional feature space F through a nonlinear mapping function φ in order to do a linear regression in the feature space. We construct a linear regression function f in the feature space F: where w and b are obtained by solving a convex optimization problem.
In [49], SVR is first described as ε-SV regression, which aims to find function f that is as flat as possible and concomitantly has at most ε error from the targets y i for all training data. This problem can be formulated as a convex optimization problem [53]. Furthermore, slack variables ξ i , ξ * i are introduced to deal with unfeasible constraints of the optimization problem [49].
Equations (18) and (19) describe the convex optimization problem as follows: subject to, ∀i: where C > 0 determines the trade-off between the flatness of f and the amount up to which deviations larger than ε are tolerated. It is shown in [53] that the optimization problem can be solved more easily in its dual formulation using Lagrange multipliers. The dual form of this problem is: subject to, ∀i: where α i , α * i are Lagrange multipliers and k is a kernel function, defined as: The weight w can be written as: Thus the regression function is given by: The learning algorithms of support vector regression need to solve the optimization problem (Equations (18) and (19)), which becomes more complex and requires much computing time when a large dataset is considered. In order to reduce the algorithmic complexity, an equivalent least squares support vector regression (LS-SVR) is introduced [56]. In LS-SVR, the inequality constrains are replaced by equality ones, significantly reducing computation time.

Least Squares Support Vector Regression
The optimization problem for LS-SVR can be described as: where J is the loss function and γ is the adjustable constant, subject to, ∀i: where φ is the nonlinear mapping function in kernel space, b is the bias term and e i is the error variable. The Lagrangian function of the optimization problem is: where α i is Lagrange multiplier. The partial derivatives of L are: After eliminating w and e i from Equation (28), we can write the problem in matrix form: where: Then the function estimation of LS-SVR is: Among the kernels that can be used in support vector regression (linear, polynomial and radial basis function kernels), we have decided for the radial basis function (RBF) kernel to model and forecast GHI ( Figure 5). This kernel, which is more flexible than the linear and polynomial kernels, is capable of representing complex relationship between data and is well-suited for nonlinear modelling. The RBF kernel is given by: where δ is the length-scale. Herein, δ and γ are estimated using the gridsearch method.

Artificial Neural Networks
Artificial neural networks are nonlinear data-driven self-adaptive approaches widely used for time series forecasting [57], in particular for solar resource forecasting [13,26,58,59]. Indeed, ANN, which are inspired by the neural structure of the human brain, are powerful tools for regression. They can process problems involving nonlinear and complex data even if these data are imprecise and noisy. An artificial neural network is fully defined by its architecture (or topology), its parameters (named weights, or synaptic weights, and biases) and the algorithm used for training.
While feedforward neural networks are the most widely used for time series forecasting tasks, recurrent neural networks are also extensively used now due to their ability to learn dependencies between time steps of data [60]. Convolutional neural networks, often used for image classification and computer vision problems, have the ability to learn and automatically extract features from raw input data [60]. Thus, they can also be applied to time-series forecasting tasks. The capabilities of these ANN types can be combined to build hybrid models, such as CNN associated with long short-term memory neural networks that seek to harness the ability of each model type [35].
In the present work, a multilayer perceptron, i.e., a feedforward neural network, and a LSTM neural network will be used to forecast GHI. Both networks are described in the sequel. MLP and LSTM neural network have been broadly used for time-series forecasting and have proven good performances [57,61]. Despite their enhanced feature learning ability, convolutional neural networks cannot learn temporal dependencies in data. Nevertheless, the use of CNN and hybrid models seems to be promising in the application of ANN to time-series forecasting. As this paper doesn't aim to provide an exhaustive study about ANN, CNN and hybrid models are not considered for GHI forecasting.

Multilayer Perceptron
The multilayer perceptron is a feedforward artificial neural network historically used for time series forecasting. In a MLP with hidden layers, there is no feedback and the information flows through connecting pathways, in only one direction, from the input layer to the output layer. For the input layer, h 0 (x) = x, where x is an input vector. For the network's hidden layers, with L = 1, . . . , : and: where b L is the bias vector of layer L, W L is the weight matrix of layer L and φ is the hidden neurons' activation function. The sigmoid function was historically the most widely used activation function in hidden layers since it is differentiable and enables to keep values in the interval [0, 1]. However, using this function is problematic since its gradient is very close to 0 when |x| is not close to 0. Because this can cause trouble during training, in particular with deep architectures, the sigmoid function was supplanted by the rectified linear unit (or ReLU). As an interesting feature, the ReLU function, which is not differentiable in 0, has a sparsification effect. It is used in our network. So: For the network's output layer, with L = + 1: and: where b +1 is the bias vector of layer + 1, W +1 is the weight matrix of layer + 1 and ψ is the output neurons' activation function (which is linear). Figure 6 shows a graphical view of the MLP model's architecture for GHI forecasting.

Long Short-Term Memory
LSTM networks, introduced by [62], are recurrent neural networks developed to overcome the gradient vanishing problem that occurs when training traditional recurrent neural networks [63,64]. LSTM networks are capable of learning sequences of observations. This may make them neural networks well suited for forecasting. Initially, an LSTM unit was composed of a cell, an input gate, and an output gate. Later on, the forget gate has been introduced to allow the LSTM to reset its own state [65]. Let us denote by i, f , g, o the input gate, forget gate, input node, and output gate respectively. These gates and nodes are shortly described below. The input node, the input gate and the forget gate are respectively defined as follows: where x k is the current input vector, h k−1 is the hidden state at the previous time step, W gx , W ix , W f x are input weights, W gh , W ih , W f h are recurrent weights, b g , b i , b f are biases, σ is the sigmoid function, and φ is the hyperbolic tangent function. At the heart of each memory cell, there is a node with linear activation. This internal state has a self-connected recurrent edge, often called the constant error carousel, with the following unit weight: where is the Hadamard product. The output gate is described by: and: Figure 7 shows a graphical view of the LSTM model's architecture for GHI forecasting.

Training Algorithm
Adam (for adaptive moment estimation), a well-known adaptive learning rate optimization algorithm [66], has been used to train both the MLP and LSTM neural networks. Adam performs first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments. It is computationally efficient, has very little memory requirements and is well suited for large problems in terms of data or parameters. It is also well suited for problems with very noisy or sparse gradients. Finally, the hyperparameters have intuitive interpretations and tuning is rather simple [66].
Dropout [67] works by probabilistically removing, or "dropping out", inputs to a layer, which may be input variables in the data sample or activations from a previous layer. It has the effect of simulating a large number of networks with different topologies and, in turn, making neurons generally more robust to the inputs. Such a mechanism improves the generalization ability of deep artificial neural networks significantly, even though its computational cost is low, and reduces overfitting (which leads to improved generalization ability). That is why dropout is nowadays the most popular regularization method. Its main drawback is that training is slowed down and, usually, the number of epochs is increased.
For the MLP and LSTM models, we tried our best to optimize the results, and after testing many architectures we obtained the topologies depicted in Figures 6 and 7.

Results
This section presents the results and is divided into two parts. The first part presents the criteria used to assess the models' performance: the normalized root mean square error, the dynamic mean absolute error and the coverage width-based criterion. A discussion about the results based on the obtained criteria values and examples of forecasts shown in Figures 8-10 is conducted in the second part.

Performance Criteria
The different models developed in this paper were compared using the popular root mean square error normalized by the mean of data (nRMSE): where y * ∈ R n * ×1 is the test data andŷ * ∈ R n * ×1 is the forecast.
The nRMSE is a well-known criterion in the scientific literature and assesses the error between the forecasts and test data without giving an adequate quantification of the failures due to mismatches in the time index. Indeed, this temporal error can lead to the misestimation of some events, such as the production peak or ramps. That is why the dynamic mean absolute error (DMAE) [68], which accounts for temporal error (misalignments) and absolute magnitude error simultaneously, is also used to evaluate the models' performance. The DMAE error is a trade-off between the temporal distorsion index and the mean absolute error. It involves two parameters: λ that determines the weight of the MAE for the considered temporal distortion and c that controls the allowed temporal distorsion. Herein, λ = 0.1 and c = 5%. For more details, the interested reader is referred to [68].
As the confidence intervals are calculated and displayed for all models, the quality of prediction intervals is assessed. Criteria such as the prediction interval coverage probability (PICP) or the prediction interval normalized average width (PINAW) can be used for the assessment of prediction intervals [69]. The PICP evaluates whether the target values lie within the constructed prediction intervals limits and is defined by [69]: where n * is the number of test data and i is a Boolean variable, which is equal to 1 if the target value y * i lie within the prediction interval limits and 0 otherwise. Mathematically, i is defined as follows: where y * i is a test datum and L i and U i are respectively the lower and upper bounds of the prediction interval. The PINAW assesses the width of prediction intervals and is defined by [69]: where R used for normalization is the difference between the maximum and minimum of the test data. The assessment of prediction intervals based only on PICP criterion or PINAW index is inaccurate, as forecast with a large prediction interval can lead to a high PICP. However, such a forecast is not as well as another one that has a narrow prediction interval (small PINAW value) and a high PICP. That is why the combinational index of these two criteria proposed in [70], named coverage width-based criterion (CWC) that carries out information about the width and coverage probability of prediction intervals is interesting to evaluate the sharpness of the forecast. This index is defined by: where µ is the nominal confidence level associated with the prediction intervals, and η is a parameter that magnifies the difference between PICP and µ. Herein, η = 10 and µ = 0.95. The function γ is defined by: In this paper, besides the nRMSE and DMAE used to evaluate the models' forecasting performance, the CWC was used to assess the quality of models' prediction intervals. GPR naturally provides a predictive distribution, from which confidence intervals are obtained. However, the other considered methods-LS-SVR and ANN-do not. For LS-SVR, the method developed in [71] for construction of confidence intervals based on the central limit theorem for linear smoothers combined with bias correction and variance estimation is used to obtain the confidence intervals. Concerning the ANN models, the bootstrap method [72,73] that allows training neural networks with different parameter initializations in order to estimate the model uncertainty is used to quantify the uncertainty associated with the forecasts (here 30 repetitions were used for both the MLP and LSTM models).

Forecasting Results
The models' forecasting skill was assessed by comparing the forecasts over different time horizons ranging from 10 min to 4 h. The aforementioned criteria were calculated for each horizon and for each model. Numerical results for the considered forecast horizons are stored in Tables 2-4. As can be seen in Table 2, at the lowest forecast horizon (10 min), all the models developed in this study gave good and similar results, with a slight advantage to the ANN models, except the LS-SVR model which gave the worst result. As the forecast horizon increased, the superiority of the machine learning models became clear. They surpassed the scaled persistence model, whose performance degraded quickly as the forecast horizon increased. When looking at the nRMSE values of the machine learning models for all horizons, it can be noticed that the performances of these models were very similar, but the LSTM model and the GPR model based on k RQ-ARD kernel outperformed the others as the forecast horizon increased. Table 2 also shows that, at the lowest horizon (10 min), the MLP model gave the best results but gave the worse results at the highest horizon (4 h).     The values of DMAE for all models are stored in Table 3. Based on this criterion, the LS-SVR model and the scaled persistence model gave the worst results for all horizons compared to the other models. This criterion combined the temporal distorsion between the forecasts and test data with the absolute magnitude error. For lower horizons, the scaled persistence model gave good results according to its nRMSE values ( Table 2) and its performance at the lowest horizon (10 min) was practically similar to the performance of ANN models (nRMSE = 20.20%) and slightly better than the performance of GPR models. However, there were some time lags between the forecasts given by the scaled persistence model and the test data (Figures 8-10), which are reflected in its DMAE values. Table 3 shows that, at the lowest horizon, the MLP model gave the best results, but for all horizons, the best performing models were the LSTM model and k RQ-ARD -based GPR model. Table 4 stores the CWC values for the LS-SVR, ANN, and GPR models. All these models gave high coverage probability at the lowest horizon (10 min), with an advantage to the GPR model based on k RQ-ARD kernel which, besides, had narrow confidence intervals as demonstrated by its CWC value. When the forecast horizon increased, the quality and accuracy of forecasts degraded for each model. The rapid augmentation of the CWC values indicated these effects. At the highest forecast horizon (4 h), the ANN models gave high PICP values, that is why they had small CWC values, but in terms of quality and accuracy of forecasts, when looking at the nRMSE values, the GPR model based on k RQ-ARD kernel (nRMSE 0.40) was better than the MLP model (nRMSE 0.44). The ANN models also had large confidence intervals at the highest horizon, which were reflected by their forecasts displayed in Figure 10. For each forecast horizon, the LS-SVR model gave the worst CWC value compared to the other models. As a result, by examining its confidence intervals combined with its nRMSE values, we can conclude that it was difficult for this model to follow its counterparts' performances.
Some examples of 10 min, 1 h and 4 h forecasts are displayed in Figures 8-10. A period of 6 days in the whole test database (from 28 November 2019 to 3 December 2019), having both clear and cloudy days, was chosen. At the lowest forecast horizon, the scaled persistence model was excellent, but when the horizon increased, its performance degraded quickly. Looking at Figure 8 for the lowest forecast horizon, all the models followed the trends in the test data quite well, except the LS-SVR model that did not predict well the patterns in overcast days. When the horizon increased, the performances of all forecasting models degraded. However, the figures show that the LSTM model and the GPR model based on k RQ-ARD kernel outperformed the other models. That is especially evident in Figure 10, at the highest forecast horizon, where despite the deterioration of their performances, these two models followed pretty well the trends in test data. Figure 10 also shows that while the LSTM model gave better results than others during cloudy days, during clear days the GPR model based on k RQ-ARD kernel prevailed. The MLP model and the GPR model based on k SE-ARD did not follow well the trend of days when the forecast horizon increased. The LS-SVR model converged to a smooth signal at the highest forecast horizon. This was not surprising because its kernel was suitable for smooth functions' representation and could not correctly model both large-scale and small-scale variations in the GHI signal. Among the ANN models, the LSTM model, which could learn long-term dependencies in the data, outperformed the MLP model. The large confidence intervals, when the forecast horizon got longer, demonstrated the models' performance degradation.
When examining the models' performances based on the obtained criteria values and examples of forecasts shown in Figures 8-10, it yielded that the two best models for GHI forecasting were the GPR model based on k RQ-ARD kernel and the LSTM model. Indeed, these two models outperformed their counterparts considered in this study. However, as the LSTM model and the k RQ-ARD -based GPR model gave similar results, it was hardly easy to distinguish between them through this work. As the GHI data used in this paper exhibited more clear days than overcast days, one would tend to choose the GPR model based on k RQ-ARD kernel, which outperformed the LSTM model during clear days and gave less good results than the LSTM model during cloudy days, for GHI forecasting for our location (i.e., Font-Romeu-Odeillo-Via, in the Occitania region (southern France)). Table 5 stores the models' training time, the number of parameters involved in each model, and the time needed to perform one forecast for each model. Although many parameters were involved in ANN models, they were fast to train and their forecasts were almost instantaneous. Concerning GPR models, they were very time-consuming in the training phase and needed more time to provide forecasts. Indeed, the generation of a forecast by a GPR model required an inversion of a matrix as big as the training dataset is large. However, there are many sparse approaches in the literature [74] that can be used to reduce time complexity and memory requirement while keeping the same performances for an in situ implementation.

Conclusions
This paper aims to shed light on the use of machine learning to forecast global horizontal irradiance (from which photovoltaic power generation can be inferred) using endogenous data. To do so, a comparative study of popular machine learning methods (artificial neural networks, support vector regression and Gaussian process regression) for GHI forecasting is conducted at different time horizons, covering intrahour and intraday cases. The models are developed using a 2-year GHI database with a 10 min time step. Usually, in machine learning studies, the time step is 1 h, which leads to significant simplification of GHI dynamics. In addition, contrary to developing a specific model for each forecast horizon, as shown in many studies in the literature and which can be computationally demanding when many horizons are considered, we made the choice of multi-horizon forecasting models. The scaled persistence model is also included in the comparative study as a reference model. Two criteria, the dynamic mean absolute error (DMAE) and the coverage width-based criterion (CWC), are used in addition to the popular root mean square error normalized by the mean of data (nRMSE) in order to conduct an in-depth analysis of the models' performance.
The results show that all the machine learning-based models outperform the scaled persistence model and among these models, the long short-term memory (LSTM) model and the k RQ-ARD -based GPR model outperform their counterparts. However, it is difficult to distinguish between these two models because their performances are very similar. The conclusion arising from the examination of the forecasts' temporal evolution is that the LSTM model provides good forecasts for cloudy days, contrary to the case of clear-sky days where the k RQ-ARD -based GPR model outperforms the LSTM model. Funding: This research was funded by the French agency for ecological transition (ADEME) and the Occitania Region (France).

Institutional Review Board Statement: Not applicable.
Informed Consent Statement: Not applicable.

Data Availability Statement:
The data presented in this study are available on request from the corresponding author.

Acknowledgments:
The authors would like to thank the French agency for ecological transition and the Occitania Region for their financial support. They also thank ENEDIS, the French distribution grid operator, and all the academic and industrial entities involved in the Smart Occitania project for their contribution to this work.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The