Multi-Horizon Forecasting of Global Horizontal Irradiance Using Online Gaussian Process Regression: A Kernel Study

: In the present paper, global horizontal irradiance (GHI) is modelled and forecasted at time horizons ranging from 30 min to 48 h, thus covering intrahour, intraday and intraweek cases, using online Gaussian process regression (OGPR) and online sparse Gaussian process regression (OSGPR). The covariance function, also known as the kernel, is a key element that deeply inﬂuences forecasting accuracy. As a consequence, a comparative study of OGPR and OSGPR models based on simple kernels or combined kernels deﬁned as sums or products of simple kernels has been carried out. The classic persistence model is included in the comparative study. Thanks to two datasets composed of GHI measurements (45 days), we have been able to show that OGPR models based on quasiperiodic kernels outperform the persistence model as well as OGPR models based on simple kernels, including the squared exponential kernel, which is widely used for GHI forecasting. Indeed, although all OGPR models give good results when the forecast horizon is short-term, when the horizon increases, the superiority of quasiperiodic kernels becomes apparent. A simple online sparse GPR (OSGPR) approach has also been assessed. This approach gives less precise results than standard GPR, but the training computation time is decreased to a great extent. Even though the lack of data hinders the training process, the results still show the superiority of GPR models based on quasiperiodic kernels for GHI forecasting.


Introduction
Over the past few decades, the penetration of distributed generation in the power grid has been gaining momentum due to environmental concerns and increases in global power demand. Therefore, being able to handle an increased share of fluctuating power generation from intermittent renewable energy sources-in particular, solar photovoltaics-has become a critical challenge for power grid operators. In fact, the decentralisation of power generation has made compliance with voltage constraints, current levels and voltage drop gradients increasingly difficult and has spawned a multitude of stability, quality and safety issues. The intermittency of solar power is poorly accounted for by conventional planning strategies for the daily operation of power grids and increases the complication of voltage regulation [1]. Therefore, grid stability has become dependent upon the successful compensation of supply/demand variation. As a result, perceiving changes in the state of

•
Using the clearness index, defined as the ratio of irradiance at ground level to extraterrestrial irradiance on a horizontal plane (see [2,3], as well as references therein).

•
Using the clear-sky index, defined as the ratio of irradiance at ground level to clear-sky irradiance on a horizontal plane, for which a clear-sky model is needed (see [4,5], as well as references therein).
Another approach is to model and forecast raw GHI time series, without any specific preprocessing. Indeed, the transformations necessary to achieve stationarity may hinder the performance of forecasting algorithms, since the resulting time series exhibit very low correlation [12]. Incidentally, it is shown in [13] that artificial intelligence-based models using raw GHI data outperform conventional approaches based on forecasting the clearness index and are able to capture the periodic component of this time series as well as its stochastic variation. Not having to rely on clear-sky or radiative transfer models also implies that all errors come from the forecasting method itself. This is especially relevant when using clear-sky models based on mean values, as those can lead to significant errors [14].
In the last few years, GHI and photovoltaic power generation forecasting with machine learning techniques have gained in popularity. Artificial neural networks are the most commonly used tools; for further information, see the works presented in [15][16][17] or the review in [18] for a summary of existing approaches. Next are support vector machines (SVMs), which are called support vector regression (SVR) in the context of regression; more information can be found in [19] or the review in [20]. Other machine learning techniques, such as nearest neighbor, regression tree, random forest or gradient boosting, have also been tested and compared to classical time-series forecasting [21][22][23].
The use of Gaussian process regression (GPR) dedicated to GHI forecasting, however, has remained limited. In [21,23], GPR was used to obtain short-term forecasts of the clear-sky index. A more recent study has been carried out but for daily and monthly forecasting of GHI in [24]. In a spatio-temporal context, some studies based on GPR have been carried out to obtain short-term forecasts of the clear-sky index [25][26][27] (it should be noted that in these papers, the spatio-temporal covariance has been obtained by fitting empirical variograms instead of using spatio-temporal kernels). It must be stressed that, in all of these studies, in addition to forecasting the clear-sky index, no optimal kernel has been selected: as a default choice, the authors used only the square exponential kernel. In this work, GPR is used to model and forecast raw GHI data at forecast horizons ranging from 30 min to 48 h, thus covering intrahour, intraday and intraweek cases. A comparison between simple kernels and combined kernels is made, demonstrating that an appropriate choice is key to obtaining the best forecasting performance, especially at higher forecast horizons. It will be shown that, when an appropriate kernel is chosen, GPR models are sufficiently flexible to model and forecast GHI by jointly optimizing short-term variability (mainly due to atmospheric disturbances) and long-term trends (mainly due to solar geometry). As a consequence, in addition to the aforementioned benefits of not using preprocessing, the proposed models can be applied without modification in different locations, since any difference will directly be learned from the data.
The paper is organized as follows: Gaussian processes as well as commonly used covariance functions and ways of combining these functions are defined in Section 2. In Section 3, the principles of Gaussian process regression are introduced, including both online and online sparse models, as well as the estimation of their parameters and hyperparameters. The datasets used to train and validate the models included in the comparative study, derived from GHI measurements taken at the laboratory PROMES-CNRS in Perpignan (southern France), are described in Section 4. In Section 5, the forecasting accuracy of the developed OGPR and OSGPR models is evaluated. The paper ends with a conclusion. The hyperparameters' estimated values and detailed numerical results can be found in Appendices A and B, respectively.

Gaussian Processes
In this section, Gaussian processes are defined. In addition, commonly used covariance functions and ways of combining these functions are presented.

Definition
A Gaussian process (GP) can be defined as a collection of random variables, any finite number of which have a joint Gaussian distribution (see Definition 2.1 in [28]). A GP defines a prior over functions, which can be converted into a posterior over functions once some data have been observed. Let us use is the covariance function and x and x are arbitrary input variables. In mathematics, GPs have been studied extensively, and their existence was first proven by the Daniell-Kolmogorov theorem (more familiar as Kolmogorov's existence/extension/consistency theorem; see [29]). In the field of machine learning, GPs first appeared in [30], as the limiting case of Bayesian inference performed on artificial neural networks with an infinite number of hidden neurons. However, their appearance in the geostatistics research community, where they are known as kriging methods, was less recent [31,32].

Covariance Functions
A covariance function, also known as kernel, evaluates the similarity between two points and encodes the assumptions about the function to be learned. This initial belief could concern the smoothness of the function or whether the function is periodic. Any function can be used as a kernel as long as the resulting covariance matrix is positive and semi-definite. For scalar-valued inputs (x and x ), the commonly used kernels (k WN , k SE , k RQ , k M ν and k Per ) are briefly described below. An exhaustive list of kernels can be found in [28].

•
The white noise kernel k WN is given by where σ 2 is the variance. In all the considered kernels, it plays the role of a scale factor.

•
The squared exponential kernel k SE is given by where σ 2 is the variance and is the correlation length or characteristic length-scale, sometimes called "range" in geostatistics. controls how quickly the functions are sampled from the GP oscillate. If is large, even points that are far away have meaningful correlation; therefore, GP functions would oscillate slowly and vice versa.

•
The rational quadratic kernel k RQ is given by where σ 2 is the variance and is the correlation length. In this kernel, α defines the relative weighting of large-scale and small-scale variations. k RQ can be seen as a scale mixture of squared exponential kernels with different correlation lengths distributed according to a Beta distribution [28].

•
The Matérn class of kernels k M ν is given by where σ 2 is the variance, Γ is the standard Gamma function, is the correlation length and B ν is the modified Bessel function of second kind of order ν. ν controls the degree of regularity (differentiability) of the resultant GP. Kernels of the Matérn class are continuous but not differentiable. If ν = 1 /2, k M 1 /2 = k E , the exponential kernel is obtained: This kernel corresponds to the continuous version of a classical discrete autoregressive model AR(1), also known as the Ornstein-Uhlenbeck process. As discussed in [28], when ν = p − 1 /2, with p ∈ N, the continuous version of classical AR(p) is obtained. Two other kernels of the Matérn class k M 3 /2 and k M 5 /2 are widely exploited in the scientific literature. These kernels correspond to the cases ν = 3 /2 and ν = 5 /2, respectively. They are given by and: • The periodic kernel k Per is given by where σ 2 is the variance and is the correlation length. k Per assumes a globally periodic structure (of period P) in the function to be learned. A larger P causes a slower oscillation, while a smaller P causes a faster oscillation.

Kernel Composition
Several kernels can be combined to define efficient GPs. The only thing to keep in mind is that the resulting covariance matrix must be a positive semi-definite matrix. Two ways of combining kernels while keeping the positive semi-definite property are addition and multiplication. As a result, a quasiperiodic GP can be modelled by multiplying a periodic kernel by a non periodic kernel. This gives us a way to transform a global periodic structure into a local periodic structure.
For example, in [28], the product of k SE and k Per was used to model monthly average atmospheric CO 2 concentrations.
In this paper, we were inspired by [33] to construct a wide variety of kernels to model global horizontal irradiance by adding or multiplying two kernels from the kernel families presented in Section 2.2.

Gaussian Process Regression
In this section, standard Gaussian process regression (GPR), online Gaussian process regression (OGPR) and online sparse Gaussian process regression (OSGPR) are presented. The section ends with the training of a GPR model.

Standard Gaussian Process Regression
Consider the standard regression model with additive noise, formulated as follows: where y is the observed value, f is the regression function, x ∈ R D×1 is the input vector (in most cases, 1 D 3; in this paper, x is the time, with a dimension of 1) and ε ∼ N (0, σ 2 ε ) is independent and identically distributed Gaussian noise. GPR can be defined as a Bayesian nonparametric regression which assumes a GP prior over the regression functions [28]. Basically, this consists of approximating In the following, all the input vectors x i are merged into a matrix X ∈ R n×D , and all corresponding outputs y i are merged into a vector y ∈ R n×1 , so that the training set can be written as (X, y). From Equation (9), one can see that y ∼ N µ(X), K + σ 2 ε I , where K = k(X, X) ∈ R n×n . The joint distribution of the observed data y and the latent noise-free function on the test points f * = f (X * ) is given in this setting by where K * = k(X, X * ) ∈ R n×n * and K * * = k(X * , X * ) ∈ R n * ×n * . The posterior predictive density is also Gaussian: where: and: Let us examine what happens for a single test point x * and let k * be the vector of covariances between this test point and the n training points: Equation (12) becomes Thus, µ * can be formulated as a linear combination of kernels, each one centered on a training point: Each time a new observation is added to the observation set, the coefficients α i -referred to as parameters in the following-are updated. By contrast, the parameters of the kernel, referred to as hyperparameters, are not updated once the training is completed (see Section 3.4 and Appendix A).

Online Gaussian Process Regression
When using standard GPR, it is difficult to incorporate a new training point or a new set of training points. In case a forecasting algorithm is run in situ, there is no fixed data set D = {(x i , y i ), 1 i n}; rather, one or a few observations are collected at a time and, as a result, the total amount of information available grows perpetually and incrementally. Besides, updates are often required every hour or minute, or even every second. In this case, using these methods would incur prohibitive computational overheads. Each time we wish to update our beliefs in the light of a new observation, we would be forced to invert the matrix K + σ 2 ε I with a complexity O(n 3 ). A solution to this is to use online Gaussian process regression (OGPR) [34][35][36][37][38]. Let us suppose that the distribution of the GP given the first n training points (X, y) is known. For simplicity, let us consider µ = 0 (without loss of generality). Suppose that n + new observations are regrouped in the matrix X + ; then, This results in the following: Using the block matrix inversion theorem, we obtain where Now, inverting the (n + n + ) × (n + n + ) matrix only requires the inverse of A, which is known and the inversion of a n + × n + matrix. Thus, the computational cost is O(n 3 + n 2 ) rather than O((n + n + ) 3 ) when performing direct matrix inversion.

Online Sparse Gaussian Process Regression
As presented in Section 3.1, a challenge of using standard GPR is its computational complexity of O(n 3 ) [28], within the context of training, with n being the number of training points. This principally comes from the need to invert the matrix K + σ 2 ε I in Equations (12) and (13). The simplest and most obvious approach to reducing computational complexity in Gaussian process regression is to ignore a part of the available training data: a forecast is then generated using only a subset of the available data during training, also known as the active set or inducing set. This leads to another version of OGPR: the so-called online sparse Gaussian process regression (OSGPR). OSGPR includes both the deterministic training conditional (DTC) [39] and the fully independent training conditional (FITC) [40] approximation methods. Here, computational complexity is reduced to O(n 2 s n), where n s is the length of the active-set (usually n s n). More information can be found in [38][39][40][41][42] and in the reviews presented in [34,43].
The choice of a selection procedure of the active set from the total n training points constitutes its own research field [38][39][40][41]. Sophisticated choices are based on various information criteria that quantify the pertinence of adding each point to the active-set. However, a simple way is to choose the n s points randomly-a technique known as the subset of data-which, in some situations, may be as efficient as more complex choices [44].

Training a GPR Model
As previously mentioned in the paper, there is freedom when choosing the kernel of a GPR model. The model hyperparameters, denoted as θ = θ 1 . . . θ m T below, group those of the chosen kernel and the noise variance. As an example, the squared exponential kernel k SE (this kernel is widely used for GHI forecasting), defined by Equation (2), has two hyperparameters-σ and (see Section 2.2)-to which the noise variance σ 2 ε is added. Thus, θ = σ σ 2 ε T . The hyperparameters of all the kernels included in this comparative study are given in Appendix A (see Equations (A1)-(A16)). These hyperparameters are estimated from the data. To do so, the probability of the data given the aforesaid hyperparameters is computed. Assuming a GPR model with Gaussian noise (9), the log marginal likelihood is given by [28] Thus, Usually, the estimation of θ is achieved by maximising the log marginal likelihood (21). If its gradient is known and a gradient-based algorithm, such as a conjugate gradient method, can be used (as proposed in [45]), the maximisation process may be accelerated. More information about using the gradient and Hessian of Equation (21) to speed the learning phase in the GPR algorithm can be found in [28,46,47].

Data Description
The data used in this comparative study originate from a rotating shadowband irradiometer (RSI), whose uncertainty range is about ±5%, housed at the laboratory PROMES-CNRS in Perpignan, roughly 20 km west of the Mediterranean Sea. This device consists of two horizontally leveled silicon photodiode radiation detectors, situated in the center of a spherically curved shadowband. The region is characterised by strong winds resulting in an often clear sky, in addition to mild winters and hot and dry summers.
The comparative study enables the evaluation of GPR models through two GHI datasets, each sampled at a 30 min rate. The first one, hereafter referred to as the "summer" dataset, covers a period of 45 days, from 5 June to 18 July 2015. The second one, hereafter referred to as the "winter" dataset, also covers a period of 45 days, from 1 November to 15 December 2015.
The motivation for using two datasets instead of one larger dataset is twofold. On one hand, clear-sky days are frequent during summertime, while in winter, more complicated situations occur (overcast, broken clouds, etc.); as will be seen in the forecasting results (see Section 5.3), GHI is thus easier to predict in summer, while the winter dataset allows the highlighting of the models' robustness in complex atmospheric conditions. On the other hand, this method allows us to show that it is not necessary to gather a huge dataset to obtain good forecasting results. If not specified otherwise, the first 30 days and the last 15 days of each GHI dataset are used for training (see Section 5.1) and to evaluate the models' generalisation ability, respectively. Finally, the forecast horizons considered in this comparative study range from 30 min to 48 h, thus covering intrahour, intraday and intraweek cases.

Modeling and Forecasting Results
In this section, the modeling and forecasting results obtained using the OGPR and OSGPR models included in the comparative study as well as the evaluation metrics used are presented. The methods of initalising and estimating the hyperparameters are also explained.

Hyperparameter Initialisation and Estimation
To model GHI using Gaussian process regression, we first have to identify the most adapted kernel. In this comparative study, kernel identification is inspired by the procedure described in [33]. The main idea is to construct a wide variety of kernel structures by adding or multiplying simple kernels from the kernel families discussed in Section 2.2. While all the possible combinations of simple kernels have been evaluated, only the following kernels and combinations have been included in the comparative study.
As a matter of fact, results emanating from other combinations of non-periodic kernels are not presented here because they demonstrate a behaviour similar to that of simple kernels.
The maximisation of the log marginal likelihood (21) allows the estimation of kernel hyperparameters from GHI training data. Convergence to the global maximum cannot be guaranteed since L(θ) is non-convex with respect to θ. Several techniques can be used to remedy this issue, the most classical of which is to use multiple starting points which are randomly selected from a specific prior distribution; for instance, θ i ∼ Uniform(0, 1). In [48], the authors incorporate various prior types for the hyperparameters' initial values, then examine the impact that priors have on the GPR models' predictability. Results show that hyperparameters' estimates, when using k SE , are indifferent to the prior distributions, as opposed to k Per , whose hyperparameters' estimates differ along with the priors. This leads to the conclusion that prior distributions have a great influence on the hyperparameters' estimates in the case of a periodic kernel.
As mentioned in Section 4, the two GHI datasets (i.e., the summer and winter datasets) are split into a training subset and a testing subset that cover periods of 30 days and 15 days, respectively. As for the initial values of the hyperparameters θ, we have made the following choices.

•
The correlation length has been chosen to be equal to the training data's standard deviation.

•
In case a periodic kernel is involved in the regression process, an initial value equal to one day has been chosen for the period P in order to surpass issues that arise while estimating it.

•
The initial values of remaining hyperparameters, if any, are randomly drawn from a uniform distribution.

Evaluation of Models' Accuracy
Two evaluation metrics are considered in this comparative study. The first one is the normalized root mean square error (nRMSE): The second evaluation metric is the correlation coefficient (R): where · is the arithmetic mean and y test ∈ R n * ×1 and y forecast ∈ R n * ×1 are the test data and forecasts given by the GPR models, respectively.
These evaluation metrics enable performance (generalisation) assessment during testing, which is a challenging task when time horizons and time scales vary, as highlighted in [21].

Forecasting Results Using OGPR
Forecasts of GHI obtained at different time horizons are compared to enable the assessment of the models' global performance. The persistence model is used as a reference. For all the OGPR models catalogued in Section 5.1, the nRMSE vs. the forecast horizon is displayed in Figure 1. Further numerical results for the summer dataset (Tables A2 and A3) and the winter dataset (Tables A4 and A5) are presented in Appendix B. Broadly speaking, there are three classes of models, each possessing a different performance level. The worst performance is exhibited by the persistence model, and improved performance is witnessed when using OGPR models based on simple kernels; lastly, considerably better performance is exhibited by OGPR models based on quasiperiodic kernels, particularly considering higher forecast horizons.
For both datasets, even at the lowest forecast horizon (30 min), OGPR models based on simple kernels give forecasts comparable to those given by the persistence model (nRMSE 0.22 in summer and nRMSE 0.30 in winter), while OGPR models based on quasiperiodic kernels already give better forecasts (nRMSE 0.18 in summer and nRMSE 0.21 in winter). As the forecast horizon increases, the persistence model's performance degradation is more rapid than that of OGPR models. At the highest forecast horizon (5 h), the persistence model gives nRMSE 1.11 in summer and nRMSE 1.50 in winter; for OGPR models based on simple kernels, nRMSE 0.55 in summer and nRMSE 0.70 in winter; for OGPR models based on quasiperiodic kernels, nRMSE 0.30 in summer and nRMSE 0.40 in winter.
Regarding OGPR models based on simple kernels, no best-performing models have been found. Depending on the forecast horizon, k M 3/2 , k M 5/2 , k SE and k RQ all alternatively give the best forecasts, while k E does not manage to perform competitively. An interesting observation is that k SE -this kernel is often used to forecast GHI; see [21,23]-does not ensure the best results among the simple kernels.
Because OGPR models based on a periodic kernel produce a periodic signal, one can say that such models are similar to clear-sky GHI models whose parameters have been fitted to the data. As a consequence, these models simply recurrently reproduce the same bell-shaped curve and produce practically the same nRMSE with respect to increasing forecast horizons. Although good forecasts can be produced by these OGPR models on clear-sky days, they are unable to predict atmospheric-disturbance-induced variability in GHI data.
However, OGPR models based on quasiperiodic kernels-these kernels combine a periodic kernel with a non-periodic kernel-possess the advantage of the periodic kernel, while still managing to predict rapid changes in GHI during the day. Among quasiperiodic kernels, k Per×RQ surpasses other kernels for the summer dataset, with k Per+M 3 /2 , k Per+M 5 /2 and k Per+E all coming in a close second (see Table A2); for the winter dataset, however, there is no clear best-performing kernel, as those four kernels alternatively take the first place (see Table A4).
An in-depth analysis of the temporal evolution of GHI during the models' training and testing phases sheds more light on their performance. Once more, the persistence model serves as a reference. Three OGPR models are selected: the classic k SE -based OGPR model and two of the best-performing models based on quasiperiodic kernels; i.e., the k Per×RQ -based OGPR model and the k Per+SE -based OGPR model. Here, a dataset of nine days (selected from the summer dataset) is used, split into a seven-day training subset and a two-day testing subset. In Figures 2-4 k P e r× S E k P e r× M 5 / 2 k P e r× M 3 / 2 k P e r× E k P e r k P e r+ S E k P e r+ M 5 / 2 k P e r+ M 3 / 2 k P e r+ E k P e r+ R Q k P e r× R Q 0. Recall that, while each data sample is used during training, during testing, a new observation is added to the observation set only each whole multiple of the forecast horizon. This means that, first, for all three figures, the training results are identical; second, the coefficients α i in Equation (16) are updated every 30 min in Figure 2 and every 4 h in Figure 3, while for the 48-h forecast horizon, no update occurs.
An inspection of the seven-day training phase reveals that the data are well-fitted by every OGPR model. Signals generated by both k SE and k Per+SE are quite smooth and show few differences, as opposed to k Per×RQ , whose capability for generating more irregular signals allows it to follow the temporal evolution of GHI more closely in the case of atmospheric disturbances. 01  A study of the two-day testing phase reveals the following: all models perform very well when a new observation is added to the observation set every 30 min (Figure 2), although OGPR models based on quasiperiodic kernels, especially k Per×RQ , perform slightly better. The performance gap becomes more apparent when the forecast horizon increases. Thus, when a new observation is added to the observation set every 4 h, the k SE -based OGPR model struggles to predict changes in GHI accurately, as conveyed by the substantial confidence interval between observations (Figure 3a). As soon as a new observation is made, the model fits to the data, but in the absence of an update, it converges to the constant mean value learned during training (around 280 W m −2 ). In Figure 4a, this behaviour is more obvious: no update is made throughout the entire two-day testing period, and without additional information, the OGPR model simply makes an "educated guess", consisting of this constant mean value associated with a large confidence interval. Quasiperiodic kernels, however, do possess additional information on GHI, showing that it has a daily pattern. As with OGPR models based on simple kernels, when a new observation is added to the observation set, they fit to the data (see Figure 3b,d).
In the opposite case, OGPR models based on quasiperiodic kernels reproduce the periodic value learned during training (see Figure 4b,d), giving a result that is distinctly more faithful to the desired behaviour than a constant mean value. This explains why the performance of OGPR models based on quasiperiodic kernels degrades more slowly when the forecast horizon increases, as seen in Figure 1. Based on the results shown in Figure 3b,d, k Per×RQ is the best choice among quasiperiodic kernels as it permits sharper and more drastic changes in the modelled signal. 01    The nRMSE gain relative to the persistence model is presented in Table 1, for OGPR models based on the classical k SE and the three best-performing quasiperiodic kernels (winter dataset). Table 1. Performance comparison, in terms of nRMSE, between the persistence model and OGPR models based on different kernels: the classical k SE and the three best-performing quasiperiodic kernels (winter dataset). The forecast horizon ranges between 30 min and 5 h. The best results are highlighted in green.

Forecasting Results Using OSGPR
The impact of training data sparsity on forecasting accuracy (generalisation) is assessed in this section of the paper. The subset of data technique is used; this technique simply consists of ignoring a part of the data available for training. Lowering the quantity of data used to train the models reduces computation time during training and also during testing, since the number of parameters is reduced (see Equation (16)). In addition, this allows us to evaluate the models' ability to handle missing data in the training sequence (i.e., their generalisation ability in the case of missing data in that sequence).
Thirty minute, 4 h and 48 h forecasts given by both the k SE -based OSGPR model and the k Per×RQ -based OSGPR model are displayed in Figure 5. The dataset constitutes the same measurements of GHI (nine days) that were considered with OGPR models (see Section 5.3). Here, however, only 20% of the available data-randomly selected from the summer dataset-have been used during training.
With this low number of training points, the OSGPR model based on k SE does not provide acceptable results. Compare, for example, the third and fourth days of training: when given sufficient data, a good fit is obtained, but if the only training points are at the beginning and the end of the day (as during the fourth day), the OSGPR model based on k SE simply makes the best inference, which, in this case, is not sufficient. In contrast, the OSGPR model based on k Per×RQ still manages to give an acceptable fit. The key here is that even a low number of training points is enough for the models based on quasiperiodic kernels to learn the daily periodic behavior of GHI.
As expected, the forecast results are good when considering the 30 min horizon, with an advantage shown for the k SE -based OSGPR model. However, as for OGPR models, when the forecast horizon increases, the superiority of the quasiperiodic kernel is shown again. It should be noted that, with a low number of data, periodic behaviour is not learned as well as before: compare Figure 5f, where the two testing days are not bell-shaped but rather resemble the last four training days, to Figure 4d). Nonetheless, this example demonstrates the usefulness of Gaussian process regression models, even in the case of (severely) missing data. Figures 6 and 7 show nRMSE vs. training data sparsity for the k SE -based OSGPR model and the k Per×RQ -based OSGPR model, respectively. Figure 8 shows the computation time vs. training data sparsity for the latter model. To avoid making conclusions based on a particular realisation (keep in mind that training data are randomly selected), 100 Monte Carlo runs have been conducted; thus, the use of box plots (the Tukey box plots display the median, 0.25-quartile and 0.75-quartile, and whiskers corresponding to ±1.5 times the interquartile range) in Figures 6-8. As can be seen, if during training, nRMSE decreases steadily with data sparsity, during testing, it quickly reaches a threshold level, and dispersion also quickly vanishes. In this study, this threshold seems to be around 70%. As a consequence, using only 30% of the available training data appears to be sufficient to achieve the same nRMSE results to those obtained when using the whole training dataset. The gain in computation time is significant, as it falls from a median of 175 s at a sparsity level of 0% to a median of 17 s at a sparsity level of 70%, which amounts to around a 90% reduction.

Conclusions
The present paper aimed to design a Gaussian process regression model in the service of GHI forecasting at various time horizons (intrahour, intraday and intraweek). In existing research, only the squared exponential kernel k SE was used as the default choice (see [21,23]); no due attention was paid to the kernel selection and the ensuing impact of the model's performance, which was thoroughly investigated and proven to be pertinent in this paper. A comparative study was carried out on several GPR models, ranging from rather simple models to more complex versions, such as quasiperiodic kernels that are the sum or product of a simple kernel and a periodic kernel. Two separate 45-day datasets have been used. The conclusions which arise are that GPR models predictably outperform the persistence model, and that quasiperiodic-kernel-based GPR models are particularly apt to model and forecast GHI. In the case of low forecast horizons, all OGPR models perform satisfactorily. The case of higher forecast horizons, however, is an entirely different matter, as quasiperiodic kernels prove themselves to be much more suitable for the application at hand. The proposed interpretation is that the structure of GHI is better modelled through an omnipresent periodic component representing its global structure and a random latent component explaining rapid variations due to atmospheric disturbances. The periodic component in the proposed quasiperiodic kernels opens the door to GHI forecasting at higher forecast horizons (several hours to a few days) by minimising performance degradation. Accurate GHI forecasting from intrahour to intraweek horizons thus becomes a real possibility, provided that the proper kernel is fashioned.
A simple sparse GPR approach, the subset of data technique, has also been assessed; this consists of using only a part of the available data for training. This approach gives less precise results than standard GPR, but the training computation time is decreased to a great extent: this study indicates that comparable performance (in terms of nRMSE) is obtained when using only 30% of the available training data, which leads to a 90% decrease of computation time. Additionally, even though the lack of data hinders the training process, the results still show the superiority of GPR models based on quasiperiodic kernels.
Future work will focus on some of the best-performing quasiperiodic kernels that have been put forward in this paper and assess their performance using larger GHI datasets.

Acknowledgments:
The authors would like to thank ADEME for its financial support. The authors also thank all the Smart Occitania consortium members-in particular, ENEDIS, the French distribution system operator-for their contribution to this work.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript:

GHI
Global horizontal irradiance GP Gaussian process GPR Gaussian process regression OGPR Online Gaussian process regression OSGPR Online sparse Gaussian process regression nRMSE Normalized root mean square error

Appendix A. Kernel Hyperparameters
In this appendix, the mathematical expression of the simple and quasiperiodic kernels considered in Section 5 are given. One can also find in Table A1 the estimated values of the model hyperparameters for all OGPR models, using both summer and winter datasets. As stated in Section 3.4, these hyperparameters are denoted as θ = θ 1 . . . θ m T .
• Products of simple kernels: • Sums of simple kernels:

Appendix B. Detailed Numerical Results
In this appendix, detailed numerical results are presented. The values of nRMSE and the correlation coefficient R can be found in Tables A2 and A3 for the summer dataset and in  Tables A4 and A5 for the winter dataset. Table A2. Numerical values of nRMSE obtained when performing forecasts with persistence and OGPR models (summer dataset). The forecast horizon ranges between 30 min and 5 h. The best results are highlighted in green.   Table A4. Numerical values of nRMSE obtained when performing forecasts with persistence and OGPR models (winter dataset). The forecast horizon ranges between 30 min and 5 h. The best results are highlighted in green.