Next Article in Journal
Underground Coal Gasification Induced Multi-Physical Field Evolution and Overlying Strata Fracture Propagation: A Case Study Targeting Deep Steeply Inclined Coal Seams
Previous Article in Journal
Global Transition of Energy Vectors in the Maritime Sector: Role of Liquefied Natural Gas, Green Hydrogen, and Ammonia in Achieving Net Zero by 2050
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Hybrid Linear and Support Vector Quantile Regression for Short-Term Probabilistic Forecasting of Solar PV Power

by
Roberto P. Caldas
1,*,
Albert C. G. Melo
2 and
Djalma M. Falcão
3
1
Alberto Luiz Coimbra Institute for Graduate Studies and Research in Engineering (COPPE), Federal University of Rio de Janeiro, Rio de Janeiro 21941-901, Brazil
2
Mathematics and Statistics Institute, Rio de Janeiro State University (UERJ), Rio de Janeiro 20550-900, Brazil
3
Department of Electrical Engineering, Alberto Luiz Coimbra Institute for Graduate Studies and Research in Engineering (COPPE), Federal University of Rio de Janeiro, Rio de Janeiro 21941-901, Brazil
*
Author to whom correspondence should be addressed.
Energies 2026, 19(2), 569; https://doi.org/10.3390/en19020569
Submission received: 8 December 2025 / Revised: 14 January 2026 / Accepted: 17 January 2026 / Published: 22 January 2026
(This article belongs to the Section F5: Artificial Intelligence and Smart Energy)

Abstract

The increasing penetration of solar photovoltaic (PV) generation into power systems poses significant operational and planning integration challenges due to the high variability in solar irradiance, which makes PV power forecasting difficult—particularly in the short term. These fluctuations originate from atmospheric dynamics that are only partially captured by numerical weather prediction (NWP) models. In this context, probabilistic forecasting has emerged as a state-of-the-art approach, providing central estimates and additional quantification of uncertainty for decision-making under risk conditions. This work proposes a novel hybrid methodology for day-ahead, hourly resolution point, and probabilistic PV power forecasting. The approach integrates a multiple linear regression (LM) model to predict global tilted irradiance (GTI) from NWP-derived variables, followed by support vector quantile regression (SVQR) applied to the residuals to correct systematic errors and derive GTI quantile forecasts and a linear mapping to PV power quantiles. Robust data preprocessing procedures—including outlier filtering, smoothing, gap filling, and clustering—ensured consistency. The hybrid model was applied to a 960 kWp PV plant in southern Italy and outperformed benchmarks in terms of interval coverage and sharpness while maintaining accurate central estimates. The results confirm the effectiveness of hybrid risk-informed modeling in capturing forecast uncertainty and supporting reliable, data-driven operational planning in renewable energy systems.

1. Introduction

1.1. Background

Rising global temperatures, now exceeding 1.7 °C above pre-industrial levels in many parts of the world, are primarily the result of human activities that release greenhouse gases. Among these activities, energy production and consumption remain the dominant contributors, accounting for more than three-quarters [1] of total greenhouse gas (GHG) emissions globally. Adopted in 2015 at COP21, the Paris Agreement called for global efforts to limit the increase in average temperature significantly below 2 °C relative to pre-industrial levels. It brought together 195 nations under shared mitigation and adaptation goals, emphasizing the pivotal role of the energy sector and the urgent need for the expansion of renewable and low-carbon technologies supported by coordinated policies, innovation, and financial mechanisms.
In this context, solar photovoltaic (PV) and wind power have emerged as key trends in the energy transition. According to the IEA [2], the participation of renewable technologies in global electricity generation was 32% in 2024, and the main contributors were hydropower (14%), wind (8%), and solar PV (7%). Renewables are expected to surpass coal at the end of 2025 to become the largest source of electricity generation globally. The share of renewables in global electricity generation is projected to rise to 43% by 2030, while the share of intermittent renewable energy sources is set to almost double to 27%. Growth in utility-scale and distributed solar PV is expected to increase by more than double and represent nearly 80% of worldwide renewable electricity capacity expansion. Low module costs, relatively efficient permitting processes, and broad social acceptance drive the acceleration in solar PV adoption.
Despite significant progress, the intermittency of wind and solar generation—resulting from the hourly, daily, and seasonal variability in wind speed and solar irradiance—still poses challenges to power system integration due to the dispersion and limited predictability of these renewable energy sources. Accurate forecasting is essential to ensuring system stability, reliability, and economic efficiency, thereby driving the development of methods that can represent uncertainties across different planning horizons [3,4].
In this context, short-term solar photovoltaic (SPV) power forecasting models play a key role, including those with hourly resolution and horizons of up to one day ahead—which are the focus of this work. The formulation of such models requires an accurate characterization of ground-level irradiance [5], commonly quantified as global horizontal irradiance (GHI) or global tilted irradiance (GTI) on the plane of the photovoltaic (PV) modules. Solar irradiance at the surface is governed by meteorological and geographical determinants that influence the absorption, scattering, and reflection of solar radiation [6]. According to standard definitions [7,8,9], GHI comprises two principal components: direct normal irradiance (DNI), measured on a surface perpendicular to the sun’s rays, and diffuse horizontal irradiance (DHI), which accounts for sky-scattered radiation [9,10].
Recent forecasting methodologies for solar irradiance rely on meteorological inputs from numerical weather prediction (NWP) models, and their combination with local measurement data is crucial for reliable short- and medium-term estimates [11]. These models simulate atmospheric dynamics from observed initial states. Among them, the European Centre for Medium-Range Weather Forecasts (ECMWF) is recognized for the high precision of its global model and the products of the Copernicus Atmosphere Monitoring Service (CAMS) [12], which provide estimates of GHI and its main components, as well as atmospheric composition. In this work, CAMS irradiance estimates are extensively used as the basis for modeling and evaluation.

1.2. Review of Solar PV Forecasting Methods

Relevant reviews of the literature can help to trace the historical progression of solar irradiance and photovoltaic (PV) power forecasting.
Early work relied mainly on physical and empirical models—such as persistence models and cloud motion vectors from satellite images—which offer physically based descriptions of the solar resource [13,14]. As continuous measurement datasets became available, statistical time-series approaches—notably AR, ARMA, ARIMA, and linear regression-based models—emerged as the primary tools for short-term forecasting and remained the standard baseline for several decades [13,14,15].
Two methods became increasingly prominent in solar forecasting research, in parallel with the increasing availability and scope of data from numerical weather prediction (NWP) [13] and advances in machine learning techniques. Artificial Neural Networks (ANNs) are widely used due to their ability to approximate nonlinear mappings [15,16], as well as Support Vector Machines (SVMs/SVR)—grounded in Statistical Learning Theory—due to their kernel-based flexibility, convex optimization, and robustness when handling high-dimensional datasets [16,17,18,19,20,21,22,23,24,25,26]. With their introduction, SVMs/SVR and ANNs demonstrated superior performance under cloud-driven and highly variable conditions, which significantly improved point-forecast accuracy. ANN deep learning architectures, designed to learn from vast amounts of input data, benefit from refined parameter analysis methods and provide enhanced forecasting capabilities [27,28].
As these techniques matured, the growing use of hybrid methods—which combine two or more different methodologies, such as physical, statistical, and machine learning—subsequently became increasingly relevant for improving forecast accuracy by leveraging the strengths of each approach [5,27,28,29]. Recent reviews also note that hybrid forecasting systems tend to be modular and structured in functional stages [30], such as those intended for noise treatment, input data decomposition, and residual or error modeling [31,32,33].
Across these developments, state-of-the-art reviews have consistently identified the growing importance of probabilistic forecasting in obtaining operationally meaningful uncertainty information for grid integration [34,35]. The field has increasingly adopted probabilistic modeling approaches that provide forecasts based on quantiles and prediction intervals [16,34,35].
Quantile regression [27,36] (QR) based on the asymmetric pinball loss function has thus become a well-established reference technique for probabilistic solar forecasting, due to its properties such as robustness to asymmetries and extreme values and because it does not make assumptions about data distributions—characteristics that are highly desirable in forecasting irradiance and PV power. At the same time, support vector quantile regression (SVQR) [37], introduced in 2004 [38], combines the main attributes of QR with the key features of the SVM framework, including kernel-based flexibility, convex optimization, and robustness in high-dimensional settings. SVQR can handle nonlinear and heterogeneous structural problems in industrial, economic, and other systems and has emerged more recently as a promising alternative to solar power forecasting methodologies [39,40].
Focusing now on approaches that simultaneously combine attributes such as hybrid structure, probabilistic forecasting, and a module dedicated to explicit residual modeling—to produce probabilistic forecasts and/or for forecast improvement—the literature reports works using heterogeneous methodological strategies, including parametric distributions (Gaussian and Laplacian) [33,40], extreme region modeling [41], residual bootstrapping [42], generative models based on residual diffusion [43], and empirical quantiles derived from historical simulation [44]. Thus, to the best of our knowledge of the state of the art, the use of robust approaches regarding explicit modeling of the residual structure and direct estimation of conditional quantiles in a convex and kernel-generalized manner and with no assumptions about data distributions—such as that offered by support vector quantile regression (SVQR)—has not been observed. This motivated the framework developed in the present work.

1.3. Objectives and Contributions

The objective of this work is to develop and evaluate a three-stage hybrid probabilistic framework for short-term photovoltaic (PV) power forecasting, designed to jointly address point prediction accuracy and predictive uncertainty. The architecture integrates: (i) a multiple linear regression stage for computing point estimates of GTI; (ii) a support vector quantile regression (SVQR) module for explicit modeling of the residual component and estimation of conditional quantiles; and (iii) a final conversion stage that maps GTI quantiles into PV power quantiles. The main contributions and novelties of this work can be summarized as follows:
  • A novel hybrid probabilistic forecasting architecture is proposed, in which SVQR is not used as a standalone model, but rather as a residual-learning component that simultaneously refines point GTI forecasts and produces probabilistic quantile estimates. This dual role of SVQR within a multi-stage framework distinguishes the proposed approach from existing probabilistic solar forecasting methods.
  • An explicit residual-based uncertainty modeling strategy is introduced, enabling a more accurate characterization of conditional GTI uncertainty, particularly under complex and non-linear atmospheric conditions that are not fully captured by linear regression models.
  • Improved probabilistic forecasts of both GTI and PV power are achieved, with enhanced coverage and sharpness of predictive quantiles, including a more consistent representation of upper-tail events. This feature is especially relevant for risk-aware operation, reserve allocation, and planning and operation of power systems with high penetration of solar generation.
  • A comprehensive benchmarking framework is employed, including point and probabilistic models as well as ensemble-based baselines, allowing assessment of the proposed method against state-of-the-art alternatives.

1.4. Structure of the Paper

The remainder of this manuscript is organized as follows. Section 2 provides the technical background, reviews the forecasting approaches, defines the error metrics, and presents the proposed three-stage hybrid framework, describing the LM1 irradiance model, the SVQR residual-quantile module and the LM2 power-conversion stage. Section 3 outlines the case study, including the data sources, preprocessing procedures and variable-selection strategy. Section 4 reports and discusses the results for point and probabilistic forecasts. Section 5 concludes the study and indicates directions for future developments.

2. Methods

The methods adopted in this work represent complementary modeling strategies. Linear regression models (LMs) [45] and quantile regression (QR) [37,46,47,48] form the statistical baseline for estimating mean responses and conditional quantiles. Support vector regression (SVR) [17,18,19,20] and support vector quantile regression (SVQR) [49,50,51,52] extend this baseline through maximum-margin learning and kernel functions, enabling the representation of nonlinear and asymmetric data structures.
This combination provides both comparability with conventional benchmarks and access to advanced probabilistic modeling. Model performance was assessed using both point and probabilistic forecasting performance metrics to characterize accuracy, coverage, sharpness, and tail behavior.

2.1. Linear and Multiple Linear Regression Models (LMs)

The linear regression model establishes a mathematical function f that describes the relationship between a single variable y (explained or dependent variable) and one or more variables x1, x2, …, xp (explained or independent variables), that is, y = f(x1, x2, …, xp) + ε, where ε represents the non-systematic variation, i.e., the unobservable random component or random error.
Once the function f is obtained, it is possible to describe or predict the behavior of the explained variable based on the values that the explanatory variables may assume.
The function f is unknown and must be inferred from the observations of the variables y, x1, x2, …, xp. This inference process is typically referred to as regression analysis. A class of functions f of interest is one that establishes a linear relationship between the two types of variables.
Linear regression analysis can be simple or multiple, depending on the number of explanatory variables [45]:
(i)
Simple Linear Regression: One dependent variable y is explained by one independent variable x, through the relationship
y = β0 + β1 x + ε
(ii)
Multiple Linear Regression: One dependent variable y is explained by at least two independent variables y , x 1 , , x p :
y = β0 + β1 x1 + … + βp xp + ε; (p ≥ 2)
In simple or multiple linear regression, the values of the regression coefficients β0, β1, …, βp are estimated from a random sample of size n from the population, corresponding to the pairs of values (y, xi,1, …, xi,n), i = 1, n. This is carried out by formulating an optimization model that minimizes the sum of the squared residuals (the observable difference between the actual value yi and the value estimated by the model ŷi), and the solution is obtained analytically using the ordinary least squares (OLS) method.
The LM specified in (1) and (2) relies on a specific set of assumptions detailed in [45]. As a consequence of these assumptions, it is verified, for example, that all conditional distributions (y|x) have the same variance and that the means of the conditional distributions; e.g., E(y|x), lie on the regression line in the case of simple linear regression. This means that the OLS estimates the average value of y given the observed explanatory variables.

2.2. Quantile Regression Model (QR)

Linear quantile regression (QR) was initially introduced in 1978 [37] and, unlike OLS, which estimates E(y|x), estimates the effect at conditional quantiles of y given x (y and x are random variables with observations yi and xi) [37,46]. Thus, QR describes how different parts of the conditional distribution of y vary with x, capturing distributional features that are not observable through mean regression alone [47].
Let Q(τ) be defined as the value for which the probability of obtaining values of y below Q(τ) is τ. In quantile regression, Q(τ), 0 < t < 1, is expressed as a linear combination of some known regressors (explanatory variables) and unknown coefficients, exactly as the mean is modeled in (multiple) linear regression. Thus, the τ-quantile is modeled as [48]
Q(τ) = β0(τ) + β1(τ)x1 + ⋯ + βp(τ)xp
where vector xi, i = 1, p comprises the known explanatory variables and β i, i = 1, p are the unknown coefficients, depending on τ, to be determined from observations (yi, xi,1, …, xi,p), i = 1, n.
The sample t-quantile can be found by solving the following minimization problem:
m i n β i = 1 n ρ τ y i q
where ρ τ is the check function [48], also called pinball loss function [52] given by
ρ τ ( u ) = τ u ,           u 0 τ 1 u , u < 0
Finally, estimates of the vector of unknown coefficients, β ^ ( τ ) , are obtained replacing q in (4) by (3) and solving the following linear programming problem:
  β ^ ( τ ) = arg m i n β i = 1 n ρ τ [ y i ( β 0   + β 1   x i , 1 + β p   x i , p ) ]
Observe that for τ = 0.5 , this formulation reduces to the least absolute deviations (LAD) estimator.

2.3. Support Vector Regression (SVR)

Support vector machines (SVMs), initially proposed in the early 1990s, were developed within the framework of Statistical Learning Theory to address pattern classification problems [17]. Subsequently, the method was extended to regression problems, leading to the development of support vector regression (SVR) [19]. A distinctive feature of SVR is that the extension from linear to nonlinear regression is direct via the kernel trick, which adds significant flexibility in modeling both linear and nonlinear relationships [19,20].
In traditional regression, the model parameters are estimated by fitting a hyperplane to the training data, typically using the ordinary least squares (OLS) criterion. The prediction of the dependent variable y in support vector regression (SVR) is also based on fitting a regression function. However, the hyperplane is obtained within the ε-insensitive loss framework, as expressed in the model [19,20]:
f(x) = wTx + b
where the weights w represent the coefficients β1, β2, …, βp, and the intercept term b corresponds to the coefficient β0.
Two basic assumptions underlie this formulation: (i) all observations are assumed to lie within a hypertube of width 2ε, with ε > 0, and (ii) there exists a central hyperplane within this hypertube that models all observations. Geometrically (see Figure 1), this hyperplane can be interpreted as lying at the center of an ε-insensitive hypertube that bounds the variation in the observations.
The problem involves optimally fitting the regression hypertube to the data, ensuring that most observations fall within it while maximizing its width. This positioning involves defining the slope and intercept of the model so that the points lie close to the hypertube boundaries (see Figure 1b). This formulation leads to a quadratic programming problem [18,20]:
M i n w , b 1 2 w T w     y i w T x i + b ε   w T x i + b y i ε
In practice, it is not always possible for all observations to lie entirely within the ε-hypertube since this would require substantial values of ε, which would compromise the model’s generalization capability (see Figure 2a). To address this limitation, slack variables (ξ) are introduced to quantify the deviation of observations from the hypertube boundaries, as illustrated in Figure 2b and Equation (9) [18,20].
M i n w , b , ξ , ξ + 1 2 w T w + C i = 1 n ξ i + + ξ i             y i w T x i + b ξ i + + ε          w T x i + b y i ξ i + ε                   ξ i , ξ i + 0
ξ+ = {0, if yf(x) ≤ ε; |yf(x)|− ε, otherwise}
ξ = {0, if f(x)− yε; |f(x) − y| − ε, otherwise}
In the optimization problem (9), the constant C is positive and controls the trade-off between maximizing the margin within the ε-hypertube and minimizing the slack variables, thus acting as a mechanism to prevent overfitting. The solution to the optimization problem (9) is also generally obtained in its dual form, as expressed in [18,20]:
M a x λ i + , λ i 1 2 i = 1 n j = 1 n λ i + λ i λ j + λ j X i T x j + i = 1 n y i λ i + λ i ε i = 1 n λ i + λ i s . a . i = 1 n λ i + λ i = 0 0 λ i + , λ i C , i = 1 , n
where λ denotes the Lagrange multipliers associated with the constraints of the optimization problem (9). The problem in (10) is solved using linear mathematical programming techniques, formulating the associated Lagrangian function to transform the problem into a system with only equality constraints and subsequently applying the Karush–Kuhn–Tucker (KKT) optimality conditions. At the optimal solution, the relationships between the primal problem (9) and the dual problem (10) are established by the following equations [18,20]:
w * = j = 1 n λ j + * λ j * x j
b * = 1 n i = 1 n w * T x i y i = 1 n i = 1 n j = 1 n λ j + * λ j * X j T x i y i
At the optimal solution, the points where the differences between the pairs λ j + *   a n d   λ j * are nonzero—i.e., λ j + * λ j * —correspond to the support vectors. The predictive equation for a given input X0 is obtained by substituting (11) and (12) into (7), as expressed in [18,20]:
y ^ = f x 0 = w T x 0 + b = i = 1 n s v λ i + * λ i * X i T x 0 + b *
where nsv is the number of support vectors, i.e., the number of observation pairs for which λ j + * λ j * 0 .

The Kernel Trick

A particularly attractive feature of support vector regression is the ease with which the linear model can be extended to the nonlinear case. This transformation involves replacing the inner product x j T x i in the linear regression Equations (7) and (13), as well as in optimization problem (9), with a kernel function (kernel trick), which maps the input space into a higher-dimensional feature space. In general, the kernel function can be written as [18,20]
K x i , x j = ϕ x i ϕ x j
By substituting (8) into the optimization problem in its dual form, (13) can be rewritten as [18,20]
M a x λ i + , λ i 1 2 i = 1 n j = 1 n λ i + λ i λ j + λ j K x i , x j + i = 1 n y i λ i + λ i ε i = 1 n λ i + λ i s . a . i = 1 n λ i + λ i = 0 0 λ i + , λ i C , i = 1 , n
And w * and b * can be expressed as [18,20]
w * = j = 1 n ( λ j + * λ j * ) ϕ ( x j ) b * = 1 n i = 1 n w * T x i y i = 1 n i = 1 n j = 1 n λ j + * λ j * ϕ ( x j ) ϕ ( x i ) y i
or
b * = 1 n i = 1 n j = 1 n λ j + * λ j * K x i , x j y i
Finally, the regression equation for a given input x0, taking into account the kernel function, is given by the equation [18,20]
y ^ = f x 0 = w T x 0 + b = i = 1 n v s λ i + * λ i * ϕ x i ϕ x 0 + b * o r y ^ = f x 0 = i = 1 n v s λ i + * λ i * K x i , x 0 + b *
It can be observed from (18) that the SVR method constructs the regression equation as a weighted sum of kernel functions. Thus, without loss of generality, (18) can be rewritten as [18,20]
y ^ = f x 0 = i = 1 n s v + 1 α i K x i , x 0
where α i represents the pairs λ i + * λ i * , together with the intercept term b*.
A relevant feature of the SVR approach with kernel functions, which adds flexibility, is that the mapping function does not require explicit computation; instead, it only requires the use of an appropriate kernel function. For example, in the specific case of linear regression, the kernel function in (14) can be expressed as [18,20]
K x i , x j = x i T x j
On the other hand, when a nonlinear kernel function is employed—for instance, the radial kernel in (18)—a forecast can be obtained through nonlinear regression [18,20]:
K x i , x j = e x p ( ϒ x i x j 2 )
In summary, support vector regression estimates a regression function by means of the ε-insensitive loss and an optimization framework that depends only on a subset of training points (support vectors). This leads to models that can incorporate nonlinear relationships through kernel functions and are less sensitive to extreme residuals than ordinary least squares. These properties provide a basis for extending the method to conditional quantiles in support vector quantile regression (SVQR), where the same structure is used to estimate different quantile levels.

2.4. Support Vector Quantile Regression (SVQR)

Whereas support vector regression (SVR) estimates the conditional expectation of the dependent variable, support vector quantile regression (SVQR) extends this framework to estimate conditional quantiles. The essential modification, similarly to the linear quantile regression (QR), consists of employing the asymmetric pinball loss function—also known as the tilted absolute value loss or check function [48], the definition of which is repeated here [52]:
ρ τ ( u ) = τ u ,             u 0   τ 1 u       u < 0     
where τ ∈ (0, 1) denotes the quantile level. This modification shifts the regression function according to each chosen quantile, enabling the direct estimation of conditional quantiles within the maximum-margin learning paradigm. Figure 3 illustrates the shape of the pinball loss function.
Thus, the primal SVR problem in (9) is reformulated for quantile estimation to incorporate the estimation of conditional quantiles through the pinball loss, as described in [52]:
M i n w , b , ξ , ξ + 1 2 w T w + C i = 1 n τ ξ i + + ( 1 τ ) ξ i s u b j e c t   t o y i w T X i + b w T X i + b y i ξ i + + ε ξ i + ε   ξ i , ξ i + 0
In the dual Formulation (15), the applicable constraints are also modified to reflect the asymmetric treatment of residuals imposed by the quantile level τ, leading to the revised conditions [52]:
0 λ i + * C τ , 0 λ i * C 1 τ .
The remaining Formulations (16)–(21) can remain unchanged once these modifications are introduced.
From a theoretical standpoint, both the statistical consistency and the preservation of optimal learning rates for support vector quantile regression with the pinball loss function can be demonstrated [49,50]. These results consolidate SVQR as a natural extension of SVR within the maximum-margin learning framework.
In practice, these advances have been incorporated into the liquidSVM package [51], which was employed in this work. The qtSVM solver implemented in this package applies the quantile formulation of the SVM and provides an optimized workflow through automatic hyperparameter selection, parallel computation, and simultaneous estimation of multiple quantiles. This connection between theoretical properties and practical implementation reinforces the applicability of SVQR to real-world probabilistic forecasting tasks, particularly those related to renewable energy generation.

2.5. Outline of the Proposed Forecasting Methodology

2.5.1. Forecast Model Structure

The proposed forecasting model, denoted propMod_LM1-SVQR-LM2 and outlined in the forecasting module presented in Figure 4, is organized into three sequential stages that integrate linear statistical models and machine learning techniques to capture both the central tendency and the uncertainty of short-term (24 h horizon) photovoltaic (PV) generation. The 24 h horizon was selected to match the operational needs of day-ahead scheduling.
In the first stage, a multiple linear regression model (LM1) estimates the global tilted irradiance (GTI) using numerical weather prediction (NWP) variables, including GHI, DNI, and DHI. This model produces point forecasts based on dominant linear relationships but, naturally, does not adequately capture nonlinear effects from clouds and atmospheric interactions. To address these limitations, the second stage models the LM1 residuals through support vector quantile regression (SVQR) trained on multiple quantiles (e.g., τ = 0.05; 0.10; 0.50; 0.90; 0.95) using additional variables selected from a set encompassing temperature, cloud cover, humidity, wind, and solar angles, thereby constructing predictive distributions of GTI that incorporate uncertainty.
In the third stage, the corrected quantile forecasts of GTI are transformed into probabilistic forecasts of PV power using a second linear model (LM2), which exploits the strong historical relationship between irradiance and generation after outlier removal. This model operates as a conversion step, transferring irradiance-related uncertainty to power while maintaining coherence with the physical behavior of the photovoltaic system.
Finally, the proposed forecast approach encompasses both point and probabilistic forecasts of global tilted irradiance (GTI) and photovoltaic power (P). A set of reference models was also implemented for comparative purposes.

2.5.2. Previous Data Processing Module

As depicted in Figure 5, the approach comprises a previously developed module by the authors of [53] related to the treatment of observed global tilted irradiance (GTI) and photovoltaic power (P) data. In this approach, P and GTI measurements are analyzed jointly by employing statistical techniques [54,55], data mining algorithms [56], and reanalysis data [57]. The purpose is to correct discrepant values (outliers), substitute erroneous values (bad data), and fill data gaps (missing data).

2.5.3. Input Variables

As mentioned before, short-term photovoltaic (PV) power forecasts primarily depend on global horizontal irradiance (GHI) and global tilted irradiance (GTI), both of which are influenced by meteorological and geographical variables. Accordingly, modern methodologies incorporate not only direct observations but also variables derived from NWP models. These datasets include, in addition to irradiance components (global, direct, and diffuse), atmospheric parameters related to clouds, aerosols, ozone, water vapor, and albedo [58]. However, the inclusion of redundant or irrelevant variables may compromise the generalization capability of machine learning algorithms [59], making careful predictor selection essential [60,61].
In this work, multiple NWP sources were considered: CAMS [62] (also used as a benchmark), MERRA-2/SoDa [58,63], WWOnline [64], and NOAA [65], as well as two reference models—the naive persistence model (GTI observed with a 24 h lag) and CAMS GTI forecasts. The selected variables included GHI, BHI, DHI, DNI, and their clear-sky counterparts from CAMS; surface pressure, precipitation, humidity, shortwave irradiation, temperature, and wind from MERRA-2/SoDa; cloud cover, instantaneous and daily temperatures, dew point, visibility, precipitation, pressure, and UV index from WWOnline; and astronomical descriptors such as solar angles and true solar time from NOAA. Derived variables such as CAMS’s GTI (from GHI and its components), BHI_GTI, DHI_GTI, clear-sky index, and duration of nonzero GHI were also constructed. To reduce redundancies, the Least Absolute Shrinkage and Selection Operator (LASSO) method [66,67] was applied, yielding representative subsets employed in LM1 and SVQR. LASSO is widely recognized method for identifying relevant subsets of predictors by penalizing multiple regression coefficients. The process results in a reduced and more parsimonious list by enabling a ranking of the relative contributions of each variable and automatically eliminating those with a minor impact.
Finally, only the measured historical series of GTI and P were preprocessed. Outliers were corrected by replacing them with median values obtained from self-organizing map (SOM) clusters [56]. Short gaps were filled using LOESS [55], and more extended absences were replaced with clear-sky model estimates. After processing, the series were normalized using min–max scaling to enhance model convergence.

2.5.4. Training and Validation

For the adequate analysis of forecasting model performance, it is essential to properly train and test the predictors in the evaluation process. It is recommended to separate the training dataset from the test dataset. However, when a dataset of sufficient size is not available, the evaluation process can employ k-fold cross-validation (CV). First, the data are divided into k folds; second, the models are trained on the data, excluding the data in the k-th fold; and third, the k-th data are applied to evaluate the model’s performance, and this process must be repeated until all folds have been applied for evaluation.
Although CV is a commonly employed tool for evaluating model performance, simply splitting the datasets when they are ordered by date (without a sufficient number of years) could leave a seasonal bias between the training and test datasets in each split for verification. To prevent this, one can ensure that the test and training data do not become unbalanced around specific dates.
It should be noted that the results of a support vector regression (SVR) model are sensitive to the definition of its hyperparameters, a set formed by the constant C, the width of the hypertube ε, the kernel function, and its corresponding parameters.
Identifying the best configuration involves a grid search over possible values for each hyperparameter. The error is evaluated for each possible configuration, and the best hyperparameter configuration is the one that yields the lowest error. This procedure, which was adopted in this work, is called cross-validation or k-fold cross-validation [68] and consists of randomly dividing the dataset into k subsets of equal size.

2.5.5. Error Metrics

Point Error Metrics
Point error metrics play a crucial role in evaluating forecasts, even in studies focused on probabilistic approaches, as the central quantile (median) of a predictive distribution corresponds to a point forecast. Two widely used measures were adopted in this work: the mean absolute error (MAE) and the root mean square error (RMSE).
MAE provides a direct measure of accuracy; however, it is scale-dependent and may mask significant individual errors. RMSE, in turn, is more sensitive to outliers because it includes squared deviations, but it offers strong statistical interpretability by reflecting the variance and standard deviation of errors.
Although both metrics are scale-dependent, their application in this work is appropriate as comparisons are always made on a homogeneous basis. In addition to quantifying point accuracy, MAE and RMSE provide a natural link with probabilistic evaluation by serving as a reference for the performance of the central quantile relative to predictive distributions.
MAE and RMSE can be expressed by [69]
M A E = 1 N   i = 1 N y ^ i y i ,
and
R M S E = 1 N i = 1 N y ^ i y i 2  
where y i denotes the observed value at instance i , y ^ i is the corresponding point forecast, and N is the total number of observations. Both metrics summarize the average magnitude of forecast errors and are computed over the same evaluation sample used for the probabilistic models.
Probabilistic Error Metrics
Probabilistic error metrics are crucial for evaluating forecasts beyond the central point and encompass properties such as coverage, sharpness, and tail performance [70].
Reliability measures the consistency between predicted probabilities and observed empirical frequencies and is usually quantified using the Prediction Interval Coverage Probability (PICP), which compares the empirical coverage rate of intervals with their nominal level.
Sharpness, in turn, expresses the degree of concentration of predictive distributions and is quantified using the Prediction Interval Normalized Average Width (PINAW), which calculates the normalized average width of the intervals; smaller values indicate greater informational power, provided that adequate coverage is maintained.
To simultaneously integrate reliability and sharpness, the Continuous Ranked Probability Score (CRPS) is employed. The CRPS is a proper scoring rule that measures the distance between the predictive distribution and the observation, serving as a probabilistic generalization of the absolute error.
In addition, under critical operational conditions, it becomes relevant to evaluate behavior at the extremes. We utilized the established Mean Excess Function [71] to measure the average severity of violations above the 95th percentile (p95), here referred to as UTailmean. Additionally, we introduced a variation of this metric, termed Median Excess (UTailmed). Unlike the traditional approach, which averages instance-by-instance deviations, the proposed UTailmed calculates the difference between the medians of the observed and predicted distributions within the tail. This comparison offers a specific assessment of central tendency behavior during extreme events, a perspective not typically covered by standard error metrics.
These metrics can be expressed by [70]
P I C P = 1 N i = 1 N I y i L i , U i
where L i and U i denote the lower and upper bounds of the prediction interval, respectively, and I y i L i , U i is an indicator function that takes the value 1 when y i L i , U i and 0 otherwise.
P I N A W = 1 N i = 1 N U i L i R
where R is a normalization factor (for example, the range of observed PV power values in the test set). Lower PINAW values indicate greater sharpness, and it is desirable to minimize them while maintaining the PICP close to the nominal coverage level. Again, Ui and Li represent the upper and lower bounds of the prediction interval.
C R P S F ^ , x 0 = F ^ x I x x 0 2 d x
where I x   x 0   is an indicator function that takes the value 1 when x x 0 and 0 otherwise. Lower CRPS values indicate better forecast quality. The physical interpretation of CRPS is associated with the average distance between the predicted probability distribution and the actually observed value.
U T a i l m e a n = m e a n y i q ^ 0.95 , i | y i > q ^ 0.95 , i
U T a i l m e d i a n = m e d i a n y i | y i > q ^ 0.95 , i m e d i a n q ^ 0.95 , i | y i > q ^ 0.95 , i
where q ^ 0.95 , i represents the predicted 95th-percentile value, for instance, i.
In this work, PICP, PINAW, CRPS, and UTail were used to assess the reliability and sharpness and statistical robustness of the probabilistic forecasts, providing a consistent basis for comparing the benchmark model with the proposed hybrid approach.

3. Case Study

Data from a 960 kWp photovoltaic system [72] installed on the University of Salento campus in Monteroni di Lecce (Italy) as part of the European BEAMS project were used in this work. The dataset provides 640 days of concurrent hourly measurements of AC power output, plane-of-array irradiance, ambient temperature, and module temperature. The PV array system, mounted on a metal parking structure, consists of two sections with different tilt angles (3° and 15°), both oriented to the southeast (azimuth −10°), totaling 3000 modules with an effective area of approximately 4700 m2. The plant is equipped with grid-connected inverters, GTI sensors, ambient and module temperature meters, and a SCADA system for continuous acquisition of GTI and power (P). This dataset is openly accessible, accompanied by detailed documentation, and provides a transparent, well-characterized basis for assessing the hybrid probabilistic forecasting framework proposed in this work.
The preprocessing involved multiple steps to ensure data consistency and quality and followed the approach described in Section 2.5.2. Initially, temporal realignment was performed based on reference profiles—specifically, adjustments to sunrise and sunset times—using GTI data from the SARAH product [73], calibrated and made available through the Renewables.ninja platform [74]. Next, a preliminary linear regression analysis was performed between GTI and P to identify and remove invalid measurements. Data gaps were filled using reanalysis data (for gaps > 4 h) or LOESS smoothing [55] (for gaps < 4 h). Outliers were detected using hourly boxplots and refined using self-organizing maps (SOMs), more specifically, the Kohonen map [56], with anomalous values replaced by median cluster profiles. Finally, all variables were normalized to the [0, 1] range, with separate processing for the training and test datasets to prevent information leakage.
The computational simulation utilized hourly GTI and P series for the periods from 1 April 2012, to 31 March 2013 (training/validation, 3906 records) and from 1 April 2013, to 31 December 2013 (testing, 3417 records), considering only time steps with nonzero irradiance. A complete 12-month cycle for the training/validation stage was chosen with the aim of preserving the seasonal structure of the data, ensuring that annual variability patterns were incorporated into the modeling process. Forecasts were performed using 24 h ahead rolling windows, covering the entire test interval. The hybrid model was calibrated using 10-fold cross-validation, which yielded the best performance.
For comparison, a set of benchmarks encompassing both point and probabilistic forecasts of GTI and P was implemented. All procedures were developed in R (version 4.5) [75] on Windows 11 on a notebook equipped with Intel i7-12700H processor and 32 GB RAM. The liquidSVM package [76] was used for SVR/SVQR modeling with the default Gaussian kernel (“GAUSS_RBF”), as preliminary tests with the Poisson kernel yielded negligible differences in results. The base lm function and the rq function from the quantreg package [77] were employed for the LM and QR models, respectively. Customized routines were also developed for data preprocessing and computation of probabilistic metrics (PICP, PINAW, CRPS, and UTail).

4. Results and Discussion

4.1. Processing of Historical Measured GTI and P Data

The preparation process for the global tilted irradiance (GTI) and photovoltaic power (P) measured historical data was structured into multiple stages to ensure statistical consistency, physical plausibility, and preservation of the original temporal structure of the series (see Section 2.5.2 and Figure 5 [53]).
The main associated figures are described below.
  • Temporal Alignment and GTI–P Relationship: Initially, the hourly measured profiles were compared to reanalysis references, which resulted in the identification of about 15% of daily series with significant temporal shifts. These records were realigned until the best fit was achieved, and a simple linear regression model was then adjusted to characterize the functional relationship between GTI and P.
  • Filtering of Overestimations: GTI and P outliers were detected using hourly boxplots and replaced with the upper fence limit values. This correction affected 2.7% of irradiance observations and 2.9% of power records, reducing the impact of overestimations.
  • Gap Filling: About 27% of days contained missing data during the daytime period. Gaps longer than 4 h were filled using reanalysis data, while shorter gaps were treated with LOESS smoothing at a 99.5% confidence level. In total, 3.1% of records were adjusted, including power corrections via the linear model.
  • Profile Clustering: A total of 639 daily irradiance profiles were identified and organized into 25 groups using Kohonen Self-Organizing Maps. Within each cluster, hourly outliers were replaced with median values, correcting about 1.9% of records.
  • New Linear Model and Results: After the corrections, a new linear regression model was estimated between GTI and P using the processed data. The fit showed improved statistical coherence, with 3.6% of power records corrected as outliers. The final GTI and P profiles exhibited greater regularity and adherence to physical conditions.
  • Evaluation of Data Processing: Comparisons between the original and filtered series, in terms of temporal profiles, cumulative distributions, and autocorrelation functions, confirmed the effectiveness of the process. The treatment filled gaps and mitigated inconsistencies while preserving the global statistical properties and intrinsic temporal dependencies, ensuring the integrity of the data used for modeling.

4.2. Benchmarks Considered

For comparative purposes, a set of reference models was implemented to cover both point and probabilistic forecasts of global tilted irradiance (GTI) and photovoltaic power (P). These benchmarks range from simple point forecasting approaches to more sophisticated models based on machine learning and hybrid strategies, thereby establishing multiple baselines for evaluating the proposed hybrid model.
The following benchmark models were considered:
  • B1_Naive_point_GTI: point forecast for GTI, defined by the 24 h time shift in the observed series, assuming an exact repetition of the previous day’s irradiance profile;
  • B2_CAMS_point_GTI: point forecast of GTI derived from NWP irradiance data provided by CAMS;
  • B3_MLR_point_GTI: multiple linear regression model with explanatory variables selected from basic irradiance components (GHI, BHI, DHI, BNI) and clear-sky data, assessing direct linear relationships with measured GTI and corresponding to the first linear modeling stage (LM1) of the proposed hybrid model for GHI point forecasting;
  • B4_QR_point_GTI: quantile regression (QR) model for estimating the central quantile (median) of the conditional GTI distribution, obtained using the linear QR formulation, and used here as a reference point forecast for GTI;
  • B5_SVR_point_GTI: support vector regression model with a radial kernel function, used as a nonlinear benchmark for GTI point regression;
  • B6_Ensemble_point_GTI: ensemble point forecast based on the arithmetic mean of the forecasts provided by models B2, B3, B4, and B5, aiming to evaluate the performance gains from combining distinct forecasting approaches;
  • B7_QR_probabilistic_GTI: quantile regression (QR) model for generating probabilistic forecasts of GTI, corresponding to the 0.05, 0.10, 0.90, and 0.95 quantiles, provided by the same model as B4;
  • B8_Hybrid_QR_and_LM_probabilistic_P: probabilistic forecasts of P, corresponding to the 0.05, 0.10, 0.90, and 0.95 quantiles, generated by the same QR model as B4, coupled with the linear model LM2, which represents the final stage of the proposed hybrid framework.

4.3. Selection of Explanatory Variables

The selection of explanatory variables is a critical step for the performance of both the proposed hybrid model and the reference (benchmark) models as it defines the predictor base that underpins the model’s generalization capability. In this work, variable selection was based on LASSO. This process involved identifying relevant subsets of predictors by penalizing multiple regression coefficients, ranking the relative contributions of each variable, and automatically eliminating those with a minor impact, resulting in a reduced and more parsimonious list.

4.3.1. Selection in the Proposed Hybrid Model

As detailed in Section 2.5, the proposed hybrid model has a three-stage structure consisting of submodel LM1, responsible for the initial estimation of the GTI series based on irradiance variables with direct physical relationships to the historical GTI; submodel SVQR, dedicated to nonlinear and quantile modeling of the residuals from LM1; and submodel LM2, responsible for mapping the probabilistic GTI forecasts to photovoltaic power (P).
In LM1, the variables submitted to LASSO were restricted to those with direct physical relationships to global irradiance and its components. This choice ensures both physical interpretability and statistically consistent residuals for the subsequent stage. As a result, six explanatory variables were selected: one GTI-related variable (derived from CAMS GHI components), two GHI variables (from CAMS and MERRA-2/SoDa), two BHI variables (from CAMS, including one derived from CAMS GTI), and one clear-sky DHI variable (from CAMS).
In the SVQR stage, selection began with all available variables (meteorological, irradiance, numerical weather forecasts, and derived variables), with LM1 residuals as the dependent variable. The resulting list remained extensive after the initial LASSO application. To refine it, an additional cutoff criterion was applied based on cumulative contribution percentage, retaining only variables that accounted for up to 80% of the total relevance (80%_list). A progressive SVQR training process was subsequently implemented:
  • The first iteration included only the most relevant variable from 80%_list.
  • Additional variables were incorporated in decreasing order of relevance.
  • MAE was monitored at each step to identify performance saturation.
This procedure was repeated over three independent rounds, producing distinct lists. The combined evaluation employed MAE, RMSE, and quantile coverage at τ = {0.05, 0.10, 0.90, 0.95} for the power (P) forecast. Each sublist was ranked using penalty scores from 1 (best performance) to 10 (worst performance), and the sublist with the lowest total score was selected as the final set of explanatory variables.
At the end of this process, four variables were retained: clear-sky GHI, representing the fundamental radiative component for GTI forecasting and capturing variations related to atmospheric transparency; daily mean temperature, summarizing meteorological influences on radiation attenuation and photovoltaic efficiency; and two astronomical variables for characterizing the effect of irradiance profile on the tilted plane, solar zenith angle, describing the sun’s daily geometry of the solar path (height) and solar declination, capturing the seasonal variations in the solar trajectory. In LM2, as anticipated in Section 2.5, due to the strong correlation between global tilted irradiance (GTI) and measured photovoltaic power (P), only GTI was considered as the explanatory variable. This choice ensured the model’s physical interpretability and the consistency of the final forecasts. Indeed, a high coefficient of determination (R2 = 0.9939) confirmed the suitability of a simple linear mapping for converting predicted GTI into power (P).
The resulting set of explanatory variables met two key requirements: (i) statistical relevance, ensured by the LASSO process and progressive validation, and (ii) physical coherence, particularly in the linear stage, which avoided artificial combinations that could compromise the residual interpretation. This approach enabled the hybrid model to exploit both direct physical information (through LM1) efficiently and complex nonlinear structures (through SVQR) while maintaining parsimony in the number of predictors.

4.3.2. Selection in the Benchmark Models

In the benchmark models, the selection of explanatory variables was also conducted using the LASSO algorithm, following the same methodological approach applied to the hybrid model.
Convergence was observed regarding the most relevant variables for models B3_MLR_point_GTI, B4_QR_point_GTI, B5_SVR_point_GTI, B7_QR_probabilistic_GTI, QR submodel of B8_Hybrid_QR_and_LM_probabilistic_P and the components of the hybrid model, resulting in the adoption of the same set of six variables used in submodel LM1. Accordingly, the direct and diffuse components provided by CAMS (one derived from CAMS GTI, GHI, two BHI and DHI) as well as GHI from MERRA-2/SoDa were retained as the predictor base. Naturally, the B6_Ensemble_point_GTI model was not subject to this explanatory variable selection process, as it is derived directly from the arithmetic mean of the forecasts generated by its constituent models.
Thus, the strategy adopted for the benchmarks aimed to balance parsimony and representativeness, enabling the comparison models to reflect an explanatory foundation consistent with and comparable to that used in the proposed hybrid model.

4.4. Results of the Computational Simulation with Case Study Data

The evaluation of the methodology was conducted using an hourly series of global tilted irradiance (GTI) and measured photovoltaic power (P), separately considering the training/validation and testing stages. At each stage, the performance of the reference (benchmark) models and the proposed hybrid model was analyzed for both point and probabilistic forecasts.
Regarding the computational time required for model training, the simpler benchmarks and auxiliary components—specifically B1 (Naive_point_GTI), B2 (CAMS_point_GTI), B4 (QR_point_GTI), B6 (Ensemble_point_GTI), B7 (QR_probabilistic_GTI), B8 (QR_LM_probabilistic_P), and the P ~ GTI linear predictor LM2—exhibited negligible runtimes, consistently remaining below 0.1 s. The Linear Model point GTI predictor (LM1 and B3_point_GTI) recorded an intermediate execution time of 12.6 s. Notably, among the kernel-based methods, the proposed SVQR approach demonstrated superior efficiency, requiring 16.5 s compared to the standard SVR benchmark (B5), which demanded the highest processing time of 51.1 s. This performance relies on the proposed model utilizing an optimized set of four variables derived from a refined multi-stage selection process, in contrast to the six variables identified by the single-step LASSO employed for the SVR.

4.4.1. Performance in Point Forecasts

Table 1 and Table 2 present the GTI point forecast estimates for the training/validation and testing phases, respectively. The benchmark model B3_MLR_point_GTI was adopted as the baseline for comparison.
The results in Table 1 show that in the training/validation stage, the naive benchmark (B1) exhibited weak performance, with a mean absolute error (MAE) = 387.7 W/m2 and a root mean square error (RMSE) = 483.7 W/m2, both substantially higher than those of the reference, resulting in negative relative reductions. The CAMS-based benchmark (B2) achieved moderately better performance, although still within the negative reduction range (−14.4% for MAE and −9.0% for RMSE), while the linear quantile regression (B4) achieved only marginal improvements. The ensemble model (B6) demonstrated intermediate gains, outperforming B2, B3, and B4 with an MAE of 42.6 W/m2 and an RMSE of 67.9 W/m2, corresponding to reductions of 7.7% and 5.2%, respectively. However, it did not surpass the Support Vector Regression (B5), which yielded consistent reductions in error (17.4% in MAE and 13.2% in RMSE). Finally, the first two stages of the proposed hybrid model (propMod_LM1–SVQR) outperformed all benchmarks, achieving reductions of 30.5% in MAE and 14.7% in RMSE, confirming its strong ability to capture residual nonlinearities and enhance forecast accuracy.
In the testing stage (Table 2), a similar behavior was observed: the naive model (B1) continued to perform poorly, showing negative reductions exceeding −700%. The CAMS-based model (B2) again exhibited limitations, while the quantile regression (B4) and support vector regression (B5) provided incremental improvements. The ensemble model (B6) outperformed B2, B3, and B4, achieving an MAE of 43.7 W/m2 and an RMSE of 70.2 W/m2, corresponding to reductions of 6.4% and 3.1%, respectively. However, consistent with the training results, it did not surpass the Support Vector Regression (B5), which provided higher error reductions. Notably, the first two stages of the proposed hybrid model (propMod_LM1–SVQR) achieved reductions of 16.3% in MAE and 4.8% in RMSE, confirming its robust generalization and outperforming all benchmark models.
Regarding photovoltaic power (P) (Table 3 and Table 4), the proposed hybrid model (propMod_LM1–SVQR-LM2) achieved a MAE = 29.4 kW and RMSE = 49.4 kW during the training/validation stage, and 37.5 kW and 58.9 kW in the testing stage, respectively, demonstrating effective transfer of the improvements obtained for GTI to the final target variable.

4.4.2. Performance in Probabilistic Forecasts

In the probabilistic evaluation, quantile intervals (p5, p10, p90, and p95) were adopted. Performance was then assessed using the metrics PICP (Probability Interval Coverage Probability), PINAW (Prediction Interval Normalized Average Width), CRPS (Continuous Ranked Probability Score), and tail indicators (UTail).
In the case of GTI (Table 5), the probabilistic quantile regression benchmark (B7) demonstrated good performance, achieving PICP = 89.9% and PINAW = 14.2% during the training/validation phase, but it had relatively high CRPS values (33.3). It achieved PICP = 87.8%, PINAW = 12.9%, and CRPS = 33.5 in the testing stage. In comparison, the first two stages of the proposed hybrid model (propMod_LM1–SVQR) significantly reduced both the interval width (PINAW = 11.0% in training/validation and 10.5% in testing) and the CRPS (25.7 and 30.0, respectively) while maintaining coverage levels close to the 90% target. It also exhibited reasonable tail behavior, particularly around the median, reinforcing its ability to provide helpful information about the distribution’s extremes.
Regarding photovoltaic power (P) (Table 6), the hybrid QR–LM benchmark (B8) showed lower coverage, particularly in the testing stage (PICP = 78.8%), although with relatively narrow intervals (PINAW = 14.1% in training/validation and 12.9% in testing). In addition, the CRPS values remained high (28.6 and 29.2, respectively). The proposed hybrid model (propMod_LM1–SVQR–LM2) demonstrated clear improvements, achieving a better balance between coverage (PICP = 84.5% in training/validation and 81.9% in testing) and narrower interval widths (PINAW = 10.9% and 10.4%), along with reduced CRPS values (22.2% and 27.9%). These results indicate that the proposed model provides more accurate and informative probabilistic forecasts.
A relevant aspect concerns the behavior in the distribution tails, assessed using the UTail indicators. In the case of GTI, the hybrid model (propMod_LM1–SVQR) exhibited a distinct improvement in the central tendency of the extremes: while the Mean Excess UTailmean increased in the test phase compared to the QR benchmark (B7), likely due to specific outliers, the Median Excess (UTailmed) was substantially reduced (from 13.0 W/m2 to 5.1 W/m2). This indicates a correction of the systematic bias observed in the benchmark’s upper tail. Regarding photovoltaic power (P), the complete hybrid model (propMod_LM1–SVQR–LM2) followed a similar pattern. Although the Mean Excess remained comparable to the benchmark (B8), the proposed model successfully reduced the Median Excess during testing (from ≈ 17.6$ kW to ≈ 13.9$ kW). These results suggest that, beyond achieving gains in overall coverage and sharpness, the proposed methodology enhances the robustness of the forecast for the majority of high-value events, mitigating the typical risk of underestimation during generation peaks.
Clear and Non-Clear Sky Probabilistic Performance Effect
To address the limitations of aggregate metrics, the probabilistic performance was decomposed based on the clear and non-clear sky regimes, assessing how the proposed hybrid methodology adapts its prediction intervals to distinct uncertainty levels.
To implement this decomposition, an automated classification algorithm was applied to segregate days into clear-sky and non-clear-sky regimes based on comparisons between observed GTI and a clear-sky reference model. A day is classified as clear-sky only if it simultaneously satisfies three daily criteria [78,79,80]: (i) a mean irradiance ratio (GTI_observed/GTI_clearSky) greater than 0.85, ensuring sufficient magnitude; (ii) a temporal correlation coefficient exceeding 0.75, ensuring shape similarity; and (iii) a roughness index (normalized standard deviation of first-order differences) below 0.35, ensuring curve smoothness. All days failing to meet this joint set of conditions were categorized as non-clear sky. The application of these criteria yielded 120 clear and 177 non-clear days for the training/validation phase, and 107 clear versus 139 non-clear days for the testing phase, providing a consistent data distribution for the segregated analysis.
The quantitative assessment of these regimes is summarized in Table 7, which contrasts the performance of the benchmarks B7 and B8 against the proposed hybrid model. The evaluation focuses on coverage reliability (PICP), interval sharpness (PINAW), and overall probabilistic accuracy (CRPS) during the testing phase. Initially, as expected, it can be observed that the average values presented in Table 5 and Table 6 are intermediate between those obtained in Table 7, where the segregation of clear and non-clear sky regimes is considered.
The results in Table 7 also reveal distinct performance characteristics depending on the cloud cover regime. Under clear-sky conditions, the proposed hybrid methodology exhibited its most significant advantages. For GTI, the model achieved a high coverage probability (PICP = 94.3%) while significantly reducing the interval width (PINAW = 9.2%) and improving overall probabilistic accuracy (CRPS = 18.6%) compared to the benchmark (89.1%, 12.0% and 24.3%, respectively). This efficiency effectively transferred to the photovoltaic power (P) forecasts, where the proposed model improved coverage to 84.5% (surpassing the benchmark’s 77.7%) and minimized both the CRPS to 20.5% and PINAW to 9.2%. These figures confirm the model’s ability to correctly identify stable atmospheric periods and narrow the uncertainty bounds accordingly, avoiding the excessive conservatism observed in the benchmark models.
In non-clear sky scenarios, characterized by higher volatility, the proposed model demonstrated consistently superior sharpness. For GTI, it maintained a narrower average interval (PINAW = 12.0%) compared to the benchmark (14.3%), resulting in a lower CRPS (39.9%). Although for Power, CRPS increased and the coverage (PICP) dropped to approximately 80% for both models due to the difficulty in capturing rapid ramps, the proposed method distinguished itself by maintaining a tighter envelope (PINAW = 12.2% vs. 14.6% for the benchmark). This suggests that while the benchmark models tend to widen intervals indiscriminately to capture outliers, the proposed approach strives to provide more precise and informative intervals, even under uncertain conditions.

4.4.3. Examples of Hourly Curves with Probabilistic Forecasts

Figure 6 and Figure 7 present examples (13 April 2013) of quantile forecast curves for (a) GTI and (b) P obtained using the quantile regression benchmark models B7_QR (GTI) and B8_QR-LM (P) and the proposed model (LM1–SVQR stage and the complete model, LM1_SVQR-LM2), including the quantile ranges from 0.05 to 0.95 and from 0.10 to 0.90 as well as the central forecast together with the observed values. The results indicate that the proposed model outperforms the benchmark by delivering superior sharpness—characterized by narrower predictive bands that maintain valid coverage—unlike the lower-resolution wide bands of the benchmark. This precision is also notable during the rising and falling ramps, where the proposed model’s central forecast tightly tracks the observed data.
Additional examples for different days within the testing dataset are shown in Figure 8 and Figure 9. In general, the superior performance of the proposed model over the benchmarks B7_QR and B8_QR-LM is consistently reaffirmed. As also observed in Figure 6 and Figure 7, the proposed model exhibits excellent agreement with the observed data while maintaining narrower and more accurate quantile prediction bands.

4.4.4. Structural Characteristics of the SVQR Implementation

The implementation of the SVQR model in this study utilizes the qtSVM solver from the liquidSVM package [51]. A fundamental architectural feature of this solver is the independent estimation of each target quantile τ. Unlike simultaneous estimation methods, this approach treats each quantile as a distinct regression task. This strategy has dual implications: it allows for highly specific hyperparameter optimization for different parts of the probability distribution (addressed in Section Hyperparameter Optimization and Model Sensitivity), but it may require ex-post verification to obtain statistical monotonicity, if relevant (addressed in Section Crossing Quantile Evaluation).
Hyperparameter Optimization and Model Sensitivity
To ensure reproducibility and verify model robustness, we extracted the internal hyperparameters utilized by the solver. Since the training phase is not monolithic, the algorithm executes an automated grid search for every quantile and every validation fold. This process selects the optimal combination of kernel bandwidth γ and regularization strength λ that minimizes the validation error for that specific task, effectively acting as a localized sensitivity analysis.
The extracted parameters demonstrate a logical adaptation to the data’s statistical behavior. For the median (τ = 0.5), the solver consistently favored stability, selecting higher regularization (λ ≈ 10−3) and smoother kernels (γ < 0.99). Conversely, for the extreme tails (τ = 0.05 and 0.95), the algorithm increased model complexity by minimizing regularization λ ≈ 10−6–10−7) and increasing kernel sensitivity (γ ≈ 1.71–2.96). This configuration allows the regression surface to track sharp stochastic variations and rapid gradients.
Crossing Quantile Evaluation
While independent training allows for the precise tuning described above, it does not inherently enforce non-crossing constraints during the optimization process, as noted by Shin and Jung [81]. This can lead to quantile crossing—a phenomenon where estimated curves intersect, violating the statistical property of monotonicity (i.e., a higher-quantile curve must always exceed a lower-quantile curve).
To assess the extent of this issue in the proposed hybrid model, an ex-post verification was conducted on the subset of productive hours. Monotonicity violations were identified with frequencies ranging from 0.79% to 3.45% across consecutive quantile pairs for both Global Tilted Irradiance (GTI) and Photovoltaic Power (P). Although specific pairs (e.g., 0.90–0.95) exhibited frequencies slightly above 1%, the impact analysis indicates that their practical effect on the estimation of PV system production was negligible.
For GTI, the average error magnitude during crossing events is less than 0.2 W/m2, a deviation virtually imperceptible relative to typical daytime irradiance levels. Regarding PV power, the effect was even lower, ranging from 0.04 to 0.14 kW. Consequently, even during these rare moments of inconsistency (occurring approximately 3% of the time in the worst-case scenario), the average estimation error remained below 0.15 kW. Given the 960 kWp capacity of the SPV system, these results confirm that the physical integrity and operational utility of the generation forecasts remain uncompromised.

4.5. Discussion

The results indicate that combining a linear regression stage—capturing the primary physical relationship between GTI and P—with an SVQR module for residual-based probabilistic GTI estimation, followed by a linear mapping to probabilistic power forecasts, produces systematic improvements relative to all evaluated benchmarks. The proposed hybrid model demonstrated greater balance among accuracy, coverage, and robustness, achieving reductions in mean errors, narrower uncertainty intervals, coverage levels close to the desired targets, and a more consistent representation of events located in the tails of the frequency distributions. The qualitative analysis of the temporal curves reinforces these findings. The central forecasts followed the observed GTI and P data more closely, particularly during rising and falling ramps, where the quantile bands appeared narrower and better positioned. In contrast, the probabilistic benchmarks showed wider, less calibrated intervals in several cases. The combined use of p5/p95 and p10/p90 bands proved useful, effectively capturing abrupt variations even on clear-sky days, as illustrated in Figure 6 and Figure 7. The independent quantile training used by the SQVR, the central stage of the proposed model, while enabling precise tuning, lacks intrinsic non-crossing constraints [81]. Consequently, minor monotonicity violations (0.79–3.45%) were detected. However, these are physically negligible: the average error during such events remained below 0.2 W/m2 (irradiance) and 0.15 kW (power). Given the 960 kWp plant capacity, these infinitesimal deviations confirm that the forecasts’ operational utility remains uncompromised.
Despite the robustness of the framework, limitations became more evident in the testing stage, which constitutes the most rigorous test of generalization.
Regarding GTI, although the hybrid model achieved significant reductions relative to the multiple linear baseline (B3) and outperformed the probabilistic benchmark QR (B7) in interval width (PINAW) and CRPS, its improvement over the SVR (B5) in RMSE was modest (69.0 vs. 69.8 W/m2). This suggests that there may be cases where the SVR’s ability to capture nonlinearities remains competitive with the proposed hybrid approach.
In the probabilistic domain, the forecasting of photovoltaic power (P) reinforces this conclusion. The proposed hybrid model outperformed the B8 (QR + LM) benchmark in both PINAW and CRPS, producing narrower and more informative intervals. However, the coverage (PICP) remained below the nominal 90% level, reaching 80.5% during the testing stage. Furthermore, the tail analysis revealed a trade-off: while the model improved the median behavior in extreme events (UTailmed), the increase in mean excess (UTailmean) indicates a persistent sensitivity to sporadic large outliers.
Regime-based analysis validates the framework’s adaptability while highlighting volatility limits. Under clear skies, the model achieved high reliability (94.3% PICP) with significantly narrower intervals. Conversely, non-clear sky regimes posed challenges for all methods, with coverage dropping to ≈80% (below the 90% target) due to rapid generation ramps. However, the proposed model maintained superior sharpness (PINAW = 12.2% vs. 14.6% for the benchmark) and accuracy.
These drawbacks and findings point toward some possible methodological refinement, particularly to address the coverage deficits observed under highly variable conditions. One promising direction involves upgrading the final conversion stage (LM2): replacing the simple linear mapping with a piecewise linear model could better capture the non-linear response of the PV system across distinct operating regimes (low, medium, and high irradiance), potentially reducing errors in the tails. Additionally, given the performance disparity between clear and non-clear skies, regime-based data segmentation—using cloudiness indices or extreme quantile clustering prior to training—could allow for more specialized tuning of the uncertainty envelopes. Finally, exploring the integration of SVQR into hybrid ensembles with more complex architectures stands as a potential alternative to maximize operational robustness even under critical atmospheric volatility.

5. Conclusions

This work presented the development, implementation, and validation of a hybrid framework for short-term probabilistic forecasting of photovoltaic (PV) power generation organized into three stages: an initial multiple linear regression (LM1) to generate GTI point forecasts; residual modeling via support vector quantile regression (SVQR), which refines the GTI predictions and produces corresponding quantile estimates; and a final conversion step using linear regression (LM2) to transform the corrected GTI quantiles into probabilistic PV power forecasts. This configuration effectively combines the interpretability of linear models with the flexibility of machine learning methods, resulting in accurate and well-calibrated point and probabilistic forecasts.
The case study, based on an hourly series of global tilted irradiance (GTI) and measured PV power (P) from a real 960 kWp plant located in Monteroni di Lecce, Italy, demonstrated consistent improvements over both point and probabilistic forecast benchmarks.
In point forecasts, the hybrid model reduced mean and quadratic errors relative to classical linear models and performed competitively with SVR, although the additional gain in RMSE was modest. A high coefficient of determination (R2 ≈ 0.99) was preserved in LM2, supporting the physical coherence of the final conversion stage.
In probabilistic forecasts, the proposed model stood out by achieving narrower uncertainty intervals (low PINAW) and systematic reductions in CRPS and providing helpful information about extreme quantile ranges. Visual analyses confirmed the model’s ability to reproduce both stable and rapidly changing conditions under variable cloud cover conditions. Nonetheless, some limitations were identified: probabilistic coverage (PICP) remained below the nominal 90% level for P, averaging around 80.5% specifically due to volatility in non-clear sky regimes, and improvements over SVR in GTI RMSE were relatively small. Furthermore, the tail analysis revealed a trade-off: while the model improved the median behavior in extreme events (UTailmed), the increase in mean excess (UTailmean) indicates a persistent sensitivity to sporadic large outliers. Regime-based analysis further validated the framework’s adaptability while highlighting volatility limits. Under clear skies, the model achieved high reliability (94.3% PICP) with significantly narrower intervals. Conversely, non-clear sky regimes posed challenges for all methods, with coverage (PICP) dropping to approximately 80.5%—below the nominal 90% target—due to rapid generation ramps. Crucially, however, the proposed model maintained superior sharpness (PINAW = 12.2% vs. 14.6% for the benchmark) and accuracy even in these volatile conditions. Finally, improvements over SVR in GTI RMSE were relatively small.
These findings indicate potential directions for methodological refinement, including (i) adopting piecewise linear models in the LM2 stage to account for different irradiance regimes; (ii) segmenting data by cloudiness conditions or extreme quantiles to enhance probabilistic coverage; and (iii) integrating the current framework into more sophisticated hybrid ensembles.
In summary, SVQR-based probabilistic forecasting proved to be a promising approach for addressing the inherent uncertainties of photovoltaic generation. The proposed framework contributes to both applied research and operational practice by providing more calibrated, informative, and decision-relevant forecasts for power systems with increasing shares of variable and intermittent renewable energy sources. In practical terms, the availability of reliable predictive quantiles can directly support day-ahead operational scheduling, reserve allocation, and risk-aware decision-making by system operators and energy managers, as well as cost-effective expansion planning. In particular, the improved characterization of tail events may assist in anticipating high-generation scenarios, reducing imbalance costs, and enhancing the robustness of operational plans under uncertain weather conditions.

Author Contributions

Conceptualization, R.P.C., A.C.G.M. and D.M.F.; methodology, R.P.C., A.C.G.M. and D.M.F.; validation, R.P.C., A.C.G.M. and D.M.F.; formal analysis, R.P.C., A.C.G.M. and D.M.F.; investigation, R.P.C., A.C.G.M. and D.M.F.; resources, A.C.G.M. and D.M.F.; data curation, R.P.C.; writing—original draft preparation, R.P.C.; writing—review and editing, A.C.G.M. and D.M.F.; visualization, R.P.C.; supervision, A.C.G.M. and D.M.F.; project administration, A.C.G.M. and D.M.F.; funding acquisition, A.C.G.M. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The data presented in this study are openly available in Data on Photovoltaic Power Forecasting Models for Mediterranean Climate at 10.1016/j.dib.2016.04.063, reference number [72].

Acknowledgments

This work was partially developed at the Stochastic Optimization and Simulation Laboratory Applied to Renewable Energy Systems (Lab SOLARES), at the Rio de Janeiro State University (UERJ). The first two authors acknowledge the support provided by the Capacity Building, Research, and Innovation Programme (Qualitec 2023), offered by the Office of the Pro-Rector for Graduate Studies and Research at UERJ, through the Innovation Department (InovUERJ).

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. International Energy Agency. Greenhouse Gas Emissions from Energy Data Explorer. Available online: https://www.iea.org/data-and-statistics/data-tools/greenhouse-gas-emissions-from-energy-data-explorer (accessed on 2 December 2025).
  2. International Energy Agency. Renewables 2025—Analysis—IEA. Available online: https://www.iea.org/reports/renewables-2025 (accessed on 2 December 2025).
  3. Maceira, M.E.P.; Melo, A.C.G.; Pessanha, J.F.M.; Cruz, C.B.; Almeida, V.A.; Justino, T.C. Combining Monthly Wind and Inflow Uncertainties in the Stochastic Dual Dynamic Programming: Application to the Brazilian Interconnected System. Energy Syst. 2023. [Google Scholar] [CrossRef]
  4. Falcao, D.M.; Taranto, G.N.; Hincapie, C.C.O. Chronological Simulation of the Interaction between Intermittent Generation and Distribution Network. In Proceedings of the 2013 IEEE PES Innovative Smart Grid Technologies LATIN AMERICA (ISGT LA 2013), Sao Paulo, Brazil, 15–17 April 2013. [Google Scholar]
  5. Kumar, D.S.; Yagli, G.M.; Kashyap, M.; Srinivasan, D. Solar Irradiance Resource and Forecasting: A Comprehensive Review. IET Renew. Power Gener. 2020, 14, 1641–1656. [Google Scholar] [CrossRef]
  6. Housmans, C.; Ipe, A.; Bertrand, C. Tilt to Horizontal Global Solar Irradiance Conversion: An Evaluation at High Tilt Angles and Different Orientations. Renew. Energy 2017, 113, 1529–1538. [Google Scholar] [CrossRef]
  7. Marion, B. A Model for Deriving the Direct Normal and Diffuse Horizontal Irradiance from the Global Tilted Irradiance. Sol. Energy 2015, 122, 1037–1046. [Google Scholar] [CrossRef]
  8. Thekaekara, M.P. Solar radiation measurement: Techniques and instrumentation. Sol. Energy 1976, 18, 309–325. [Google Scholar] [CrossRef]
  9. Sengupta, M.; Habte, A.; Wilbert, S.; Gueymard, C.; Remund, J.; Lorenz, E.; van Sark, W.; Jensen, A. Best Practices Handbook for the Collection and Use of Solar Resource Data for Solar Energy Applications: Fourth Edition; National Renewable Energy Laboratory (NREL): Golden, CO, USA, 2024. [Google Scholar]
  10. Ghassemi, A.; Myers, D.R.; Vignola, F.; Michalsky, J.; Stoffel, T. Solar Radiation: Practical Modeling for Renewable Energy Applications Solar and Infrared Radiation Measurements, 1st ed.; Myers, D.R., Ed.; CRC Press: Boca Raton, FL, USA, 2017; ISBN 978-1-4665-0294-9. [Google Scholar]
  11. Sørensen, M.L.; Nystrup, P.; Bjerregård, M.B.; Møller, J.K.; Bacher, P.; Madsen, H. Recent Developments in Multivariate Wind and Solar Power Forecasting. Wiley Interdiscip. Rev. Energy Environ. 2023, 12, e465. [Google Scholar] [CrossRef]
  12. Inness, A.; Ades, M.; Agustí-Panareda, A.; Barr, J.; Benedictow, A.; Blechschmidt, A.M.; Jose Dominguez, J.; Engelen, R.; Eskes, H.; Flemming, J.; et al. The CAMS Reanalysis of Atmospheric Composition. Atmos. Chem. Phys. 2019, 19, 3515–3556. [Google Scholar] [CrossRef]
  13. Inman, R.H.; Pedro, H.T.C.; Coimbra, C.F.M. Solar Forecasting Methods for Renewable Energy Integration. Prog. Energy Combust. Sci. 2013, 39, 535–576. [Google Scholar] [CrossRef]
  14. Diagne, H.M.; David, M.; Lauret, P.; Boland, J.; Schmutz, N. Review of Solar Irradiance Forecasting Methods and a Proposition for Small-Scale Insular Grids. Renew. Sustain. Energy Rev. 2013, 27, 65–76. [Google Scholar] [CrossRef]
  15. Voyant, C.; Notton, G.; Kalogirou, S.; Nivet, M.-L.; Paoli, C.; Motte, F.; Alexis, F. Machine Learning Methods for Solar Radiation Forecasting: A Review. Renew. Energy 2017, 105, 569–582. [Google Scholar] [CrossRef]
  16. Benitez, I.B.; Singh, J.G. A Comprehensive Review of Machine Learning Applications in Forecasting Solar PV and Wind Turbine Power Output. J. Electr. Syst. Inf. Technol. 2025, 12, 54. [Google Scholar] [CrossRef]
  17. Boser, B.E.; Guyon, I.M.; Vapnik, V.N. A Training Algorithm for Optimal Margin Classifiers. In Proceedings of the Fifth Annual Workshop on Computational Learning Theory (COLT ’92), Pittsburgh, PA, USA, 27–29 July 1992; Association for Computing Machinery: New York, NY, USA, 2017; pp. 144–152. [Google Scholar]
  18. Drucker, H.; Burges, C.J.C.; Kaufman, L.; Smola, A.; Vapnik, V. Support Vector Regression Machines. In Advances in Neural Information Processing Systems 9 (NIPS 1996); NeurIPS; MIT Press: Cambridge, MA, USA, 1996. [Google Scholar]
  19. Schölkopf, A.J.; Smola, A.J.; Williamson, R.C.; Bartlett, P.L. New Support Vector Algorithms. Neural Comput. 2000, 12, 1207–1245. [Google Scholar] [CrossRef]
  20. Hamel, L. Knowledge Discovery with Support Vector Machines; John Wiley & Sons, Inc.: Hoboken, NJ, USA, 2009; ISBN 9780470371923. [Google Scholar]
  21. Awad, M.; Khanna, R. Support Vector Regression. In Efficient Learning Machines; Apress: Berkeley, CA, USA, 2015; pp. 67–80. [Google Scholar]
  22. Moguerza, J.M.; Muñoz, A. Support Vector Machines with Applications. Stat. Sci. 2006, 21, 322–336. [Google Scholar] [CrossRef]
  23. Xu, H.; Caramanis, C.; Mannor, S. Robustness and Regularization of Support Vector Machines. J. Mach. Learn. Res. 2009, 10, 1485–1510. [Google Scholar]
  24. Saadati, T.; Barutcu, B. Forecasting Solar Energy: Leveraging Artificial Intelligence and Machine Learning for Sustainable Energy Solutions. J. Econ. Surv. 2025, 39, 1929–1946. [Google Scholar] [CrossRef]
  25. Rhafes, M.Y.; Moussaoui, O.; Raboaca, M.S. Machine Learning Models in Renewable Energy Forecasting: A Systematic Literature Review. Indones. J. Electr. Eng. Comput. Sci. 2025, 37, 1874–1886. [Google Scholar] [CrossRef]
  26. Barhmi, K.; Heynen, C.; Golroodbari, S.; van Sark, W. A Review of Solar Forecasting Techniques and the Role of Artificial Intelligence. Solar 2024, 4, 99–135. [Google Scholar] [CrossRef]
  27. Ahmed, R.; Sreeram, V.; Mishra, Y.; Arif, M.D. A Review and Evaluation of the State-of-the-Art in PV Solar Power Forecasting: Techniques and Optimization. Renew. Sustain. Energy Rev. 2020, 124, 109792. [Google Scholar]
  28. Tsai, W.C.; Tu, C.S.; Hong, C.M.; Lin, W.M. A Review of State-of-the-Art and Short-Term Forecasting Models for Solar PV Power Generation. Energies 2023, 16, 5436. [Google Scholar] [CrossRef]
  29. Başaran, K.; Bozyiğit, F.; Siano, P.; Taşer, P.Y.; Kılınç, D. Systematic Literature Review of Photovoltaic Output Power Forecasting. IET Renew. Power Gen. 2021, 14, 3961–3973. [Google Scholar] [CrossRef]
  30. Liu, H.; Chen, C.; Lv, X.; Wu, X.; Liu, M. Deterministic Wind Energy Forecasting: A Review of Intelligent Predictors and Auxiliary Methods. Energy Convers. Manag. 2019, 195, 328–345. [Google Scholar] [CrossRef]
  31. Kou, M.; Wang, J.; Li, J.; Li, R.; Li, Z. A Framework for Photovoltaic Power Forecasting Based on Hybrid Data Reconstruction, Neural Network Models Fusion, and Multi-Objective Optimization. Eng. Appl. Artif. Intell. 2025, 162, 112400. [Google Scholar] [CrossRef]
  32. Devi, K.V.B.; Srivenkatesh, M. An Advanced Hybrid Meta-Heuristic Model for Solar Power Generation Forecasting via Ensemble Deep Learning. Ing. Syst. d’Inf. 2023, 28, 1395–1407. [Google Scholar] [CrossRef]
  33. Sansine, V.; Ortega, P.; Hissel, D.; Hopuare, M. Solar Irradiance Probabilistic Forecasting Using Machine Learning, Metaheuristic Models and Numerical Weather Predictions. Sustainability 2022, 14, 5260. [Google Scholar] [CrossRef]
  34. Yang, D.; Kleissl, J.; Gueymard, C.A.; Pedro, H.T.C.; Coimbra, C.F.M. History and Trends in Solar Irradiance and PV Power Forecasting: A Preliminary Assessment and Review Using Text Mining. Sol. Energy 2018, 168, 60–101. [Google Scholar] [CrossRef]
  35. Horat, N.; Klerings, S.; Lerch, S. Improving Model Chain Approaches for Probabilistic Solar Energy Forecasting through Post-Processing and Machine Learning. Adv. Atmos. Sci. 2025, 42, 297–312. [Google Scholar] [CrossRef]
  36. Gneiting, T.; Lerch, S.; Schulz, B. Probabilistic Solar Forecasting: Benchmarks, Post-Processing, Verification. Sol. Energy 2023, 252, 72–80. [Google Scholar] [CrossRef]
  37. Koenker, R.; Bassett, G., Jr. Regression Quantiles. Econometrica 1978, 46, 33–50. [Google Scholar] [CrossRef]
  38. Takeuchi, I.; Furuhashi, T. Non-Crossing Quantile Regressions by SVM. In Proceedings of the 2004 IEEE International Joint Conference on Neural Networks (IEEE Cat. No.04CH37541), Budapest, Hungary, 25–29 July 2004; pp. 401–406. [Google Scholar]
  39. Takamatsu, T.; Ohtake, H.; Oozeki, T. Support Vector Quantile Regression for the Post-Processing of Meso-Scale Ensemble Prediction System Data in the Kanto Region: Solar Power Forecast Reducing Overestimation. Energies 2022, 15, 1330. [Google Scholar] [CrossRef]
  40. Yagli, G.M.; Yang, D.; Srinivasan, D. Reconciling Solar Forecasts: Probabilistic Forecasting with Homoscedastic Gaussian Errors on a Geographical Hierarchy. Sol. Energy 2020, 210, 59–67. [Google Scholar] [CrossRef]
  41. Liu, J.; Liu, J.; Liu, X.; Ding, T.; Wang, G.; Liu, X.; Zhao, Y. Extreme Probabilistic Solar Power Prediction via Localized Sample Structure Recognition and Generalized Error Estimation. IEEE Trans. Sustain. Energy 2025, 16, 3110–3123. [Google Scholar] [CrossRef]
  42. Doelle, O.; Kalysh, I.; Amthor, A.; Ament, C. Comparison of Intraday Probabilistic Forecasting of Solar Power Using Time Series Models. In Proceedings of the SEST 2021—4th International Conference on Smart Energy Systems and Technologies, Vaasa, Finland, 6–8 September 2021. [Google Scholar]
  43. Jing, T.; Chen, S.; Du, T.; Li, M. DiffSolar: Prior Knowledge-Guided Residual Diffusion Model for Enhanced Regional-Scale Probabilistic Solar Irradiance Forecasting. TechRxiv 2025. [Google Scholar] [CrossRef]
  44. Uniejewski, B.; Ziel, F. Probabilistic Forecasts of Load, Solar and Wind for Electricity Price Forecasting. arXiv 2025, arXiv:2501.06180. [Google Scholar] [CrossRef]
  45. Montgomery, D.C.; Peck, E.A.; Vining, G.G. Introduction to Linear Regression Analysis, 6th ed.; John Wiley & Sons, Ed.; John Wiley & Sons: Hoboken, NJ, USA, 2021; ISBN 1119578752,9781119578758. [Google Scholar]
  46. Koenker, R. Quantile Regression; Econometric Society Monographs; Cambridge University Press: Cambridge, UK, 2005; Volume 38, ISBN 1139444719. [Google Scholar]
  47. Fitzenberger, B.; Wilke, R.A. Quantile Regression Methods; John Wiley & Sons, Inc.: Hoboken, NJ, USA, 2015. [Google Scholar]
  48. Nielsen, H.A.; Madsen, H.; Nielsen, T.S. Using Quantile Regression to Extend an Existing Wind Power Forecasting System with Probabilistic Forecasts. Wind Energy 2006, 9, 95–108. [Google Scholar] [CrossRef]
  49. Christmann, A.; Steinwart, I. Consistency of Kernel-Based Quantile Regression. Appl. Stoch. Models Bus. Ind. 2008, 24, 171–183. [Google Scholar] [CrossRef]
  50. Steinwart, I.; Christmann, A. Estimating Conditional Quantiles with the Help of the Pinball Loss. Bernoulli 2011, 17, 211–225. [Google Scholar] [CrossRef]
  51. Steinwart, I.; Thomann, P. liquidSVM: A Fast and Versatile SVM Package. arXiv 2017, arXiv:1702.06899. [Google Scholar] [CrossRef]
  52. Anand, P. Uncertainty Quantification in SVM Prediction. arXiv 2025, arXiv:2505.15429. [Google Scholar] [CrossRef]
  53. Pessanha, J.F.M.; Melo, A.C.G.; Caldas, R.P.; Falcão, D.M. An Approach for Data Treatment of Solar Photovoltaic Generation. IEEE Lat. Am. Trans. 2020, 18, 1563–1571. [Google Scholar] [CrossRef]
  54. Härdle, W.; Simar, L. Applied Multivariate Statistical Analysis; Springer: Berlin/Heidelberg, Germany, 2003. [Google Scholar]
  55. Cleveland, W.S. Robust Locally Weighted Regression and Smoothing Scatterplots. J. Am. Stat. Assoc. 1979, 74, 829–836. [Google Scholar] [CrossRef]
  56. Kohonen, T. Self-Organization and Associative Memory, 3rd ed.; Springer Series in Information Sciences; Springer: Berlin/Heidelberg, Germany, 1989; ISBN 978-3-540-51387-2. [Google Scholar]
  57. Pfenninger, S.; Staffell, I. Long-Term Patterns of European PV Output Using 30 Years of Validated Hourly Reanalysis and Satellite Data. Energy 2016, 114, 1251–1265. [Google Scholar] [CrossRef]
  58. SoDa. CAMS Radiation Service. Available online: https://www.soda-pro.com/web-services/radiation/cams-radiation-service (accessed on 1 December 2025).
  59. Aler, R.; Martín, R.; Valls, J.M.; Galván, I.M. A Study of Machine Learning Techniques for Daily Solar Energy Forecasting Using Numerical Weather Models. In Intelligent Distributed Computing VIII (Studies in Computational Intelligence); Springer International Publishing: Cham, Switzerlands, 2015; Volume 570, pp. 269–278. [Google Scholar] [CrossRef]
  60. Hedar, A.R.; Almaraashi, M.; Abdel-Hakim, A.E.; Abdulrahim, M. Hybrid Machine Learning for Solar Radiation Prediction in Reduced Feature Spaces. Energies 2021, 14, 7970. [Google Scholar] [CrossRef]
  61. Almaraashi, M. Investigating the Impact of Feature Selection on the Prediction of Solar Radiation in Different Locations in Saudi Arabia. Appl. Soft Comput. J. 2018, 66, 250–263. [Google Scholar] [CrossRef]
  62. Schroedter-Homscheidt, M.; Azam, F.; Betcke, J.; Hoyer-Klick, C.; Lefèvre, M.; Saint-Drenan, Y.-M.; Wey, E.; Saboret, L. User Guide to the CAMS Radiation Service (CRS); Copernicus GmbH: Göttingen, Germany, 2021; Volume 6. [Google Scholar]
  63. GMAO. Modern-Era Retrospective Analysis for Research and Applications, Version 2 (MERRA-2). Available online: https://gmao.gsfc.nasa.gov/gmao-products/merra-2/data-access_merra-2/ (accessed on 1 December 2025).
  64. Weather API Documentation. World Weather Online. Available online: https://www.worldweatheronline.com/weather-api/api/docs/ (accessed on 1 December 2025).
  65. NOAA—National Oceanic and Atmospheric Administration. NOAA Solar Calculations. Available online: https://gml.noaa.gov/grad/solcalc/NOAA_Solar_Calculations_year.xls (accessed on 1 December 2025).
  66. Tibshirani, R. Regression Shrinkage and Selection Via the Lasso. J. R. Stat. Society. Ser. B Methodol. 1996, 58, 267–288. [Google Scholar] [CrossRef]
  67. Zhang, H.; Wang, J.; Sun, Z.; Zurada, J.M.; Pal, N.R. Feature Selection for Neural Networks Using Group Lasso Regularization. IEEE Trans. Knowl. Data Eng. 2020, 32, 659–673. [Google Scholar] [CrossRef]
  68. Kohavi, R. A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection. In Proceedings of the 14th International Joint Conference on Artificial Intelligence, Montreal, QC, Canada, 20–25 August 1995. [Google Scholar]
  69. Hyndman, R.J.; Koehler, A.B. Another Look at Measures of Forecast Accuracy. Int. J. Forecast. 2006, 22, 679–688. [Google Scholar] [CrossRef]
  70. van der Meer, D.W.; Widén, J.; Munkhammar, J. Review on Probabilistic Forecasting of Photovoltaic Power Production and Electricity Consumption. Renew. Sustain. Energy Rev. 2018, 81, 1484–1512. [Google Scholar] [CrossRef]
  71. Coles, S. An Introduction to Statistical Modeling of Extreme Values; Springer: London, UK, 2001; pp. 78–91. [Google Scholar]
  72. Malvoni, M.; De Giorgi, M.G.; Congedo, P.M. Data on Photovoltaic Power Forecasting Models for Mediterranean Climate. Data Brief. 2016, 7, 1639–1642. [Google Scholar] [CrossRef]
  73. Müller, R.; Pfeifroth, U.; Träger-Chatterjee, C.; Cremer, R.; Trentmann, J.; Hollmann, R. Surface Solar Radiation Data Set—Heliosat (SARAH)—Edition 1. Available online: https://navigator.eumetsat.int/product/EO:EUM:CM:MULT:SARAH_V001 (accessed on 20 November 2025).
  74. Pfenninger, S.; Staffell, I. Simulations of Hourly Power Output from Wind and Solar Power Plants—Renewables. Ninja. Available online: https://www.renewables.ninja/ (accessed on 1 December 2025).
  75. R: The R Project for Statistical Computing. Available online: https://www.r-project.org/ (accessed on 1 December 2025).
  76. Thomann, P.; Steinwart, I. LiquidSVM: Documentation for R. Available online: https://pnp.mathematik.uni-stuttgart.de/isa/steinwart/software/R/documentation.html (accessed on 1 December 2025).
  77. CRAN: Package Quantreg. Available online: https://cran.r-project.org/web/packages/quantreg/index.html (accessed on 1 December 2025).
  78. Reno, M.J.; Hansen, C.W. Identification of periods of clear sky irradiance in time series of GHI measurements. Renew. Energy 2016, 90, 520–531. [Google Scholar] [CrossRef]
  79. Engerer, N.A.; Mills, F.P. KPV: A clear-sky index for photovoltaics. Sol. Energy 2014, 105, 679–693. [Google Scholar] [CrossRef]
  80. Stein, J.S.; Hansen, C.W.; Reno, M.J. The Variability Index: A New and Novel Metric for Quantifying Irradiance and PV Output Variability. In Proceedings of the World Renewable Energy Forum (WREF), Denver, CO, USA, 13–17 May 2012. [Google Scholar]
  81. Shin, W.; Jung, Y. Deep support vector quantile regression with non-crossing constraints. Comput. Stat. 2023, 38, 1947–1976. [Google Scholar]
Figure 1. Basic SVR formulation [20]. (a) ε-insensitive regression hypertube (shaded region), containing observations (blue dots) and SVR hyperplane (red solid line); (b) optimal alignment (support vectors as red dots).
Figure 1. Basic SVR formulation [20]. (a) ε-insensitive regression hypertube (shaded region), containing observations (blue dots) and SVR hyperplane (red solid line); (b) optimal alignment (support vectors as red dots).
Energies 19 00569 g001
Figure 2. SVR formulation with slack variables [20]. (a) ε-insensitive regression hypertube containing all observations and the resulting hyperplane (red dashed line); (b) optimal alignment using slack variables ξ, highlighting outliers (yellow dots).
Figure 2. SVR formulation with slack variables [20]. (a) ε-insensitive regression hypertube containing all observations and the resulting hyperplane (red dashed line); (b) optimal alignment using slack variables ξ, highlighting outliers (yellow dots).
Energies 19 00569 g002
Figure 3. ε pinball loss function [52].
Figure 3. ε pinball loss function [52].
Energies 19 00569 g003
Figure 4. Schematic diagram of the proposed model, propMod_LM1-SVQR-LM2, for short-term probabilistic forecasting of solar photovoltaic generation, combining statistical methods and machine learning in a hybrid approach.
Figure 4. Schematic diagram of the proposed model, propMod_LM1-SVQR-LM2, for short-term probabilistic forecasting of solar photovoltaic generation, combining statistical methods and machine learning in a hybrid approach.
Energies 19 00569 g004
Figure 5. Processing of measured historical GTI and P data [53].
Figure 5. Processing of measured historical GTI and P data [53].
Energies 19 00569 g005
Figure 6. Quantile regression benchmark model (B8_QR-LM_probP): example of quantile forecast curves for (a) forecasted global tilted irradiance GTI and (b) forecasted photovoltaic power P.
Figure 6. Quantile regression benchmark model (B8_QR-LM_probP): example of quantile forecast curves for (a) forecasted global tilted irradiance GTI and (b) forecasted photovoltaic power P.
Energies 19 00569 g006
Figure 7. Proposed hybrid model (propMod_LM1–SVQR-LM2): example of quantile forecast curves for (a) forecasted global tilted irradiance GTI and (b) forecasted photovoltaic power P.
Figure 7. Proposed hybrid model (propMod_LM1–SVQR-LM2): example of quantile forecast curves for (a) forecasted global tilted irradiance GTI and (b) forecasted photovoltaic power P.
Energies 19 00569 g007
Figure 8. Quantile regression benchmarks models B7_QR (GTI) and B8_QR-LM (P): additional examples of quantile forecast curves for global tilted irradiance (GTI) and PV power output (P). Sample dates (2013), top to bottom: 11–23 May; 11–23 July; 4–16 September; 17–29 December.
Figure 8. Quantile regression benchmarks models B7_QR (GTI) and B8_QR-LM (P): additional examples of quantile forecast curves for global tilted irradiance (GTI) and PV power output (P). Sample dates (2013), top to bottom: 11–23 May; 11–23 July; 4–16 September; 17–29 December.
Energies 19 00569 g008
Figure 9. Proposed hybrid model (LM1–SVQR stage for GTI and complete LM1-SVQR–LM2 for P): additional examples of quantile forecast curves for global tilted irradiance (GTI) and PV power output (P). Sample dates (2013), top to bottom: 11–23 May; 11–23 July; 4–16 September; 17–29 December.
Figure 9. Proposed hybrid model (LM1–SVQR stage for GTI and complete LM1-SVQR–LM2 for P): additional examples of quantile forecast curves for global tilted irradiance (GTI) and PV power output (P). Sample dates (2013), top to bottom: 11–23 May; 11–23 July; 4–16 September; 17–29 December.
Energies 19 00569 g009
Table 1. Training/validation phase (GTI point forecast values). Bold values indicate the best-performing method for each criterion.
Table 1. Training/validation phase (GTI point forecast values). Bold values indicate the best-performing method for each criterion.
ModelMAE
(W/m2)
RMSE
(W/m2)
MAE Perc. Red.
(%)
RMSE Perc. Red.
(%)
B1_Naive_point_GTI387.7483.7−738.7−575.4
B2_CAMS_point_GTI52.978.0−14.4−9.0
B3_MLR_point_GTI46.271.60.00.0
B4_QR_point_GTI45.872.61.0−1.4
B5_SVR_point_GTI38.462.517.012.8
B6_Ensemble_point_GTI42.667.97.75.2
propMod_LM1-SVQR32.161.130.514.7
Table 2. Testing phase (GTI point forecast values). Bold values indicate the best-performing method for each criterion.
Table 2. Testing phase (GTI point forecast values). Bold values indicate the best-performing method for each criterion.
ModelMAE
(W/m2)
RMSE
(W/m2)
MAE Perc. Red.
(%)
RMSE Perc. Red.
(%)
B1_Naive_point_GTI406.6501.4−770.6−592.4
B2_CAMS_point_GTI55.681.7−19.1−12.8
B3_MLR_point_GTI46.772.40.00.0
B4_QR_point_GTI45.972.91.7−0.7
B5_SVR_point_GTI42.169.89.93.6
B6_Ensemble_point_GTI43.770.26.43.1
propMod_LM1-SVQR39.169.016.34.8
Table 3. Training/validation phase (P point forecast values).
Table 3. Training/validation phase (P point forecast values).
ModelMAE
(kW)
RMSE
(kW)
propMod_LM1-SVQR-LM229.449.4
Table 4. Testing phase (P point forecast values).
Table 4. Testing phase (P point forecast values).
ModelMAE
(kW)
RMSE
(kW)
propMod_LM1-SVQR-LM237.558.9
Table 5. Training/validation and testing (GTI probabilistic forecast values). Bold values indicate the best-performing method for each criterion.
Table 5. Training/validation and testing (GTI probabilistic forecast values). Bold values indicate the best-performing method for each criterion.
ModelPhasep5
(%)
p10
(%)
p90
(%)
p95
(%)
PICP
(%)
PINAW
(%)
CRPS
(%)
UTailmean
(W/m2)
UTailmed
(W/m2)
B7_QR_probGTITrain./Val.5.010.190.194.989.914.233.332.84.8
B7_QR_probGTITest7.812.988.795.687.812.933.531.613.0
propMod_LM1-SVQRTrain./Val.4.09.190.095.691.611.025.733.4−1.6
propMod_LM1-SVQRTest6.112.592.095.789.710.530.051.85.1
Table 6. Training/validation and testing (P probabilistic forecast values). Bold values indicate the best-performing method for each criterion.
Table 6. Training/validation and testing (P probabilistic forecast values). Bold values indicate the best-performing method for each criterion.
ModelPhasep5
(%)
p10
(%)
p90
(%)
p95
(%)
PICP
(%)
PINAW
(%)
CRPS
(%)
UTailmean
(kW)
UTailmed
(kW)
B8_QR-LM_probPTrain./Val.9.415.588.493.384.014.128.631.57.1
B8_QR-LM_probPTest15.222.391.194.078.812.929.228.917.6
propMod_LM1-SVQR-LM2Train./Val.8.517.288.294.185.610.922.228.87.9
propMod_LM1-SVQR-LM2Test12.323.491.194.281.910.427.945.613.9
Table 7. Clear and Non-Clear Sky Probabilistic Performance Effect (testing phase). Bold values indicate the best-performing method for each criterion.
Table 7. Clear and Non-Clear Sky Probabilistic Performance Effect (testing phase). Bold values indicate the best-performing method for each criterion.
VariableWeather ConditionModelPICP
(%)
PINAW
(%)
CRPS
(%)
GTIClear SkyB7_QR_probGTI89.112.024.3
propMod_LM1-SVQR94.39.218.6
Non-ClearB7_QR_probGTI86.714.341.6
propMod_LM1-SVQR85.712.039.9
PClear-SkyB8_QR-LM_probP77.712.024.1
propMod_LM1-SVQR-LM284.59.220.5
Non-ClearB8_QR-LM_probP79.814.633.7
propMod_LM1-SVQR-LM279.712.234.5
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Caldas, R.P.; Melo, A.C.G.; Falcão, D.M. Hybrid Linear and Support Vector Quantile Regression for Short-Term Probabilistic Forecasting of Solar PV Power. Energies 2026, 19, 569. https://doi.org/10.3390/en19020569

AMA Style

Caldas RP, Melo ACG, Falcão DM. Hybrid Linear and Support Vector Quantile Regression for Short-Term Probabilistic Forecasting of Solar PV Power. Energies. 2026; 19(2):569. https://doi.org/10.3390/en19020569

Chicago/Turabian Style

Caldas, Roberto P., Albert C. G. Melo, and Djalma M. Falcão. 2026. "Hybrid Linear and Support Vector Quantile Regression for Short-Term Probabilistic Forecasting of Solar PV Power" Energies 19, no. 2: 569. https://doi.org/10.3390/en19020569

APA Style

Caldas, R. P., Melo, A. C. G., & Falcão, D. M. (2026). Hybrid Linear and Support Vector Quantile Regression for Short-Term Probabilistic Forecasting of Solar PV Power. Energies, 19(2), 569. https://doi.org/10.3390/en19020569

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop