1. Introduction
The continuously increasing installed capacity of photovoltaic (PV) power plants highlights the growing importance of accurate PV power generation forecasting for power system operation and market participation [
1]. PV generation is primarily driven by solar irradiance; therefore, improving the accuracy of day-ahead POA irradiance forecasting is critical for reducing operational uncertainty under variable weather conditions.
In many practical applications, PV generation forecasts rely on weather forecast data or satellite-derived irradiance products rather than on-site measurements [
2]. While satellite-based irradiance estimates provide broad spatial coverage and operational availability, they often exhibit systematic deviations from locally measured POA irradiance due to spatial resolution limits, local atmospheric conditions, and site-specific characteristics. As a result, statistical mapping and site adaptation of satellite-derived irradiance have emerged as important approaches for improving site-specific accuracy [
3].
Existing research on solar irradiance modeling can be broadly grouped into three main directions. First, statistical post-processing and bias correction methods aim to reduce systematic deviations in satellite-derived irradiance products using machine learning techniques [
4]. These approaches focus on improving site-specific accuracy through data-driven correction of satellite inputs.
Second, a large body of work has focused on direct irradiance forecasting using machine learning and deep learning models [
5,
6,
7]. Tree-based ensemble methods such as XGBoost, CatBoost, LightGBM, and Gradient Boosted Trees have demonstrated strong performance in nonlinear regression and time-series prediction tasks [
8,
9,
10,
11], while deep learning architectures, including N-BEATS and N-HiTS, have shown promising results in capturing complex temporal patterns in energy-related data [
12,
13]. Third, hybrid approaches combining physical models and data-driven techniques, as well as studies on the impact of input data quality on forecasting performance, have also been explored [
14].
However, several important aspects remain insufficiently addressed. First, the distinction between forecasting, site adaptation, and irradiance estimation tasks is often not explicitly defined, leading to ambiguity in problem formulation. Second, most studies focus primarily on improving predictive accuracy under realistic forecasting conditions, while relatively limited attention is given to isolating the intrinsic modeling capability of machine learning approaches under controlled input conditions. Third, the impact of irradiance estimation accuracy on downstream photovoltaic energy yield calculations is not consistently evaluated within a unified framework.
In this study, the problem is formulated as a regression-based site adaptation task, where day-ahead POA irradiance is forecasted from satellite-derived global horizontal irradiance (GHI) and auxiliary predictors within a time-series regression framework. Although the task is defined as day-ahead POA irradiance forecasting, it is evaluated under controlled input conditions to isolate the intrinsic modeling capability rather than full operational forecasting performance and can therefore be interpreted as an upper-bound performance scenario.
To address the defined problem, several machine learning and deep learning models are evaluated, including tree-based algorithms, as well as deep learning architectures. The models are trained and evaluated using rolling training windows of 5, 10, and 15 days, motivated by recent findings suggesting that shorter training windows may better capture rapidly changing weather dynamics [
15,
16]. Hyperparameter optimization is performed using both grid search and automated Bayesian optimization techniques [
17].
The target variable corresponds to measured plane-of-array (POA) irradiance, while satellite-derived global horizontal irradiance (GHI) is used as the primary input variable. Model performance is evaluated using multiple statistical metrics, including normalized RMSE (nRMSE), average R
2 across forecasted days, average R
2 for days with positive R
2, the number of days with positive R
2, and the percentage of days with positive R
2 [
18].
As baseline references, satellite-derived GHI is used as a simple proxy for site POA irradiance and compared with locally measured values. In addition, a day-ahead persistence model is introduced as a benchmark, where POA irradiance is approximated by the value from the same hour of the previous day.
The experimental setup is designed to reflect an operational forecasting structure while using ideal meteorological inputs to isolate the modeling capability and represent an upper-bound performance scenario for day-ahead POA irradiance forecasting under controlled conditions.
Following model evaluation, the best-performing model is selected for further analysis. The impact of statistical feature augmentation and changepoint-informed training is then investigated for this selected model in order to assess their contribution to performance improvements. Statistical feature augmentation is commonly used in machine learning-based forecasting to enrich input representation and capture nonlinear relationships between variables [
19], while feature selection and dimensionality reduction techniques are often applied to improve model robustness and generalization performance [
20].
Solar irradiance time series exhibit non-stationary behavior due to atmospheric regime transitions and abrupt variations in cloud dynamics. Changepoint detection provides a systematic framework for identifying structural breaks in temporal data [
21,
22].
Finally, the impact of irradiance forecasting accuracy on photovoltaic power generation is evaluated by comparing PV energy yield calculations under two scenarios: (i) using measured irradiance and temperature data, and (ii) using forecasted irradiance while keeping the remaining inputs unchanged. This comparison allows quantifying how irradiance forecasting errors propagate to PV energy production estimates.
The main contributions of this work can be summarized as follows:
Investigation of regression-based mapping and site adaptation of satellite-derived irradiance for day-ahead POA irradiance forecasting.
Comparative evaluation of multiple machine learning and deep learning models for day-ahead POA irradiance forecasting using sliding training windows.
Systematic comparison between grid search and automated hyperparameter optimization strategies.
Integration and evaluation of changepoint detection within a rolling forecasting framework for the selected best-performing model.
Quantification of the impact of irradiance input accuracy on photovoltaic power generation estimation.
The remainder of this article is structured as follows:
Section 2 presents the literature review.
Section 3 describes the methodology and modeling framework.
Section 4 introduces the test system.
Section 5 presents the data and preprocessing procedures.
Section 6 details the training and evaluation process.
Section 7 discusses the results.
Section 8 concludes the study.
2. Literature Review
Solar irradiance modeling and photovoltaic (PV) power forecasting have been widely studied, with existing approaches generally falling into several main categories.
A first group of studies focuses on statistical post-processing and site adaptation of satellite-derived irradiance data. Methods such as regression-based correction and empirical quantile mapping have been proposed to reduce systematic deviations between satellite estimates and ground measurements [
4].
Similarly, simple regression models and machine learning techniques, including linear regression, XGBoost, and multilayer perceptrons, have been applied to correct satellite-derived global horizontal irradiance (GHI), often showing that increased model complexity does not always lead to substantial accuracy improvements [
23]. However, these approaches are typically limited to static bias correction and do not explicitly account for temporal dependencies, rolling-window training strategies, or the estimation of plane-of-array (POA) irradiance relevant for PV systems.
A second research direction focuses on short-term irradiance and PV power forecasting using machine learning and deep learning models. Various approaches have been proposed, including tree-based ensemble methods, neural networks, and hybrid architectures combining multiple data sources [
15,
24,
25,
26]. These studies demonstrate predictive performance, particularly when combining forecasted meteorological inputs with historical measurements. In addition, advanced architectures such as LSTM, CNN, and hybrid deep learning frameworks have been applied to day-ahead forecasting tasks, often incorporating hyperparameter optimization strategies such as grid search or Bayesian optimization [
15,
27]. Nevertheless, most of these works focus on direct forecasting of irradiance or power, rather than on regression-based mapping between satellite-derived inputs and locally measured POA irradiance. Several studies have also evaluated the performance of different photovoltaic power prediction models under real operating conditions, highlighting the influence of local climate and measurement conditions on model accuracy [
28]. In particular, it has been shown that prediction models may not generalize across different geographic and weather conditions, reinforcing the need for site-specific modeling approaches.
A third group of studies investigates model architecture design and performance optimization in time-series forecasting. Comparative analyses of machine learning and deep learning models, including N-BEATS, N-HiTS, TCN, and Transformer-based architectures, have shown that model performance depends strongly on the forecasting horizon, data characteristics, and architectural design choices [
12,
29,
30,
31]. In addition, the selection of appropriate time lags and input structures has been shown to significantly influence model performance in time-series forecasting tasks [
32]. In particular, recent studies highlight that carefully designed architectures and appropriate training strategies may provide significant performance gains without increasing model complexity. However, these works are typically evaluated in standard forecasting setups and do not explicitly address site adaptation or the influence of input data errors.
In parallel, several studies have examined the role of data quality and availability in PV forecasting. Investigations into the use of global meteorological datasets as substitutes for local measurements suggest that acceptable accuracy can be achieved under certain conditions, although the replacement of local data introduces additional uncertainty [
14]. Other works have explored transfer learning approaches, where models trained on existing PV plants are adapted to new sites using limited local data [
27]. While these approaches address practical deployment challenges, they do not explicitly isolate the contribution of the modeling approach from the uncertainty introduced by input data.
Despite the extensive body of literature, several important gaps remain. First, the distinction between forecasting, site adaptation, and irradiance estimation tasks is often not clearly defined, leading to ambiguity in problem formulation. Second, relatively limited work has focused on evaluating the intrinsic modeling capability of machine learning approaches under controlled input conditions, where the influence of errors in meteorological forecast inputs is minimized. Third, the combined problem of regression-based site adaptation, temporal forecasting structure, and plane-of-array irradiance estimation has not been systematically addressed within a unified evaluation framework.
This study addresses these gaps by formulating the problem as a regression-based site adaptation task for day-ahead POA irradiance forecasting using satellite-derived GHI and auxiliary predictors within a time-series framework. In contrast to existing studies, the proposed approach combines rolling-window training, systematic model comparison, and controlled-input evaluation to isolate modeling performance and assess its impact on downstream PV energy yield estimation.
3. Methodology
The methodological framework of this study combines machine learning and deep learning models with changepoint detection techniques and a rolling-window training strategy for day-ahead forecasting of plane-of-array (POA) irradiance from satellite-derived global horizontal irradiance (GHI) and additional meteorological predictors. The main models used in this research include XGBoost, CatBoost, LightGBM, Gradient Boosted Trees, N-BEATS, N-HITS and Temporal Convolutional Networks (TCN).
3.1. Tree-Based Machine Learning Models
Tree-based gradient boosted ensemble methods were selected due to their strong performance in nonlinear regression problems, robustness to multicollinearity, and ability to model complex interactions between meteorological inputs and solar radiation.
3.1.1. XGBoost (eXtreme Gradient Boosting)
XGBoost is a gradient boosting framework based on additive tree ensembles with second-order optimization. It incorporates regularization terms to control model complexity and prevent overfitting. The algorithm sequentially fits decision trees to the residuals of previous trees using gradient descent in function space [
33].
3.1.2. CatBoost
CatBoost is a gradient boosting algorithm that employs ordered boosting and symmetric (oblivious) trees. The symmetric tree structure ensures fast inference and stable behavior across varying training window sizes [
34]. In this study, CatBoost is evaluated under rolling training schemes to assess its robustness to short historical samples.
3.1.3. LightBGM
LightGBM is a histogram-based gradient boosting framework that grows trees leaf-wise rather than level-wise [
35]. This approach enables efficient training and improved accuracy on large feature spaces. LightGBM incorporates Gradient-based One-Side Sampling (GOSS) and Exclusive Feature Bundling (EFB), which reduce computational cost while preserving predictive power. Its ability to efficiently handle large lag structures makes it suitable for time-series forecasting with multiple autoregressive inputs.
3.1.4. Gradient Boosted Trees
Gradient Boosted Trees serve as a baseline ensemble boosting model. Unlike the optimized frameworks above, this implementation follows classical stage-wise additive modeling without advanced sampling strategies. GBT provides a reference point for evaluating the added value of specialized boosting frameworks under identical sliding-window validation conditions [
36].
3.2. Deep Learning Models
Deep learning architectures were included to evaluate their ability to capture temporal dependencies and long-range patterns in solar radiation time series.
The models were trained using rolling windows identical to those used for the tree-based methods, enabling a consistent comparison.
3.2.1. N-BEATS
N-BEATS (Neural Basis Expansion Analysis for Time Series) is a fully connected deep neural architecture based on backward and forward residual stacking. The model decomposes time series into trend and seasonal components through learned basis expansions. Its block-based structure allows flexible representation of nonlinear patterns without requiring recurrent connections [
37]. In this study, N-BEATS is applied in both univariate and multivariate configurations, incorporating lagged solar radiation and meteorological covariates.
3.2.2. N-HITS
N-HiTS (Neural Hierarchical Interpolation for Time Series) extends the N-BEATS architecture by introducing hierarchical interpolation and multi-resolution learning. It improves long-horizon forecasting stability by processing time series at multiple temporal scales [
38]. This property makes N-HiTS suitable for day-ahead forecasting, where capturing both intra-day patterns and inter-day variability is critical.
3.2.3. TCN
Temporal Convolutional Networks (TCN) are convolutional architectures specifically designed for sequential data. Instead of recurrent connections, TCN employs causal and dilated convolutions to model long-range dependencies. The dilation mechanism enables exponential growth of the receptive field while maintaining computational efficiency. Compared to recurrent neural networks, TCN offers parallel training and improved gradient stability. In the context of solar radiation forecasting, TCN can extract temporal patterns across multiple lag horizons while incorporating exogenous covariates [
39].
Tree-based models rely primarily on engineered lag features and nonlinear regression mechanisms, whereas deep learning models learn temporal representations directly from sequential input data. Comparative evaluation under identical sliding-window conditions enables assessment of whether explicit feature engineering or representation learning is more suitable for short-term solar radiation forecasting.
3.3. Changepoint Detection
Changepoint detection (CPD) refers to the identification of structural breaks in time series, where statistical properties such as mean or variance change due to regime transitions. Such shifts are common in solar radiation data due to atmospheric variability.
In this study, CPD is performed using the Pruned Exact Linear Time (PELT) algorithm, which formulates segmentation as a penalized cost minimization problem [
40]. PELT identifies multiple structural breaks by minimizing the total within-segment cost while penalizing excessive segmentation. Owing to its pruning strategy, the algorithm achieves near-linear computational complexity, making it suitable for rolling window applications.
4. Test System Description
The data used in this research is recorded on-site from a grid-connected PV PP near Medkovetz, Bulgaria, with 103 kWp installed power, achieved using 448 pcs of 230 Wp south-oriented PV modules with 30° inclination and 8 pc. 12 kW string inverters [
16]. Inclined plane-of-array (POA) solar irradiance, wind speed, module and ambient temperature, and other parameters are measured using a weather station and a dedicated SCADA system.
5. Data Description and Pre-Processing
The on-site measurements are recorded in daily files with a temporal resolution of 5 min, resulting in 288 samples per day. The parameters used in this study are the plane-of-array (POA) solar irradiance and ambient temperature.
The analyzed dataset spans 253 consecutive days, from 5 September 2012 to 15 May 2013, yielding a total of 72,864 five-minute observations.
Due to the requirements of the sliding-window time-series modeling framework, a continuous temporal sequence is necessary. Two types of missing data were identified: (i) days with partially missing observations, where daily recordings start later within the day due to technical or operational reasons, and (ii) days with completely missing data.
To ensure data continuity, missing values were handled through two preprocessing strategies adapted to the characteristics of the dataset. The reconstruction procedure was performed using only temporally preceding days to avoid any potential information leakage and to preserve the chronological structure of the time series.
For partially missing days (5.5% of the dataset), only the missing segments were reconstructed. The reconstruction was performed by copying the corresponding time intervals from a previous or temporally close day with a similar irradiance pattern at the beginning of the day. The selection of a suitable reference day was supported by comparison with available satellite-derived irradiance (NASA GHI) to ensure consistency in daily irradiance trends.
For fully missing days (9.5%), complete daily profiles were replaced using data from temporally adjacent days exhibiting similar seasonal characteristics. In some cases, a scaling adjustment was applied to the reference-day profile based on the relative change observed in satellite-derived irradiance, in order to account for differences in overall irradiance levels.
In total, approximately 15% of the dataset was subject to reconstruction or imputation.
The distribution of reconstructed data across training and testing subsets was controlled to avoid potential bias, with 6.3% (training) and 2.2% (testing) of partially reconstructed days, and 7.2% (training) and 20.0% (testing) of fully imputed days.
Although this reconstruction introduces approximation error, it preserves the structural consistency and temporal patterns of the time series.
A quantitative validation was performed for the full-day reconstruction case, which represents the more demanding scenario. The results indicate limited reconstruction error, with nMAE of approximately 2% and nRMSE around 4%. Since partial-day reconstruction affects only a restricted portion of the daily profile, its impact is expected to be smaller, although it was not evaluated separately and remains a limitation of the study.
Meteorological forecast data were obtained from NASA [
41]. The external dataset provides hourly values of global horizontal solar irradiance (GHI) and ambient temperature for the same 253-day period, resulting in 6072 hourly observations.
To ensure consistency between the two data sources, the on-site 5-min measurements were aggregated to hourly values using arithmetic averaging. This aggregation yields aligned hourly time series suitable for model training and comparative analysis.
6. Training Process
6.1. Training Process Description—Rolling Training Strategy
A hold-out test dataset, consisting of the last 45 days, was separated prior to model development. The remaining data were used for rolling validation and hyperparameter optimization. The final selected model was evaluated exclusively on the unseen test set to ensure unbiased performance estimation.
To address seasonal drift, a sliding-window training scheme was adopted. For a training length of days (5, 10, or 15 days), the first days are used for model training, while the subsequent day serves as the validation day. The training window then shifts forward by one day, and the process is repeated until the end of the validation period.
This procedure ensures day-ahead forecasting consistency, avoidance of future information leakage and realistic operational simulation of model retraining.
The total validation period length differs depending on the training window size. The last validation day is fixed for all configurations (31 March 2013), while the starting day shifts according to the number of training days required. As a result:
5-day training → 202 validation days
10-day training → 197 validation days
15-day training → 192 validation days
Although this leads to slightly different validation lengths, it enables the use of the maximum available dataset for each configuration.
The forecasting performance is assessed by using RMSE, Average R2 over the full validation period, Average R2 over days with positive R2, Number of days with positive R2, and Percentage of days with positive R2.
While RMSE directly reflects the magnitude of forecasting error and is particularly relevant for PV generation estimation, it does not necessarily indicate whether the model captures the functional relationship between inputs and target variables.
For example, under low-radiation conditions, a model predicting near-constant values may achieve relatively low RMSE without learning meaningful temporal dynamics. Therefore, R2 is used as a complementary metric to evaluate explanatory capability.
However, average R2 may become significantly negative if a limited number of poorly predicted days dominate the metric. To provide additional insight into model robustness, two supplementary metrics are introduced: the proportion of days with positive R2 and the mean R2 calculated only for those days.
These indicators reflect the frequency with which the model captures daily structure successfully.
Therefore, in practical PV forecasting applications, RMSE and nRMSE are typically prioritized due to their direct relation to forecasting error, while R2-based metrics provide complementary insight into structural consistency.
Two hyperparameter search strategies are compared: GridSearch [
42] and Automated Bayesian Optimization [
43].
6.1.1. Grid Search
For each model type (tree-based and deep learning), a predefined parameter grid was constructed. The rolling validation procedure described in
Section 6.1 was repeated for every hyperparameter combination.
The grid search also included evaluation of preprocessing transformations such as:
Although exhaustive, grid search is computationally expensive and limited to predefined parameter ranges.
6.1.2. Automated Optimization
Automated hyperparameter optimization was performed using Optuna [
44].
A total of 400 trials were executed for DT base models and 50 trials for deep learning models, due to long computational time. The objective function minimized validation RMSE, averaged across the rolling total validation period.
Compared to grid search, Bayesian optimization generally enables continuous parameter sampling, more efficient exploration of high-dimensional search spaces, and improved adaptation to model-specific sensitivities.
Both optimization strategies were conducted using the 5-day training configuration to ensure comparable search conditions for the models.
Hyperparameter optimization was primarily conducted for the 5-day training window configuration due to computational cost and time constraints. The same optimized hyperparameters were subsequently applied to the 10-day and 15-day training window configurations. As a result, the comparison across different training window lengths may not fully reflect the optimal performance achievable for each configuration.
For deep learning architectures (N-BEATS, N-HiTS, and TCN), the same rolling-window framework was applied. Models were retrained for each window using identical training lengths as in tree-based models.
Early stopping was employed based on validation loss to mitigate overfitting, particularly under short training windows. Dropout regularization and weight decay were used where applicable.
This unified framework enables direct comparison between feature-engineered tree-based models and neural architectures based on representation learning.
To investigate whether structural breaks affect forecasting accuracy, changepoint detection was implemented for the best-performing model.
Detected changepoints were used to adjust the weights of the input parameters from the current training window by prioritizing the values after the accepted changepoint of the time series. This mechanism aims to reduce the influence of abrupt weather pattern transitions, mainly bad weather days.
To examine whether additional descriptive information improves forecasting accuracy, the best-performing model was provided with statistical features computed over the training window. These include rolling mean, standard deviation, minimum, maximum, variance, and short-term trend indicators of solar radiation and temperature.
The objective was to assess whether summary statistics of recent variability provide complementary information. The augmented model was evaluated under the same rolling validation framework, and its performance was compared against the results achieved without statistics features, using RMSE, R2, and the proportion of positive-R2 days.
The impact of changepoint integration was evaluated by comparing performance metrics with and without the detection mechanism under identical sliding-window conditions.
In addition to solar radiation forecasting, the impact on PV generation estimation was evaluated under two scenarios: using measured on-site radiation and temperature, and using forecasted radiation while keeping the remaining inputs unchanged.
This comparison quantifies the increase in generation error when measured radiation is replaced by forecasted radiation inputs, thereby assessing the practical implications of meteorological forecast accuracy.
6.2. Training Process Validation and Testing
6.2.1. Gridsearch Analysis (Tree-Based Models)
Grid search was performed exclusively for the tree-based models (XGBoost, CatBoost, LightGBM, and GBT). The selected hyperparameter configurations are summarized below:
XGBoost: n_estimators = 250, scale_pos_weight = 20, learning rate = 0.01, gamma = 0, subsample = 0.8
CatBoost: iterations = 300, learning rate = 0.01, depth = 4, L2_leaf_reg = 1
LightGBM: n_estimators = 300, boosting type = goss, max_depth = 6, learning rate = 0.01
GBT: max_depth = 5, max_features = sqrt
As shown in
Table 1, no single model dominates across all evaluation metrics. The optimal model depends on the selected performance criterion.
In terms of RMSE and nRMSE, CatBoost with a 5-day training window, achieves the best performance (103.6 W/m2 and 9.3%), followed by CatBoost with a 15-day window (106.3 W/m2 and 9.6%, respectively).
When considering average validation R2, LightGBM with a 5-day training window exhibits the least negative value, indicating comparatively better explanatory capacity over the full validation period. However, its RMSE remains higher than the best CatBoost configuration.
The highest average R2 calculated only on positive-R2 days is obtained by LightGBM with a 15-day window, although this configuration does not provide the highest proportion of positive days.
CatBoost with a 5-day training window achieves the lowest RMSE, the second-best average R2, strong R2 performance on positive days and the highest percentage of positive-R2 days.
This balanced behavior suggests that CatBoost, with a short training window, provides the most stable performance among the grid-searched configurations.
Following the implementation of automated Bayesian hyperparameter optimization, significantly improved performance was observed compared to grid search for tree-based models.
Given that the automated approach provides substantially larger hyperparameter space, the continuous nature of most of the models’ hyperparameters and the higher efficiency of Bayesian optimization, the GridSearch was not applied to the deep learning models. Instead, automated optimization was directly adopted as the primary tuning strategy for neural architectures.
This decision reduces computational cost while enabling a more effective exploration of high-dimensional hyperparameter spaces.
6.2.2. Automated Optimization Analysis
Automated hyperparameter optimization was performed using Bayesian search with 400 trials for DT models and 50 trials for deep learning architectures.
Tree-Based Models
Among the tree-based models, substantial improvements in RMSE were observed compared to grid search results, as shown in
Table 1.
CatBoost achieves one of the lowest RMSE and nRMSE values within this group (75.5 W/m2 and 6.8% for both 10-day and 15-day training windows). LightGBM and GBT demonstrate highly comparable performance, with RMSE values also remaining within the 75–80 W/m2 range depending on the training window size. XGBoost exhibits slightly higher RMSE values but maintains strong structural stability across configurations.
In general, automated optimization significantly narrows the performance gap between tree-based methods and reduces sensitivity to training window length.
The selected configurations for the tree-based models are summarized below:
XGBoost: max_depth = 2, n_estimators = 268, min_child_weight = 1, learning rate = 0.0196, gamma = 0.601, subsample = 0.329, colsample_bytree = 0.947, reg_alpha = 0.490, reg_lambda = 0.527, lags = 259;
CatBoost: iterations = 1000, learning rate = 0.0620, depth = 2, subsample = 0.904, colsample_bylevel = 0.831, min_data_in_leaf = 98, lags = 70;
LightGBM: n_estimators = 1000, bagging_freq = 1, learning rate = 0.00306, num_leaves = 312, subsample = 0.441, colsample_bytree = 0.990, min_data_in_leaf = 3, lags = 0;
GBT: n_estimators = 1000, learning rate = 0.00721, subsample = 0.568, max_depth = 5, max_leaf_nodes = 127, min_impurity_decrease = 74.82, alpha = 0.752, lags = 0
Figure 1 shows the recorded on-site (true) values compared with the estimated values of POA solar irradiance using the CatBoost model. The NASA GHI values are shown only for the 5-day training period, illustrating the input–output relationship and model fit.
Figure 2 shows the recorded on-site (true) values compared with the estimated values of POA solar irradiance using the XGBoost model. The NASA GHI values are shown only for the 10-day training period.
Deep Learning Models
Deep learning architectures were optimized exclusively using Bayesian search due to their high-dimensional hyperparameter spaces. The best configurations are summarized below:
N-BEATS: learning rate = 4.18 × 10−4, batch size = 28, dropout = 0.043, 14 stacks, 3 blocks, 5 layers per block, width = 256, expansion coefficient = 8, polynomial trend degree = 2.
N-HiTS: learning rate = 1.45 × 10−4, batch size = 48, dropout = 0.0045, 1 stack, 4 blocks, 5 layers, width = 256, MaxPool enabled, validation retraining period = 3 epochs.
TCN: learning rate = 6.51 × 10−4, batch size = 66, dropout = 0.0227, kernel size = 100, 86 filters, 6 layers, dilation base = 4.
The most stable deep learning model is N-HiTS, as shown in
Table 1, achieving: RMSE between 76.3–77.6 W/m
2 and nRMSE 6.9–7% across training windows, positive average validation R
2 for the 5-day window (0.31), and the highest proportion of positive-R
2 days among all models (up to 87.8%).
N-HiTS achieves the highest proportion of positive-R2 days and maintains comparable RMSE across training window sizes, suggesting stable predictive behavior under the rolling training scheme.
N-BEATS demonstrates moderate performance (best RMSE ≈ 90.8 W/m2) but exhibits higher variability and more negative average R2 values.
TCN shows comparatively weaker performance, with RMSE and nRMSE exceeding 100 W/m2 and 9.4%, respectively.
Figure 3 shows the relative importance of the N-BEATS hyperparameters obtained through Bayesian optimization.
Figure 4 shows the optimization history, illustrating the evolution of the objective metric over 50 trials. Each point corresponds to the result of a single trial.
Comparative Interpretation
When considering pure error minimization (RMSE and nRMSE), the best-performing configuration is CatBoost (75.5 W/m2 and 6.8%), closely followed by GBT and LightGBM.
However, when structural consistency is considered (percentage of positive-R2 days and average R2), N-HiTS demonstrates strong robustness.
This distinction highlights an important trade-off:
Tree-based models achieve slightly lower absolute error.
Neural hierarchical architectures (N-HiTS) provide more stable structural pattern capture.
Neural architectures require higher computational resources and training time.
These differences can be partially explained by the interaction between model architecture and the characteristics of the dataset. Tree-based models benefit from the structured lag-based input representation and relatively limited training data, allowing efficient learning of nonlinear relationships. In contrast, neural architectures are designed to exploit temporal dependencies and may benefit from larger datasets or longer training sequences; however, in the present setup, N-HiTS demonstrates competitive performance even under short rolling windows, suggesting robustness to limited training context.
The relatively stronger performance of N-HiTS compared to N-BEATS may also be related to differences in architectural design, sensitivity to data preprocessing, and interaction with the adopted training configuration.
To further assess whether the observed performance differences between the best-performing models are meaningful, statistical significance analysis was performed using paired comparison tests on the daily RMSE values across the validation period.
The comparison between CatBoost and N-HiTS using both paired t-tests and Wilcoxon signed-rank tests did not indicate statistically significant differences at the 95% confidence level for the evaluated rolling-window configurations.
These results suggest that, although CatBoost achieves slightly lower RMSE values in some configurations, the forecasting performance of CatBoost and N-HiTS can be considered statistically comparable under the examined dataset and evaluation framework.
Furthermore, automated optimization increases the number of positive-R2 days across almost all models compared to grid search, even when the average R2 over positive days slightly decreases. This suggests that Bayesian optimization expands the set of moderately well-predicted days rather than maximizing performance on already well-fitted cases.
When prioritizing day-ahead operational forecasting accuracy, where RMSE directly reflects expected generation error, CatBoost with a 10-day training window demonstrates one of the most favorable overall performances among the evaluated models.
In addition to achieving the lowest RMSE, CatBoost demonstrates competitive structural consistency and significantly lower computational complexity compared to deep learning architectures, making it particularly suitable for practical deployment scenarios.
Although N-HiTS achieves the highest proportion of positive-R2 days, its RMSE remains slightly higher, and its training time is substantially longer.
Therefore, considering forecasting accuracy, computational efficiency, retraining complexity, and deployment practicality, CatBoost represents a particularly well-balanced solution for day-ahead PV generation forecasting in the examined case study.
Overall, automated hyperparameter optimization proves clearly superior to grid search, particularly in high-dimensional search spaces and under short rolling training conditions.
Figure 5 shows the measured on-site (true) values compared with estimated POA solar irradiance using the CatBoost model with a 10-day training window. The model is optimized using GridSearch. Deviations are observed during the early morning and late evening hours, where irradiance levels change more rapidly.
Figure 6 shows the measured on-site (true) values compared with estimated POA solar irradiance using the XGBoost model with a 10-day training window. The model is optimized using GridSearch. The estimation captures the irradiance profile more accurately during the early morning and late evening hours compared to the CatBoost model for this example day. However, the overall performance of XGBoost remains slightly lower than that of CatBoost.
6.2.3. Impact of Statistical Feature Augmentation
The inclusion of additional statistical features in the CatBoost model does not lead to performance improvement. RMSE values remain comparable to the baseline and are slightly degraded in several configurations, as presented in
Table 2. Re-optimizing hyperparameters for the augmented feature set does not yield gains. This indicates that lag-based inputs already capture the essential temporal structure of the series.
6.2.4. CPD-Based Sample Weighting Approach
After selecting CatBoost as the best-performing model, an additional experiment was conducted to evaluate whether short-term forecasting performance regime shifts could be leveraged to improve predictive accuracy. The rolling training windows, validation procedure, and test horizon remained unchanged.
Before each retraining step, changepoint detection was applied to the most recent 10-day rolling MAE sequence produced by the baseline CatBoost model (without CPD) [
45]. The PELT algorithm with a fixed penalty and L2 cost function was used to detect changes in the mean level of the error series. Detected changepoints were interpreted as transitions between short-term forecasting performance regimes.
If no changepoint was detected, retraining proceeded with uniform sample weights. When a changepoint was identified, observations preceding the most recent change were down-weighted (weight factor 0.5), while more recent observations retained full weight.
The decay factor was selected empirically based on preliminary experiments. No extensive sensitivity analysis or parameter optimization of the changepoint detection procedure was performed, as the objective is to provide an initial assessment of its applicability within the proposed framework.
The approach was first evaluated on the validation period (
Table 3) and subsequently applied to the final 45-day test set. Performance improvements remained consistent across different rolling training window lengths.
For the 5-day training window, the inclusion of changepoint detection does not yield improvement in RMSE (78.6 W/m2 vs. 78.4 W/m2 baseline). This is expected, as short training windows already prioritize recent data (using a 5-day detection window also did not improve the results).
For longer training windows, modest but consistent improvements are observed:
The impact is therefore more visible when longer historical segments are used for training, where structural shifts may have a stronger influence on model fitting.
Additionally, the number of detected changepoints was:
These results indicate that changepoint detection provides measurable, though moderate, gains under extended training windows by limiting the influence of regime transitions.
6.2.5. Summary of Model Enhancements and Final Selection
The comparative analysis reveals distinct outcomes across the examined model refinements.
First, automated hyperparameter optimization substantially improved performance across all model families compared to grid search, with CatBoost emerging as the most balanced solution in terms of RMSE and computational efficiency.
Second, the introduction of additional statistical features did not lead to measurable performance gains. RMSE values remained comparable to the baseline configuration and, in several cases, slightly degraded. This suggests that the lag-based input structure already captures the dominant temporal dynamics of the series, limiting the added value of further statistical descriptors.
In contrast, the integration of changepoint detection introduced a CPD-based weighting approach that yielded modest but consistent RMSE improvements for longer training windows (10 and 15 days), with reductions of approximately 0.8% and 0.3%, respectively. The effect was negligible for short (5-day) windows, where recent observations already dominate the learning process.
Overall, considering forecasting accuracy, stability across validation and test horizons, and computational efficiency, the CatBoost model with automated hyperparameter optimization and CPD-based sample weighting under a 10-day training window represents the most suitable configuration for day-ahead PV generation forecasting in the present case study.
6.3. Test Set Evaluation
The final CatBoost configuration was evaluated on the hold-out test set covering the last 45 days (1 April 2013–15 May 2013). The model was tested both with and without changepoint detection (CP), using a 10-day detection window, as presented in
Table 4.
Without Changepoint Detection, Baseline RMSE and nRMSE values are:
5-day window: 119.0 W/m2 and 10.70%;
10-day window: 121.1 W/ m2 and 10.89%;
15-day window: 120.0 W/ m2 and 10.80%.
The percentage of positive-R2 days remains high (93–96%), indicating strong structural performance on unseen data.
With Changepoint Detection, the changepoint detection is enabled:
5-day window: 119.0 W/m2 (no change);
10-day window: RMSE is reduced from 121.1 W/m2 to 119.8 W/m2 and nRMSE from 10.89% to 10.78% (~1.1% improvement);
15-day window: RMSE is reduced from 120.0 W/m2 to 119.5 W/m2 and nRMSE from 10.80% to 10.75% (~0.4% improvement).
The number of accepted changepoints was:
Interpretation
The test set results confirm the validation findings:
Changepoint detection does not degrade performance.
Modest improvements are observed for longer training windows.
The effect is most visible for the 10-day window (~1% RMSE and nRMSE reduction).
Although the absolute improvements are moderate, they are consistent and achieved without increasing model complexity or inference time significantly.
These findings suggest that changepoint detection may help mitigate the influence of changing temporal patterns when longer historical windows are used.
Overall, the test set evaluation confirms that CPD-based sample reweighting provides small but stable performance gains without introducing additional model complexity or overfitting effects.
Figure 7 shows the measured on-site (true) values compared with estimated POA solar irradiance using the CatBoost model without changepoint detection (CPD) and a 10-day training window. The figure corresponds to the forecasted day of 14 April 2013.
Figure 8 shows the measured on-site (true) values compared with estimated POA solar irradiance using the CatBoost model with changepoint detection (CPD) and a 10-day training window. The figure corresponds to the forecasted day of 14 April 2013. A slight difference is observed in the peak irradiance, consistent with the overall performance comparison between models with and without CPD.
6.4. Benchmark Power Model Evaluation
As a complementary analysis, a benchmark model based on multilinear regression with polynomial features (degree = 2) was implemented to provide a reference for PV generation estimation. This model has previously demonstrated strong performance when using measured on-site meteorological data, achieving R
2 values above 0.99 [
16].
The benchmark was evaluated under the same rolling-window framework as the machine learning models, with hyperparameters optimized during validation. Results are presented in
Table 5.
Using measured insulation, the model achieves:
When replacing measured insulation with forecasted insulation (the input parameters in this case are forecasted insulation, local ambient measured temperature, NASA calculated ambient temperature and insulation), RMSE increases consistently:
This corresponds to an increase in absolute error of approximately 10–20%, depending on the training window.
While the absolute RMSE difference is moderate, the relative increase (approximately 10–20%) is non-negligible and demonstrates the direct impact of replacing measured with forecasted insulation.
It should be noted that the test period is characterized predominantly by stable, close to clear-sky conditions with high solar irradiance levels. Under such conditions, PV generation patterns are more regular and easier to predict, which may contribute to the relatively high performance observed for both measured and forecasted input scenarios. However, lower forecast accuracy might be expected under lower irradiance conditions.
The benchmark model is intentionally kept simple to isolate the impact of irradiance estimation errors on PV power generation while still achieving high accuracy with measured inputs (R
2 > 0.99 [
16]). More complex models may compensate for input inaccuracies and obscure this relationship; therefore, this component serves as a foundation for future work.
7. Discussion
The results indicate that automated hyperparameter optimization may play a more significant role in performance improvement than model architecture alone. Tree-based ensemble models, particularly CatBoost, generally achieve lower RMSE values compared to deep learning architectures under the rolling-window framework.
The observed differences may also be influenced by the optimization budget and model-specific sensitivity to hyperparameter tuning, particularly for deep learning architectures, which were evaluated under more limited optimization trials due to computational constraints.
Additional paired statistical comparison between CatBoost and N-HiTS using daily RMSE values indicates that the observed performance differences are not statistically significant at the 95% confidence level for the evaluated rolling-window configurations.
This suggests that both model families achieve statistically comparable forecasting performance under the examined dataset and evaluation framework, despite their substantially different architectural complexity and computational requirements.
Hyperparameter optimization was performed only for the 5-day training window, which may limit the comparability of results across different window lengths. The same hyperparameters were subsequently applied to the 10-day and 15-day configurations; therefore, the reported performance differences may not fully reflect the optimal performance achievable for each configuration.
The experiments with statistical feature augmentation indicate that additional descriptive variables do not systematically improve performance. This suggests that the lag-based input structure already captures the dominant temporal dynamics relevant for day-ahead POA irradiance forecasting.
Changepoint detection introduces modest but consistent improvements, especially for longer training windows. While the absolute RMSE reduction remains limited, the improvement is stable across validation and test sets, indicating that selective adaptation to temporal shifts may enhance predictive stability without increasing model complexity.
The results further indicate that the proposed machine-learning framework can effectively learn a site-specific mapping between satellite-derived global horizontal irradiance (GHI) and measured plane-of-array (POA) irradiance. This highlights the ability of data-driven models to capture both local site characteristics and the transformation between horizontal and plane-of-array irradiance.
The benchmark regression model confirms that PV generation estimation can achieve very high R2 values when measured meteorological inputs are used. However, replacing measured irradiance inputs with satellite-derived or forecasted values leads to a clear increase in generation error. This highlights that the accuracy of irradiance input data directly influences the final PV power forecasting performance.
The evaluation period is dominated by relatively stable irradiance conditions; therefore, the impact of the proposed approach under highly variable or cloudy conditions remains an open question for further investigation.
The comparison of different training window lengths is intended to assess the sensitivity of model performance to the amount of recent training data under a consistent hyperparameter configuration, rather than to fully optimize each configuration.
Overall, the findings indicate that for operational day-ahead PV forecasting, model selection should prioritize error minimization and computational efficiency, while additional feature engineering provides limited or no benefits. The results also suggest that regression-based site adaptation of satellite-derived irradiance represents an effective approach for improving site-specific POA irradiance forecasting under operational constraints.
While the results demonstrate strong performance for the considered dataset, the proposed approach is inherently site-specific, and its applicability to other locations may depend on local climatic conditions and data availability. Nevertheless, the framework itself is transferable and can be applied to other sites provided that sufficient representative data are available for model training.
Ambient temperature plays a significant role in photovoltaic power generation forecasting, influencing both system efficiency and output variability. While this study focuses on the site adaptation of solar irradiance, extending the proposed framework to include site-specific correction of temperature inputs represents a promising direction for future work.
Figure 9 illustrates a comparison between the NASA forecasted and locally measured ambient temperature over an 11-day period. The results show generally good agreement, with moderate deviations during certain intervals. This suggests that, similarly to solar irradiance, temperature inputs may also introduce additional uncertainty and could benefit from site-specific correction.
Additionally, the results encourage further research on machine-learning approaches for direct site-specific modeling and forecasting of PV power plant production using satellite-derived global horizontal irradiance (GHI).
8. Conclusions
This study presents a comparative evaluation framework for day-ahead forecasting of plane-of-array (POA) solar irradiance and PV generation using tree-based machine learning models, deep learning architectures, changepoint detection, and statistical feature engineering under a rolling-window validation scheme. The proposed approach is based on a data-driven mapping between satellite-derived global horizontal irradiance (GHI) and site-measured POA irradiance.
The main findings can be summarized as follows:
Automated hyperparameter optimization significantly outperforms grid search, particularly in high-dimensional parameter spaces. Bayesian optimization enables more efficient exploration and leads to consistent RMSE reduction across model families.
Among the evaluated models, CatBoost provides the most favorable trade-off between forecasting accuracy and computational efficiency, achieving the lowest RMSE and nRMSE under the 10-day training window while maintaining high structural consistency on unseen data. Although paired statistical tests indicate that the performance differences between CatBoost and N-HiTS are not statistically significant at the 95% confidence level, CatBoost remains preferable from an operational perspective due to its lower computational complexity and training time.
The inclusion of additional statistical features does not lead to systematic performance improvement, indicating that lag-based representations already capture the dominant temporal dynamics relevant for short-term POA irradiance forecasting.
Changepoint detection introduces modest but consistent RMSE improvements for longer training windows without degrading stability. Although the gains are limited, they are consistent across validation and test sets and suggest a potential benefit under the examined conditions.
The benchmark regression model confirms that PV generation estimation can achieve very high R2 values when driven by measured meteorological inputs. However, replacing measured irradiance inputs with satellite-derived or forecasted values leads to a non-negligible increase in generation error, demonstrating the direct impact of irradiance input accuracy on power prediction performance.
Overall, the findings indicate that regression-based site adaptation of satellite-derived irradiance represents an effective and practical approach for improving site-specific day-ahead POA irradiance forecasting under operational constraints.
Future work may investigate multi-site validation, probabilistic forecasting approaches, ambient temperature site-specific adaptation, and integration of improved meteorological forecast sources to further reduce generation-level error.