1. Introduction
The global push for decarbonization has driven regulatory developments such as the U.S. Federal Energy Regulatory Commission (FERC) Order 2222 [
1], which enables the participation of distributed energy resources (DERs), including rooftop solar photovoltaic (PV) systems, in wholesale electricity markets. This regulatory shift is accelerating the integration of renewable energy resources and reshaping both power system operations and market dynamics [
2]. However, integrating variable renewable energy resources (VREs) like solar PV presents significant challenges for grid stability, reliability, and dispatch [
3]. Unlike conventional generators, VREs exhibit volatile output due to rapid fluctuations in cloud cover, humidity, and other localized weather conditions. As a result, accurate and high-resolution solar irradiance forecasting is essential for managing DERs effectively within the power grid. However, existing global irradiance models typically operate at coarse spatial and temporal resolutions, limiting their utility for site-specific forecasting and real-time grid operations.
Advances in power electronic converter (PEC) technology, such as improved dynamic models [
4] and virtual inertia solutions [
5], address some stability concerns. However, accurate solar PV forecasting remains a critical challenge for grid operators and market participants. Existing approaches for solar PV forecasting primarily rely on numerical weather prediction (NWP) models [
6,
7], statistical methods and machine-learning techniques [
8,
9,
10,
11], deep learning algorithms [
12,
13], and hybrid approaches [
9,
14,
15]. Some studies incorporate probabilistic methods to quantify uncertainty [
16,
17,
18], while others consider the spatial influence of neighboring sites on a target location [
12,
19]. Although these methods have demonstrated success, many rely on an implicit assumption that global solar irradiance datasets provide accurate site-specific representations. This assumption introduces inaccuracies because global datasets, such as those from the National Solar Radiation Database (NSRDB) [
20], provide irradiance values averaged over large spatial scales (e.g., 4 km × 4 km), which fail to capture fine-grained local variability at individual DERs.
A key challenge in solar PV forecasting is the need for high-resolution, site-specific solar irradiance data. Current datasets from sources like NSRDB and Open Meteo [
21] cover broad spatial regions, with some reaching resolutions as coarse as 11 km × 11 km. This leads to approximation errors that can undermine accurate capacity estimation and power system operations such as dispatch and balancing. While the ideal solution would involve deploying weather stations at every prospective PV site to obtain direct irradiance measurements, this approach is impractical due to cost, maintenance, and logistical constraints. Furthermore, it would be infeasible for forecasting irradiance at prospective solar PV sites where physical measurements are not yet available. Therefore, there is a pressing need for advanced downscaling techniques that can transform coarse global irradiance data into high-resolution, site-specific estimates suitable for real-time market operations and DER integration.
By leveraging historical data on both global and local solar irradiance measurements, models can be developed to learn the mapping of the global values to the local values. Additionally, by incorporating spatial correlations, the dependencies of neighboring sites could be captured to infer forecast values for sites whose local solar irradiance measurements are unknown. As promising as downscaling is for power system operations, there is a dearth of literature on this method. Most solar irradiance downscaling studies focused on generating high-resolution time-scale measurements from coarse-grained solar irradiance forecast values [
22,
23,
24,
25,
26]. Some of the studies that focused on spatial downscaling include [
27], where coarse resolution downward shortwave radiation is disaggregated into a 30 m scale using an atmospheric transmittance-based weighting technique. Although the proposed model achieves reliable downscaling results, especially for mountainous areas, the model heavily depends on a satellite-derived dataset. Similarly, artificial neural networks (ANNs) were used to downscale weather variables from a 1.2 km spatial resolution to a 240 m resolution [
28]. The study, however, used a simulated dataset generated from the Weather Research and Forecasting (WRF) model in Large Eddy Simulation mode with a 240 grid, which may bring into question the practical implementation of their model. The study in [
29] adopted the nearest-neighbor Gaussian process (NNGP) to downscale solar irradiance from global resolution to a more fine-grained local resolution. However, the model in [
29] performed interpolation instead of extrapolation (i.e., forecasting) to obtain day-ahead predictions. By evaluating the NNGP model on a temporal point within the training temporal space, the study failed to validate the capability of the NNGP to accurately forecast downscaled irradiance for time points beyond the training temporal space.
The main contribution of this paper is the development of an accurate, scalable spatiotemporal downscaling model for day-ahead solar irradiance forecast. Our study designs a novel approach to spatiotemporal downscaling of solar irradiance for modeling and forecasting using a Nearest-Neighbor Random Forest (NNRF). The NNRF approach is compared to the performance of NNGP updated from [
29] to properly perform forecasting.
In the rest of the paper, the methodology is presented in
Section 2, which describes the NNGP and NNRF spatiotemporal models. The simulation setup and data collection are presented in
Section 3.
Section 4 of this paper discusses the results. Finally, conclusions are made in
Section 5, alongside future research directions. Additionally, the data preprocessing steps are shown in
Appendix A.
3. Simulation Setup
In this study, we trained the NNGP and NNRF models on datasets from eight different sites and compared their performances in terms of accuracy and computational speed. The datasets were first cleaned of outliers and missing data. Details of the preprocessing steps are provided in the appendix. The temporal features were engineered to capture the periodic patterns in the dataset. The Euclidean distances between the sites were computed to determine the nearest neighboring sites. The NNGP model was first built using the squared exponential kernel, followed by the replacement of the GP with RF to form the NNRF model. Both models were developed in Python version 3.11.7. Major packages used in the code implementation include scikit-learn, mainly for the machine-learning modeling, and geopandas for the spatial analysis of the sites’ coordinates. Both models were executed on a local desktop equipped with an Intel® Core™ processor featuring 16 CPUs, each operating at a clock speed of 2.0 GHz and 16 GB of RAM. The rest of the simulation setup describes the dataset used in the study, the implementation details of NNGP and NNRF, the scalability test of the models, and the evaluation metrics.
3.1. Data Collection
Texas has significant solar energy potential and deployed solar resources, with annual average global tilt solar radiation ranging from 4.76 kWh/m
2/day to 6.58 kWh/m
2/day across the state [
40]. Its abundant solar resources, combined with the unique characteristics of the isolated ERCOT grid, make it an ideal region for this study. Eight sites across Texas were selected for analysis, with their locations and coordinates listed in
Table 1, and their spatial distribution illustrated in
Figure 2. In this study, Sites 2 and 7 share the same global 4 km
2 grid space, while each other site has a distinct global region.
The spatial distribution of the selected sites, shown in
Figure 2, indicates that most of the sites are sparsely distributed across the region. However, certain locations—specifically Sites 3, 4, 5, and 8 form a relatively close cluster, while Sites 2 and 7 share the same global space as shown by the inset plot in
Figure 2. Understanding this spatial arrangement is essential for accurately capturing the influence of neighboring sites on local conditions.
Table 2 presents the ground distances between the sites in km, further illustrating their relative proximity and separation.
The global dataset for each site was obtained from the NSRDB website [
20], while the local dataset was sourced from Ambient Weather [
41]. The global dataset consists of 5-min interval data on solar irradiance, temperature, pressure, wind speed, and dew point, spanning the period from 1 January 2022 to 31 December 2022. During the same period and with the same temporal resolution, ground-measured solar irradiance data were collected as the local irradiance dataset. The dataset for each site has 105,120 data points, enough to effectively train a machine-learning model.
Figure 3 and
Figure 4 present the annual profiles of global and local irradiance, respectively, for Site 1 in 2022. The plots show significant differences between the global and local irradiance. This is confirmed by the residual plot in
Figure 5, which shows the numerical deviations of the global and local irradiance at each time point for Site 1. The global irradiance, being forecast values by NSRDB, does not accurately represent the ground measurements. In the same manner, the datasets of the other seven sites have similar distribution and deviation between the global and local irradiance. This observation reveals that the use of global solar irradiance for solar PV estimation could result in inaccurate solar PV forecasting, posing risks to power dispatch reliability and even network stability. This affirms the need to downscale the forecast global irradiance to more accurate site-specific values.
3.2. Feature Engineering
Solar irradiance is influenced by temporal and meteorological factors. Therefore, critical temporal features such as the hour of the day and day of the year were encoded using sine and cosine transformations to capture cyclical trends accurately. This cyclical encoding ensures continuity across periodic boundaries, such as midnight to midnight or December to January transitions. Capturing these cyclical temporal variations is crucial in reducing predictive errors, particularly during transitional periods with rapid changes in solar angles. Meteorological parameters such as temperature, atmospheric pressure, wind speed, and dew point from the global dataset were incorporated to reflect weather-dependent variability in irradiance. Integrating meteorological data provides essential contextual information that improves the model’s ability to generalize under diverse weather conditions and thus enhances its robustness. The
Hour_sin and
Hour_cos features were computed using equations:
Similarly, the
DayOfYear_sin and
DayOfYear_cos features were computed as:
The input data to the models are the encoded hour and day values, temperature, pressure, wind speed, dew point, and the global solar irradiance of the target and neighboring sites. These features were scaled using scikit-learn’s MinMax Scaler. The model was then trained to predict local solar irradiance as the output.
3.3. NNGP Implementation
The NNGP model was implemented to predict site-specific GHI values using a combination of spatial, temporal, and feature-based distances. For each site, its k-NNs were identified based on Euclidean distance in the spatial domain, with geographic coordinates serving as inputs. By focusing solely on these neighbors, the NNGP framework effectively localized the GP. This significantly reduced the dimensionality of the covariance matrices. Temporal dependencies were incorporated by considering the time difference between the current observation and historical data from neighboring sites. Temporal distances were calculated in minutes to ensure fine-grained temporal resolution suitable for solar irradiance modeling. Furthermore, feature-based distances were included to account for meteorological conditions such as temperature, pressure, wind speed, and dew point. These feature-based distances were computed in the feature-space to capture similarities and differences in environmental conditions across sites.
The squared exponential kernel in (
5), used for the computation of the covariances incorporated spatial, temporal, and feature-based distances, with separate length-scale parameters (
,
,
,
) for each distance type. Larger values of these parameters implied that distance neighbors and observations had a significant influence on the prediction outcome. These parameters, including the variance (
), were tuned to balance their influence on the predictions. The final values of the hyperparameters used for the results in this paper are presented in
Table 3.
To enhance the predictive capacity of the model, cyclical encoding of time was applied to represent the periodic nature of hours and days of the year. The covariance matrix among training points (
) was regularized for numerical stability during inversion. Regularization was achieved by adding a small offset penalty term to the training covariance matrix to prevent singularities. Predictions were made using the GP regression formula provided in (
6). This was achieved by combining the computed covariance matrices to scale the residuals between the neighbors’ local GHI and their predicted values from the linear regression model computed in (
7). The resulting scaled residuals, known as the Gaussian correction term, were added to the linear regression estimated values of the target site to obtain the final downscaled GHI.
3.4. NNRF Implementation
Similar to the NNGP model, the NNRF model was developed to predict site-specific GHI values by leveraging spatial, temporal, and meteorological information. For each site, the model utilized its k-NNs, determined using Euclidean distances computed from the geographic coordinates of the sites. A k value of 3 was used in this model. This nearest-neighbor framework enabled the model to incorporate local spatial relationships, ensuring that predictions were informed by the most relevant neighboring data. Temporal dependencies were integrated by incorporating the encoded Hour and Day as part of the input features. This was carried out to ensure that training data reflected consistent diurnal and seasonal patterns. Additionally, meteorological features such as temperature, pressure, wind speed, and dew point were included to capture the environmental factors influencing solar irradiance variability.
The global GHI values of neighboring sites were also included as additional features to provide spatial context for each prediction. These neighbor features were aligned temporally to ensure consistency between the target site and its neighbors. The RF model, comprising an ensemble of decision trees, was trained using these features to predict the local GHI values. The model was initialized with 150 estimators to balance accuracy and computational efficiency. The ensemble approach enabled the RF model to handle complex non-linear relationships between input features and the target variable.
3.5. Scalability Test
Additionally, we investigated how well both models scale in terms of accuracy and computational speed with varying numbers of k-NNs. We achieved this by adding seven more sites to the existing ones to make fifteen, and retrained the models on a new dataset. To ensure adequate evaluation of the model’s scaling performance, the new training dataset was collected from 1 January to 30 December 2023, different from the former training dataset. Similar to earlier simulations, the dataset for 31 December 2023 was excluded to validate the accuracy of the scaled models. The k-NNs were varied from 2 to 14 for each of the sites. As a result of the increased data size, the scalability test was performed on one compute node of the South Dakota State University’s Innovator’s Cluster equipped with 2 Intel Xeon Gold 6342 CPUs with 48 cores. Each processor operated at a clock speed of 2.80 GHz and was supported by 256 GB of RAM.
3.6. Evaluation of Models
After the initial training of the two models on the dataset from 1 January 2022 to 30 December 2022, the models were validated by making day-ahead predictions of the downscaled local GHI for all eight sites for 31 December 2022. The evaluation of the model’s performance was achieved using Mean Absolute Error (MAE) and the Goodness of Fit (GoF). The MAE was computed as:
where
is the actual or true value of the
i-th observation,
is the predicted value corresponding to the
i-th observation, and
N denotes the sample size in the testing set. To measure how well the estimated models capture the patterns of the dataset, the GoF is computed using the normalized Root Mean Squared Error (NRMSE), such that:
where NRMSE is given by:
The use of the NRMSE to compute the GoF offers the advantage of normalizing the error to make it dimensionless, allowing easier comparison of the model’s performance across the dataset with different units. Additionally, while MAE, NRMSE, and GoF measure the effectiveness of the model, we assessed the model’s efficiency by measuring the computational time, as this is required for real-time power system operations.
3.7. Case Study on Puerto Rico
To evaluate the generalizability of the proposed data-driven models, it is essential to test their performance not only on unseen data but also across distinct climate zones. To this end, we conduct an additional validation using a new dataset from Puerto Rico, which represents a markedly different climatic environment compared to Texas. Puerto Rico was chosen due to its highly variable and erratic weather patterns, providing a test of the models’ robustness. Successful performance in this setting demonstrates the models’ ability to adapt to diverse environmental conditions. As with the Texas simulation, comparable datasets including global and local solar irradiance, as well as meteorological variables, were collected for this analysis. The data spans from 1 February 2023 to 31 December 2023, across five geographically distributed sites in Puerto Rico.
Figure 6 presents the map of the selected locations.
Both the NNGP and the NNRF were trained on the dataset from Puerto Rico from 1 February 2023 to 31 December 2023. Similar to the earlier simulations, predictions were made for 31 December 2023. The hyperparameters of both models were maintained, but the number of decision trees for the RF was tuned from 100 trees to 150 trees.
5. Conclusions
This study evaluates the performance of the NNRF and NNGP models in downscaling solar irradiance to localized measurements. The NNRF model performs better than the NNGP model in terms of accuracy and computational time during both training and validation. The NNRF recorded an average validation accuracy of 90.61% which outperformed the 85.88% recorded by the NNGP model. The NNRF model also improved computational speed by 2.5 times over the NNGP model. A case study on five sites in Puerto Rico further confirmed the superior performance of the NNRF model over the NNGP model in terms of accuracy and computational speed. Although the NNGP lags behind in terms of accuracy and computational speed, it shows strength in terms of interpretability and prediction uncertainty quantification. Its hyperparameters, such as the variance (), temporal (), spatial (), and feature-based () scale parameters, give insight into how far in time and space past observations and neighboring sites influence the prediction outcome. This makes the NNGP more suitable for probabilistic estimates, which demands more transparency of the modeling process.
Furthermore, scalability tests that measured both models’ performance and computational speed with varying numbers of k-NNs from 2 to 14 showed a linear scaling for NNRF and a polynomial scaling for NNGP in terms of computational time. Similarly, the NNRF achieved a local optimal accuracy with 3 k-NNs, while the NNGP took 6 k-NNs to obtain a local optimal accuracy. Irrespective of the increased number of k-NNs, the NNGP still could not outperform the NNRF, which had a scaling accuracy of 90.13% compared to 88.74% of the NNGP. These findings prove the superiority of the NNRF for large-scale solar irradiance downscaling tasks. Further studies could investigate improving the NNRF’s accuracy by combining the nearest-neighbor model with either artificial neural networks, the extreme gradient boosting (XGboost) method, or other advanced machine-learning models.
These findings could be very useful in the implementation of the FERC order 2222, by enabling real-time, accurate estimation of solar PV capacity for day-ahead dispatch. Additionally, due to the improved accuracy and computational speed of the NNRF model, its application can be extended to the real-time electricity market with accurate PV estimates on a 5 min rolling basis. Finally, the use of downscaled solar irradiance forecasts reduces the level of solar PV generation uncertainties with its associated reserve requirements and instability issues in power dispatch planning.