1. Introduction
In recent decades, Renewable Energy Sources (RESs) have globally penetrated the Electric Power System (EPS), as they offer a wide range of advantages in the evolving energy landscape. RESs are inexhaustible sources of environmentally friendly energy that contribute to reducing dependence on conventional generation units. Among them, solar energy stands out as the most abundant and globally accessible resource [
1]. One of the primary applications of solar energy is photovoltaic (PV) power systems, which use PV cells to convert solar radiation into electric power. PV systems are considered highly trustworthy and offer considerable flexibility in installation, a decreasing cost over time [
2] and an increasing average efficiency (through the introduction of advanced materials such as perovskite and technologies like tandem PVs) [
3,
4]. However, PV power generation is characterized by high variability that is primarily attributed to cloud movement, which undermines its reliability.
In the context of increasing the integration of PV generation into the EPS while simultaneously maintaining system reliability, significant attention has been directed towards PV power forecasting. PV power forecasting is utilized in a wide range of applications across various spatial and temporal scales, and its accuracy can significantly impact the stability of the EPS. PV power forecasting also brings value to all stakeholders within the electricity market. For power system operators, it facilitates congestion management and the extraction of operational flexibility. Energy producers benefit through improved participation in electricity and balancing markets while also minimizing the risk of penalties. Finally, forecasting is advantageous for prosumers—individuals who both consume and produce electricity—by enabling more effective management of household energy loads [
5].
Depending on the forecasting horizon, PV power forecasts are typically categorized into day-ahead, intra-day, and intra-hour forecasts [
6]. This paper focuses on intra-hour forecasting, as it is critically important for the safe and economically efficient operation of EPS. Intra-hour minute-scale forecasting plays a key role in various applications, such as ramp-rate control, optimal management of energy storage systems, and real-time demand response [
5].
A widely adopted approach for intra-hour PV power forecasting involves the use of ground-based sky images. Compared to numerical data alone, sky images provide significantly richer information regarding the presence and movement of clouds [
7]. PV power forecasting based on sky images can be classified into two categories. The first group of methods directly translates sky images into PV power output using deep learning techniques [
8,
9]. The second group introduces an intermediate stage, in which cloud motion is modeled and future sky conditions are predicted before being translated into PV generation [
10]. Compared with the first group of methods, motion-based approaches have the advantage of establishing a clearer physical link between cloud dynamics and PV variability while also improving the robustness of forecasts under rapidly changing sky conditions [
11].
Methods for modeling cloud motion in solar forecasting can be grouped into Cloud Motion Vector (CMV)-based methods [
12,
13,
14] and Artificial Intelligence (AI) approaches [
10]. CMV-based methods, including Optical Flow (OF) and Block Matching (BM) algorithms, are computationally efficient and interpretable but struggle with rapidly evolving or overlapping clouds because they assume linear cloud motion. This linearity assumption limits the forecasting horizon and imposes the requirement for high temporal resolution in the input imagery. AI-based approaches, such as Convolutional Neural Networks (CNNs), capture non-linear spatiotemporal patterns and enhance robustness under variable conditions; however, they typically require deeper architectures to avoid premature convergence to local optima, demanding large datasets for training and substantial computational resources [
11], which makes them impractical for local smart microgrids. A summary of CMV-based and AI-based Cloud Motion Modeling (CMM) methods used in solar forecasting is provided in
Section 2.
This paper proposes a novel multi-step sky image prediction model, which can be applied for minute-scale PV power forecasting. Unlike previous methods, the proposed hybrid approach combines physics-informed data pre-processing with deep learning to effectively capture non-linearities of cloud dynamics without requiring excessive computational resources. To this end, a dataset of sky images was classified into clusters using a recently proposed method based on unsupervised learning and hybrid image feature representation, and cluster-specific CNNs are trained to forecast sequences of sky images. The key contributions of this paper are summarized as follows:
The combination of Auto-Encoder (AE)-like CNNs with a physics-informed data preprocessing pipeline primarily focusing on input classification. No prior deep learning approach has utilized a physics-informed data preprocessing pipeline. The proposed model simplifies the original forecasting problem by decomposing it into simpler subproblems comprising more homogeneous data. This approach lowers the risk of premature convergence to suboptimal solutions and thus decreases training data requirements and enhances the generalization capability of the AE-like CNNs.
A sensitivity analysis is separately conducted for each cluster. The optimal kernel size and number of hidden layers are separately determined for the AE-like CNN associated with each cluster, rather than being universally fixed across all clusters. This per-cluster sensitivity analysis allows for optimal adaptation to the specific characteristics of each sky condition and further reduces the risk of premature convergence. In previous works, the hyperparameter selection process was either not clearly defined or limited to global sensitivity analysis, without considering cluster-specific variability.
The remainder of this paper is organized as follows: In
Section 2, a brief overview of the related literature is provided. In
Section 3, the methodology of the proposed sky image forecasting framework is presented along with the fundamental theoretical background. Details of the experimental setup and the proposed prediction process are provided in
Section 4.
Section 5 presents and discusses the experimental results. The main conclusions are summarized in
Section 6.
2. Related Work
For several years, sky image forecasting typically relied on physics-informed CMM techniques to extract CMVs and extrapolate the future position of clouds. A comprehensive survey of CMV-based methods can be found in [
15]. Commonly used CMM techniques for sky images include OF [
12], BM [
13], and Particle Image Velocimetry (PIV) [
14]. These methods generally rely on linear motion assumptions and thus fail to capture the non-linear nature of cloud dynamics, such as cloud deformation and displacement [
16]. Moreover, traditional CMV-based methods assume brightness consistency between consecutive images, making them prone to errors induced by reflections, noise, and the low resolution of cheap camera systems. These limitations constrain the forecasting horizon [
17] and necessitate sky images to be captured at high temporal resolutions, which is not always feasible in practice.
Various efforts have been made to overcome the challenges of traditional CMV-based methods. A novel 3D CMM approach leveraging a network of All-Sky Imagers (ASIs) was introduced in [
18]. In [
19], several modifications to the sector ladder method were introduced to address periods of high intermittency and enable real-time irradiance forecasting. In [
20], a CMV-based technique incorporating image-phase-shift invariance and Fourier phase correlation theory was developed for improved cloud displacement estimation and short-term PV power forecasting. While these methods managed to improve cloud displacement forecasting accuracy, they remain constrained by the inherent linear assumptions of traditional CMV-based methods, particularly under highly variable sky conditions and coarser temporal resolutions.
In addition, multiple-camera systems have also been explored to enhance the spatial coverage and robustness of CMV estimation. For instance, ref. [
21] presented a doctoral study where a network of all-sky imagers was utilized to derive CMVs and improve nowcasting performance. This work demonstrated the advantages of multi-view setups in reducing motion ambiguity and improving accuracy under highly variable sky conditions.
In [
16], OF and BM approaches were linearly combined with a feature matching method into an ensemble model, with weights determined using Particle Swarm Optimization (PSO). The model was separately calibrated for each of the sky image classes generated using k-Means clustering on features extracted from Gray Level Co-occurrence Matrices (GLCM). The ensemble method consistently outperformed the standalone approaches, highlighting the effectiveness of combining complementary techniques. Furthermore, the classification of input images played a crucial role in improving accuracy, as it allowed the ensemble to be tailored to distinct sky conditions. However, the ensemble model still exhibited relatively high errors in some of the more challenging classes, likely due to the inherent linear nature of the ensemble combination and the coarse temporal resolution of the sky images. Moreover, finetuning the hyperparameters of each standalone model of the ensemble remains a challenging task—particularly when finetuning is performed separately for each cluster.
The rapid advancement of AI in recent years has driven widespread adoption of deep learning techniques in computer vision applications. Inspired by video prediction models, [
10] utilized AE-like CNNs for sequential sky image prediction based on previous image sequences. Unlike CMV-based methods, this approach demonstrated greater robustness to noise and coarser temporal resolutions. Other studies have bypassed sky image forecasting altogether, directly predicting PV generation from sky images through deep end-to-end CNN-based (Deep Neural Networks–DNN) models. For example, ref. [
8] developed several end-to-end models, with those leveraging sequences of sky images as input outperforming others under more dynamic conditions. In [
9], ECLIPSE was proposed for the joint prediction of segmented sky images or satellite images alongside associated irradiance values.
DNN-based models have also been successfully applied to satellite imagery for solar irradiance forecasting. In [
22] a deep learning framework that replaces traditional CMV extraction with CNN-based motion modeling on satellite images is proposed. Although this work focuses on satellite data rather than ground-based sky images, it highlights a similar trend toward replacing physics-based motion vector estimation with data-driven approaches, reinforcing the motivation for CNN-based sky image forecasting frameworks.
Although DNN methods effectively capture local non-linear cloud dynamics, they exhibit limitations. Optimization via backpropagation-based gradient descent is prone to local optima entrapment, often resulting in premature convergence and sub-optimal model calibration, particularly in complex scenarios with highly non-convex objective spaces, such as those encountered in minute-scale PV generation forecasting [
23]. In addition, the inherent locality of convolutional filters restricts their ability to capture the global structure of sky images, impairing cloud tracking performance under highly dynamic sky conditions [
24]. Furthermore, the pure data-driven DNN architectures depend heavily on historical datasets, limiting their generalization capability to sky conditions with low mutual information with the training data. To mitigate these limitations, deep generative AI models have recently attracted growing interest. In [
24], an end-to-end multi-modal model utilizing Vision Transformers (ViTs) was proposed for short-term irradiance forecasting. Acknowledging the limitations of end-to-end modeling for cloud tracking, ref. [
17] introduced a two-step approach for PV power forecasting, combining a U-net model with SkyGPT, a deep generative AI model for stochastic sky image sequence prediction. While deep generative AI models help address part of the shortcomings of DNN approaches, they still rely on backpropagation-based gradient descent optimization algorithms and demand large datasets and extensive training on resource-intensive platforms, significantly increasing computational requirements. To alleviate this, in this paper we propose a hybrid approach that deviates from the pure data-driven paradigm.
5. Results
5.1. Assessment Metrics
To evaluate the performance and estimate the average error of the sky image prediction models, a comparison between the predicted images and the target (ground-truth) images is required. This comparison is achieved using quantitative evaluation metrics. In the case of sky images, commonly employed metrics include the Mean Squared Error (MSE), the Structural Similarity Index Measure (SSIM) and the Peak signal-to-noise ratio [
10,
30].
5.1.1. Mean Squared Error
MSE compares two images pixel by pixel based on their intensity values. MSE is computed as follows [
31]:
where
x is the predicted image,
y is the target image, (
i,
j) are the pixel coordinates, and
M,
N are the image dimensions. The smaller the MSE value, the more similar the two images are.
5.1.2. Structural Similarity Index Measure
SSIM quantifies the degree of similarity between two images. It depends on the following three factors [
32]:
Luminance: A measure of the brightness difference in the two images;
Contrast: A contrast comparison (i.e., the difference between bright and dark regions within the image) between the two images;
Structure: An estimation of the spatial arrangement of luminance patterns within the images;
The mathematical formulation of SSIM between two images is presented through Equations (10)–(14):
where
x the predicted image,
y the target image,
the luminance term,
the contrast term,
the structural term,
and
the mean pixel values of images
x and
y, respectively,
and
the standard deviation of the pixel values of images
x and
y, respectively,
the covariance of the pixel values of images
x and
y, and
,
,
constants added to avoid division by values close to zero in the denominator of the terms. In this paper, we select the values suggested in [
10], i.e.,
,
,
, and
. Unlike MSE, the higher the SSIM value, the more similar the two images are.
5.1.3. Peak Signal-to-Noise Ratio
The PSNR is a traditional image quality metric that estimates fidelity by comparing the maximum possible signal strength to the distortion introduced by reconstruction or compression. The PSNR is mathematically modeled through the following equation:
where
x is the predicted image,
y is the target image, (
i,
j) are the pixel coordinates, and
M,
N are the image dimensions. Like SSIM, the higher the PSNR value, the more similar the two images are.
5.2. Benchmark Forecasting Models
The following benchmark methods are developed for comparison with the proposed 8-Cluster AE-like CNN sky image prediction model:
Persistence;
CMV-based method;
1-Cluster AE-like CNN;
3-Cluster AE-like CNNs;
6-Cluster AE-like CNNs.
5.2.1. Persistence Method
The persistence method supposes no further change in a random variable’s value. If is the sky image at , then the prediction for will be:
where
k is any timestep in the forecasting horizon.
5.2.2. CMV-Based Method
The CMV-based benchmark is based on the Gunnar Farneback OF method [
33]. This method compares the pixel intensities between two consecutive sky images to extract a dense CMV field, which is then used to linearly extrapolate future cloud movement. The Gunnar Farneback OF method has been widely used for CMM from sky images, both as a standalone method and in combination with other approaches [
34].
5.2.3. AE-like CNN
Apart from the proposed 8-Cluster AE-like CNN model, similar AE-like CNN models with 3 and 6 clusters were developed according to the procedure that was thoroughly described in
Section 3 and
Section 4, to assess the impact of the number of clusters on the forecasting performance. In addition, a 1-Cluster AE-like CNN, trained on the entire dataset, was included to assess the model’s generalization capability when using all available data.
5.3. Sensitivity Analysis
As mentioned in
Section 4.3, the proposed 8-Cluster sky image prediction model uses AE-like CNNs for the clusters that are associated with sky conditions of intense variability and the persistence method for the rest of the clusters. In order to find the optimum combination of hyperparameters for which the AE-like CNNs perform better, a sensitivity analysis was conducted. The hyperparameters to be finetuned were: the convolution type (CT), the kernel size (KS), the number of hidden layers (NHL) and the input image dimensions (IID). The complete set of experiments is presented by the following Cartesian product:
Preliminary tests showed that in cases where CT was 3D, the AE-like CNNs exhibited worse performance and a significantly longer training time. The experiments with 64 × 64 input resolution revealed an average increase of approximately 200% in training time and about 30% in inference time, while the gain in prediction accuracy was limited to an average of 3.57% in terms of MSE. Given that the proposed framework is designed to be deployable on low-cost hardware platforms (e.g., standard personal computers), computational efficiency was prioritized over marginal accuracy improvements. For this reason, and to reduce the overall evaluations and computational overhead, CT = 2D and IID = 32 were selected, and the sensitivity analysis continued for KS and NHL. The simplified Cartesian product is now as follows:
This simplification makes the implementation of a per-cluster sensitivity analysis computationally feasible. Thus, each cluster sets its own cluster-specific hyperparameter values that correspond to the particular sky condition. The per-cluster sensitivity analysis results for each of the eight clusters of the proposed model are shown in
Table 2. The optimum combination of KS and NHL for each cluster is in bold. As can be seen, the model’s performance is highly affected by changes in hyperparameter values. In many cases, premature convergence can be noticed, causing a significant deterioration in the assessment metric values. The sensitivity analysis is also visualized using boxplots for the MSE and SSIM metrics in
Figure 8 (for better visualization clarity, outliers have been excluded so that the distribution of the remaining boxplots can be properly observed).
From the sensitivity analysis results it can be concluded that for KS = 3 the kernels are too small, resulting in overly local feature extraction, while AE-like CNNs require at least 9 layers to adequately model the input–output relationship. In general, KS = 5 and NHL = 9 or 11 yield the best results for most clusters.
5.4. Final Forecasting Results
Table 3 shows the final values of the assessment metrics for each sky image forecasting model that was implemented. The results for the 3-Cluster AE-like CNN, the 6-Cluster AE-like CNN, and the proposed 8-Cluster AE-like CNN are aggregated -weighted by sample count- across all clusters. From
Table 3 it can be seen that all AE-like CNN models perform better compared to the persistence method, with the proposed 8-Cluster model yielding the best results. The OF model achieves an improvement of 13.68% on SSIM and a deterioration of 21.32% on MSE, compared to persistence. This performance deterioration is likely due to the limitations of OF in coarser temporal resolutions. In the case of the model without classification, although the evaluation metrics indicate satisfactory similarity between the images, visual inspection highlights the need for improved results. The output image sequences suggested that the model responded adequately under clear-sky conditions, as it successfully identified the sun’s position and captured image brightness to a reasonably good extent. However, under cloudy conditions, although the sun’s location is detected, the model fails to accurately predict cloud distribution. These observations indicate that, for image-based forecasting tasks, relying solely on quantitative metrics is inadequate, as such measures treat images merely as numerical arrays and may overlook perceptual differences. Thus, complementing quantitative assessment with qualitative (visual) evaluation is essential to obtain a more complete understanding of model performance.
The performance of the AE-like CNN models improves significantly with the increase in the number of clusters, reaching improvements of 73.1% and 24.3% on MSE and SSIM, respectively, compared to persistence. Based on this observation, it can be concluded that the AE-like CNN model performs better—becoming more capable of recognizing patterns—when it is trained on more homogeneous image subsets. From
Table 3, it seems that from the 6-Cluster model to the 8-Cluster model the improvements are quite imperceptible (0.86% and 7.02% on SSIM and MSE, respectively), showing that further classification is unnecessary. This saturation was expected to occur at some point, as beyond a certain level, the subproblems become sufficiently simple and the data highly homogeneous, allowing the AE-like CNN to handle them effectively without the risk of getting trapped in a local optimum. The proposed model was selected to be the 8-Cluster, as it appears to be the “knee point” beyond which the additional accuracy gain is not worth the investment of more effort.
As far as training time is concerned, it seemed to decline with the increase in the number of clusters. Specifically, in our relatively low capability computer system (described in
Section 4.4.2) that we executed our simulations, the model without classification required approximately 26.75 h for training, whereas the proposed 8-Cluster required approximately 16.6 h of total training time, a significant reduction of 37.94%. This training time reduction may be attributed to several reasons, such as the reduced per-epoch training time from the smaller data subsets of each cluster and the overall fewer epochs required for convergence due to more homogeneous clusters that create simpler sub-problems. The processing time was measured 324 ms at maximum for a single prediction implementation and the memory requirements varied from 500 MB to 550 MB.
Table 4 presents the results per cluster for the proposed model. Overall, each cluster achieved strong performance, with notably low MSE values and SSIM values exceeding 90%, indicating strong correlation between real and predicted sequences. Clusters 1 and 6 achieved the best MSE (0.0155%) and SSIM (98.5%) values, respectively. Persistence demonstrated sufficient accuracy for the chosen clusters that are associated with sky conditions of mild variability, combining effectiveness and low computational cost.
In
Figure 9 and
Figure 10, examples of generated sky image forecasts are depicted for clusters 6 and 8, respectively. For each figure, the first row illustrates the real sequence, the second row the forecasted one and the third is a heat map that visualizes the MSE for each pixel between the two images. In these two examined cases, different color bar scales are applied to the heat maps to ensure that the variations in error are clearly distinguishable. Comparing the forecasted sky images to the real ones, it is obvious that the cloud distribution and coverage ratio have been accurately modeled, and the model has also correctly predicted whether the sun is blocked or not. Even in the case of Cluster 8, where more pronounced variations are present, the model was able to predict them with high accuracy. By observing the heat maps, it becomes clear that the model has achieved its objective, since the heat maps are mostly white—indicating low MSE values—with a few isolated red patches that reveal localized error spikes.
The per-cluster sensitivity analysis is a major factor in the performance of the proposed sky image prediction model. If a predetermined set of hyperparameters, derived from a sensitivity analysis on the entire dataset, is applied instead, the results of the proposed 8-Cluster model would be considerably worse. To demonstrate this, the AE-like CNNs of all clusters are trained using KS = 3 and NHL = 5, as suggested in [
10], and the results are compared to those obtained from the cluster-specific sensitivity analysis.
From the results in
Table 5 it seems that with the use of the per-cluster sensitivity analysis, MSE and SSIM are improved by 61.87% and 12.97%, respectively. That happens because each cluster addresses a different subproblem of cloud motion and thus its corresponding prediction model requires a specific configuration to adapt to the characteristics of the sky image cluster.
5.5. Evaluation on a Second Dataset
In order to further validate the robustness and generalizability of the proposed model, it was also applied to a second case study. The open-access dataset that was employed consists of high-resolution sky images (2048 × 2048) captured by a ground-based camera with a fish-eye lens at a frequency of 1 frame per minute at Stanford University [
35]. A total of 22,519 daytime images were sampled from months January, February, May, June, and July and finally 22,519 images were acquired. While this additional case study represents a further step toward demonstrating generalizability, we acknowledge that important limitations remain regarding the evaluation of the proposed model across a broader range of contexts; thus, further future work is still required, as discussed in
Section 6.
The clustering process resulted in 8 distinct clusters, each representing a characteristic sky condition. Specifically, the clusters correspond to: clear sky with limited broken clouds (4487 images), thin cirrus clouds (3420 images), clear sky (5950 images), raindrops with the sun not blocked (1399 images), sunrise (1593 images), raindrops with the sun blocked (2409 images), high turbulence (1994 images), and overcast conditions (1267 images). The number of images assigned to each cluster indicates a balanced representation of diverse sky scenarios within the dataset.
Figure 11 presents the clustering results for the additional dataset.
For this new case study, the experiments were repeated only for the persistence model, the 1-Cluster model, and the proposed 8-Cluster model. As
Table 6 shows all AE-like CNN models perform better compared to the persistence method, with the proposed 8-Cluster model yielding the best results. In particular, the 8-Cluster model performs improvements of 76.3% and 20% on MSE and SSIM, respectively, compared to persistence.
5.6. Alternative Dataset Split
Aiming to examine the models’ capability in a different dataset split, all experiments were repeated for the 1st case study, for a split that uses 50% on training set, 25% on validation set and 25% on test set. This split was chosen as a more balanced option between training and validation and to evaluate the robustness of the proposed approach toward limited training data availability. The results are depicted in
Table 7.
When compared with the 1-cluster model, the benefit of clustering remains clear, while the 3-Cluster model even shows an improvement in the alternative split, suggesting better generalization with fewer training samples. On the other hand, the 6-Cluster model performs notably worse, as the division into six groups does not secure stable training; in some clusters the model converged prematurely, leading to weak overall results. The proposed 8-Cluster model proves the most consistent, with only slight differences observed across the two splits. This indicates that the approach remains effective even when the amount of training data is reduced.
5.7. Random Perturbation Check
In order to assess the impact of clustering performance on the forecasting outcome, the clustering results were perturbed in a controlled way: Two experiments were conducted for the first case study, where a random fraction of 5% and 10% of the images were reassigned to alternative random clusters. The AE-like CNNs were then re-trained using the perturbed clusters.
Figure 12 indicates that random perturbation of sky image clusters significantly affects performance across all clusters, particularly for clusters associated with more dynamic, cloudy conditions (clusters 1, 6, 7, and 8). In five out of eight clusters, bigger perturbations lead to higher MSE values, as expected. In clusters 1, 3, and 4, the 10% perturbation leads to slightly better results compared to the 5% perturbation, potentially due to the random nature of the imposed perturbation. Even with just a 5% perturbation, forecasting performance deteriorates significantly, with an aggregated MSE increase of 37.81% and an aggregated SSIM decrease of 3.98%. A 10% perturbation further increases the aggregated MSE by 16.41% and decreases the aggregated SSIM by an additional 1.71%. Overall, these results demonstrate the sensitivity of the proposed sky image forecasting model toward the input classification performance and suggest that even a slightly worse sky image classification method would significantly impact forecasting performance. This behavior is expected, as random cluster perturbations decrease cluster cohesion. The bigger the perturbation, the more a cluster resembles the original sky image dataset in terms of variability, but now with significantly less training samples.
6. Conclusions
This paper proposes a multi-step forecasting framework for predicting sequences of ground-based sky images. The proposed approach combined physics-informed clustering of images and AE-like CNN models, which were trained separately for each cluster. In addition, a per-cluster sensitivity analysis was conducted, allowing the model to adapt better to the specific characteristics of each cluster and the rapidly changing sky conditions. Dividing the forecasting task into smaller and more homogeneous groups of data simplified the learning process, reduced training effort, and allowed the CNNs to capture the underlying patterns more effectively, thereby improving their generalization capability. The per-cluster sensitivity analysis further improved the results by up to 61.87% in terms of MSE. These results confirm that physics-informed preprocessing, together with targeted hyperparameter tuning, can make multi-step ahead sky image forecasting feasible even when the available data have relatively low temporal resolution, without the need for excessive computational resources. Apart from the numerical gains, the findings underline the practical importance of careful data preparation when forecasts need to cope with highly variable sky conditions.
Accurate sky image predictions provide insight into short-term cloud movement and deformation, which in turn enables various solar forecasting applications. The outputs generated by the proposed model can be used as inputs to deep-learning-based PV generation forecasting models that translate predicted sky image sequences into short-term PV energy yield estimates using computer vision techniques such as CNNs. Because the performance of these models strongly depends on the quality of the input data, improvements in sky image forecasting accuracy directly enhance the reliability of PV energy yield predictions. More accurate PV generation forecasting helps mitigate the operational challenges arising from the stochastic behavior of solar generation, particularly under conditions with high PV penetration in electric power systems. Improved short-term PV generation forecasts enable proactive grid management strategies, such as ramp rate mitigation, energy storage scheduling, and load balancing, that reduce voltage dips, frequency fluctuations, and other power quality issues. Consequently, accurate sky image prediction plays a key role in supporting the effective integration of PV systems into electric grids.
Despite the promising results, this study has several limitations that should be acknowledged and addressed in future work. First, both case studies are aligned with the constraints and practical requirements of local community-scale PV systems, where sky images are preprocessed to reduce computational complexity and enable low-cost scalable deployment. Future research could evaluate the proposed method in fundamentally different deployment contexts, i.e., for large-scale PV power plants where fewer hardware constraints allow for the use of high-resolution sky images and more powerful computing platforms and ultimately enable alternative modeling strategies and more detailed image analysis. Second, the proposed method incorporates physics-informed knowledge only at the preprocessing stage, thereby influencing the training of the AE-like CNNs only indirectly. As recent research suggests that embedding physical constraints directly into deep learning model architectures may improve both accuracy and computational efficiency, future work could explore alternative physics-informed CNN designs, which may also enable the use of higher-resolution, colored images. Additional future research directions include integrating the proposed model with a short-term PV generation forecasting framework to enable downstream evaluation of its impact on PV generation prediction performance, as well as conducting expanded comparative studies with other, more computationally intensive state-of-the-art architectures (e.g., Vision Transformers and Generative Adversarial Networks) to assess performance trade-offs relative to increased model complexity.