Multi-Step Sky Image Prediction Using Cluster-Specific Convolutional Neural Networks for Solar Forecasting Applications

Stylianos P. Schizas; Markos A. Kousounadis-Knousen; Francky Catthoor; Pavlos S. Georgilakis

doi:10.3390/en18215860

,

and

School of Electrical and Computer Engineering, National Technical University of Athens (NTUA), 15780 Athens, Greece

^*

Author to whom correspondence should be addressed.

Energies2025, 18(21), 5860;https://doi.org/10.3390/en18215860

This article belongs to the Special Issue Challenges and Progresses of Electric Power Systems

Version Notes

Order Reprints

Abstract

Effective integration of photovoltaic (PV) systems into electric power grids presents significant challenges due to the inherent variability in solar energy. Therefore, accurate PV power forecasting in various timescales is critical for the reliable operation of modern electric power systems. For short-term horizons, the primary source of solar power stochasticity is cloud movement and deformation, which are typically captured at high spatiotemporal resolutions using ground-based sky images. In this paper, we propose a novel multi-step sky image prediction framework for improved cloud tracking, which can be deployed for short-term PV power forecasting. The proposed method is based on deep learning, but instead of being purely data-driven, we propose a hybrid approach where we combine Auto-Encoder-like Convolutional Neural Networks (AE-like CNNs) with physics-informed sky image clustering to enhance robustness towards fast-varying sky conditions and effectively model non-linearities without adding to the computational overhead. The proposed method is compared against several state-of-the-art approaches using a real-world case study comprising minutely sky images. The experimental results show improvements of up to 17.97% on structural similarity and 62.14% on mean squared error, compared to persistence. These findings demonstrate that by combining effective physics-informed preprocessing with deep learning, multi-step ahead sky image forecasting can be reliably achieved even at low temporal resolutions.

Keywords:

ground-based sky images; multi-step forecasting; convolutional neural networks; image classification; photovoltaic generation

1. Introduction

In recent decades, Renewable Energy Sources (RESs) have globally penetrated the Electric Power System (EPS), as they offer a wide range of advantages in the evolving energy landscape. RESs are inexhaustible sources of environmentally friendly energy that contribute to reducing dependence on conventional generation units. Among them, solar energy stands out as the most abundant and globally accessible resource []. One of the primary applications of solar energy is photovoltaic (PV) power systems, which use PV cells to convert solar radiation into electric power. PV systems are considered highly trustworthy and offer considerable flexibility in installation, a decreasing cost over time [] and an increasing average efficiency (through the introduction of advanced materials such as perovskite and technologies like tandem PVs) [,]. However, PV power generation is characterized by high variability that is primarily attributed to cloud movement, which undermines its reliability.

In the context of increasing the integration of PV generation into the EPS while simultaneously maintaining system reliability, significant attention has been directed towards PV power forecasting. PV power forecasting is utilized in a wide range of applications across various spatial and temporal scales, and its accuracy can significantly impact the stability of the EPS. PV power forecasting also brings value to all stakeholders within the electricity market. For power system operators, it facilitates congestion management and the extraction of operational flexibility. Energy producers benefit through improved participation in electricity and balancing markets while also minimizing the risk of penalties. Finally, forecasting is advantageous for prosumers—individuals who both consume and produce electricity—by enabling more effective management of household energy loads [].

Depending on the forecasting horizon, PV power forecasts are typically categorized into day-ahead, intra-day, and intra-hour forecasts []. This paper focuses on intra-hour forecasting, as it is critically important for the safe and economically efficient operation of EPS. Intra-hour minute-scale forecasting plays a key role in various applications, such as ramp-rate control, optimal management of energy storage systems, and real-time demand response [].

A widely adopted approach for intra-hour PV power forecasting involves the use of ground-based sky images. Compared to numerical data alone, sky images provide significantly richer information regarding the presence and movement of clouds []. PV power forecasting based on sky images can be classified into two categories. The first group of methods directly translates sky images into PV power output using deep learning techniques [,]. The second group introduces an intermediate stage, in which cloud motion is modeled and future sky conditions are predicted before being translated into PV generation []. Compared with the first group of methods, motion-based approaches have the advantage of establishing a clearer physical link between cloud dynamics and PV variability while also improving the robustness of forecasts under rapidly changing sky conditions [].

Methods for modeling cloud motion in solar forecasting can be grouped into Cloud Motion Vector (CMV)-based methods [,,] and Artificial Intelligence (AI) approaches []. CMV-based methods, including Optical Flow (OF) and Block Matching (BM) algorithms, are computationally efficient and interpretable but struggle with rapidly evolving or overlapping clouds because they assume linear cloud motion. This linearity assumption limits the forecasting horizon and imposes the requirement for high temporal resolution in the input imagery. AI-based approaches, such as Convolutional Neural Networks (CNNs), capture non-linear spatiotemporal patterns and enhance robustness under variable conditions; however, they typically require deeper architectures to avoid premature convergence to local optima, demanding large datasets for training and substantial computational resources [], which makes them impractical for local smart microgrids. A summary of CMV-based and AI-based Cloud Motion Modeling (CMM) methods used in solar forecasting is provided in Section 2.

This paper proposes a novel multi-step sky image prediction model, which can be applied for minute-scale PV power forecasting. Unlike previous methods, the proposed hybrid approach combines physics-informed data pre-processing with deep learning to effectively capture non-linearities of cloud dynamics without requiring excessive computational resources. To this end, a dataset of sky images was classified into clusters using a recently proposed method based on unsupervised learning and hybrid image feature representation, and cluster-specific CNNs are trained to forecast sequences of sky images. The key contributions of this paper are summarized as follows:

The combination of Auto-Encoder (AE)-like CNNs with a physics-informed data preprocessing pipeline primarily focusing on input classification. No prior deep learning approach has utilized a physics-informed data preprocessing pipeline. The proposed model simplifies the original forecasting problem by decomposing it into simpler subproblems comprising more homogeneous data. This approach lowers the risk of premature convergence to suboptimal solutions and thus decreases training data requirements and enhances the generalization capability of the AE-like CNNs.
A sensitivity analysis is separately conducted for each cluster. The optimal kernel size and number of hidden layers are separately determined for the AE-like CNN associated with each cluster, rather than being universally fixed across all clusters. This per-cluster sensitivity analysis allows for optimal adaptation to the specific characteristics of each sky condition and further reduces the risk of premature convergence. In previous works, the hyperparameter selection process was either not clearly defined or limited to global sensitivity analysis, without considering cluster-specific variability.

The remainder of this paper is organized as follows: In Section 2, a brief overview of the related literature is provided. In Section 3, the methodology of the proposed sky image forecasting framework is presented along with the fundamental theoretical background. Details of the experimental setup and the proposed prediction process are provided in Section 4. Section 5 presents and discusses the experimental results. The main conclusions are summarized in Section 6.

2. Related Work

For several years, sky image forecasting typically relied on physics-informed CMM techniques to extract CMVs and extrapolate the future position of clouds. A comprehensive survey of CMV-based methods can be found in []. Commonly used CMM techniques for sky images include OF [], BM [], and Particle Image Velocimetry (PIV) []. These methods generally rely on linear motion assumptions and thus fail to capture the non-linear nature of cloud dynamics, such as cloud deformation and displacement []. Moreover, traditional CMV-based methods assume brightness consistency between consecutive images, making them prone to errors induced by reflections, noise, and the low resolution of cheap camera systems. These limitations constrain the forecasting horizon [] and necessitate sky images to be captured at high temporal resolutions, which is not always feasible in practice.

Various efforts have been made to overcome the challenges of traditional CMV-based methods. A novel 3D CMM approach leveraging a network of All-Sky Imagers (ASIs) was introduced in []. In [], several modifications to the sector ladder method were introduced to address periods of high intermittency and enable real-time irradiance forecasting. In [], a CMV-based technique incorporating image-phase-shift invariance and Fourier phase correlation theory was developed for improved cloud displacement estimation and short-term PV power forecasting. While these methods managed to improve cloud displacement forecasting accuracy, they remain constrained by the inherent linear assumptions of traditional CMV-based methods, particularly under highly variable sky conditions and coarser temporal resolutions.

In addition, multiple-camera systems have also been explored to enhance the spatial coverage and robustness of CMV estimation. For instance, ref. [] presented a doctoral study where a network of all-sky imagers was utilized to derive CMVs and improve nowcasting performance. This work demonstrated the advantages of multi-view setups in reducing motion ambiguity and improving accuracy under highly variable sky conditions.

In [], OF and BM approaches were linearly combined with a feature matching method into an ensemble model, with weights determined using Particle Swarm Optimization (PSO). The model was separately calibrated for each of the sky image classes generated using k-Means clustering on features extracted from Gray Level Co-occurrence Matrices (GLCM). The ensemble method consistently outperformed the standalone approaches, highlighting the effectiveness of combining complementary techniques. Furthermore, the classification of input images played a crucial role in improving accuracy, as it allowed the ensemble to be tailored to distinct sky conditions. However, the ensemble model still exhibited relatively high errors in some of the more challenging classes, likely due to the inherent linear nature of the ensemble combination and the coarse temporal resolution of the sky images. Moreover, finetuning the hyperparameters of each standalone model of the ensemble remains a challenging task—particularly when finetuning is performed separately for each cluster.

The rapid advancement of AI in recent years has driven widespread adoption of deep learning techniques in computer vision applications. Inspired by video prediction models, [] utilized AE-like CNNs for sequential sky image prediction based on previous image sequences. Unlike CMV-based methods, this approach demonstrated greater robustness to noise and coarser temporal resolutions. Other studies have bypassed sky image forecasting altogether, directly predicting PV generation from sky images through deep end-to-end CNN-based (Deep Neural Networks–DNN) models. For example, ref. [] developed several end-to-end models, with those leveraging sequences of sky images as input outperforming others under more dynamic conditions. In [], ECLIPSE was proposed for the joint prediction of segmented sky images or satellite images alongside associated irradiance values.

DNN-based models have also been successfully applied to satellite imagery for solar irradiance forecasting. In [] a deep learning framework that replaces traditional CMV extraction with CNN-based motion modeling on satellite images is proposed. Although this work focuses on satellite data rather than ground-based sky images, it highlights a similar trend toward replacing physics-based motion vector estimation with data-driven approaches, reinforcing the motivation for CNN-based sky image forecasting frameworks.

Although DNN methods effectively capture local non-linear cloud dynamics, they exhibit limitations. Optimization via backpropagation-based gradient descent is prone to local optima entrapment, often resulting in premature convergence and sub-optimal model calibration, particularly in complex scenarios with highly non-convex objective spaces, such as those encountered in minute-scale PV generation forecasting []. In addition, the inherent locality of convolutional filters restricts their ability to capture the global structure of sky images, impairing cloud tracking performance under highly dynamic sky conditions []. Furthermore, the pure data-driven DNN architectures depend heavily on historical datasets, limiting their generalization capability to sky conditions with low mutual information with the training data. To mitigate these limitations, deep generative AI models have recently attracted growing interest. In [], an end-to-end multi-modal model utilizing Vision Transformers (ViTs) was proposed for short-term irradiance forecasting. Acknowledging the limitations of end-to-end modeling for cloud tracking, ref. [] introduced a two-step approach for PV power forecasting, combining a U-net model with SkyGPT, a deep generative AI model for stochastic sky image sequence prediction. While deep generative AI models help address part of the shortcomings of DNN approaches, they still rely on backpropagation-based gradient descent optimization algorithms and demand large datasets and extensive training on resource-intensive platforms, significantly increasing computational requirements. To alleviate this, in this paper we propose a hybrid approach that deviates from the pure data-driven paradigm.

3. Methodology

3.1. Forecasting Framework

In the present work, the forecasting framework consists of multi-step sky image prediction. The whole concept is to create a model that takes a sequence of n consecutive sky images and returns a sequence of the next m sky images (Figure 1). In Figure 1,

t_{0}

is the time of the forecast issuing,

I_{t_{0}}

is the sky image at

t = t_{0}

, S is the sky image sequence used as input, and S’ is the forecasted sky image sequence.

Figure 1. Overview of the sky image sequence prediction framework, where a sequence of past frames is used as input to predict a sequence of future frames.

3.2. Auto Encoder-like Convolutional Neural Networks

The sky image prediction model in this paper employs AE-like CNNs. CNNs are feed-forward artificial neural networks that contain convolutional layers. They are widely used in computer vision, due to their capability of extracting features and patterns from images, using linear algebra methods []. The employed CNN imitates the structure of an AE (Figure 2). It contains an encoder (takes the input images and compresses them into a latent vector), a bottleneck layer (the encoded information) and a decoder (decompresses the encoded information into a new set of images). This process can be mathematically modeled through the following relation:

S^{'} = f_{D E} (f_{E N} (S))

(1)

where S is the input sky image sequence, S′ is the forecasted sky image sequence and

f_{D E}

and

f_{E N}

are the decoding and encoding mapping functions, respectively. The encoder and the bottleneck layer use convolutional layers, while the decoder uses transposed convolutional layers [].

Figure 2. Schematic representation of a typical fully-connected Auto-Encoder architecture, consisting of an encoder, a bottleneck layer, and a decoder. Black circles represent the neurons of the artificial neural network.

In the proposed AE-like CNN, the output of each layer is calculated using the following procedure: At first, the previous layer is multiplied with the corresponding weights, and a bias matrix is added. Then, Batch Normalization (BN) and activation are applied, giving the output values of the layer []. Τhe architecture of the proposed AE-like CNN is shown in Figure 3.

Figure 3. Architecture of the proposed AE-like CNN model, showing the encoder, bottleneck, and decoder stages with the main layer types.

The AE-like CNN models are trained to extract the input-output relationship. For this purpose, an early stopping mechanism is included, which halts training if no improvement is observed in the validation error for N consecutive epochs (where N is defined by a patience parameter) to prevent overfitting.

3.3. D and 3D Convolutions

The proposed sky image prediction model was implemented using two types of convolutions. The first one is the 2D convolution, in which a 2D filter is applied to each image of the sequence simultaneously, and the result is a 2D output frame. For K filters, the outcome of each convolution layer is K 2D output frames:

Input (H1 × W1 × L) → Output (H2 × W2 × K)

(2)

The 2D convolution is illustrated in Figure 4a.

Figure 4. Illustration of convolution types: (a) 2D Convolution (b) 3D Convolution. The asterisk symbol (*) denotes the convolution operation.

The second one is the 3D convolution, in which a 3D filter is used instead. The outcome is a 3D output volume for each filter (Figure 4b):

Input (H1 × W1 × L) → Output (H2 × W2 × L′ × K)

(3)

In (2) and (3), H1 × W1 are the dimensions of the input frames, H2 × W2 are the dimensions of the output frames, L the length of the input sequence,

L^{'}

is the length of the output volume, K is the number of filters used per layer and f is the size of the filters [].

L^{'}

can be calculated as follows:

L^{'} = L - f + 1

(4)

3.4. Data Preprocessing

Sky images are typically captured at high resolutions in color (RGB images). Consequently, each image has 3 channels of H1 × W1 pixels. The number of operations per convolution is proportional to the dimensions of the image; thus, it is evident that retaining the original image dimensions is computationally impractical. Moreover, raw data handling potentially enables the extraction of even more information from the source data, but it also increases the likelihood of getting stuck in local optima. That occurs because the backpropagation algorithm is not subserved by any known information or correlation about the signals. Therefore, all images undergo a preprocessing pipeline that includes conversion to grayscale, resolution downscaling, and classification. This pipeline reflects the constraints and design goals of resource-efficient local sky imager platforms, which are intended for deployment in local, community-scale PV systems. In this context, although grayscaling and downscaling result in loss of information related to cloud tonalities and solar scattering, this preprocessing pipeline is practically a necessary step to reduce computational complexity and enable experiments on resource-limited hardware with scalable deployment across multiple sites.

3.4.1. Grayscaling

The first stage of sky image preprocessing involves conversion to grayscale, by reducing the number of channels per image from three (RGB) to one. This step decreases the computational load while preserving the essential information required for recognition of objects, such as the sun and cloud distribution—both critical for future sky image prediction. The grayscale conversion was performed according to the following equation:

Y_{i} = 0.299 R_{i} + 0.587 G_{i} + 0.114 B_{i}

(5)

where

R_{i}

,

G_{i}

,

B_{i}

are the values of the i-th pixel in the red, green and blue channels, respectively, of the original image and

Y_{i}

is the value of the i-th pixel of the new image.

3.4.2. Downscaling

The downscaling preprocessing step is essential, as it significantly decreases the number of pixels per image, thereby reducing the number of convolutions and the total training time. Importantly, this resolution reduction does not compromise the ability to recognize the key visual elements (i.e., clouds and the sun). To downscale the sky images, the value of each new pixel is computed as the average of the original pixels within its corresponding region:

P_{j}^{'} = \frac{1}{N} \sum_{i = 0}^{N} P_{i}

(6)

where

P_{j}^{'}

is the value of the new (downscaled) pixel

j

,

P_{i}

the value of each original pixel

i

that falls within the region of the new pixel

j

, and

N

is the number of original pixels that fall within that region. This downscaling approach corresponds to an average pooling operation. Compared to nearest neighbor interpolation, this method avoids aliasing artifacts, and compared to bilinear or bicubic interpolation, it is computationally simpler while still maintaining the key spatial patterns necessary for sky condition recognition [].

3.4.3. Classification of the Input Data

In this paper the input data are passed through a classification process; thus, the sky images are divided into clusters of similar sky conditions. Classification is a technique that helps reduce the impact of dataset imbalance and the variability of the dataset by creating more coherent subsets. For each cluster, a class-specific sky image forecasting model is separately trained. Hence, the original problem is decomposed into smaller and simpler subproblems, since trainings take place with more homogeneous data. This is especially crucial because when less non-linearity and non-homogeneity is observed, the hill climbing requirement during optimization reduces and the chances of the gradient-descent backpropagation algorithm to find the “global optimum” increases significantly.

The employed sky image classification approach is a recently proposed automatic method based on unsupervised learning and downstream evaluation []. A schematic overview of the employed approach is provided in Figure 5. A total of 42 global spectral and textural handcrafted features is extracted from multiple color spaces of each sky image to capture tonal variations and color distributions. These features are based on the rich literature on cloud classification and are described in detail in []. Seven additional handcrafted features related to the total cloud coverage, cloud velocity, and solar elevation are incorporated to account for PV energy yield variations. These features are computed using only the current sky image available at the time of issuing the forecast. The total cloud coverage is represented by the cloud coverage percentage (estimated using a multi-colored threshold technique), the clear sky index (using solar irradiance measurements and a reference clear-sky day), and the luminance (based on Otsu’s threshold) []. Solar elevation is represented through the solar zenith and azimuth angles, while for cloud velocity, the two most recent sky images are used to estimate the average CMV via a dense OF method [].

Figure 5. Overview of the employed approach for the automatic classification of the sky images.

The total 49 handcrafted features are encoded into a reduced set of 15 latent features using a hybrid dimensionality reduction technique that combines Principal Component Analysis (PCA) and shallow, fully connected feed-forward AEs. The fully connected AE is symmetrical and contains a single hidden layer []. The resulting latent feature set is then clustered using k-Means clustering. The optimal number of clusters is identified through a novel forecast-driven strategy, which co-optimizes k with a minute-scale PV energy yield forecasting model—leveraging PV generation data associated with the sky images—and subsequent downstream evaluation. More details on the employed sky image classification approach can be found in [].

The employed sky image classification method is primarily selected for its ability to extract multi-class partitions without relying on pre-assigned ground-truth labels []. This eliminates the need for manual labeling, enhancing both the practicality and scalability of the classification process. Furthermore, determining the number of clusters with respect to the minimization of the average forecasting error of the downstream PV model results in more detailed partitions than those typically achieved using standard clustering metrics based on cohesion and separation. This is particularly relevant as sky image classification involves fine-grained distinctions where instances are not easily separable []. Unlike [], in which the sky image classification was used for minute-scale PV energy yield forecasting, here the classification is conducted for minute-scale sky image sequence prediction; therefore, sky images closer to dusk and dawn are included in the training dataset. In contrast to self-supervised learning methods based on DNNs, the employed approach extracts global features from the sky images that facilitate the identification of subtle distinctions and fine-grained classification. Unlike similar unsupervised learning methods, the employed approach incorporates non-instantaneous features to the original handcrafted feature set, e.g., CMV and solar elevation variations, creating a form of physics-informed machine learning approach that improves performance and reduces the computational requirements []. These variations are obviously related to cloud movement and should thus be considered when classifying the input for sky image forecasting.

4. Experimental Setup

4.1. Data Presentation and Analysis

In this study, ground-based sky images were utilized, captured over a 14-day period (from 16 November 2023 to 29 November 2023), in a timespan from sunrise to sunset, with a temporal resolution of one frame per minute and finally 8168 images were acquired []. Data were collected near a PV system with an installed capacity of 1.2 kW, located in the region of Boeotia, Greece. Although the dataset pertains to a mild arid climate, the dataset is quite diverse (32% sunny, 22% cloudy and 46% overcast). The images were captured using an EKO ASI-16 sky monitoring camera manufactured by CMS Ing. Dr. Schreder GmbH, Kirchbichl, Austria. The camera features a 180° field of view, 5 MP resolution, and a wide-angle fisheye lens, enclosed within a reflective and durable quartz dome equipped with an air circulation system to prevent fogging. All images were captured in color (Red, Green, and Blue; RGB) and stored in JPG format, with an average file size of 76.52 KB. To ensure the validity of the results, the image acquisition location was carefully selected to minimize noise from external obstructions (e.g., buildings, trees).

4.2. Classification Results

According to the classification method that was described in Section 3.4., the whole dataset was divided into eight clusters that correspond to eight different sky conditions:

Overcast (936 sequences);
Sunny (1963 sequences);
Clear sky and the sun near to sunset (561 sequences);
Almost overcast (1442 sequences);
Clear sky and the sun at sunrise (794 sequences);
Sun low on the horizon and partial cloud cover (517 sequences);
Partial cloud cover with thin clouds (885 sequences);
Partial cloud cover with thick clouds (753 sequences).

Figure 6 presents representative examples of sky images contained in each cluster.

Figure 6. Representative sky image samples per cluster derived from the proposed clustering approach: (a) Overcast (Cluster 1); (b) Sunny (Cluster 2); (c) Clear sky, sun near sunset (Cluster 3); (d) Almost overcast (Cluster 4); (e) Clear sky, sun near sunrise (Cluster 5); (f) Sun low on the horizon, partial cloud cover (Cluster 6); (g) Partial cloud cover, thin clouds (Cluster 7); (h) Partial cloud cover, thick clouds (Cluster 8).

4.3. Proposed Prediction Process

Out of the eight clusters, some demonstrate considerable variability, while others present slighter changes. For example, clusters (1), (4), (6), (7) and (8) are associated with relatively high cloud coverage and thus intense cloud motion on an intra-minute scale, in contrast to clusters (2), (3) and (5), where the sky is almost clear. Thus, while AE-like CNN models are required for the former clusters to capture their inherent variability, the persistence method can be used for the latter, to decrease the overall computational load. Persistence is a method that assumes that the future values of a random variable are the same as the present ones (worst case scenario) and performs well for short-term forecasting horizons and decreased variability.

The prediction process begins with the classification of the current sky image into a suitable cluster, according to the classification method that is thoroughly described in Section 3.4.3 []. Afterwards, if the cluster belongs to the first category (high variability), the associated trained AE-like CNN model is called; otherwise, the persistence method is used instead. Then, the prediction implementation takes place and finally the future sky image sequence is extracted. The flow chart of the proposed sky image prediction process is illustrated in Figure 7.

Figure 7. Overview of the proposed sky image prediction process, including the decision mechanism that selects between persistence and AE-like CNN models according to cluster variability.

4.4. Configuration Setup

4.4.1. Data Organization

Sky images inherently provide limited spatial coverage. Therefore, the forecasting horizon should remain relatively short—typically no more than 25–30 min []—to reduce the influence of out-of-frame clouds that appear after forecast issuance. To balance computational efficiency with adequate lead time for control actions, we set the forecasting horizon to 10 min in our practical experiments. The temporal resolution is set to 1 min, which is a relatively coarse setting under which baseline linear methods are known to underperform. By selecting that challenging condition, our objective is to assess whether the proposed model can also operate effectively then. By demonstrating this, it should be clear that our approach will also continue to work well under less difficult conditions. Additionally, we use a sequence of images for the input to provide the temporal information necessary for the AE-like CNNs to capture cloud motion dynamics. Specifically, we select 10 input frames to achieve a trade-off between forecasting accuracy and training time.

Τhe sky image prediction model takes a sequence of ten input sky images and returns a sequence of ten forecasted sky images. The forecasting procedure can be mathematically formulated as follows:

{I_{t_{0} + (n - 1) t_{i n}}, {. . ., I}_{t_{0} - t_{i n}}, I_{t_{0}}} \overset{P R E D I C T I O N M O D E L}{\to} {I_{t_{0} + t_{o u t}}, . . ., I_{t_{0} + (h - 1) t_{o u t}}, I_{t_{0} + h t_{o u t}}}

(7)

where I refers to a sky image,

t_{i n}

and

t_{o u t}

are the temporal resolutions of the input and output sequence, respectively (

t_{o u t} = t_{i n} = 1 m i n)

,

t_{0}

is the time of forecasting issuing,

n

is the input sequence length (

n = 10

), and

h

is the forecasting horizon (

h = 10

). For this purpose, the preprocessed data were organized in batches of 20 consecutive images of which the first 10 represent the input and the last 10 represent the output. In total, 7851 sequences were obtained.

After the classification step, each cluster was split such that the first 70% of the samples were assigned to the training set, the following 15% to the validation set, and the final 15% to the test set. This chronological partitioning was applied to avoid overlapping image sequences and to ensure independence between the datasets.

4.4.2. Model’s Architectures, Implementation Details, and Environment

Τhe proposed AE-like CNN models were implemented using 2D and 3D Convolutions, with 3, 5, 7, 9 and 11 hidden layers and input images dimensions 32 × 32 and 64 × 64.

The number of filters and output frame dimensions per layer were chosen according to []. The following equation shows the number of filters applied in the n-th layer of the l-layered CNN.

a_{m} = 32 (\frac{l + 3}{2} - |m - \frac{l + 1}{2}|), m = 1, 2, \dots, l

(8)

Symmetrical padding is applied to ensure the output retains the same spatial dimensions as the input by symmetrically zero-padding the borders as needed. Epochs and patience values (mentioned in Section 3.2) were set to 100 and 10, respectively, and were selected empirically after a series of tests, aiming to maintain model accuracy while also limiting training time and computational cost. The optimization algorithm chosen for the training process was Adam optimization, with the learning rate set to 0.001. Furthermore, Leaky ReLU was adopted as the activation function, using a negative slope parameter of α = 0.2 and the loss function employed was the Mean Squared Error (MSE). A detailed description of the parameter configuration can be found in Table 1.

Table 1. Parameter configuration for the training of the proposed AE-like CNN model.

All models were implemented in Python 3.12.4, using the Spyder environment and the TensorFlow–Keras library. The computational system used was an 11th Gen Intel^® Core™ i5-1135G7 @ 2.40GHz 2.42 GHz laptop computer, with 8.00 GB of RAM.

5. Results

5.1. Assessment Metrics

To evaluate the performance and estimate the average error of the sky image prediction models, a comparison between the predicted images and the target (ground-truth) images is required. This comparison is achieved using quantitative evaluation metrics. In the case of sky images, commonly employed metrics include the Mean Squared Error (MSE), the Structural Similarity Index Measure (SSIM) and the Peak signal-to-noise ratio [,].

5.1.1. Mean Squared Error

MSE compares two images pixel by pixel based on their intensity values. MSE is computed as follows []:

M S E = \frac{1}{M N} \sum_{i = 1}^{M} \sum_{j = 1}^{N} {[x (i, j) - y (i, j)]}^{2}

(9)

where x is the predicted image, y is the target image, (i, j) are the pixel coordinates, and M, N are the image dimensions. The smaller the MSE value, the more similar the two images are.

5.1.2. Structural Similarity Index Measure

SSIM quantifies the degree of similarity between two images. It depends on the following three factors []:

Luminance: A measure of the brightness difference in the two images;
Contrast: A contrast comparison (i.e., the difference between bright and dark regions within the image) between the two images;
Structure: An estimation of the spatial arrangement of luminance patterns within the images;

The mathematical formulation of SSIM between two images is presented through Equations (10)–(14):

S S I M = {[l (x, y)]}^{α} {[c (x, y)]}^{β} {[s (x, y)]}^{γ}

(10)

l (x, y) = \frac{2 μ_{x} μ_{y} + C_{1}}{{μ_{x}}^{2} + {μ_{y}}^{2} + C_{1}}

(11)

c (x, y) = \frac{2 σ_{x} σ_{y} + C_{2}}{{σ_{x}}^{2} + {σ_{y}}^{2} + C_{2}}

(12)

s (x, y) = \frac{σ_{x y} + C_{3}}{σ_{x} σ_{y} + C_{3}}

(13)

C_{1} = {(K_{1} L)}^{2}, C_{2} = {(K_{2} L)}^{2}, C_{3} = C_{3} / 2

(14)

where x the predicted image, y the target image,

l (x, y)

the luminance term,

c (x, y)

the contrast term,

s (x, y)

the structural term,

μ_{x}

and

μ_{y}

the mean pixel values of images x and y, respectively,

σ_{x}

and

σ_{y}

the standard deviation of the pixel values of images x and y, respectively,

σ_{x y}

the covariance of the pixel values of images x and y, and

C_{1}

,

C_{2}

,

C_{3}

constants added to avoid division by values close to zero in the denominator of the terms. In this paper, we select the values suggested in [], i.e.,

α = β = γ = 1

,

K_{1} = 0.001

,

K_{2} = 0.003

, and

L = 255

. Unlike MSE, the higher the SSIM value, the more similar the two images are.

5.1.3. Peak Signal-to-Noise Ratio

The PSNR is a traditional image quality metric that estimates fidelity by comparing the maximum possible signal strength to the distortion introduced by reconstruction or compression. The PSNR is mathematically modeled through the following equation:

P S N R (x, y) = 10 \log_{10} (\frac{255^{2}}{\frac{1}{M N} \sum_{i = 1}^{M} \sum_{j = 1}^{N} {(x_{i j} - y_{i j})}^{2}})

(15)

where x is the predicted image, y is the target image, (i, j) are the pixel coordinates, and M, N are the image dimensions. Like SSIM, the higher the PSNR value, the more similar the two images are.

5.2. Benchmark Forecasting Models

The following benchmark methods are developed for comparison with the proposed 8-Cluster AE-like CNN sky image prediction model:

Persistence;
CMV-based method;
1-Cluster AE-like CNN;
3-Cluster AE-like CNNs;
6-Cluster AE-like CNNs.

5.2.1. Persistence Method

The persistence method supposes no further change in a random variable’s value. If

A_{t_{0}}

is the sky image at

t = t_{0}

, then the prediction

{\hat{A}}_{t_{0} + k}

for

t = t_{0} + k

will be:

{\hat{A}}_{t_{0} + k} = A_{t_{0}}

(16)

where k is any timestep in the forecasting horizon.

5.2.2. CMV-Based Method

The CMV-based benchmark is based on the Gunnar Farneback OF method []. This method compares the pixel intensities between two consecutive sky images to extract a dense CMV field, which is then used to linearly extrapolate future cloud movement. The Gunnar Farneback OF method has been widely used for CMM from sky images, both as a standalone method and in combination with other approaches [].

5.2.3. AE-like CNN

Apart from the proposed 8-Cluster AE-like CNN model, similar AE-like CNN models with 3 and 6 clusters were developed according to the procedure that was thoroughly described in Section 3 and Section 4, to assess the impact of the number of clusters on the forecasting performance. In addition, a 1-Cluster AE-like CNN, trained on the entire dataset, was included to assess the model’s generalization capability when using all available data.

5.3. Sensitivity Analysis

As mentioned in Section 4.3, the proposed 8-Cluster sky image prediction model uses AE-like CNNs for the clusters that are associated with sky conditions of intense variability and the persistence method for the rest of the clusters. In order to find the optimum combination of hyperparameters for which the AE-like CNNs perform better, a sensitivity analysis was conducted. The hyperparameters to be finetuned were: the convolution type (CT), the kernel size (KS), the number of hidden layers (NHL) and the input image dimensions (IID). The complete set of experiments is presented by the following Cartesian product:

C T \times K S \times N H L \times I I D = \{2 D, 3 D\} \times \{3, 5, 7\} \times \{5, 7, 9, 11\} \times {32, 64}

(17)

Preliminary tests showed that in cases where CT was 3D, the AE-like CNNs exhibited worse performance and a significantly longer training time. The experiments with 64 × 64 input resolution revealed an average increase of approximately 200% in training time and about 30% in inference time, while the gain in prediction accuracy was limited to an average of 3.57% in terms of MSE. Given that the proposed framework is designed to be deployable on low-cost hardware platforms (e.g., standard personal computers), computational efficiency was prioritized over marginal accuracy improvements. For this reason, and to reduce the overall evaluations and computational overhead, CT = 2D and IID = 32 were selected, and the sensitivity analysis continued for KS and NHL. The simplified Cartesian product is now as follows:

K S \times N H L = \{3, 5, 7\} \times {5, 7, 9, 11}

(18)

This simplification makes the implementation of a per-cluster sensitivity analysis computationally feasible. Thus, each cluster sets its own cluster-specific hyperparameter values that correspond to the particular sky condition. The per-cluster sensitivity analysis results for each of the eight clusters of the proposed model are shown in Table 2. The optimum combination of KS and NHL for each cluster is in bold. As can be seen, the model’s performance is highly affected by changes in hyperparameter values. In many cases, premature convergence can be noticed, causing a significant deterioration in the assessment metric values. The sensitivity analysis is also visualized using boxplots for the MSE and SSIM metrics in Figure 8 (for better visualization clarity, outliers have been excluded so that the distribution of the remaining boxplots can be properly observed).

Table 2. Sensitivity analysis of kernel size (KS) and number of hidden layers (NHL) for the proposed 8-Cluster AE-like CNN model.

Figure 8. Visualization of sensitivity analysis of kernel size (KS) and number of hidden layers (NHL) for the proposed 8-Cluster AE-like CNN model for the assessment metrics: (a) MSE; (b) SSIM.

From the sensitivity analysis results it can be concluded that for KS = 3 the kernels are too small, resulting in overly local feature extraction, while AE-like CNNs require at least 9 layers to adequately model the input–output relationship. In general, KS = 5 and NHL = 9 or 11 yield the best results for most clusters.

5.4. Final Forecasting Results

Table 3 shows the final values of the assessment metrics for each sky image forecasting model that was implemented. The results for the 3-Cluster AE-like CNN, the 6-Cluster AE-like CNN, and the proposed 8-Cluster AE-like CNN are aggregated -weighted by sample count- across all clusters. From Table 3 it can be seen that all AE-like CNN models perform better compared to the persistence method, with the proposed 8-Cluster model yielding the best results. The OF model achieves an improvement of 13.68% on SSIM and a deterioration of 21.32% on MSE, compared to persistence. This performance deterioration is likely due to the limitations of OF in coarser temporal resolutions. In the case of the model without classification, although the evaluation metrics indicate satisfactory similarity between the images, visual inspection highlights the need for improved results. The output image sequences suggested that the model responded adequately under clear-sky conditions, as it successfully identified the sun’s position and captured image brightness to a reasonably good extent. However, under cloudy conditions, although the sun’s location is detected, the model fails to accurately predict cloud distribution. These observations indicate that, for image-based forecasting tasks, relying solely on quantitative metrics is inadequate, as such measures treat images merely as numerical arrays and may overlook perceptual differences. Thus, complementing quantitative assessment with qualitative (visual) evaluation is essential to obtain a more complete understanding of model performance.

Table 3. Performance assessment metrics (PSNR, MSE, and SSIM) for all models, where aggregate values are computed weighted by sample count.

The performance of the AE-like CNN models improves significantly with the increase in the number of clusters, reaching improvements of 73.1% and 24.3% on MSE and SSIM, respectively, compared to persistence. Based on this observation, it can be concluded that the AE-like CNN model performs better—becoming more capable of recognizing patterns—when it is trained on more homogeneous image subsets. From Table 3, it seems that from the 6-Cluster model to the 8-Cluster model the improvements are quite imperceptible (0.86% and 7.02% on SSIM and MSE, respectively), showing that further classification is unnecessary. This saturation was expected to occur at some point, as beyond a certain level, the subproblems become sufficiently simple and the data highly homogeneous, allowing the AE-like CNN to handle them effectively without the risk of getting trapped in a local optimum. The proposed model was selected to be the 8-Cluster, as it appears to be the “knee point” beyond which the additional accuracy gain is not worth the investment of more effort.

As far as training time is concerned, it seemed to decline with the increase in the number of clusters. Specifically, in our relatively low capability computer system (described in Section 4.4.2) that we executed our simulations, the model without classification required approximately 26.75 h for training, whereas the proposed 8-Cluster required approximately 16.6 h of total training time, a significant reduction of 37.94%. This training time reduction may be attributed to several reasons, such as the reduced per-epoch training time from the smaller data subsets of each cluster and the overall fewer epochs required for convergence due to more homogeneous clusters that create simpler sub-problems. The processing time was measured 324 ms at maximum for a single prediction implementation and the memory requirements varied from 500 MB to 550 MB.

Table 4 presents the results per cluster for the proposed model. Overall, each cluster achieved strong performance, with notably low MSE values and SSIM values exceeding 90%, indicating strong correlation between real and predicted sequences. Clusters 1 and 6 achieved the best MSE (0.0155%) and SSIM (98.5%) values, respectively. Persistence demonstrated sufficient accuracy for the chosen clusters that are associated with sky conditions of mild variability, combining effectiveness and low computational cost.

Table 4. Per-cluster performance metrics (PSNR, MSE, and SSIM) for the proposed 8-Cluster AE-like CNN model.

In Figure 9 and Figure 10, examples of generated sky image forecasts are depicted for clusters 6 and 8, respectively. For each figure, the first row illustrates the real sequence, the second row the forecasted one and the third is a heat map that visualizes the MSE for each pixel between the two images. In these two examined cases, different color bar scales are applied to the heat maps to ensure that the variations in error are clearly distinguishable. Comparing the forecasted sky images to the real ones, it is obvious that the cloud distribution and coverage ratio have been accurately modeled, and the model has also correctly predicted whether the sun is blocked or not. Even in the case of Cluster 8, where more pronounced variations are present, the model was able to predict them with high accuracy. By observing the heat maps, it becomes clear that the model has achieved its objective, since the heat maps are mostly white—indicating low MSE values—with a few isolated red patches that reveal localized error spikes.

Figure 9. Example of real and forecasted sky image sequences for cluster 6, including pixel-wise error visualization through heat maps based on MSE values.

Figure 10. Example of real and forecasted sky image sequences for cluster 8, including pixel-wise error visualization through heat maps based on MSE values.

The per-cluster sensitivity analysis is a major factor in the performance of the proposed sky image prediction model. If a predetermined set of hyperparameters, derived from a sensitivity analysis on the entire dataset, is applied instead, the results of the proposed 8-Cluster model would be considerably worse. To demonstrate this, the AE-like CNNs of all clusters are trained using KS = 3 and NHL = 5, as suggested in [], and the results are compared to those obtained from the cluster-specific sensitivity analysis.

From the results in Table 5 it seems that with the use of the per-cluster sensitivity analysis, MSE and SSIM are improved by 61.87% and 12.97%, respectively. That happens because each cluster addresses a different subproblem of cloud motion and thus its corresponding prediction model requires a specific configuration to adapt to the characteristics of the sky image cluster.

Table 5. Comparison of the proposed 8-Cluster AE-like CNN model with and without per-cluster sensitivity analysis, based on MSE and SSIM metrics.

5.5. Evaluation on a Second Dataset

In order to further validate the robustness and generalizability of the proposed model, it was also applied to a second case study. The open-access dataset that was employed consists of high-resolution sky images (2048 × 2048) captured by a ground-based camera with a fish-eye lens at a frequency of 1 frame per minute at Stanford University []. A total of 22,519 daytime images were sampled from months January, February, May, June, and July and finally 22,519 images were acquired. While this additional case study represents a further step toward demonstrating generalizability, we acknowledge that important limitations remain regarding the evaluation of the proposed model across a broader range of contexts; thus, further future work is still required, as discussed in Section 6.

The clustering process resulted in 8 distinct clusters, each representing a characteristic sky condition. Specifically, the clusters correspond to: clear sky with limited broken clouds (4487 images), thin cirrus clouds (3420 images), clear sky (5950 images), raindrops with the sun not blocked (1399 images), sunrise (1593 images), raindrops with the sun blocked (2409 images), high turbulence (1994 images), and overcast conditions (1267 images). The number of images assigned to each cluster indicates a balanced representation of diverse sky scenarios within the dataset. Figure 11 presents the clustering results for the additional dataset.

Figure 11. Representative sky image samples per cluster derived from the proposed clustering approach for the second case study: (a) Clear sky with limited broken clouds (Cluster 1); (b) Thin cirrus clouds (Cluster 2); (c) Clear sky (Cluster 3); (d) Raindrops—sun not blocked (Cluster 4); (e) Sunrise (Cluster 5); (f) Raindrops—sun blocked (Cluster 6); (g) High turbulence (Cluster 7); (h) Overcast (Cluster 8).

For this new case study, the experiments were repeated only for the persistence model, the 1-Cluster model, and the proposed 8-Cluster model. As Table 6 shows all AE-like CNN models perform better compared to the persistence method, with the proposed 8-Cluster model yielding the best results. In particular, the 8-Cluster model performs improvements of 76.3% and 20% on MSE and SSIM, respectively, compared to persistence.

Table 6. Performance assessment metrics (PSNR, MSE, and SSIM) for all models, where aggregate values are computed weighted by sample count for the second dataset.

5.6. Alternative Dataset Split

Aiming to examine the models’ capability in a different dataset split, all experiments were repeated for the 1st case study, for a split that uses 50% on training set, 25% on validation set and 25% on test set. This split was chosen as a more balanced option between training and validation and to evaluate the robustness of the proposed approach toward limited training data availability. The results are depicted in Table 7.

Table 7. Evaluation results (PSNR, MSE, and SSIM) of the AE-like CNN models for different dataset splits (70-15-15 and 50-25-25).

When compared with the 1-cluster model, the benefit of clustering remains clear, while the 3-Cluster model even shows an improvement in the alternative split, suggesting better generalization with fewer training samples. On the other hand, the 6-Cluster model performs notably worse, as the division into six groups does not secure stable training; in some clusters the model converged prematurely, leading to weak overall results. The proposed 8-Cluster model proves the most consistent, with only slight differences observed across the two splits. This indicates that the approach remains effective even when the amount of training data is reduced.

5.7. Random Perturbation Check

In order to assess the impact of clustering performance on the forecasting outcome, the clustering results were perturbed in a controlled way: Two experiments were conducted for the first case study, where a random fraction of 5% and 10% of the images were reassigned to alternative random clusters. The AE-like CNNs were then re-trained using the perturbed clusters.

Figure 12 indicates that random perturbation of sky image clusters significantly affects performance across all clusters, particularly for clusters associated with more dynamic, cloudy conditions (clusters 1, 6, 7, and 8). In five out of eight clusters, bigger perturbations lead to higher MSE values, as expected. In clusters 1, 3, and 4, the 10% perturbation leads to slightly better results compared to the 5% perturbation, potentially due to the random nature of the imposed perturbation. Even with just a 5% perturbation, forecasting performance deteriorates significantly, with an aggregated MSE increase of 37.81% and an aggregated SSIM decrease of 3.98%. A 10% perturbation further increases the aggregated MSE by 16.41% and decreases the aggregated SSIM by an additional 1.71%. Overall, these results demonstrate the sensitivity of the proposed sky image forecasting model toward the input classification performance and suggest that even a slightly worse sky image classification method would significantly impact forecasting performance. This behavior is expected, as random cluster perturbations decrease cluster cohesion. The bigger the perturbation, the more a cluster resembles the original sky image dataset in terms of variability, but now with significantly less training samples.

Figure 12. Effect of input perturbation on cluster-wise MSE performance.

6. Conclusions

This paper proposes a multi-step forecasting framework for predicting sequences of ground-based sky images. The proposed approach combined physics-informed clustering of images and AE-like CNN models, which were trained separately for each cluster. In addition, a per-cluster sensitivity analysis was conducted, allowing the model to adapt better to the specific characteristics of each cluster and the rapidly changing sky conditions. Dividing the forecasting task into smaller and more homogeneous groups of data simplified the learning process, reduced training effort, and allowed the CNNs to capture the underlying patterns more effectively, thereby improving their generalization capability. The per-cluster sensitivity analysis further improved the results by up to 61.87% in terms of MSE. These results confirm that physics-informed preprocessing, together with targeted hyperparameter tuning, can make multi-step ahead sky image forecasting feasible even when the available data have relatively low temporal resolution, without the need for excessive computational resources. Apart from the numerical gains, the findings underline the practical importance of careful data preparation when forecasts need to cope with highly variable sky conditions.

Accurate sky image predictions provide insight into short-term cloud movement and deformation, which in turn enables various solar forecasting applications. The outputs generated by the proposed model can be used as inputs to deep-learning-based PV generation forecasting models that translate predicted sky image sequences into short-term PV energy yield estimates using computer vision techniques such as CNNs. Because the performance of these models strongly depends on the quality of the input data, improvements in sky image forecasting accuracy directly enhance the reliability of PV energy yield predictions. More accurate PV generation forecasting helps mitigate the operational challenges arising from the stochastic behavior of solar generation, particularly under conditions with high PV penetration in electric power systems. Improved short-term PV generation forecasts enable proactive grid management strategies, such as ramp rate mitigation, energy storage scheduling, and load balancing, that reduce voltage dips, frequency fluctuations, and other power quality issues. Consequently, accurate sky image prediction plays a key role in supporting the effective integration of PV systems into electric grids.

Despite the promising results, this study has several limitations that should be acknowledged and addressed in future work. First, both case studies are aligned with the constraints and practical requirements of local community-scale PV systems, where sky images are preprocessed to reduce computational complexity and enable low-cost scalable deployment. Future research could evaluate the proposed method in fundamentally different deployment contexts, i.e., for large-scale PV power plants where fewer hardware constraints allow for the use of high-resolution sky images and more powerful computing platforms and ultimately enable alternative modeling strategies and more detailed image analysis. Second, the proposed method incorporates physics-informed knowledge only at the preprocessing stage, thereby influencing the training of the AE-like CNNs only indirectly. As recent research suggests that embedding physical constraints directly into deep learning model architectures may improve both accuracy and computational efficiency, future work could explore alternative physics-informed CNN designs, which may also enable the use of higher-resolution, colored images. Additional future research directions include integrating the proposed model with a short-term PV generation forecasting framework to enable downstream evaluation of its impact on PV generation prediction performance, as well as conducting expanded comparative studies with other, more computationally intensive state-of-the-art architectures (e.g., Vision Transformers and Generative Adversarial Networks) to assess performance trade-offs relative to increased model complexity.

Author Contributions

Conceptualization, M.A.K.-K.; Methodology, S.P.S., M.A.K.-K., F.C. and P.S.G.; Software, S.P.S. and M.A.K.-K.; Validation, S.P.S. and M.A.K.-K.; Formal analysis, S.P.S. and M.A.K.-K.; Investigation, S.P.S., M.A.K.-K., F.C. and P.S.G.; Data curation, M.A.K.-K.; Writing—original draft, S.P.S. and M.A.K.-K.; Writing—review and editing, S.P.S., M.A.K.-K., F.C. and P.S.G.; Supervision, F.C. and P.S.G.; Project administration, F.C. and P.S.G.; Funding acquisition, F.C. and P.S.G. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The datasets presented in this article are not readily available because they belong to the company that collected the data and provided to us for research purposes. Requests to access the datasets should be directed to the authors.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Chow, S.K.H.; Lee, E.W.M.; Li, D.H.W. Short-term prediction of photovoltaic energy generation by intelligent approach. Energy Build. 2012, 55, 660–667. [Google Scholar] [CrossRef]
International Energy Agency (IEA). Renewable Energy Market Update—June 2023: Will Solar PV and Wind Costs Finally Begin to Fall Again in 2023 and 2024? IEA: Paris, France, 2023. Available online: https://www.iea.org/reports/renewable-energy-market-update-june-2023 (accessed on 20 October 2025).
Snaith, H.J. Present status and future prospects of perovskite photovoltaics. Nat. Mater. 2018, 17, 372–376. [Google Scholar] [CrossRef]
Martinho, F. Challenges for the future of tandem photovoltaics on the path to terawatt levels: A technology review. Energy Environ. Sci. 2021, 14, 3840–3871. [Google Scholar] [CrossRef]
Sengupta, M.; Habte, A.; Wilbert, S.; Gueymard, C.; Remund, J.; Lorenz, E.; van Sark, W.; Jensen, A.R. Best Practices Handbook for the Collection and Use of Solar Resource Data for Solar Energy Applications, 4th ed.; IEA PVPS: Paris, France, 2024. [Google Scholar]
Diagne, M.; David, M.; Lauret, P.; Boland, J.; Schmutz, N. Review of solar irradiance forecasting methods and a proposition for small-scale insular grids. Renew. Sustain. Energy Rev. 2013, 27, 65–76. [Google Scholar] [CrossRef]
Barbieri, F.; Rajakaruna, S.; Ghosh, A. Very short-term photovoltaic power forecasting with cloud modeling: A review. Renew. Sustain. Energy Rev. 2017, 75, 242–263. [Google Scholar] [CrossRef]
Kong, W.; Jia, Y.; Dong, Z.Y.; Meng, K.; Chai, S. Hybrid approaches based on deep whole-sky-image learning to photovoltaic generation forecasting. Appl. Energy 2020, 280, 115875. [Google Scholar] [CrossRef]
Paletta, Q.; Hu, A.; Arbod, G.; Lasenby, J. ECLIPSE: Envisioning cloud induced perturbations in solar energy. Appl. Energy 2022, 326, 119924. [Google Scholar] [CrossRef]
Fu, Y.; Chai, H.; Zhen, Z.; Wang, F.; Xu, X.; Li, K.; Shafie-Khah, M.; Dehghanian, P.; Catalão, J.P.S. Sky image prediction model based on convolutional auto-encoder for minutely solar PV power forecasting. IEEE Trans. Ind. Appl. 2021, 57, 3272–3281. [Google Scholar] [CrossRef]
Lin, F.; Zhang, Y.; Wang, J. Recent advances in intra-hour solar forecasting: A review of ground-based sky image methods. Int. J. Forecast. 2023, 39, 244–265. [Google Scholar] [CrossRef]
Wood-Bradley, P.; Zapata, J.; Pye, J. Cloud tracking with optical flow for short-term solar forecasting. In Proceedings of the 50th Conference of the Australian Solar Energy Society, Melbourne, Australia, 6–7 December 2012. [Google Scholar]
Chow, C.W.; Urquhart, B.; Lave, M.; Dominguez, A.; Kleissl, J.; Shields, J.; Washom, B. Intra-hour forecasting with a total sky imager at the UC San Diego solar energy testbed. Sol. Energy 2011, 85, 2881–2893. [Google Scholar] [CrossRef]
Marquez, R.; Coimbra, C.F.M. Intra-hour DNI forecasting based on cloud tracking image analysis. Sol. Energy 2013, 91, 327–336. [Google Scholar] [CrossRef]
Sawant, M.; Shende, M.K.; Feijóo-Lorenzo, A.E.; Bokde, N.D. The state-of-the-art progress in cloud detection, identification, and tracking approaches: A systematic review. Energies 2021, 14, 8119. [Google Scholar] [CrossRef]
Zhen, Z.; Pang, S.; Wang, F.; Li, K.; Li, Z.; Ren, H.; Shafie-Khah, M.; Catalão, J.P.S. Pattern classification and PSO optimal weights based sky images cloud motion speed calculation method for solar PV power forecasting. IEEE Trans. Ind. Appl. 2019, 55, 3331–3342. [Google Scholar] [CrossRef]
Nie, Y.; Zelikman, E.; Scott, A.; Paletta, Q.; Brandt, A. Skygpt: Probabilistic ultra-short-term solar forecasting using synthetic sky images from physics-constrained videogpt. Adv. Appl. Energy 2024, 14, 100172. [Google Scholar] [CrossRef]
Peng, Z.; Yu, D.; Huang, D.; Heiser, J.; Yoo, S.; Kalb, P. 3D cloud detection and tracking system for solar forecast using multiple sky imagers. Sol. Energy 2015, 118, 496–519. [Google Scholar] [CrossRef]
Bone, V.; Pidgeon, J.; Kearney, M.; Veeraragavan, A. Intra-hour direct normal irradiance forecasting through adaptive clear-sky modelling and cloud tracking. Sol. Energy 2018, 159, 852–867. [Google Scholar] [CrossRef]
Wang, F.; Zhen, Z.; Liu, C.; Mi, Z.; Hodge, B.-M.; Shafie-Khah, M.; Catalão, J.P.S. Image phase shift invariance based cloud motion displacement vector calculation method for ultra-short-term solar PV power forecasting. Energy Convers. Manag. 2018, 157, 123–135. [Google Scholar] [CrossRef]
Blum, N.B. Nowcasting of Solar Irradiance and Photovoltaic Production Using a Network of All-Sky Imagers. Ph.D. Dissertation, RWTH Aachen University, Aachen, Germany, 2022. [Google Scholar]
Straub, N.; Karalus, S.; Herzberg, W.; Lorenz, E. Satellite-Based Solar Irradiance Forecasting: Replacing Cloud Motion Vectors by Deep Learning. Sol. RRL 2024, 8, 2400475. [Google Scholar] [CrossRef]
Kousounadis-Knousen, M.A.; Anagnostos, D.; Bazionis, I.K.; Bakovasilis, A.; Georgilakis, P.S.; Catthoor, F. Accurate PV Energy Yield Forecasting. In Energy Production, Load and Battery Management Framework with Supporting Methods for Smart Microgrids; Catthoor, F., Taniguchi, I., Georgilakis, P.S., Zhao, D., Soudris, D., Siozios, K., Kazantzidis, A., Eds.; Springer: Cham, Switzerland, 2025; pp. 25–46. [Google Scholar] [CrossRef]
Liu, J.; Zang, H.; Cheng, L.; Ding, T.; Wei, Z.; Sun, G. A transformer-based multimodal-learning framework using sky images for ultra-short-term solar irradiance forecasting. Appl. Energy 2023, 342, 121160. [Google Scholar] [CrossRef]
Zhang, A.; Lipton, Z.; Li, M.; Smola, A.J. Dive into Deep Learning. arXiv 2021, arXiv:2106.11342. [Google Scholar] [CrossRef]
Dumoulin, V.; Visin, F. A guide to convolution arithmetic for deep learning. arXiv 2018, arXiv:1603.07285. [Google Scholar] [CrossRef]
Gonzalez, R.C.; Woods, R.E. Digital Image Processing, 4th ed.; Pearson: London, UK, 2018. [Google Scholar]
Kousounadis-Knousen, M.A.; Catthoor, F.; Bakovasilis, A.; Georgilakis, P.S. Automatic multiclass classification of unlabeled ground-based sky images for minute-scale PV energy yield forecasting. IEEE Access 2025, 13, 120547–120562. [Google Scholar] [CrossRef]
Shirazi, E.; Gordon, I.; Reinders, A.; Catthoor, F. Sky Images for Short-Term Solar Irradiance Forecast: A Comparative Study of Linear Machine Learning Models. IEEE J. Photovolt. 2024, 14, 691–698. [Google Scholar] [CrossRef]
Horé, A.; Ziou, D. Image Quality Metrics: PSNR vs. SSIM. In Proceedings of the 20th International Conference on Pattern Recognition (ICPR), Istanbul, Turkey, 23–26 August 2010; pp. 2366–2369. [Google Scholar] [CrossRef]
Michael, N.E.; Suykens, J.A.K.; Deconinck, G.; De Vos, K. Short-Term Solar Power Predicting Model Based on Multi-Step CNN-Stacked LSTM Technique. Energies 2022, 15, 2150. [Google Scholar] [CrossRef]
Xu, F.; Sun, Y.; Guo, M. Prediction of Solar Flux Density Distribution Concentrated by a Heliostat Using a Ray Tracing-Assisted Generative Adversarial Neural Network. Energies 2025, 18, 1451. [Google Scholar] [CrossRef]
Farnebäck, G. Two-frame motion estimation based on polynomial expansion. In Proceedings of the Scandinavian Conference on Image Analysis, Halmstad, Sweden, 29 June–2 July 2003; Springer: Berlin/Heidelberg, Germany, 2003; pp. 363–370. [Google Scholar]
Arrais, J.M.; Cerentini, A.; Martins, B.J.; Chaves, T.Z.L.; Neto, S.L.M.; von Wangenheim, A. Systematic Review on Ground-Based Cloud Tracking Methods for Photovoltaics Nowcasting. Am. J. Clim. Change 2024, 13, 452–476. [Google Scholar] [CrossRef]
Stanford University. 2019 Sky Images and Photovoltaic Power Generation Dataset for Short-Term Solar Forecasting (Stanford Raw). Available online: https://purl.stanford.edu/jj716hx9049 (accessed on 29 October 2025).

Figure 1. Overview of the sky image sequence prediction framework, where a sequence of past frames is used as input to predict a sequence of future frames.

Figure 2. Schematic representation of a typical fully-connected Auto-Encoder architecture, consisting of an encoder, a bottleneck layer, and a decoder. Black circles represent the neurons of the artificial neural network.

Figure 3. Architecture of the proposed AE-like CNN model, showing the encoder, bottleneck, and decoder stages with the main layer types.

Figure 4. Illustration of convolution types: (a) 2D Convolution (b) 3D Convolution. The asterisk symbol (*) denotes the convolution operation.

Figure 5. Overview of the employed approach for the automatic classification of the sky images.

Figure 6. Representative sky image samples per cluster derived from the proposed clustering approach: (a) Overcast (Cluster 1); (b) Sunny (Cluster 2); (c) Clear sky, sun near sunset (Cluster 3); (d) Almost overcast (Cluster 4); (e) Clear sky, sun near sunrise (Cluster 5); (f) Sun low on the horizon, partial cloud cover (Cluster 6); (g) Partial cloud cover, thin clouds (Cluster 7); (h) Partial cloud cover, thick clouds (Cluster 8).

Figure 7. Overview of the proposed sky image prediction process, including the decision mechanism that selects between persistence and AE-like CNN models according to cluster variability.

Figure 8. Visualization of sensitivity analysis of kernel size (KS) and number of hidden layers (NHL) for the proposed 8-Cluster AE-like CNN model for the assessment metrics: (a) MSE; (b) SSIM.

Figure 9. Example of real and forecasted sky image sequences for cluster 6, including pixel-wise error visualization through heat maps based on MSE values.

Figure 10. Example of real and forecasted sky image sequences for cluster 8, including pixel-wise error visualization through heat maps based on MSE values.

Figure 11. Representative sky image samples per cluster derived from the proposed clustering approach for the second case study: (a) Clear sky with limited broken clouds (Cluster 1); (b) Thin cirrus clouds (Cluster 2); (c) Clear sky (Cluster 3); (d) Raindrops—sun not blocked (Cluster 4); (e) Sunrise (Cluster 5); (f) Raindrops—sun blocked (Cluster 6); (g) High turbulence (Cluster 7); (h) Overcast (Cluster 8).

Figure 12. Effect of input perturbation on cluster-wise MSE performance.

Table 1. Parameter configuration for the training of the proposed AE-like CNN model.

Parameter	Value
Leaky ReLU negative slope	0.2
Maximum epochs	100
Patience	10
Learning rate	0.001
Optimizer	Adam
Padding	Symmetrical zero-padding
Loss function	MSE

Table 2. Sensitivity analysis of kernel size (KS) and number of hidden layers (NHL) for the proposed 8-Cluster AE-like CNN model.

KS	NHL	Cluster 1		Cluster 4		Cluster 6		Cluster 7		Cluster 8
KS	NHL	MSE	SSIM	MSE	SSIM	MSE	SSIM	MSE	SSIM	MSE	SSIM
3	5	1.2971	59.08	0.157	67.24	1.0872	58.16	0.1182	78.78	0.2921	62.2
3	7	0.7376	64.82	0.1041	79.11	1.1235	60.71	0.0928	83.36	0.169	79.35
3	9	0.0408	89.29	0.0619	88.66	1.0422	48.07	0.0804	87.80	0.0981	89.41
3	11	0.8922	68.18	0.0635	88.06	1.26	57.12	0.0805	87.21	0.1247	86.6
5	5	1.0360	60.82	0.1328	72.44	1.2877	54.83	0.1298	80.63	0.2448	69.26
5	7	0.0450	88.03	0.0562	89.37	2.0304	36.64	2.0543	28.05	1.8057	27.12
5	9	0.0354	90.57	0.038 *	92.68 *	0.2552	89.59	1.9815	25.28	0.0773	91.31
5	11	0.0155 *	96.34 *	0.049	91.01	2.0244	55.5	0.0304 *	95.12 *	0.051 *	95.3 *
7	5	0.6987	65.78	0.117	76.09	4.1567	20.14	0.1181	80.28	1.1463	41.59
7	7	1.014	58.35	0.0536	89.91	2.3391	32.4	3.3495	19.46	2.1185	32.86
7	9	1.124	55.27	0.0455	92.14	0.0774 *	98.5 *	10.9458	13.64	5.5331	25.66
7	11	0.63	78.08	0.0408	92.46	0.2265	94.97	2.9834	32.5	1.5028	33.53

* Best metric values for each cluster are in bold.

Table 3. Performance assessment metrics (PSNR, MSE, and SSIM) for all models, where aggregate values are computed weighted by sample count.

Model	PSNR (dB)	MSE (%)	SSIM (%)
Persistence	30.42	0.197	76.1
OF	31.10	0.239	86.51
1-Cluster AE-like CNN	32.06	0.14	80.18
3-Cluster AE-like CNN	33.35	0.093	84.54
6-Cluster AE-like CNN	34.32	0.057	93.78
8-Cluster AE-like CNN (proposed)	35.56 *	0.053 *	94.59 *

* Best metric values are in bold.

Table 4. Per-cluster performance metrics (PSNR, MSE, and SSIM) for the proposed 8-Cluster AE-like CNN model.

Cluster	Model	Sequences	PSNR (dB)	MSE (%)	SSIM (%)
1	AE-like CNN	936	38.53	0.0155	96.34
2	Persistence	1963	37.68	0.029	96.13
3	Persistence	561	34.40	0.0858	92.27
4	AE-like CNN	1442	34.69	0.038	92.68
5	Persistence	794	32.29	0.1678	90.39
6	AE-like CNN	517	39.46	0.0774	98.5
7	AE-like CNN	885	34.09	0.0304	95.12
8	AE-like CNN	753	31.43	0.051	95.3
	Aggregate	7851	35.56	0.053	94.59

Table 5. Comparison of the proposed 8-Cluster AE-like CNN model with and without per-cluster sensitivity analysis, based on MSE and SSIM metrics.

Cluster	Model	MSE (%)		SSIM (%)
Cluster	Model	With	Without	With	Without
1	AE-like CNN	0.0155	0.0775	96.34	81.73
2	Persistence	0.029	0.029	96.13	96.13
3	Persistence	0.0858	0.0858	92.27	92.27
4	AE-like CNN	0.038	0.1533	92.68	68.08
5	Persistence	0.1678	0.1678	90.39	90.39
6	AE-like CNN	0.0774	0.266	98.5	73.65
7	AE-like CNN	0.0304	0.113	95.12	78.97
8	AE-like CNN	0.051	0.286	95.3	62.12
	Aggregate	0.053	0.139	94.59	83.73

Table 6. Performance assessment metrics (PSNR, MSE, and SSIM) for all models, where aggregate values are computed weighted by sample count for the second dataset.

Model	PSNR (dB)	MSE (%)	SSIM (%)
Persistence	31.83	0.202	80.23
1-Cluster AE-like CNN	33.16	0.132	81.37
8-Cluster AE-like CNN (proposed)	36.03 *	0.0478 *	96.28 *

* Best metric values are in bold.

Table 7. Evaluation results (PSNR, MSE, and SSIM) of the AE-like CNN models for different dataset splits (70-15-15 and 50-25-25).

Model	PSNR (dB)		MSE (%)		SSIM (%)
Model	70-15-15	50-25-25	70-15-15	50-25-25	70-15-15	50-25-25
1-Cluster AE-like CNN	32.06	30.41	0.14	0.17	80.18	80.1
3-Cluster AE-like CNN	33.35	33.85	0.093	0.089	84.54	86.27
6-Cluster AE-like CNN	34.32	33.01	0.057	0.085	93.78	86.36
8-Cluster AE-like CNN	35.56 *	34.25 *	0.053 *	0.055 *	94.59 *	93.02 *

* Best metric values are in bold.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Multi-Step Sky Image Prediction Using Cluster-Specific Convolutional Neural Networks for Solar Forecasting Applications

Abstract

1. Introduction

2. Related Work

3. Methodology

3.1. Forecasting Framework

3.2. Auto Encoder-like Convolutional Neural Networks

3.3. D and 3D Convolutions

3.4. Data Preprocessing

3.4.1. Grayscaling

3.4.2. Downscaling

3.4.3. Classification of the Input Data

4. Experimental Setup

4.1. Data Presentation and Analysis

4.2. Classification Results

4.3. Proposed Prediction Process

4.4. Configuration Setup

4.4.1. Data Organization

4.4.2. Model’s Architectures, Implementation Details, and Environment

5. Results

5.1. Assessment Metrics

5.1.1. Mean Squared Error

5.1.2. Structural Similarity Index Measure

5.1.3. Peak Signal-to-Noise Ratio

5.2. Benchmark Forecasting Models

5.2.1. Persistence Method

5.2.2. CMV-Based Method

5.2.3. AE-like CNN

5.3. Sensitivity Analysis

5.4. Final Forecasting Results

5.5. Evaluation on a Second Dataset

5.6. Alternative Dataset Split

5.7. Random Perturbation Check

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Article Metrics

Citations

Article Access Statistics