Probabilistic Short-Term Sky Image Forecasting Using VQ-VAE and Transformer Models on Sky Camera Data

Seyidbayli, Chingiz; Nezakat, Soheil; Reinhardt, Andreas

doi:10.3390/jimaging12040165

Open AccessArticle

Probabilistic Short-Term Sky Image Forecasting Using VQ-VAE and Transformer Models on Sky Camera Data

by

Chingiz Seyidbayli

^*

,

Soheil Nezakat

and

Andreas Reinhardt

Department of Informatics, Clausthal University of Technology, 38678 Clausthal-Zellerfeld, Germany

^*

Author to whom correspondence should be addressed.

J. Imaging 2026, 12(4), 165; https://doi.org/10.3390/jimaging12040165

Submission received: 22 February 2026 / Revised: 2 April 2026 / Accepted: 4 April 2026 / Published: 10 April 2026

(This article belongs to the Special Issue AI-Driven Image and Video Understanding)

Download

Browse Figures

Versions Notes

Abstract

Cloud cover significantly reduces the electrical power output of photovoltaic systems, making accurate short-term cloud movement predictions essential for reliable solar energy production planning. This article presents a deep learning framework that directly estimates cloud movement from ground-based all-sky camera images, rather than predicting future production from past power data. The system is based on a three-step process: First, a lightweight Convolutional Neural Network segments cloud regions and produces probabilistic masks that represent the spatial distribution of clouds in a compact and computationally efficient manner. This allows subsequent models to focus on the geometry of clouds rather than irrelevant visual features such as illumination changes. Second, a Vector Quantized Variational Autoencoder compresses these masks into discrete latent token sequences, reducing dimensionality while preserving fundamental cloud structure patterns. Third, a GPT-style autoregressive transformer learns temporal dependencies in this token space and predicts future sequences based on past observations, enabling iterative multi-step predictions, where each prediction serves as the input for subsequent time steps. Our evaluations show an average intersection-over-union ratio of 0.92 and a pixel accuracy of 0.96 for single-step (5 s ahead) predictions, while performance smoothly decreases to an intersection-over-union ratio of 0.65 and an accuracy of 0.80 in 10 min autoregressive propagation. The framework also provides prediction uncertainty estimates through token-level entropy measurement, which shows positive correlation with prediction error and serves as a confidence indicator for downstream decision-making in solar energy forecasting applications.

Keywords:

cloud motion forecasting; ground-based sky imaging; vector-quantized variational autoencoders; autoregressive transformer; uncertainty-aware prediction

1. Introduction

The use of solar panels, also known as photovoltaic (PV) systems, is growing very fast [1]. Because of this, accurate short-term forecasts of the solar radiation received on the ground are more important than ever [2]. The dynamic and frequently changing nature of clouds introduces significant variability in solar irradiance, making solar power generation less predictable under cloudy conditions. Rapid changes in cloud cover can cause abrupt fluctuations in solar irradiance; therefore, accurate forecasts of cloud motion are essential to understand PV generation better and use this knowledge to balance electricity supply and demand [3,4].

Sky cameras are useful tools to capture timestamped images of cloud fields. These observations are widely used in both research and operational settings for short-term cloud-motion tracking and solar irradiance forecasting [5,6]. From a remote sensing perspective, all-sky cameras constitute a form of ground-based passive optical remote sensing, in which data quality is affected by instrument-specific characteristics such as wide-angle geometric distortion inherent to fisheye optics, radiometric variations induced by atmospheric aerosol scattering, and changes in solar elevation angle throughout the day [7,8]. While satellite-based cloud products offer broad spatial coverage but are limited by coarse temporal resolution, ground-based sky cameras provide higher-temporal-resolution imagery over a localized domain (typically on the order of a few kilometers), serving as a complementary observational modality with sub-minute temporal sampling directly relevant to site-specific PV energy management and intra-hour nowcasting [9,10]. However, even with high-resolution sky-camera observations, predicting the subsequent evolution of cloud fields remains challenging. This difficulty stems from the fact that cloud evolution is governed by nonlinear, multiscale atmospheric dynamics in which small differences in initial conditions can lead to substantially different outcomes, thereby limiting forecast predictability [11].

In previous studies using sky cameras, cloud motions have often been estimated using optical flow in conjunction with physics-based advection models [5]. Optical flow is an image processing approach that calculates the motion vector field between successive frames under the assumption of approximate brightness constancy. However, these traditional methods can break under rapidly changing conditions (e.g., cloud formation, deformation, or dissipation), as they ignore non-stationary deformations and dynamic changes in cloud geometry, leading to unstable or inconsistent motion estimates [12]. Another limitation of these conventional methods is that the degree of certainty associated with their predictions is often not reported, which restricts their use in applications that call for risk-aware decision-making. This limitation motivates the incorporation of uncertainty quantification so that forecasts can be accompanied by an explicit and interpretable confidence indicator.

The development of deep neural networks has significantly improved spatiotemporal forecasting capabilities by enabling models to learn complex spatial and temporal dependencies across high-dimensional environmental data [13]. Neural network-based convolutional recurrent models, including the Convolutional Long Short-Term Memory (ConvLSTM) architecture, have been widely used for cloud movement and weather forecasting [14]. However, ConvLSTM-based methods often produce smoothed or blurry predictions at longer forecast lead times, indicating limitations in preserving fine spatial structure and detailed motion dynamics [15]. Another limitation is that deterministic pixel-level losses often yield blurry predictions, as the model is penalized equally for all deviations from the ground truth and thus learns to produce conservative, mean-like outputs rather than sharp spatial structure [16]. Also, small discrepancies at the beginning of the forecast have been reported to grow into large errors later on [17]. Most of these pixel-to-pixel models also struggle to explain the uncertainty of their own predictions [16,18,19].

Rather than predicting how pixel patterns move over time directly (i.e., in pixel space), learned latent representations can be used to model the future evolution of cloud movements. A suitable way to encode images into compact discrete latent representations is through the use of Vector-Quantized Variational Autoencoders (VQ-VAEs) [20]. The resulting discrete representation reduces spatial complexity and enables sequence-based temporal modeling [21]. Transformer-based autoregressive models have shown robust performance in video prediction tasks due to their ability to model long-term dependencies [22,23]. Uncertainty quantification is critical in renewable energy forecasting because it enhances the reliability of forecasts by presenting probability intervals for possible outcomes and supports risk-aware operational decisions [24]. Accordingly, uncertainty-aware forecasts enable better decision-making by providing richer information than single-point forecasts [25].

Existing cloud-motion and irradiance nowcasting pipelines often rely on assumptions (e.g., appearance constancy or slowly varying cloud structure) that may not hold true under rapid cloud formation, deformation, or dissipation, leading to degraded predictive performance. To mitigate these limitations, this work adopts a modular framework that separates (i) cloud-field characterization, (ii) a compact state representation, and (iii) temporal evolution modeling, with the aim of improving robustness under dynamic cloud conditions. The specific architectural design choices—namely CNN-based segmentation for cloud mask extraction, latent tokenization via VQ-VAE, and Transformer-based temporal prediction—are described in detail in Section 3. Notably, predicting binary cloud masks rather than full RGB images allows the model to focus on physically relevant cloud dynamics while reducing representational complexity [26,27].

2. Related Work

Short-term cloud motion forecasting has been studied from multiple perspectives, including image-based sensing, cloud segmentation, and spatiotemporal prediction models. In the following sections, we review the most significant works in sky-camera analysis, deep learning approaches to cloud-pattern segmentation, and representative state-of-the-art methods for predicting cloud movement over time.

2.1. Sky Camera Cloud Observation

For monitoring clouds at a local scale, sky cameras offer high-resolution observations that are more accessible and cost-effective than satellite-based imaging. Their ability to capture rapid changes at a low cost makes them ideal for short-term tasks. Early studies demonstrated the feasibility of estimating cloud motion vectors from sky images using optical flow and geometric projection models [5,28]. More recent works have focused on using sky imagery to improve solar irradiance forecasting [29,30]. Research in this area suggests that accurately estimating cloud motion from sky images is a powerful tool for reducing the uncertainty of short-term solar power [22]. Also, studies based on all-sky camera observations have shown that high-resolution cloud segmentation and cloudiness retrieval at the local scale can complement satellite products and improve the understanding of short-term cloud variability [31]. Advances in data-driven methods, particularly deep learning-based cloud segmentation from all-sky images, further highlight the potential of sky cameras for robust cloud monitoring under diverse atmospheric conditions [32,33]. However, many traditional methods still rely on handcrafted features such as color thresholds, edge patterns, or optical-flow vectors, or physical assumptions that often struggle when faced with complex cloud behavior including cloud growth or dissipation, multi-layer motion, or rapid appearance changes caused by light, all of which occur frequently in real-world sky camera observations [34,35].

2.2. Cloud Segmentation Using Deep Learning

Correctly segmenting images into cloud and non-cloud areas is crucial, as even small errors in identifying cloud patterns can lead to significant inaccuracies in the final prediction [27,33]. Datasets such as Singapore Whole sky Nychthemeron Image Segmentation (SWINySEG) [36] provide the pixel-level cloud–sky annotations necessary for supervised training, laying the foundation for a fair comparison of cloud segmentation models. Classical threshold-based and color-space methods often fail in challenging illumination conditions such as sunrise, sunset, or thin cloud layers. To address these limitations, Convolutional Neural Networks (CNNs) have been increasingly adopted for cloud–sky segmentation tasks [36]. Lightweight CNN architectures have gained traction because they are well-suited for the limited processing power of edge devices and embedded systems. Models such as U-Net variants and compact encoder–decoder networks have shown strong performance in binary cloud masking while maintaining low computational cost [36]. In this context, Lightweight CNN inspired by UCloudNet is attractive, as it is developed to maintain cost efficiency while ensuring high accuracy and has shown that this architecture achieves both high segmentation accuracy and computational efficiency for cloud mask extraction from sky images [37].

2.3. Spatiotemporal Forecasting of Cloud Dynamics

Because cloud fields exhibit variability in both morphology and motion, accurate forecasting of their evolution requires models capable of capturing complex spatiotemporal dependencies. Early methods applied optical flow and motion extrapolation directly to cloud masks or image intensities [38]. While these methods are computationally efficient, they are limited in predicting nonlinear cloud motion. Recent developments in deep learning have led to major improvements in spatiotemporal forecasting, offering a level of precision that was previously difficult to achieve. Recurrent neural network (RNN)-based architectures, particularly ConvLSTM models [14], have been extensively applied to precipitation nowcasting and cloud motion prediction [39,40,41]. These models effectively capture temporal patterns, but when trained with deterministic pixel-based loss functions, they tend to produce fuzzy predictions, especially in multimodal scenarios where multiple possible future scenarios exist. For cloud boundary prediction, this fuzziness can compromise the spatial accuracy of the predicted cloud masks and contribute to error accumulation over longer prediction horizons. Transformers have emerged as a powerful alternative for sequence modeling, owing to their ability to capture long-range dependencies through self-attention mechanisms. Their effectiveness has been demonstrated across diverse domains, including video modeling, time-series forecasting, and audio processing [42,43]. Several studies have applied transformer-based architectures to video prediction and weather forecasting tasks, showing improved temporal consistency compared to recurrent models [19,44]. Despite their strengths, applying transformers directly to high-resolution image prediction remains computationally expensive.

2.4. Latent-Space and Token-Based Forecasting

To reduce computational cost, recent work has shifted toward latent-space forecasting as a more efficient approach. Several methods exist for representing images as discrete tokens including patch-based tokenization and learned codebooks, but VQ-VAE has gained prominence because its learned discrete latent codes offer a compact representation that is well suited for high-resolution image modeling and autoregressive sequence prediction [20,45]. Accordingly, sequence models can operate directly on these tokens rather than in high-dimensional pixel space. This strategy has been successfully applied to video generation and prediction using autoregressive transformers [46]. Latent token forecasting excels at cloud prediction because it tracks essential structural patterns while ignoring irrelevant pixel-level noise. Recent studies suggest that working with tokens helps maintain a clear spatial structure while still allowing the model to explore different possible outcomes through probabilistic sampling [46,47,48].

2.5. Physics-Informed Temporal Modeling

A common issue with forecasting is that models can start to ignore basic physical constraints during longer, multi-step forecasts. To overcome this issue, physics-informed neural networks embed physical laws into the model’s training objective by enforcing the governing differential equations, which helps constrain solutions and improve robustness, especially when data are limited [49]. The Physics-informed Cell (PhyCell) framework introduces a recurrent cell structure inspired by physical evolution equations and a prediction correction paradigm [50] (drawing on data-assimilation ideas), enabling partial differential equation (PDE) constrained prediction in a learned latent space and improving long-term forecasting behavior and robustness to missing inputs [50].

Hybrid models that combine deep learning with physics-guided components have also shown practical benefits in Earth-system prediction, for example by integrating a process-based ecosystem model with a long short-term memory to improve accuracy while retaining physical interpretability [51]. We see that existing methods often focus on either pixel-level prediction or latent modeling, but they frequently lack a framework for explicit physical guidance; moreover, physical laws may not apply directly in pixel space for generic video/cloud imagery, motivating latent spaces where physical and residual factors can be disentangled [50]. There are limited approaches that simultaneously handle computational efficiency, physical laws, and uncertainty in the context of short-term cloud motion prediction; in this direction, operator-learning surrogates highlight how offline training can enable fast online inference (a simple forward pass) once trained [52]. The proposed framework addresses these gaps by merging efficient cloud segmentation with transformer-based forecasting and physical laws to predict cloud movement.

3. Methodology

The proposed cloud motion nowcasting system given in Figure 1 operates as a three-stage sequential pipeline: (1) cloud segmentation via a lightweight CNN to produce probability masks, (2) compression into discrete token sequences using a VQ-VAE, and (3) autoregressive temporal forecasting with a Generative Pre-trained Transformer (GPT)-style transformer. Each stage is trained independently, with outputs from earlier modules serving as inputs to subsequent ones.

By predicting cloud probability masks rather than raw RGB images, the system focuses on meteorologically relevant cloud geometries while filtering out illumination variations and background noise [53,54]. The VQ-VAE further abstracts these masks into a discrete token vocabulary, enabling efficient transformer-based sequence modeling and preventing physically implausible predictions during multi-step rollout [20,45]. The following subsections detail each component’s architecture and training procedure.

3.1. Cloud Detection Using a Lightweight CNN

Cloud detection is performed using a lightweight CNN inspired by the UCloudNet architecture [37] but substantially simplified for computational efficiency. The primary objective of this network is to transform raw sky images into continuous pixel-wise cloud probability masks. Unlike binary (0/1) segmentation, these probability masks produce a value in the range [0,1] for each pixel, where 0 represents clear sky with certainty, 1 represents cloud with certainty, and intermediate values represent ambiguous or partial cloud coverage, such as thin cirrus layers, cloud edges, or semi-transparent regions. This continuous representation provides smoother and more realistic spatial transitions compared to hard binary segmentation, particularly near cloud boundaries and in regions with thin cloud structures [36]. While the original UCloudNet employs a deep U-Net architecture with multiple encoder–decoder levels, the proposed network is simplified to a single-level minimal encoder–decoder structure to reduce computational cost. The model consists of only four convolutional layers and contains approximately ~14,300 parameters—roughly 23 times fewer than UCloudNet’s ~330,000 parameters. This reduction in model complexity makes real-time inference possible even on resource-constrained devices.

As shown in Figure 2 the encoder comprises two successive

3 \times 3

convolutional layers (with padding = 1 to preserve spatial dimensions) followed by a single

2 \times 2

max-pooling operation that downsamples the spatial resolution by a factor of two. Each convolutional layer uses 32 feature channels and ReLU activation. The decoder restores the original spatial resolution through a learned transpose convolution layer (

2 \times 2

kernel, stride = 2) followed by a

1 \times 1

convolutional layer that projects the 32 feature channels into a single output channel. Transpose convolution is preferred over bilinear interpolation, as it provides learnable upsampling weights, enabling the model to generate sharper cloud boundaries. The model accepts input images of size

H \times W \times 3

in RGB format where

H \times W

stays for image dimension of the resized at

256 \times 256

; however, since grayscale images are used in practice, all three channels contain identical values. The 3-channel structure has been retained because the standard PyTorch vision models expect pre-trained weights with 3-channel RGB inputs. The output consists of logit values of size

H \times W \times 1

, which are converted to probability masks in the range [0,1] via sigmoid activation. The network is trained using Binary Cross-Entropy with Logits Loss (BCEWithLogitsLoss) loss to optimize pixel-wise classification accuracy. Several architectural components present in UCloudNet have been deliberately removed based on task-specific considerations. First, skip connections—a hallmark of U-Net architectures—are omitted because cloud–sky segmentation is a relatively simple two-class problem where cloud regions exhibit high internal homogeneity and are easily distinguishable from the sky background, unlike complex multi-class medical imaging tasks that typically motivate U-Net designs. While this simplification introduces minor blurring (1–2 pixels) at cloud edges, this loss is negligible in the overall pipeline since the subsequent VQ-VAE encoder applies 8× spatial downsampling. Second, residual connections are excluded because the shallow depth of the network (only four convolutional layers) does not pose a risk of vanishing gradients, rendering residual paths unnecessary. Third, batch normalization layers are removed to reduce inference overhead and avoid batch-size dependency, which is problematic in real-time single-image processing scenarios; simple input normalization (scaling images to [0,1]) provides sufficient training stability for this lightweight architecture. This minimalist design philosophy prioritizes deployment efficiency without compromising the core segmentation capability required for cloud motion nowcasting. The resulting model achieves a favorable trade-off between computational cost and task-specific performance, making it suitable for edge deployment and real-time applications.

3.2. Discrete Latent Representation with VQ-VAE

To represent cloud masks in a compressed and structured format, a customized VQ-VAE architecture is used [20]. The primary function of this module is to convert high-resolution spatial cloud masks (H×W pixel array) into discrete symbolic token sequences; thus, the subsequent transformer-based prediction model can operate on meaningful cloud structure patterns selected from a limited codebook rather than continuous pixel values. The VQ-VAE model architecture is given in Figure 3. The encoder architecture reduces the binary cloud mask by a factor of 8 through convolutional layers and produces a compact latent representation for each mask. The architecture produces a 64-dimensional continuous feature vector for each spatial location using three consecutive strided-convolution layers.

The 8× spatial downsampling filters out local pixel noise in cloud masks, preserving only large-scale cloud structures (cloud clusters, boundaries, and holes); this allows the transformer model to operate on fewer tokens and reduces computational cost by approximately 64 times. The encoder output is converted into discrete symbols using a fixed-bit embedding table, which is a learnable codebook consisting of 1024 entries, each being a 64-dimensional vector. The reason for selecting this codebook size is that usage examples are also available in the literature [45]. Furthermore, the ablation study results for codebooks of varying sizes and the rationale for the selected configuration are provided in Section 5. For each spatial location, the encoder output is mapped to the nearest vector according to the Euclidean distance in the codebook, and as a result of this process, a token index is obtained for each spatial location. Consequently, each cloud mask is represented as a token map of size

H / 8 \times W / 8

and corresponds to a selected cloud structure prototype from the codebook. The obtained quantified latent representation values are converted back to the original cloud mask using a decoder with three consecutive transpose convolution (deconvolution) layers. This restructuring is necessary because the transformer operates in a compressed discrete symbol space for computational efficiency but visualization requires interpretable pixel-level cloud masks for comparative evaluation with real data.

3.3. Temporal Forecasting Using an Autoregressive Transformer

To model cloud motion, a GPT autoregressive transformer architecture that learns the temporal evolution of discrete token sequences is used [21]. The main task of this module is to predict cloud masks at the future time step at the token level, taking token sequences at the past time step as input (e.g., predicting the next 8 masks from the last 4 cloud masks). At each time step, the cloud mask is converted to

H / 8 \times W / 8

tokens by the VQ-VAE encoder. This 2D token map is flattened into a 1D sequence. If 4 past time steps are used, the total input length is 4 × (

H / 8 \times W / 8

) tokens. Each token is an index value selected from the VQ-VAE codebook. Our model uses a decoder-only transformer architecture; this is the same approach used in GPT and similar autoregressive language models. This design includes only a causal self-attention mechanism; that is, each token can only access at the tokens preceding it and has no access to future tokens.

Each token index is first converted into a 256-dimensional continuous vector via a learnable embedding lookup table—a

1024 \times 256

weight matrix where the token index serves as a key to retrieve the corresponding row vector. Since transformers do not naturally carry sequence information, a learnable positional encoding is added to each token. Learnable encoding is preferred over fixed sinusoidal encoding because the spatiotemporal structure of cloud sequences differs from that of natural language, and it is more appropriate for the model to learn this structure itself.

Concretely, a learnable positional embedding matrix of shape

(T + K) \cdot S \times d_{model}

is employed, where

S = \frac{H}{8} \times \frac{W}{8}

is the number of tokens per frame and

d_{model} = 256

is the embedding dimension. Each of the

(T + K) \cdot S

positions receives a unique trainable vector that jointly encodes both the spatial location of the token within its frame and its temporal offset across frames. Although the 2D token map is serialised into a 1D sequence via raster-scan (row-major) order, the learned embeddings implicitly recover spatial proximity. To verify this, pairwise cosine similarities between all token embeddings within a single frame are computed on the trained model: tokens that are direct spatial neighbours in the

\frac{H}{8} \times \frac{W}{8}

grid (4-connected, Manhattan distance

= 1

) yield a mean cosine similarity of

0.051

, whereas tokens separated by Manhattan distance

\geq 5

yield only

0.004

, giving a locality ratio of

12.4 \times

. This demonstrates that spatial context is preserved during autoregressive sequence modelling despite the 1D serialisation.

The model consists of 6 identical transformer layers, and each layer contains multi-head causal self-attention, feed-forward network and residual connections and, layer normalization. The ablation study regarding the selection of the number of layers is presented in Section 5. In encoder–decoder transformers, encoder layers use bidirectional full attention, while decoder layers use causal attention. Since the proposed model has a decoder-only architecture, causal masking is applied in all layers. It provides critical advantages in causal masking, temporal consistency, long-range dependencies [19], and working at the token level instead of the pixel level [47,48].

Operating at token level is also crucial for performance: If the model operated at the pixel level, the attention matrix would be

{(H \times W)}^{2}

in size. However, since the model operates at the token level, both the height and width dimensions are reduced by a factor of 8. In this case, the total matrix size of

{(\frac{H}{8} \times \frac{W}{8})}^{2} = \frac{{(H \times W)}^{2}}{64^{2}}

(1)

is obtained. Therefore, the attention matrix is

64^{2} = 4096

times smaller compared to the pixel level. This also shows that the calculation cost and duration have been reduced by a factor of 4096.

3.4. Autoregressive Multi-Step Forecasting Strategy

Our framework employs an autoregressive forecasting approach where predictions are generated iteratively in a sliding-window manner. Given an initial temporal context of four consecutive cloud mask token sequences (frames at

t - 3, t - 2, t - 1, t

), the transformer predicts the token sequence for the next time step (

t + 1

). This predicted sequence is then incorporated into the context by appending it and dropping the oldest frame, creating a new context window (frames at

t - 2, t - 1, t, t + 1

) for predicting

t + 2

. This recursive process continues for the desired forecast horizon, enabling multi-step predictions without requiring ground-truth future observations during inference.

4. Data Selection and Model Training

The complete training pipeline consists of three sequential stages, each optimizing a distinct component of the forecasting system. The lightweight CNN model was trained using the SWINySEG dataset [36]. Then, binary mask images of the Clausthal All Sky Camera Recordings (CASCAR) dataset [55], which has 5 s sampling rate, were extracted using this pre-trained model, and other models were trained using this dataset.

4.1. Stage 1: Cloud Segmentation Network

The lightweight CNN is trained end-to-end to produce pixel-wise cloud probability masks from raw grayscale sky images. Training data consists of 6768 image-mask pairs from the SWINySEG dataset, where masks are binary ground-truth labels (0 = clear sky, 1 = cloud) created through the annotation procedure described in Section 3.

Images are randomly shuffled at each epoch to prevent overfitting to temporal ordering. No explicit data augmentation (e.g., rotation and flipping) is applied, as fisheye sky images have inherent rotation invariance that the model must learn.

The model is trained using BCEWithLogitsLoss, which combines a sigmoid activation with the binary cross-entropy criterion in a numerically stable form:

L_{seg} = - \frac{1}{H W} \sum_{i} [y_{i} log {\hat{p}}_{i} + (1 - y_{i}) log (1 - {\hat{p}}_{i})],

(2)

where

y_{i} \in {0, 1}

is the ground-truth cloud label and

{\hat{p}}_{i}

is the predicted cloud probability at pixel i. This loss is well-suited for binary cloud/sky segmentation because it directly models per-pixel binary probabilities and operates on logits for improved numerical stability, leading to reliable and stable optimization in dense prediction tasks. The Adam optimizer was used for model training, and a learning rate of

10^{- 3}

was employed.

4.2. Stage 2: VQ-VAE for Discrete Latent Representation

The VQ-VAE is trained to compress cloud probability masks (output from Stage 1) into discrete token sequences while preserving structural cloud patterns. Training data consists of temporal sequences with a sliding window approach: each sample contains

T_{context} + T_{pred}

consecutive frames (e.g., 4 context + 1 prediction = 5 frames total), where frames are sampled with stride 5 s to capture temporal dynamics at sufficient time scales. All frames within each temporal sequence (both context and prediction frames) are processed independently through the VQ-VAE to maximize data utilization.

Training continues with the standard VQ-VAE objective combining L1 reconstruction loss and commitment loss [20], where the code book is updated using an exponential moving average (EMA) instead of direct backpropagation for numerical stability. The model converges when the reconstruction quality stabilizes and the codebook usage reaches sufficient diversity (measured by complexity). The final model is frozen and reused as a fixed encoder/decoder pair for the next stage.

The codebook size of

K = 1024

and the commitment cost

β = 0.25

follow the original VQ-VAE formulation of [20], in which

β

was identified as the empirically optimal value that prevents codebook collapse while maintaining a useful gradient signal for the encoder. The hidden channel width of 128 and latent dimension of 64 were chosen to balance representational capacity against computational cost for single-channel

256 \times 256

inputs.

4.3. Stage 3: Autoregressive Transformer for Temporal Forecasting

The GPT-based transformer is trained to predict future token sequences given past context, conditioned on the frozen VQ-VAE encoder. The VQ-VAE weights from the previous Stage 2 are loaded and frozen to prevent catastrophic forgetting of the learned discrete representation. Training data consists of temporal sequences where context frames (

T = 4

) and target frames (

K = 1

or more) are encoded into discrete tokens via the frozen VQ-VAE encoder.

For each training sample, context and target frames are first encoded to discrete tokens via the frozen VQ-VAE encoder. These token sequences are concatenated into a single flat sequence of length L, then split into input

x = [z_{1}, \dots, z_{L - 1}]

and target

y = [z_{2}, \dots, z_{L}]

for standard teacher forcing: the model predicts the next token at each position given all previous tokens. This formulation trains the transformer to autoregressively generate token sequences that, when decoded by the VQ-VAE decoder, correspond to future cloud masks. The causal attention mask ensures that predictions at position i only depend on tokens

z_{1}, \dots, z_{i}

, preventing information leakage from future time steps.

The GPT Transformer is trained with standard autoregressive cross-entropy loss over the discrete token vocabulary

V

:

L_{GPT} = - \frac{1}{L - 1} \sum_{t = 1}^{L - 1} log P_{θ} (x_{t} ∣ x_{< t}),

(3)

where

x_{t} \in V

is the ground-truth token at position t and

L = (T + K) \cdot S

is the total sequence length, with T context frames, K prediction frames, and S tokens per frame. The autoregressive cross-entropy objective is the standard training criterion for GPT-style sequence models [56], and transfers directly to discrete VQ token sequences. The architecture consists of 6 transformer layers, 512 hidden dimensions, and 8 attention heads (each operating in a 64-dimensional subspace), where the hidden dimensions and attention head configuration follow the design principles of [21], and the number of layers is determined based on the ablation study presented in Section 5.6. A dropout rate of 0.1 is applied throughout, consistent with the original transformer implementation of [21].

5. Evaluation of the System

The objective of this evaluation is to verify whether the proposed pipeline can produce short-term cloud-mask forecasts from ground-based sky-camera sequences with high accuracy, and whether its probabilistic sampling provides uncertainty estimates that are informative about prediction errors. We focus on three practical questions that can be answered directly by means of the experiments:

(Q1): Following the standard next-frame prediction protocol in the video prediction literature [57], can the model accurately predict the next cloud mask from four past frames on unseen sequences?
(Q2): Does the model remain usable under recursive rollout, i.e., does forecast quality degrade gradually rather than collapsing over multiple steps?
(Q3): Do higher uncertainty values tend to coincide with higher prediction error, indicating that the uncertainty estimates are meaningful for identifying less reliable prediction?

5.1. Sky Image Acquisition and Pre-Processing

Data for this study has been obtained from the CASCAR dataset [55]. The dataset was collected via a ground-based automated sky observation system which uses a roof-mounted sky-facing camera. The system consists of an Oculus All-Sky Camera 150° [58] camera and Oculus 180 Lens [59] (Starlight Xpress Ltd., The Old Dairy, Allanbay Park, Howe Lane, Binfield, Berkshire RG42 5QA, UK), controlled via the Instrument Neutral Distributed Interface (INDI) protocol. Image capture was performed during daylight hours, with sunrise and sunset times calculated daily based on geographic location information using astronomy libraries. The dataset has been sampled at a high temporal resolution, with between four and twelve images per minute.

This sampling strategy was adopted to capture the short-term evolution of cloud dynamics and, in particular, to preserve the temporal continuity of fast-moving cloud motions. To ensure consistent image quality under variable lighting conditions, an automatic exposure optimization algorithm was run every hour starting from sunrise that day. The algorithm performs an iterative search to converge the normalized pixel intensity of the image’s 99th percentile to a target value of 0.60. This approach adaptively responds to atmospheric lighting conditions that change from morning to noon to evening, depending on the sun’s altitude, preventing both overexposure (saturation) and underexposure. Each image has a size of

800 \times 600

pixels (width × height) in the dataset.

5.2. Experimental Setup

A total of 103,304 images were used in the study which were captured on different days and different hours of the day (152 different sequences) by considering various cloud densities and the speed of change of the clouds’ movement behavior [60]. When selecting these data, care was taken to choose partly cloudy images where clouds could be segmented and their movements predicted, rather than completely cloud-free or completely cloudy images from the dataset. Each sequence consists of binary cloud probability masks generated by the cloud segmentation model described in Section 3. The training data follows the annotation principles of publicly available ground-based sky datasets such as SWINySEG [36], ensuring compatibility with existing benchmarks [37]. As shown on Figure 4 in SWINySEG, each fisheye sky image is paired with a manually annotated binary ground truth mask where each pixel is labeled as cloud or clear sky.

The evaluation focuses on short-term cloud forecasting, which is crucial for managing PV power variability and supporting grid-balancing operations, where the model predicts future cloud masks based on a fixed number of past frames. In the study 15% (23 sequences) and 5% (7 sequences) of the data used were used as training and validation data, respectively. The remaining data (122 consecutive frames, equaling 82,931 images) were used to measure model performance, while the model was required to predict the next frame in the sequence.

All models were trained and evaluated on a server equipped with two 16-core AMD EPYC 7282 processors and Nvidia A100 80 GB GPU, running CUDA 12.4.

5.3. Evaluation Metrics

To evaluate model performance, the metrics Intersection over Union (IoU), F1-score, pixel accuracy, Mean Absolute Error (MAE), Mean Squared Error (MSE), and Structural Similarity Index Measure (SSIM) are employed, which are widely used in the remote sensing and image segmentation literature. Before calculating the binary classification metrics, both the predicted masks and the ground-truth masks are passed through a threshold of t = 128/255 to obtain a binary representation.

The IoU quantifies the spatial overlap between predicted and ground-truth cloud regions. Unlike accuracy-based measures, it is largely unaffected by class imbalance, making it particularly suitable for scenes where cloud coverage is sparse:

I o U = \frac{| P \cap G |}{| P \cup G |},

(4)

where P and G denote the sets of predicted and ground-truth cloud pixels, respectively.

The F1-score, equivalent to the Dice coefficient in binary segmentation, accounts for both false positives and false negatives and provides a balanced measure of overlap between the predicted and reference masks:

F 1 = \frac{2 | P \cap G |}{| P | + | G |} .

(5)

Pixel accuracy measures the fraction of correctly classified pixels across the entire frame:

A c c u r a c y = \frac{N_{correct}}{N_{total}},

(6)

where

N_{c o r r e c t}

is the number of correctly classified pixels and

N_{t o t a l}

is the total number of pixels in the image.

MAE and MSE capture pixel-level intensity differences between the predicted and ground-truth images. MAE computes the mean absolute deviation and is more robust to outliers, while MSE penalizes larger errors more heavily by squaring the residuals:

M A E = \frac{1}{N} \sum_{i = 1}^{N} |\hat{y} i - y_{i}|,

(7)

M S E = \frac{1}{N} \sum {i = 1}^{N} {({\hat{y}}_{i} - y_{i})}^{2},

(8)

where

{\hat{y}}_{i}

and

y_{i}

are the predicted and ground-truth intensity values at pixel i, and N is the total number of pixels.

Finally, SSIM evaluates perceptual similarity by jointly considering luminance, contrast, and structural information between two images:

S S I M (\hat{I}, I) = \frac{(2 μ_{\hat{I}} μ_{I} + c_{1}) (2 σ_{\hat{I} I} + c_{2})}{(μ_{\hat{I}}^{2} + μ_{I}^{2} + c_{1}) (σ_{\hat{I}}^{2} + σ_{I}^{2} + c_{2})},

(9)

where

\hat{I}

denotes the reconstructed image and

I

the reference image,

μ_{\hat{I}}

and

μ_{I}

are the mean intensities of

\hat{I}

and

I

, respectively,

σ_{\hat{I}}^{2}

and

σ_{I}^{2}

are their corresponding variances,

σ_{\hat{I} I}

the covariance between the two images, and

c_{1}

,

c_{2}

are small stabilizing constants included to avoid division by zero. SSIM ranges from

- 1

to 1, with a value of 1 indicating perfect structural agreement.

5.4. Training Performance

Figure 5 illustrates the convergence behavior during the training of the VQ-VAE model, which constitutes the second stage of the pipeline. Two complementary loss components are observed: (a) the reconstruction loss

L_{recon}

, which measures how accurately the decoder recovers the original cloud probability mask from the quantized tokens, and (b) the commitment loss

L_{VQ}

, which ensures that the encoder outputs remain close to the code book embeddings.

Figure 5a shows that the MSE value of the reconstruction rapidly decreases from 0.0105 to approximately 0.0045 in the first 7 cycles, then gradually approached the final value of 0.0037 in the 29th cycle. After this point, training stops because there is no further improvement in the validation loss value.

This trajectory indicates that the encoder–decoder pair quickly learns coarse cloud structure representations (e.g., large cloud clusters and open sky regions) during early training, then refines fine details (e.g., cloud edge textures and thin cirrus patterns) as training progresses. The final MSE of 0.0037 corresponds to a root mean square error of approximately 0.06 in the normalized [0,1] mask space, representing an average deviation of approximately 6% per pixel between the input and reconstructed masks.

Figure 5b shows that the commitment loss

L_{VQ}

decreases from 0.0021 to 0.0017 and stabilizes after the 7th epoch. This model reflects the convergence of the code book learning process via exponential moving average: early training involves frequent reassignment of encoder outputs to different codebook inputs as the model explores the latent space, while later training exhibits stable token assignments after the code book captures the diverse vocabulary of recurrent cloud models. The low final commitment loss (0.0017) indicates minimal inconsistency between continuous encoder outputs and their closest discrete code book neighbors, confirming that quantization does not introduce significant information bottlenecks.

The synchronized convergence of both loss components after epoch 7 suggests that the VQ-VAE reaches a stable equilibrium where reconstruction quality and codebook consistency are jointly optimized. This stability is critical for downstream transformer training (Stage 3), as inconsistent token assignments would cause distribution shift and prevent the autoregressive model from learning reliable temporal patterns.

5.5. Model Performance

To measure the performance of the proposed pipeline, its ability to predict the next time step (

t + 1

) is evaluated on the test set when four context frames (

t - 3, t - 2, t - 1, t

) are provided. The choice of

T = 4

context frames follows the established convention in the video prediction literature [57], where a four-frame window has been shown to provide sufficient short-range temporal context for next-frame prediction while remaining computationally tractable. This single-step accuracy serves as a baseline for assessing subsequent multi-step rollout degradation.

The model was evaluated on 122 sequences, each containing an average of 680 consecutive frames. The model’s performance was compared with deep learning and traditional baseline models. Optical flow estimates per-pixel displacement between consecutive frames using the Lucas–Kanade method [61] and warps the most recent frame along the estimated flow field to produce a prediction. Phase correlation computes the cross-power spectrum of two successive frames in the Fourier domain to recover a global translation vector, which is then applied to the last context frame. Both methods are parameter-free and serve as classical baselines.

ConvLSTM [14] embeds convolutional operators within long short term memory (LSTM) gate functions, enabling spatiotemporal feature learning with locally consistent dynamics. It is trained with the same BCEWithLogitsLoss objective and Adam optimiser (

η = 10^{- 3}

) as the proposed model, using a 2-layer architecture with 64 hidden channels. Predictive Recurrent Neural Network (PredRNN) [62] extends ConvLSTM via a zigzag spatiotemporal memory that propagates information across both time steps and network depth. U-Net Predictor [63] is an encoder–decoder network with skip connections that receives all four context frames concatenated along the channel dimension as input and directly regresses the next cloud mask. The models used for comparison are trained using the same training settings as the proposed architecture.

Single-Step Prediction Performance

The proposed pipeline delivers consistent improvements across all metrics reported in Table 1, compared to both classical motion estimation methods and deep learning models. The performance improvements observed in the proposed pipeline are based on the complementary roles of the proposed architectural components and have been examined in detail through an ablation study.

Figure 6 provides a frame-by-frame breakdown of IoU and pixel accuracy across the 122 test sequences, revealing temporal variability in prediction quality. Frames with lower IoU (typically <0.85) correspond to rapid cloud formation events, sunset/sunrise transitions with complex lighting, while high-quality predictions (IoU > 0.95) occur during stable cloud configurations with persistent motion patterns. Figure 7 presents qualitative examples from three representative time periods, showing (left to right): original grayscale sky images, ground-truth cloud masks, and model predictions. Visual inspection confirms that the model accurately captures cloud spatial distribution, preserves boundary sharpness, and maintains structural coherence, with minor discrepancies primarily occurring at the regions with sunshine.

5.6. Ablation Studies

5.6.1. Role of the Segmentation Module

The 9.7% drop in IoU and 8 point decrease in SSIM observed in the Lightweight CNN variant demonstrates that prediction quality is fundamentally limited by the accuracy of the segmentation in the previous stage. The residual encoder of UCloudNet, that inspired this work, applies multi-resolution feature fusion and enables the network to learn semantically meaningful features at 1/2 and 1/4 output scales via auxiliary losses [37]. When this stage is skipped, blurred mask boundaries propagate as corrupted token assignments into the VQ-VAE codebook, and this cascading error manifests in every downstream metric. This behavior is consistent with the principle that perceptual noise upstream of the discrete tokenization stage increases the codebook entropy and degrades the quality of the token sequences fed to the predictor [20].

5.6.2. Role of the Transformer Predictor

Replacing the Transformer with a simpler predictor results in a 5.5 point drop in IoU and a 5.7 point decrease in SSIM; this demonstrates that local spatial operators alone are insufficient for capturing the long-range temporal dependencies specific to cloud advection across four input frames. Unlike recurrent models, which are limited to sequential computation via gradient paths, the Transformer’s full self-attention mechanism attends jointly over all temporal tokens collectively, enabling the model to represent non-local atmospheric circulation patterns that extend beyond the effective receptive field of convolutional or recurrent kernels [21]. The discrete latent vocabulary produced by VQ-VAE fits well with the Transformer sequence modeling framework by converting continuous cloud masks into a finite codebook. The prediction task becomes an autoregressive token classification problem where Transformers show strong generalization in spatiotemporal prediction tasks [20,21].

5.6.3. Codebook Ablation

In the codebook ablation study, codebooks of varying sizes with 64 embedding dimensions and codebooks with 1024 embeddings of varying embedding dimensions were evaluated. For this study, models were retrained on 20% of the training data, and performance was evaluated using five consecutive frames per sequence. As shown in Table 2, which presents the results across different codebook configurations, the highest utilization rate was achieved by the codebook consisting of 128 embeddings, each with 64 embedding dimensions. No codebook collapse was observed for this configuration. The proposed codebook structure—consisting of 1024 embeddings, each with 64 embedding dimensions—ranked second in utilization rate and utilized only 30% of the total tokens; however, it yielded slightly better reconstruction performance (0.3% higher Reconstruction IoU) compared to the cb128 configuration, where cb128 denotes a codebook with 128 embeddings, each of 64 dimensions. While the smaller codebook proved to be more efficient in terms of compactness, the configuration achieving the highest overall performance was selected for this study.

5.6.4. Transformer Model Ablation

The most commonly encountered architecture in the literature is the 8-layer transformer, consistent with the original architecture [21]. A comparative study was conducted to evaluate the performance of 8-layer and 6-layer transformer models. As shown in Table 3, the 6-layer architecture achieved marginally better performance. Furthermore, the 6-layer transformer was adopted in this study to reduce overall computational complexity.

5.7. Comparison with Baselines

5.7.1. Classical Motion Estimation

Optical flow and phase correlation are the most commonly used signal processing methods for estimating cloud motion. Optical flow is based on the assumption of luminance constancy meaning that pixel intensity remains constant between consecutive frames [64]. However, since cloud regions exhibit changes in radiative intensity due to variations in solar angle and atmospheric scattering while simultaneously undergoing translational motion and morphological deformation, this assumption loses its validity in satellite and ground-based sky images [65,66].

5.7.2. Deep Learning Models

ConvLSTM captures spatiotemporal dependencies by embedding convolutional operators within LSTM gate functions, which makes it highly suitable for locally consistent dynamics [14]. PredRNN extends this paradigm via a zigzag spatiotemporal LSTM that spreads memory across both time steps and network depth, partially mitigating the accumulated error of deep recurrent stacks [62]. The advantage of PredRNN over ConvLSTM

Δ IoU = 0.003

is consistent with this architectural improvement but both models remain constrained by convolutional receptive fields. Temporal dynamics are modeled through local spatial operations, and large-scale advection models should propagate gradually through recurrent hidden states rather than being handled globally.

The superiority of the proposed model over PredRNN, particularly the reduction in MAE from 0.017 to 0.011 and the 0.006 increase in SSIM, can be attributed to the Transformer’s ability to establish direct attention paths between any two spatial tokens across its entire temporal context window; a capability that recurrent architectures can only approximate imperfectly through state propagation [21]. The relatively weaker performance of U-Net Predictor [63] supports the idea that, without a dedicated temporal memory or attention mechanism, encoder–decoder architectures lose the sequential order and dynamic structure of temporal context by combining four input frames into a single hidden representation.

5.8. Multi-Step Forecasting and Error Propagation

To ensure temporal stability, long forecasting using recursive rollout is evaluated. Long-term forecasts were generated by autoregressively predicting up to 120 future frames (10 min), where each newly predicted frame is appended to the input sequence for the subsequent prediction step. Due to error accumulation specific to autoregressive feedback loops, accuracy gradually decreases over the prediction horizon; however, the model provides meaningful spatial consistency as confirmed by IoU values above 0.65 for up to 10 min. Multi-step rollout performance across different horizons is summarized in Table 4. As seen in the table, as the prediction time increases, a certain decline in model performance begins. The reason for this is that the model attempts to continue its prediction without seeing any original images during the 10 min prediction period. In other words, to predict the 120th image, the model generates 119 frames and achieves a 0.63 IoU and 0.86 pixel accuracy value at the end of the period. At last, Figure 8 compares the actual image, ground-truth cloud masks and the corresponding predictions for the next 1, 5, and 10 min.

5.9. Model Complexity Comparison

Table 5 summarizes the computational complexity of all evaluated models in terms of trainable parameters, inference latency, peak GPU memory consumption, and Giga Floating Point Operations Per Second (GFLOPS), which quantifies how many billions of arithmetic operations, specifically additions and multiplications, are needed to complete one forward pass through a neural network.

The proposed pipeline contains 29.4 M trainable parameters and requires 180.6 GFLOPS per inference step, consuming approximately 3.7 GB of GPU memory, with a measured latency of 91.5 ms on a single Nvidia A100 GPU. In particular, despite having significantly fewer parameters (1.5 million and 2.3 million, respectively), ConvLSTM and PredRNN exhibit significantly higher computational costs of 602.8 and 1200.7 GFLOPs, respectively. This is due to the recurrent spatial convolution operations; these operations process full

H \times W

feature maps repeatedly across multiple time steps, leading to high FLOP values despite the compact number of parameters.

The U-Net Predictor achieves the lowest latency (2.1 ms) and GFLOPs (27.5) among learned models, but its inferior forecasting performance indicates that raw spatial regression without discrete latent modeling is insufficient for accurate cloud motion forecasting.

Considering performance efficiency, the proposed model delivers the best performance with an IoU/GFLOPs ratio of

5 \times 10^{- 3}

, despite consuming several times more GPU memory than competing models.

5.10. Uncertainty Analysis

Modeling uncertainty in cloud evolution is a key feature of the proposed framework, enabling the model to express a distribution of possible outcomes rather than a single deterministic prediction. Following [67], we distinguish three sources of predictive uncertainty in cloud motion forecasting: epistemic uncertainty arising from model parameters, aleatoric uncertainty arising from input data noise, and chaotic uncertainty intrinsic to atmospheric dynamics [68].

5.10.1. Uncertainty Source Decomposition

The following estimation strategies are used to identify sources of uncertainty:

Epistemic uncertainty is estimated via Monte Carlo (MC) Dropout [69]. Dropout layers are kept active at inference time, and $K = 10$ stochastic forward passes are executed with deterministic (argmax) token decoding. The pixel-wise variance across these passes reflects uncertainty attributable to model parameters.
Aleatoric uncertainty is estimated by input perturbation. Independent Gaussian noise ( $σ = 0.05$ ) is added to each context frame across K trials, with deterministic decoding applied in each trial. The resulting pixel-wise variance reflects the model’s sensitivity to input observation noise.
Chaotic uncertainty is approximated as the residual component:

$U_{chaotic} (i, j) = max (U_{total} (i, j) - U_{epistemic} (i, j) - U_{aleatoric} (i, j), 0),$

(10)

where $U_{total}$ is estimated via temperature sampling. This residual captures the irreducible stochasticity of cloud motion that cannot be attributed to model parameters or input noise.

Figure 9 shows the average pixel variance contributed by each source in the test set. Chaotic uncertainty accounts for the largest share (54%), followed by the aleatoric (28%) and epistemic (17%) components. The dominance of chaotic uncertainty is consistent with the inherently unpredictable nature of atmospheric dynamics on short timescales [68], and the low epistemic contribution indicates that the model was well-trained with limited parameter uncertainty.

5.10.2. Uncertainty–Error Correlation Analysis

In this study, uncertainty is estimated by generating 10 future predictions for the same input sequence across all input sequences and measuring how much these predictions differ from one another at each pixel. When these outcomes vary strongly at certain locations, the model reports higher uncertainty there, indicating less confidence in the prediction.

Let

{\hat{Y}}^{(k)} = [{\hat{Y}}_{i, j}^{(k)}] \in {[0, 1]}^{H \times W}

denote the k-th probabilistic cloud mask prediction, where

k = 1, \dots, K

and

(i, j) \in {1, \dots, H} \times {1, \dots, W}

denotes pixel coordinates. Let

Y = [Y_{i, j}] \in {0, 1}^{H \times W}

denote the corresponding binary ground-truth cloud mask.

Pixel-wise predictive uncertainty is computed as the variance across multiple sampled predictions [69]:

U (i, j) = Var ({\hat{Y}}_{i, j}^{(1)}, {\hat{Y}}_{i, j}^{(2)}, \dots, {\hat{Y}}_{i, j}^{(K)}),

(11)

where

(i, j)

denotes the pixel location. Then the prediction error is quantified using the IoU metric between the predicted mask and the ground truth. For correlation analysis, the error is defined as

E = 1 - IoU (\hat{Y}, Y) .

(12)

The quality of the uncertainty estimate was evaluated at two different levels of detail. At the pixel level, the estimated uncertainty exhibits a Pearson correlation of

r = 0.299

and a Spearman rank correlation of

ρ = 0.365

with the pixel-based binary prediction error; this indicates a moderate yet consistent relationship between local uncertainty and local accuracy. At the sequence level, the sum of the average uncertainty and IoU error per sequence yields a significantly stronger Spearman rank correlation of

ρ = 0.873

(

p \approx 0.031

); this confirms that sequences with higher prediction uncertainty reliably align with those having lower prediction quality. The near-zero Pearson coefficient at this level (

r = 0.031

,

p = 0.736

) indicates that the relationship is monotonic but not linear; this is consistent with the limited nature of the IoU metric.

Rather than producing overly confident predictions with high probability values but low accuracy, the proposed model exhibits a lower correlation between the predicted cloud probability and the actual cloud masks, indicating that prediction reliability appropriately decreases in uncertain regions. Instead of generating spurious high-frequency details under uncertain conditions, the model produces outputs with broader spatial distributions that are consistent with increased ambiguity in cloud evolution as reflected in the variance across sampled predictions. This property is important for practical applications, as it enables the model both to predict cloud movement and to signal when its forecasts may be less reliable. Figure 10 illustrates the model’s performance over a 120-step (10 min) forecast horizon on a sequence. As accuracy declines due to error accumulation, uncertainty estimates decrease correspondingly after step 50, indicating that the model correctly signals reduced confidence at longer forecast horizons.

6. Conclusions

This study presents a modular deep learning framework for short-term cloud motion using ground-based all-sky images. The system is based on a three-stage process: (1) A lightweight Convolutional Neural Network segments cloud regions and generates probabilistic masks, (2) a Vector Quantized Variational Autoencoder compresses these masks into discrete token sequences, and (3) a GPT-style autoregressive transformer predicts future token sequences based on temporal context. The framework is trained on 103,304 images from the CASCAR dataset [55] selected from a custom dataset and evaluated on held-out test sequences in both single-step and multi-step autoregressive prediction scenarios.

Quantitative evaluation shows that the model achieves an average intersection-over-union ratio of 0.92 and pixel accuracy of 0.96 for one-step (5 s forward) predictions on the test set. Under an autoregressive rollout extending up to 120 steps (10 min), performance degrades due to error accumulation, with IoU dropping to 0.65 at the final prediction horizon and pixel accuracy falling to 0.80. Specifically, while 80% of single-step predictions maintain IoU above 0.90, 89% of squares in the entire 120-step rollout maintain pixel accuracy above 0.85. Uncertainty measurement via token-level prediction-based entropy shows a positive correlation with prediction error (1 - IoU) and confirms that the model reliably signals reduced reliability at longer prediction horizons. These results demonstrate that the framework is suitable for operational nowcasting applications requiring reliable predictions for up to 10 min.

The fundamental limitation of the current approach is that the quality of predictions longer than 10 min decreases due to the accumulation of errors in the autoregressive feedback loop. While discrete token representation is computationally efficient, it produces small correction artifacts at cloud boundaries due to 8× spatial downsampling.

Additionally, the model operates in an entirely data-driven way without explicitly incorporating atmospheric physics (e.g., wind speed fields and mass conservation constraints), which may limit extrapolation to cloud motion models poorly represented in the training distribution.

To evaluate the effectiveness of the proposed method, various ablation studies were conducted using traditional and deep learning-based prediction models commonly used in the literature, and these were tested comparatively on the same dataset [14,62,63]. The quantitative results obtained demonstrate that the proposed pipeline exhibits significantly better prediction performance compared to all comparison models.

In future research, integrating existing deep learning models trained on all-sky camera images with satellite data and wind measurements within a multimodal framework has the potential to significantly improve prediction accuracy, and model robustness and may enable the expansion of the forecasting horizon. Additionally, transformations applied to the inputs of deep learning models based on the geographic coordinate system can enable the model to achieve broader applicability without incurring additional computational overhead.

Author Contributions

Conceptualization, C.S., S.N. and A.R.; methodology, C.S., S.N. and A.R.; software, C.S. and S.N.; validation, C.S. and S.N.; formal analysis, C.S. and S.N.; investigation, C.S., S.N. and A.R.; resources, A.R.; data curation, C.S., S.N. and A.R.; writing—original draft preparation, C.S. and S.N.; writing—review and editing, C.S., S.N. and A.R.; visualization, C.S. and S.N.; supervision, A.R.; project administration, A.R.; funding acquisition, A.R. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by German Federal Ministry for Economic Affairs and Climate Action (grant number 03EI6073A).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The dataset used in this study is openly available in Zenodo. https://doi.org/10.5281/zenodo.18657514 accessed on (16 February 2026).

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

PV	Photovoltaic
CNN	Convolutional Neural Network
ConvLSTM	Convolutional Long Short-Term Memory
VQ-VAE	Vector-Quantized Variational Autoencoder
RGB	Red-Green-Blue
IoU	Intersection over Union
PhyCell	Physics-informed Cell
SWINySEG	Singapore Whole sky Nychthemeron Image Segmentation
U-Net	U-Net
RNN	Recurrent neural network
PDE	Partial differential equations
INDI	Instrument Neutral Distributed Interface
GPT	Generative Pre-trained Transformer
CASCAR	Clausthal All Sky Camera Recordings
BCEWithLogitsLoss	Binary Cross-Entropy with Logits Loss
MAE	Mean Absolute Error
MSE	Mean Squared Error
SSIM	Structural Similarity Index Measure
PredRNN	Predictive Recurrent Neural Network
LSTM	Long short-term memory
GFLOPs	Giga Floating Point Operations Per Second
MC	Monte Carlo

References

IEA PVPS Task 1. Snapshot of Global PV Markets 2025; Task 1: Strategic PV Analysis and Outreach; Report; International Energy Agency Photovoltaic Power Systems Programme (IEA PVPS): Paris, France, 2025.
Inman, R.H.; Pedro, H.T.; Coimbra, C.F. Solar forecasting methods for renewable energy integration. Prog. Energy Combust. Sci. 2013, 39, 535–576. [Google Scholar] [CrossRef]
Hummon, M.; Ibanez, E.; Brinkman, G.; Lew, D. Sub-Hour Solar Data for Power System Modeling from Static Spatial Variability Analysis: Preprint; National Renewable Energy Laboratory (NREL): Golden, CO, USA, 2012; Volume 11. [Google Scholar]
Yang, D.; Wu, E.; Kleissl, J. Operational solar forecasting for the real-time market. Int. J. Forecast. 2019, 35, 1499–1519. [Google Scholar] [CrossRef]
Chow, C.W.; Urquhart, B.; Lave, M.; Dominguez, A.; Kleissl, J.; Shields, J.; Washom, B. Intra-hour forecasting with a total sky imager at the UC San Diego solar energy testbed. Sol. Energy 2011, 85, 2881–2893. [Google Scholar] [CrossRef]
Dev, S.; Manandhar, S.; Yuan, F.; Lee, Y.H.; Winkler, S. Cloud Radiative Effect Study Using Sky Camera. In Proceedings of the 2017 USNC-URSI Radio Science Meeting (Joint with AP-S Symposium); IEEE: New York, NY, USA, 2017. [Google Scholar]
Logothetis, S.A.; Salamalikis, V.; Wilbert, S.; Remund, J.; Zarzalejo, L.F.; Xie, Y.; Nouri, B.; Ntavelis, E.; Nou, J.; Hendrikx, N.; et al. Benchmarking of solar irradiance nowcast performance derived from all-sky imagers. Renew. Energy 2022, 199, 246–261. [Google Scholar] [CrossRef]
Blum, N.; Matteschk, P.; Fabel, Y.; Nouri, B.; Román, R.; Zarzalejo, L.F.; Antuña-Sánchez, J.C.; Wilbert, S. Geometric calibration of all-sky cameras using sun and moon positions: A comprehensive analysis. Sol. Energy 2025, 295, 113476. [Google Scholar] [CrossRef]
Nouri, B.; Kuhn, P.; Wilbert, S.; Hanrieder, N.; Prahl, C.; Zarzalejo, L.; Kazantzidis, A.; Blanc, P.; Pitz-Paal, R. Cloud height and tracking accuracy of three all sky imager systems for individual clouds. Sol. Energy 2019, 177, 213–228. [Google Scholar] [CrossRef]
Perera, M.; Hoog, J.D.; Bandara, K.; Halgamuge, S. Distributed solar generation forecasting using attention-based deep neural networks for cloud movement prediction. arXiv 2024, arXiv:2411.10921. [Google Scholar] [CrossRef]
Selvam, A.M. Nonlinear Dynamics and Chaos: Applications in Atmospheric Sciences. arXiv 2010, arXiv:1006.4554. [Google Scholar] [CrossRef]
Su, X.; Li, T.; An, C.; Wang, G. Prediction of Short-Time Cloud Motion Using a Deep-Learning Model. Atmosphere 2020, 11, 1151. [Google Scholar] [CrossRef]
Yu, M.; Huang, Q.; Li, Z. Deep Learning for Spatiotemporal Forecasting in Earth System Science: A Review. Int. J. Digit. Earth 2024, 17, 2391952. [Google Scholar] [CrossRef]
Shi, X.; Chen, Z.; Wang, H.; Yeung, D.Y.; Wong, W.K.; Woo, W.C. Convolutional LSTM Network: A Machine Learning Approach for Precipitation Nowcasting. In Proceedings of the Advances in Neural Information Processing Systems; MIT Press: Cambridge, MA, USA, 2015; Volume 28. [Google Scholar]
Huang, Q.; Chen, S.; Tan, J. TSRC: A Deep Learning Model for Precipitation Short-Term Forecasting over China Using Radar Echo Data. Remote Sens. 2023, 15, 142. [Google Scholar] [CrossRef]
Mathieu, M.; Couprie, C.; LeCun, Y. Deep Multi-Scale Video Prediction Beyond Mean Square Error. In Proceedings of the International Conference on Learning Representations (ICLR), San Juan, PR, USA, 2–4 May 2016. [Google Scholar]
Bengio, S.; Vinyals, O.; Jaitly, N.; Shazeer, N. Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Montreal, QC, Canada, 7–12 December 2015; MIT Press: Cambridge, MA, USA, 2015. [Google Scholar]
Babaeizadeh, M.; Finn, C.; Erhan, D.; Campbell, R.; Levine, S. Stochastic Variational Video Prediction. In Proceedings of the International Conference on Learning Representations (ICLR), Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
Ravuri, S.; Lenc, K.; Willson, M.; Kangin, D.; Lam, R.; Mirowski, P.; Fitzsimons, M.; Athanassiadou, M.; Kashem, S.; Madge, S.; et al. Skillful Precipitation Nowcasting using Deep Generative Models of Radar. arXiv 2021, arXiv:2104.00954. [Google Scholar] [CrossRef]
van den Oord, A.; Vinyals, O.; kavukcuoglu, k. Neural Discrete Representation Learning. In Proceedings of the Advances in Neural Information Processing Systems; Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2017; Volume 30. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention Is All You Need. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Curran Associates Inc.: Red Hook, NY, USA, 2017; Volume 30. [Google Scholar]
Wu, Y.; Gao, R.; Park, J.; Chen, Q. Future Video Synthesis with Object Motion Prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); IEEE: New York, NY, USA, 2020; pp. 5539–5548. [Google Scholar]
Rakhimov, R.; Volkhonskiy, D.; Artemov, A.; Zorin, D.; Burnaev, E. Latent Video Transformer. arXiv 2020, arXiv:2006.10704. [Google Scholar] [CrossRef]
Sakib, N.; Hosen, M.A.; Khan, B.; Gunn, B.; Johnstone, M. Prediction Interval in Renewable Energy Forecasting: A Comprehensive Review of Uncertainty Quantification Methods. IEEE Access 2024, 13, 185466–185492. [Google Scholar] [CrossRef]
Lakshminarayanan, B.; Pritzel, A.; Blundell, C. Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Curran Associates Inc.: Red Hook, NY, USA, 2017; Volume 30. [Google Scholar]
Penteliuc, M.; Frincu, M. Prediction of Cloud Movement from Satellite Images Using Neural Networks. In Proceedings of the 2019 21st International Symposium on Symbolic and Numeric Algorithms for Scientific Computing (SYNASC); IEEE: New York, NY, USA, 2019; pp. 222–229. [Google Scholar]
Magnone, L.; Sossan, F.; Scolari, E.; Paolone, M. Cloud Motion Identification Algorithms Based on All-Sky Images to Support Solar Irradiance Forecast. In Proceedings of the 2017 IEEE 44th Photovoltaic Specialist Conference (PVSC); IEEE: New York, NY, USA, 2017; pp. 1415–1420. [Google Scholar]
Yang, H.; Kurtz, B.; Nguyen, D.; Urquhart, B.; Chow, C.W.; Ghonima, M.; Kleissl, J. Solar irradiance forecasting using a ground-based sky imager developed at UC San Diego. Sol. Energy 2014, 103, 502–524. [Google Scholar] [CrossRef]
Du, J.; Min, Q.; Zhang, P.; Guo, J.; Yang, J.; Yin, B. Short-Term Solar Irradiance Forecasts Using Sky Images and Radiative Transfer Model. Energies 2018, 11, 1107. [Google Scholar] [CrossRef]
Wei, L.; Zhu, T.; Guo, Y.; Ni, C.; Zheng, Q. CloudpredNet: An Ultra-Short-Term Movement Prediction Model for Ground-Based Cloud Image. IEEE Access 2023, 11, 97177–97188. [Google Scholar] [CrossRef]
Rivonirina, J.M.; Portafaix, T.; Rakotoniaina, S.; Morel, B.; Tang, C.; Lamy, K.; Lothon, M.; Toulouse, T.; Liandrat, O.; Rakotondraompiana, S.; et al. Cloudiness Retrieved from All-Sky Camera and MSG Satellite over Reunion Island and Antananarivo Madagascar. Ann. Geophys. 2025, 43, 651–666. [Google Scholar] [CrossRef]
Theis, N.; Behrens, G.; Boschert, A.; Zehner, M. Cloud Segmentation and Matching Using Deep Learning in All-Sky Images. In Proceedings of the PV-Symposium 2024, Bad Staffelstein, Germany, 27–29 February 2024; Volume 1. [Google Scholar] [CrossRef]
Magiera, D.; Fabel, Y.; Nouri, B.; Blum, N.; Schnaus, D.; Zarzalejo, L.F. Advancing semantic cloud segmentation in all-sky images: A semi-supervised learning approach with ceilometer-driven weak labels. Sol. Energy 2025, 300, 113822. [Google Scholar] [CrossRef]
Buntin, S.; Copperwheat, C.M.; Jermak, H.E. Nighttime cloud detection, tracking and prediction with All-Sky cameras. RAS Tech. Instruments 2025, 4, rzaf034. [Google Scholar] [CrossRef]
Tsourounis, D.; Tzoumanikas, P.; Kotronis, A.; Panagopoulos, O.; Kazantzidis, A.; Economou, G.; Theocharatos, C. Ground-Based Cloud Observation Using Wide-View Optical and Thermal Representations. In Proceedings of the ETCEI Conference, Thessaloniki, Greece, 19–20 October 2023. [Google Scholar]
Dev, S.; Nautiyal, A.; Lee, Y.H.; Winkler, S. CloudSegNet: A Deep Network for Nychthemeron Cloud Image Segmentation. IEEE Geosci. Remote Sens. Lett. 2019, 16, 1814–1818. [Google Scholar] [CrossRef]
Li, Y.; Wang, H.; Wang, S.; Lee, Y.H.; Pathan, M.S.; Dev, S. UCloudNet: A Residual U-Net with Deep Supervision for Cloud Image Segmentation. In Proceedings of the 2024 IEEE International Geoscience and Remote Sensing Symposium (IGARSS); IEEE: New York, NY, USA, 2024; pp. 5553–5557. [Google Scholar] [CrossRef]
Shakya, S.; Kumar, S. Characterising and predicting the movement of clouds using fractional-order optical flow. IET Image Process. 2019, 13, 1375–1381. [Google Scholar] [CrossRef]
An, S.; Oh, T.J.; Sohn, E.; Kim, D. Deep learning for precipitation nowcasting: A survey from the perspective of time series forecasting. Expert Syst. Appl. 2025, 268, 126301. [Google Scholar] [CrossRef]
Naz, F.; She, L.; Sinan, M.; Shao, J. Enhancing Radar Echo Extrapolation by ConvLSTM2D for Precipitation Nowcasting. Sensors 2024, 24, 459. [Google Scholar] [CrossRef]
Son, Y.; Zhang, X.; Yoon, Y.; Cho, J.; Choi, S. LSTM–GAN based cloud movement prediction in satellite images for PV forecast. J. Ambient Intell. Humaniz. Comput. 2023, 14, 12373–12386. [Google Scholar] [CrossRef]
Wen, Q.; Zhou, T.; Zhang, C.; Chen, W.; Ma, Z.; Yan, J.; Sun, L. Transformers in Time Series: A Survey. In Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence, Macao, China, 19–25 August 2023. [Google Scholar]
Selva, J.; Johansen, A.S.; Escalera, S.; Nasrollahi, K.; Moeslund, T.B.; Clapés, A. Video Transformers: A Survey. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 12922–12943. [Google Scholar] [CrossRef] [PubMed]
Franch, G.; Tomasi, E.; Wanjari, R.; Poli, V.; Cardinali, C.; Alberoni, P.P.; Cristoforetti, M. GPTCast: A Weather Language Model for Precipitation Nowcasting. arXiv 2024, arXiv:2407.02089. [Google Scholar] [CrossRef]
Esser, P.; Rombach, R.; Ommer, B. Taming Transformers for High-Resolution Image Synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; IEEE: New York, NY, USA, 2021; pp. 12873–12883. [Google Scholar]
Razavi, A.; van den Oord, A.; Vinyals, O. Generating Diverse High-Fidelity Images with VQ-VAE-2. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, Canada, 8–14 December 2019; Curran Associates Inc.: Red Hook, NY, USA, 2019; Volume 32. [Google Scholar]
Yan, W.; Zhang, Y.; Abbeel, P.; Srinivas, A. VideoGPT: Video Generation using VQ-VAE and Transformers. arXiv 2021, arXiv:2104.10157. [Google Scholar]
Walker, J.; Razavi, A.; van den Oord, A. Predicting Video with VQ-VAE. arXiv 2021, arXiv:2103.01950. [Google Scholar]
Raissi, M.; Perdikaris, P.; Karniadakis, G. Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations. J. Comput. Phys. 2019, 378, 686–707. [Google Scholar] [CrossRef]
Le Guen, V.; Thome, N. Disentangling Physical Dynamics from Unknown Factors for Unsupervised Video Prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); IEEE: New York, NY, USA, 2020; pp. 11474–11484. [Google Scholar]
Xi, X.; Zhuang, Q.; Liu, X. A Hybrid Physics-Guided Deep Learning Modeling Framework for Predicting Surface Soil Moisture. J. Geophys. Res. Mach. Learn. Comput. 2025, 2, e2025JH000682. [Google Scholar] [CrossRef]
Lu, L.; Jin, P.; Pang, G.; Zhang, Z.; Karniadakis, G.E. Learning nonlinear operators via DeepONet based on the universal approximation theorem of operators. Nat. Mach. Intell. 2021, 3, 218–229. [Google Scholar] [CrossRef]
Nielsen, A.H.; Iosifidis, A.; Karstoft, H. CloudCast: A Satellite-Based Dataset and Baseline for Forecasting Clouds. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2021, 14, 3485–3494. [Google Scholar] [CrossRef]
Nie, Y.; Paletta, Q.; Scott, A.; Pomares, L.M.; Arbod, G.; Sgouridis, S.; Lasenby, J.; Brandt, A. Sky image-based solar forecasting using deep learning with heterogeneous multi-location data: Dataset fusion versus transfer learning. Appl. Energy 2024, 369, 123467. [Google Scholar] [CrossRef]
Reinhardt, A.; Seyidbayli, C. CASCAR: Clausthal All Sky Camera Recordings. 2026. Available online: https://zenodo.org/records/18657514 (accessed on 16 February 2026).
Radford, A.; Wu, J.; Child, R.; Luan, D.; Amodei, D.; Sutskever, I. Language models are unsupervised multitask learners. OpenAI Blog 2019, 1, 9. [Google Scholar]
Shrivastava, G.; Shrivastava, A. Video prediction by modeling videos as continuous multi-dimensional processes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: New York, NY, USA, 2024; pp. 7236–7245. [Google Scholar]
Starlight Xpress Ltd., Binfield, Berkshire, UK. OCULUS ALL-SKY CAMERA 150°. Available online: https://www.sxccd.com/product/oculus-all-sky-camera-150/ (accessed on 12 February 2026).
Starlight Xpress Ltd., Binfield, Berkshire, UK. 180° f/2 Fish-Eye Lens for the Oculus All-Sky Camera. Available online: https://www.sxccd.com/product/oculus-180-lens (accessed on 12 February 2026).
de Sá Campos, M.H.; Tiba, C. Global Horizontal Irradiance Modeling for All Sky Conditions Using an Image-Pixel Approach. Energies 2020, 13, 6719. [Google Scholar] [CrossRef]
Lucas, B.D.; Kanade, T. An iterative image registration technique with an application to stereo vision. In Proceedings of the IJCAI’81: 7th International Joint Conference on Artificial Intelligence; Morgan Kaufmann Publishers Inc.: San Francisco, CA, USA, 1981; Volume 2, pp. 674–679. [Google Scholar]
Wang, Y.; Wu, H.; Zhang, J.; Gao, Z.; Wang, J.; Yu, P.S.; Long, M. PredRNN: A Recurrent Neural Network for Spatiotemporal Predictive Learning. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 2208–2225. [Google Scholar] [CrossRef] [PubMed]
Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention; Springer: Cham, Switzerland, 2015; pp. 234–241. [Google Scholar]
Alfarano, A.; Maiano, L.; Papa, L.; Amerini, I. Estimating optical flow: A comprehensive review of the state of the art. Comput. Vis. Image Underst. 2024, 249, 104160. [Google Scholar] [CrossRef]
Marchesoni-Acland, F.; Herrera, A.; Mozo, F.; Camiruaga, I.; Castro, A.; Alonso-Suárez, R. Deep learning methods for intra-day cloudiness prediction using geostationary satellite images in a solar forecasting framework. Sol. Energy 2023, 262, 111820. [Google Scholar] [CrossRef]
Raut, B.A.; Muradyan, P.; Sankaran, R.; Jackson, R.C.; Park, S.; Shahkarami, S.A.; Dematties, D.; Kim, Y.; Swantek, J.; Conrad, N.; et al. Optimizing cloud motion estimation on the edge with phase correlation and optical flow. Atmos. Meas. Tech. 2023, 16, 1195–1209. [Google Scholar] [CrossRef]
Kendall, A.; Gal, Y. What Uncertainties Do We Need in Bayesian Deep Learning for Computer Vision? In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Curran Associates Inc.: Red Hook, NY, USA, 2017; Volume 30. [Google Scholar]
Buizza, R. Chaos and Weather Prediction; Meteorological Training Course Lecture Series, ECMWF Training Lecture; European Centre for Medium-Range Weather Forecasts (ECMWF): Reading, UK, 2002. [Google Scholar]
Gal, Y.; Ghahramani, Z. Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning. In Proceedings of the 33rd International Conference on Machine Learning; PMLR: Norfolk, MA, USA, 2016; Volume 48, pp. 1050–1059. [Google Scholar]

Figure 1. The general architecture of the pipeline consisting of a lightweight CNN, VQ-VAE, and an autoregressive Transformer.

Figure 2. LightweightCNN architecture, inspired by the UCloudNet architecture that converts the input image into a binary mask image of the same size.

Figure 3. VQ-VAE model architecture.

Figure 4. Original and binary mask view of data from SWINySEG dataset [36].

Figure 5. Training convergence of the VQ-VAE tokenizer. (a) Reconstruction loss demonstrates improved fidelity of cloud mask reconstruction over training epochs. (b) Vector quantization loss indicates stabilization of the learned discrete codebook.

Figure 6. Single-step prediction performance of 122 different sequences.

Figure 7. Single-step prediction samples. (a) 11 AM: actual image (left), ground-truth mask (middle), and single-step prediction (right) with IoU = 0.8996 and pixel accuracy = 0.9561. (b) 10 AM: actual image (left), ground-truth mask (middle), and single-step prediction (right) with IoU = 0.924842 and pixel accuracy = 0.970657. (c) 04 PM: actual image (left), ground-truth mask (middle), and single-step prediction (right) with IoU = 0.907641 and pixel accuracy = 0.961639.

Figure 8. Qualitative examples of autoregressive cloud mask predictions at three forecast horizons. Each row shows (left to right): original sky image, ground-truth mask, and predicted mask.

Figure 9. Decomposition of predictive uncertainty into epistemic, aleatoric, and chaotic components, measured as mean pixel variance.

Figure 10. Intersection over union, pixel accuracy, uncertainty, and error over a 120-step autoregressive cloud mask forecasting horizon.

Table 1. Quantitative comparison of the proposed model against ablated variants and state-of-the-art baselines.

Model	IoU	F1-Score	Pix. Acc.	MAE	MSE	SSIM
Full Model	0.920	0.935	0.948	0.011	0.011	0.911
w/o CNN	0.837	0.888	0.928	0.022	0.002	0.835
w/o Transformer	0.865	0.904	0.925	0.035	0.035	0.854
Optical Flow	0.880	0.913	0.932	0.028	0.025	0.867
Phase Correlation	0.861	0.902	0.923	0.036	0.036	0.852
ConvLSTM [14]	0.908	0.930	0.943	0.017	0.015	0.904
PredRNN [62]	0.911	0.930	0.943	0.017	0.015	0.905
U-Net Predictor [63]	0.903	0.925	0.941	0.019	0.017	0.897

Table 2. Ablation study of the different size of codebook. In variant names, “cb” represents codebooks of varying lengths with an embedding dimension of 64, while “emb” represents codebooks with an number of embedding of 1024.

Variant	Number of Embeddings	Embedding Dimension	Codebook Util. Rate (%)	Number of Active Codebook Entries	Recon. IoU	Recon. MAE	Pred. IoU	Pred. F1-Score
cb128	128	64	100	128	0.968	0.002	0.936	0.952
cb256	256	64	9.38	24	0.961	0.005	0.930	0.942
cb512	512	64	5.08	26	0.963	0.004	0.932	0.950
Proposed	1024	64	30.86	316	0.971	0.001	0.938	0.953
Architecture	1024	64	30.86	316	0.971	0.001	0.938	0.953
emb32	1024	32	4.39	45	0.963	0.004	0.934	0.951
emb128	1024	128	2.15	22	0.955	0.008	0.925	0.946

Table 3. Performance comparison of 6 and 8-layer transformer models.

Layer Count	IoU	F1-Score	Pix. Acc.	MAE	MSE	SSIM
6-layer	0.920	0.935	0.948	0.011	0.011	0.911
8-layer	0.919	0.935	0.947	0.011	0.011	0.910

Table 4. Mean IoU and accurcay performance values every three minutes during the model’s 10 min prediction process.

Horizon Slot	Mean IoU	Mean Accuracy
(0, 3) min	0.6920	0.8915
(3, 6) min	0.6410	0.8680
(3, 10) min	0.6327	0.8615

Table 5. Model complexity comparison table.

Model	Params (M)	GFLOPs	Latency (ms)	GPU Mem (MB)
Proposed Pipeline	29.43	180.6	91.5	3699
ConvLSTM [14]	1.50	602.8	10.7	622
PredRNN [62]	2.31	1200.7	27.0	756
U-Net Predictor [63]	7.76	27.5	2.1	186

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Seyidbayli, C.; Nezakat, S.; Reinhardt, A. Probabilistic Short-Term Sky Image Forecasting Using VQ-VAE and Transformer Models on Sky Camera Data. J. Imaging 2026, 12, 165. https://doi.org/10.3390/jimaging12040165

AMA Style

Seyidbayli C, Nezakat S, Reinhardt A. Probabilistic Short-Term Sky Image Forecasting Using VQ-VAE and Transformer Models on Sky Camera Data. Journal of Imaging. 2026; 12(4):165. https://doi.org/10.3390/jimaging12040165

Chicago/Turabian Style

Seyidbayli, Chingiz, Soheil Nezakat, and Andreas Reinhardt. 2026. "Probabilistic Short-Term Sky Image Forecasting Using VQ-VAE and Transformer Models on Sky Camera Data" Journal of Imaging 12, no. 4: 165. https://doi.org/10.3390/jimaging12040165

APA Style

Seyidbayli, C., Nezakat, S., & Reinhardt, A. (2026). Probabilistic Short-Term Sky Image Forecasting Using VQ-VAE and Transformer Models on Sky Camera Data. Journal of Imaging, 12(4), 165. https://doi.org/10.3390/jimaging12040165

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Probabilistic Short-Term Sky Image Forecasting Using VQ-VAE and Transformer Models on Sky Camera Data

Abstract

1. Introduction

2. Related Work

2.1. Sky Camera Cloud Observation

2.2. Cloud Segmentation Using Deep Learning

2.3. Spatiotemporal Forecasting of Cloud Dynamics

2.4. Latent-Space and Token-Based Forecasting

2.5. Physics-Informed Temporal Modeling

3. Methodology

3.1. Cloud Detection Using a Lightweight CNN

3.2. Discrete Latent Representation with VQ-VAE

3.3. Temporal Forecasting Using an Autoregressive Transformer

3.4. Autoregressive Multi-Step Forecasting Strategy

4. Data Selection and Model Training

4.1. Stage 1: Cloud Segmentation Network

4.2. Stage 2: VQ-VAE for Discrete Latent Representation

4.3. Stage 3: Autoregressive Transformer for Temporal Forecasting

5. Evaluation of the System

5.1. Sky Image Acquisition and Pre-Processing

5.2. Experimental Setup

5.3. Evaluation Metrics

5.4. Training Performance

5.5. Model Performance

Single-Step Prediction Performance

5.6. Ablation Studies

5.6.1. Role of the Segmentation Module

5.6.2. Role of the Transformer Predictor

5.6.3. Codebook Ablation

5.6.4. Transformer Model Ablation

5.7. Comparison with Baselines

5.7.1. Classical Motion Estimation

5.7.2. Deep Learning Models

5.8. Multi-Step Forecasting and Error Propagation

5.9. Model Complexity Comparison

5.10. Uncertainty Analysis

5.10.1. Uncertainty Source Decomposition

5.10.2. Uncertainty–Error Correlation Analysis

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI