1. Introduction
Clouds are visible aggregates suspended in the atmosphere, composed of microscopic water droplets formed by condensation of atmospheric water vapor and ice crystals formed through sublimation. The geostationary meteorological satellite (hereafter referred to as “satellite”) captures cloud layer structures from high altitudes, serving as a critical information source for studying atmospheric morphology and evolutionary mechanisms over Earth’s surface [
1]. Particularly in monitoring and forecasting operations for severe convective weather events like tropical cyclones, satellite cloud imagery demonstrates irreplaceable value [
2,
3]. To accommodate the accelerating modernization process, weather forecasting, climate prediction, and meteorological disaster early-warning systems face increasingly stringent requirements for precision, necessitating more detailed and accurate analysis of satellite cloud imagery. Consequently, high-resolution meteorological satellite cloud data has become an essential foundational resource for atmospheric science research and operational meteorology.
While physical upgrades to satellite imaging systems can improve image resolution, this approach typically faces two major limitations: (1) lack of flexibility and high costs due to dynamically changing image acquisition requirements in practical applications; (2) capability only for capturing new high-resolution (HR) images rather than enhancing existing low-resolution (LR) images. Compared to hardware-based “hard” solutions, signal-processing-oriented “soft” super-resolution techniques offer greater flexibility and cost-effectiveness. As a software-driven methodology, image super-resolution (SR) technology enables resolution enhancement without equipment updates, providing cost-effective clarity improvements. This technique has been widely adopted in remote sensing, medical imaging, entertainment, and video surveillance applications, attracting sustained academic attention globally. Therefore, when meteorological satellites cannot yet provide satellite imagery with sufficient spatiotemporal resolution, developing refined cloud image interpretation techniques through software approaches using existing satellite systems and acquired observational data holds significant practical value for improved monitoring and forecasting of tropical cyclones and other severe convective weather phenomena.
Traditional satellite cloud image super-resolution methods primarily focus on enhancing image details through general digital image processing techniques. For instance, Wu et al. [
4] proposed a cartoon-texture decomposition approach based on tensor diffusion for satellite image preprocessing, effectively reducing noise while sharpening cloud edge transitions. Yin et al. [
5] introduced an adaptive nonlinear image enhancement fusion method based on grayscale mean values, leveraging domain knowledge that visible-light cloud imagery exhibits monotonically increasing grayscale over time, while infrared imagery shows monotonic decreases, thereby fusing visible and infrared images to reveal additional cloud details. Cai et al. [
6] developed an image inpainting method combining Hough transform with line-loss distribution characteristics in satellite imagery. Kim et al. [
7] improved high-resolution cloud image restoration quality by incorporating estimated interpolation error into interpolated images. Demirel et al. [
8] and Ahire et al. [
9] achieved significant improvements in reconstructed satellite image quality through discrete wavelet transform-based super-resolution approaches.
Compared to traditional satellite cloud image super-resolution algorithms, deep-learning-based super-resolution (SR) techniques offer a novel research paradigm for enhancing cloud image quality. Convolutional neural networks (CNNs) such as SRCNN [
10] and EDSR [
11], which learn end-to-end mappings from low-resolution (LR) to high-resolution (HR) images, have achieved remarkable progress in natural image domains. In specialized satellite cloud image SR tasks, Jin et al. [
12] constructed an anti-aliasing directional multi-scale transform integrated with stochastic projection techniques to introduce smoothed projection algorithms into block-based compressed sensing reconstruction. He et al. [
13] incorporated Tetrolet transform—a representation capable of capturing directional texture and edge information—into compressed sensing’s sparse representation stage, computing differences between reference cloud images and adjacent temporal images to reconstruct high-quality outputs within the compressed sensing framework. Shi et al. [
14] proposed a coupled dictionary learning algorithm that modifies dictionary pair update strategies and employs optimal orthogonal matching pursuit to generate HR cloud images meeting reconstruction constraints. Zhou et al. [
15] introduced a sparse representation-based infrared cloud image super-resolution method that structures image patches into groups as sparsity units, exploiting structural similarity information in infrared cloud data to enhance resolution. Zhang et al. [
16] presented deep-learning-based SR restoration approaches, with Su et al. [
17] demonstrating superior performance over interpolation and sparse methods through CNN-based satellite cloud image SR research. Jing et al. [
18] successfully applied adversarial-learning-based super-resolution algorithms to satellite cloud image SR tasks. Cornebise et al. [
19] proposed a multi-path network model called SRCloudNet, which involves a joint feature extraction method of backprojection network and local residual network to achieve more accurate image super-resolution reconstruction. To promote the widespread application of machine learning in satellite image super-resolution research, Zhang et al. [
20] have created a professional dataset.
We fundamentally recognize that satellite cloud imagery represents quantitative maps of physical variables like brightness temperature, making its super-resolution an inverse physical problem rather than a conventional visual enhancement task. However, existing deep-learning-based super-resolution methods face particular challenges when applied to this domain. First, cloud systems exhibit complex multi-scale textures where different cloud types show substantial variations in both morphological characteristics and frequency-domain distributions. Second, the non-rigid deformations and fractal features along cloud edges render traditional spatial convolution operations inadequate for capturing crucial high-frequency details. Most critically, meteorological applications demand strict physical consistency, requiring preserved geometric integrity and radiometric interpretability—constraints that current methods fail to sufficiently address.
Given the rich details and complex textures in satellite cloud images, combined with the dynamic evolution and deformable nature of cloud systems across spatiotemporal dimensions, this study proposes a Multi-scale Residual Deformable Attention Model (MRDAM) based on deep learning for satellite cloud image super-resolution. The model integrates visual attention mechanisms to address the unique characteristics of satellite cloud data within a Generative Adversarial Network (GAN) framework. The generator architecture comprises two key components: the Multi-scale Feature Progressive Fusion Module (MFPFM), which enhances texture detail preservation and spectral consistency in reconstructed images, and the Deformable Attention Additive Fusion Module (DAAFM), which captures irregularly shaped cloud features through adaptive attention mechanisms. Our main contributions are as follows:
(1) Multi-scale Feature Progressive Fusion Module (MFPFM): This module enables multi-scale feature perception, allowing simultaneous focus on low-frequency structural information and high-frequency detail restoration. By implementing progressive feature fusion across network depths, it achieves superior global–local feature extraction capabilities to capture both large-scale cloud systems and small-scale convective cells in satellite cloud imagery.
(2) Deformable Attention Additive Fusion Module (DAAFM): The deformable attention mechanism improves sensitivity to meteorologically critical details (e.g., cloud boundaries and textures) and irregular cloud pattern features. Through cross-layer additive attention strategies, the attention matrices from smaller-scale layers are incorporated as prior knowledge into current-scale computations. Additionally, deformable convolution operations are employed to better characterize cloud system deformations in satellite cloud images.
(3) Texture-Aware Loss Function: A texture-preserving loss component is integrated into the overall network loss formulation to compensate for insufficient high-frequency detail supervision in conventional loss functions. This enhancement improves the reconstruction of fine-scale structural details in super-resolved satellite cloud images while maintaining meteorological interpretability.
2. Method
Optical images represent visual appearances, while cloud images are quantitative maps of physical variables like temperature. Therefore, super-resolution for a cloud image is an inverse physical problem, aiming to recover high-resolution distributions of these variables. Accordingly, our model employs a multi-scale architecture with deformable attention to capture atmospheric dynamical states and incorporates a composite loss function that ensures that the super-resolution results maintain high visual quality while preserving physically plausible.
Inspired by the human visual selective attention mechanism, the computer vision field has modeled attention as a prior cognitive tool. Among these, spatial-attention mechanisms, which are sensitive to spatial structures and capable of effectively localizing key regions, have demonstrated significant advantages in complex texture reconstruction tasks. However, traditional spatial-attention mechanisms (e.g., fixed-window or global attention) suffer from rigid sampling patterns that struggle to adapt to non-regular cloud structures such as cloud deformation and vortex motion. Additionally, their ability to jointly model multi-scale meteorological features in cloud imagery—such as localized convective cells and global spiral rainbands—is limited. To address these challenges, this study proposes a cross-layer additive deformable spatial-attention mechanism and constructs a multi-scale feature fusion network module to achieve super-resolution restoration of satellite cloud images.
2.1. Network Structure
To address the aforementioned challenges, this paper proposes a Multi-scale Residual Deformable Attention Model (MRDAM) for satellite cloud image super-resolution, as illustrated in
Figure 1. Designed to meet the dual requirements of multi-scale cloud system meteorological features and super-resolution tasks in satellite cloud imagery, MRDAM achieves high-frequency detail restoration and low-frequency structural fidelity while realizing progressive hierarchical feature extraction through cross-scale attention mechanisms. By leveraging the synergistic effects of its modular components, MRDAM effectively enhances the quality of satellite cloud image super-resolution.
Firstly, MRDAM adopts the Super-Resolution Generative Adversarial Network (SRGAN) [
21] as its backbone framework. Through dynamic adversarial interactions between two neural networks—Generator (G) and Discriminator (D)—both networks iteratively improve their capabilities. The generator progressively enhances its ability to generate realistic samples, while the discriminator strengthens its capacity to distinguish real from synthetic data. Specifically, to enhance contextual information modeling, the generator incorporates a Multi-scale Feature Progressive Fusion Module (MFPFM). By perceiving multi-scale features, this module enables superior global–local feature extraction, allowing the model to better capture both large-scale cloud systems and small-scale convective cell characteristics in satellite cloud imagery. Additionally, based on visual attention principles, a Deformable Attention Additive Fusion Module (DAAFM) is constructed to better detect meteorologically critical high-frequency details (e.g., cloud boundaries and textures). The cross-layer additive strategy integrates attention matrices from smaller-scale layers as prior knowledge, adding them to attention matrices computed at the current scale layer to generate scale-specific attention matrices—thereby realizing a priori knowledge-driven attention mechanism.
Meanwhile, the discriminator enforces adversarial feedback to compel the generator to learn real-image distributions. To differentiate between generated super-resolved (SR) images and authentic high-resolution (HR) images, the discriminator guides the generator toward producing more realistic details through adversarial training. The discriminator comprises eight convolutional blocks, each consisting of a “convolution layer + activation layer + batch normalization layer” sequence. Finally, a Sigmoid activation outputs a scalar value between 0 and 1, representing the probability that the input image is a real high-resolution image.
2.2. Loss Function
To find a better loss function of the proposed method for satellite cloud image super-resolution, this study introduces a texture-aware loss and integrates it into the SRGAN loss. This composite loss function jointly constrains the super-resolution output to maintain both high-frequency texture details and atmospheric physical consistency in the reconstructed cloud image.
The Mean Squared Error (MSE) is the most fundamental and widely used loss function in image restoration tasks. It evaluates model performance by computing the average of squared differences between predicted values and ground-truth values. The mathematical formulation is as follows:
where
N denotes the number of samples,
represents the actual value of the
i-th sample, and
is the model’s predicted value for that sample. When used as an optimization objective function, MSE minimizes the sum of squared prediction errors by averaging them, which assigns higher weights to samples with larger errors during training. This prioritizes the reduction of errors in such samples, driving overall loss reduction. While this penalty mechanism enhances sensitivity to outliers and guides data fitting, it also introduces practical challenges—such as vulnerability to outliers and potential vanishing gradients—due to its overemphasis on large errors.
In the SRGAN framework, the generator G does not adopt traditional MSE as its loss function. The authors argue that while MSE constraints yield strong peak signal-to-noise ratio (PSNR) metrics, the resulting super-resolved images often lack rich high-frequency details, appearing overly smoothed. This limitation arises because MSE only quantifies pixel-wise differences between the generated high-resolution (HR) image and the ground truth, failing to account for structural or textural variations. Relying solely on low-level pixel discrepancies is insufficient for capturing perceptual realism. To address this, higher-level image features must be incorporated into the loss function to enhance sensitivity to detail variations and improve the super-resolution performance of deep learning models.
This SRGAN framework innovatively introduces a composite loss function, expressed as Equation (
2):
where
denotes the content loss, which employs the VGG network to extract features and computes the Euclidean distance between high-level semantic features of the super-resolved image and the ground-truth high-resolution image.
represents the adversarial loss for discriminating generated samples.
is a regularization term based on total variational loss, which can constrain the consistency and stability of the physical meaning represented by the grayscale values of satellite cloud images before and after super-resolution.
Comparing to traditional MSE loss, the VGG feature loss reduces image blurring by matching deep feature distributions, preserving high-frequency details while achieving semantic feature retention. It captures semantic information (e.g., edges, textures, and structures) and aligns generated images with human perceptual preferences rather than pixel-level similarity, avoiding excessive smoothing. However, the VGG feature loss is sensitive to feature layer selection—the impact of different layers (e.g., shallow conv1_2 and deep conv5_4 in VGG) varies significantly. Additionally, VGG is pre-trained on natural images (e.g., ImageNet), limiting its ability to capture domain-specific features in satellite cloud imagery with stringent structural requirements. This results in suboptimal fine-structure reconstruction in super-resolved images.
To address this problem, this study proposes a texture-aware loss function that supervises texture preservation by computing the absolute difference of variances between image patches in the generated super-resolved image and the original high-resolution image. This loss function operates between the low-level MSE feature loss and high-level feature loss
, serving as an intermediate strategy for texture-aware supervision. The mathematical formulation is defined in Equation (
3):
in which
denotes a high-resolution image patch,
represents a low-resolution image patch, and
indicates the super-resolved image patch. Var(*) calculates the variance of image patches, while
and
denote the corresponding mean values of image patches. W and H represent the width and height of image patches in pixels, respectively, and
x and
y are the pixel indices of the image.
In the generator network G of this work, the following loss function architecture is employed to further overcome existing algorithmic bottlenecks and enhance the performance of the super-resolution model.
in which
,
, and
are mentioned in detail in reference [
21].
In the discriminator, we use the Binary Cross Entropy Loss as the loss function, as in Equation (
5) as follows:
where
is the ground truth of the i-th sample, while
is the predicted value.
3. Multi-Scale Feature Progressive Fusion Module
Multi-scale feature processing has proven highly effective in digital image processing tasks. Particularly for satellite cloud imagery, where cloud systems exhibit strong scale variations, relying solely on single-scale features fails to capture complete contextual information. Parallel feature extraction across multiple scales enables models to simultaneously focus on both local details and global structural information. To address this, we propose extracting features from satellite cloud imagery at different resolution scales and sharing a deformable attention mechanism across these features to facilitate context-aware fusion over extended spatial ranges.
The Multi-scale Feature Progressive Fusion Module (MFPFM) employs a parallel multi-branch network architecture. First, multi-scale features (denoted as F in
Figure 2) extracted by the Multi-scale Feature Extraction module are fed into residual blocks at different scales for further representation learning, yielding enhanced output features (denoted as F’ in
Figure 2). Subsequently, features from smaller scales are progressively upsampled and merged with those from larger scales.
The extracted features are progressively upsampled and fused with larger-scale features. This process is repeated for subsequent scales until feature extraction completes at the maximum resolution. Consequently, both scale-specific features and fused features integrating multi-scale contextual information are obtained. Finally, feature maps from each scale are independently transmitted to subsequent network modules. This hierarchical fusion strategy strengthens the contextual relevance of features across spatial resolutions, thereby improving the model’s capacity to represent cloud structures at varying scales.
4. Deformable Attention Stacked Fusion Module
Atmospheric motion is ubiquitous, and large-scale satellite cloud targets on Earth’s surface do not conform to regular 2D planar structures. Image processing for such targets necessitates deformable operators. This section first reviews deformable convolution and deformable attention mechanisms, followed by a detailed description of the proposed Deformable Attention Stacked Fusion Module (DAAFM).
4.1. Deformable Convolution
Figure 3 illustrates the schematic of standard convolution versus deformable convolution (DC) [
22]. Specifically, left one shows a traditional 3 × 3 convolution kernel (green points), while right one introduces offset vectors (light blue arrows) to each kernel position, enabling adaptive geometric transformations.
Deformable convolution introduces learnable offsets during convolution operations, which are predicted through two convolutional layers consisting of learnable parameters generating x and y directional offset. The feature matrix produced by standard convolution is generally defined as Equation (
6), where
w denotes the convolution kernel,
x represents the input image,
y indicates the output feature, and
corresponds to the output feature at position
associated with the kernel center.
represents the points in the image, which is a one-to-one correspondence to that in the convolution kernel. In contrast, the feature matrix from deformable convolution follows Equation (
7), which incorporates an offset
into the standard convolution operation.
The offsets are obtained through parameter-learnable convolutional layers. The convolutional kernel shares the same structure as a standard kernel but with a two-dimensional channel output corresponding to the x and y directional offsets. These additional offsets in deformable convolution layers are integrated into the main network architecture, fundamentally involving passing through a standard convolutional unit, enabling end-to-end learning via gradient backpropagation.
As illustrated in
Figure 4, after training, the deformable convolution kernel dynamically adjusts its size and position according to the actual shape of the target in the image, regardless of the object’s original shape. This adaptive sampling mechanism allows the kernel positions to automatically vary with image content, accommodating geometric deformations such as object shape, scale, and orientation. While standard convolution employs regular rectangular templates for feature extraction in satellite cloud imagery, deformable convolution templates adaptively reshape according to target geometry. Consequently, deformable convolution effectively handles complex cloud patterns caused by movements, rotations, deformations, and scale variations in satellite cloud imagery.
However, how to appropriately integrate deformable convolution into the main architecture of deep neural networks remains an open question. It is known that feature maps in deeper network layers possess richer semantic information and larger receptive fields, enabling them to model more complex geometric transformations. Therefore, placing deformable convolutions at these deeper layers allows them to most effectively capture deformations and offsets present in cloud imagery. In contrast, feature maps from shallow layers primarily contain low-level information such as edges and colors, where the benefits of deformable convolutions are limited and may instead introduce unnecessary computational overhead. This paper investigates the embedding strategies of deformable convolution in the generator of SRGAN. We evaluated the performance of deformable convolution at different positions in the network structure in three scenarios: front, middle, and back, which are the positions of the input layer (before the first residual block), middle layer (after the seventh residual block), and output layer (after the last residual block), where Scheme 1 introduces multiple deformable convolution blocks, while Scheme 2 employs a single block. The experimental results, as shown in
Table 1, demonstrate that optimal performance is achieved when the deformable convolution (with a single block) is placed near the network backend. This improvement stems from the availability of high-level semantic features in later stages, making adaptive feature focusing more efficient [
23]. Consequently, with all the consideration above, we adopt the strategy of embedding deformable convolution in the latter part of the network.
4.2. Deformable Attention
Attention mechanisms have rapidly advanced in computer vision, fundamentally serving as a biomimetic reconstruction of human visual selective attention mechanisms at the algorithmic level. These mechanisms are typically formalized as spatial, channel, and temporal attention computational paradigms.
Conventional attention mechanisms often rely on fixed windows or global computations, struggling to adapt to complex features with irregular vortex structures in satellite cloud imagery. Deformable attention (DA) [
24] addresses the limitations of standard self-attention mechanisms, which suffer from prohibitively high computational complexity when processing high-resolution images or long sequential data. By dynamically adjusting spatial deformations in feature sampling regions, DA significantly enhances the model’s capability to capture intricate textures and non-rigid cloud system structures in satellite cloud imagery within super-resolution tasks.
The deformable attention mechanism generates a uniform grid of reference points based on the input feature map . First, the grid size is determined by the input feature dimensions H and W and a downsampling factor r: , . The reference coordinates are linearly spaced 2D positions . Subsequently, these coordinates are normalized to the range [−1, +1] according to the feature map shape , where (−1, −1) denotes the top-left corner, and (+1, +1) represents the bottom-right corner.
To obtain offsets for each reference point, query tokens q are generated via linear projection
. These tokens are fed into a lightweight subnetwork
to predict offsets. To stabilize training, offsets are typically scaled and clamped within predefined bounds using a normalization factor s, yielding
. Finally, features at the deformed positions are sampled as key–value pairs and projected through learnable matrices to produce the output variables defined in Equations (
8) and (
9).
The key and value in the deformable attention mechanism are denoted as
and
, respectively. Specifically, we define the sampling function
as a bilinear interpolation operation to ensure its differentiability:
Here,
and
denote all spatial positions on the feature map
. Since it is non-zero only at the four nearest integer coordinates around the offset position
, Equation (
9) simplifies to a weighted average over these four discrete grid points. Following the existing approaches, we apply multi-head attention to q, k and v, incorporating relative position bias R. The output of each attention head is formulated as Equation (
11):
Here,
denotes the position encoding derived from ref. [
24] with several modifications.
Deformable attention improves computational efficiency and flexibility by introducing offset vectors to adapt to geometric variations of targets in input feature maps. Unlike conventional attention mechanisms where weights are computed from fixed attention models at predefined positions, deformable attention dynamically adjusts the shape and scale of attention models to better accommodate varying task requirements and input characteristics. This adaptive mechanism is illustrated in
Figure 5.
4.3. Attention Stacked Fusion
We propose a low-level attention stacked fusion strategy that enables subsequent network layers to directly access low-level attention matrices from preceding layers. This strategy computes the current layer’s attention matrix by integrating knowledge from the previous layer’s attention matrix through cross-layer connections, thereby establishing structural dependencies across attention matrices.
The network architecture implementing this strategy is named the Deformable Attention Additive Fusion Module (DAAFM), which consists of three interconnected components: attention matrix generation, cross-layer connection, and feed-forward stacking. Specifically, the attention matrix generation module produces the current layer’s attention matrix. Subsequently, cross-layer connections introduce the preceding layer’s attention matrix into the current layer. Finally, feed-forward stacking computes the refined attention matrix for the current layer. The architectural design is illustrated in
Figure 6.
Specifically, we design a cross-layer architecture termed prior-attention-constrained attention mechanism. The core module comprises an attention generation layer and cross-layer pathways, enabling attention matrices to propagate through two distinct pathways: one processed by the deformable attention module and another preserving the previous attention matrix or applying linear projections. By reformulating the conventional attention stacking architecture into a cross-layer configuration, this design integrates multi-scale attention matrices, enhancing the model’s capability to focus on and extract complex satellite cloud features (e.g., edges and textures). The arithmetic operations in the cross-layer deformable attention module are formulated as Equation (
12):
Here,
and
are scalar weights, and we determine its optimal value through experimentation. We have set three scenarios, specifically
,
, and
. In
Figure 7, it can be seen that when
, the attention heatmap can more comprehensively focus on the structure of complex cloud systems in the cloud image. Therefore, in this work, we set it to
and
.
denotes an attention transformation function that takes an n×n matrix as input and outputs an n×n matrix. This function serves to transform the attention matrix from the previous layer into an attention prior usable in the current layer through learnable parameters.