Explainable Multi-Scale CAM Attention for Interpretable Cloud Segmentation in Astro-Meteorological Applications

Xu, Qing; Zhang, Zichen; Wang, Guanfang; Chen, Yunjie

doi:10.3390/app15158555

Open AccessArticle

Explainable Multi-Scale CAM Attention for Interpretable Cloud Segmentation in Astro-Meteorological Applications

by

Qing Xu

,

Zichen Zhang

,

Guanfang Wang

and

Yunjie Chen

^*

School of Mathematics and Statistics, Nanjing University of Information Science and Technology, Nanjing 210044, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(15), 8555; https://doi.org/10.3390/app15158555 (registering DOI)

Submission received: 15 July 2025 / Revised: 29 July 2025 / Accepted: 30 July 2025 / Published: 1 August 2025

(This article belongs to the Special Issue Explainable Artificial Intelligence Technology and Its Applications)

Download

Browse Figures

Versions Notes

Abstract

Accurate cloud segmentation is critical for astronomical observations and solar forecasting. However, traditional threshold- and texture-based methods suffer from limited accuracy (65–80%) under complex conditions such as thin cirrus or twilight transitions. Although the deep-learning segmentation method based on U-Net effectively captures low-level and high-level features and achieves significant progress in accuracy, current methods still lack interpretability and multi-scale feature integration and usually produce fuzzy boundaries or fragmented predictions. In this paper, we propose multi-scale CAM, an explainable AI (XAI) framework that integrates class activation mapping (CAM) with hierarchical feature fusion to quantify pixel-level attention across hierarchical features, thereby enhancing the model’s discriminative capability. To achieve precise segmentation, we integrate CAM into an improved U-Net architecture, incorporating multi-scale CAM attention for adaptive feature fusion and dilated residual modules for large-scale context extraction. Experimental results on the SWINSEG dataset demonstrate that our method outperforms existing state-of-the-art methods, improving recall by 3.06%, F1 score by 1.49%, and MIoU by 2.21% over the best baseline. The proposed framework balances accuracy, interpretability, and computational efficiency, offering a trustworthy solution for cloud detection systems in operational settings.

Keywords:

cloud image segmentation; explainable AI (XAI); class activation mapping; multi-scale interpretability; attention mechanisms; U-Net

1. Introduction

Cloud cover significantly impacts ground-based astronomical observations by attenuating incoming starlight, distorting photometric measurements, and reducing observational efficiency [1]. Studies demonstrate that even thin cirrus clouds (optical depth < 0.3) can introduce photometric errors exceeding 0.1 magnitudes, substantially affecting precision measurements [2]. The presence of clouds also alters sky brightness distribution, complicating background subtraction in wide-field surveys [3]. Traditional ground-based cloud detection methodologies face three fundamental challenges. Threshold-based approaches relying on RGB ratios or intensity thresholds [4] exhibit strong diurnal performance variation. Their accuracy drops from 85% at noon to below 60% during twilight due to rapidly changing illumination conditions [5]. Texture-analysis methods using wavelet transforms or LBPs [6] show limited generalization across cloud types. While achieving 92% accuracy for cumulus clouds, their performance degrades to 68–72% for stratus and cirrostratus formations [7]. This limitation stems from their inability to capture the diffuse boundaries characteristic of high-altitude clouds [8]. Physical model-based techniques incorporating radiative transfer calculations [9] require precise knowledge of atmospheric parameters that are rarely available in operational scenarios [10]. Recent comparative studies [11] highlight that these conventional methods exhibit mean absolute errors of 18–25% in cloud fraction estimation under realistic observing conditions. Cloud thickness presents additional challenges for segmentation algorithms, as optically thin clouds (τ < 0.3) exhibit significantly different scattering properties compared to thick cumulus layers (τ > 10) [12]. Studies show that traditional methods underestimate thin cirrus coverage by 20–40% due to their semi-transparent nature [13] while frequently overestimating thick cloud boundaries by 5–15% from shadowing effects [14].

Recent advancements in deep learning have significantly enhanced the field of ground-based cloud image detection and segmentation. Deep-learning approaches, particularly those leveraging convolutional neural networks (CNNs) and fully convolutional networks (FCNs), have demonstrated remarkable efficacy in this domain. According to Zhou et al. [15], CNNs are capable of automatically extracting hierarchical features from images, which has led to substantial improvements in the accuracy of cloud detection and segmentation. Additionally, Hariharan et al. [16] highlighted the effectiveness of FCNs in achieving pixel-level predictions, which is crucial for the precise delineation of cloud boundaries. These methods have also shown a strong ability to generalize across different imaging conditions, as noted by Drozdzal et al. [17], who emphasized the importance of skip connections in U-Net architectures for retaining critical spatial information. Furthermore, the integration of attention mechanisms into deep-learning models, as discussed by Zinner et al. [18], has further improved the performance of cloud segmentation by allowing the model to focus on the most relevant features. The development of more sophisticated architectures and the optimization of training strategies have collectively contributed to the robustness and reliability of deep-learning-based cloud image analysis. However, these deep-learning models often operate as “black boxes,” lacking interpretability in their decision-making processes. This lack of interpretability poses significant challenges for validating results in operational meteorology and climate science [19]. To address this limitation, class activation maps (CAMs) have been introduced to visualize the regions influencing model predictions, thereby enhancing the interpretability of deep-learning models in cloud segmentation tasks. CAMs provide a way to understand which features the model is focusing on when making predictions, making the models more transparent and trustworthy for researchers and practitioners in the field of cloud detection.

CAMs generate coarse heatmaps by aggregating high-level convolutional features through global average pooling (GAP), highlighting areas that contribute most to the classification output [20]. Global average pooling (GAP) is a pooling operation that computes the spatial average of each feature map, reducing each channel to a single scalar value and eliminating the need for fully connected layers in classification tasks. Nevertheless, standard CAM implementations generate low-resolution activation maps (typically 14 × 14 pixels) that fail to precisely localize cloud boundaries while often misclassifying bright surfaces as clouds due to gradient saturation effects [21]. While gradient-weighted CAM (Grad CAM and Grad-CAM++) [22,23] improves upon traditional CAMs by using higher-order gradients, its computational overhead increases training time by 40% [24]. The subsequent HiResCAM [23] paradigm eliminated global pooling dependencies to preserve high-resolution details, proving particularly effective for ground-based cloud segmentation tasks, albeit with heightened sensitivity to shallow-layer gradient noise [25]. Concurrently, FullGrad [20] adopted a comprehensive gradient aggregation strategy across all network layers, enhancing attribution faithfulness but requiring significant memory overhead due to full-network backpropagation [26]. Most recently, LayerCAM [27] emerged as a multi-scale solution, dynamically fusing hierarchical features to balance semantic precision and spatial detail, though necessitating careful layer contribution tuning to avoid artifact propagation [28].

Score-CAM [29] is a visualization technique for interpreting deep-learning model decisions that generates heat maps of regions critical to model predictions by introducing score-weighted activation maps prior to the output of a specific class. The core idea of this approach is to generate more accurate class activation maps (CAMs) by utilizing the activation maps from the forward propagation process and weighting them by the scores of the target classes. The main advantage of Score-CAM over the traditional Grad-CAM method is that it does not rely on gradient information for computation but generates heat maps by weighting the feature maps during the forward propagation process, which makes it able to provide more accurate localization information when dealing with images with rich details [23]. In addition, Score-CAM is not affected by the gradient saturation problem, which typically causes gradient-based methods to produce a weaker response in some important regions of the image, thus improving its ability to capture image details. For example, several studies have shown that Score-CAM provides clearer and more accurate feature localization than Grad-CAM when dealing with challenging cloud segmentation scenarios such as thin cirrus clouds over bright backgrounds or heterogeneous cloud formations, which is particularly important for precise cloud coverage estimation in meteorological applications [20]. These XAI methods have significantly improved the interpretability of deep-learning models in cloud segmentation, allowing researchers to better understand the decision-making process of the models and increasing trust in their predictions. However, existing CAM implementations still have limitations in terms of multi-scale feature integration and pixel-level segmentation tasks.

Moreover, the existing CAM implementations have three key limitations: (1) they only consider pre-fully connected layer features, (2) they lack multi-scale feature integration, and (3) they only focus on classification targets and perform poorly in pixel-level segmentation tasks, particularly in ground-based cloud image analysis (as shown in Figure 1). In contrast, gradient-based methods like Fullgrad CAM, Grad-CAM, and Grad-CAM++ produce blurred activations (third and fourth columns), a known limitation attributed to their reliance on gradient computations [22]. Score-CAM (seventh column) and LayerCAM (sixth column) show fragmented attention distributions, especially in regions with complex cloud textures. These artifacts arise from their dependence on forward-pass scoring or layer-specific activations, which may overlook multi-scale dependencies [27]. For HiResCAM/LayerCAM, while preserving high-resolution details, they fail to integrate global context effectively.

While deep learning has demonstrated promising results in ground-based cloud segmentation, significant challenges remain in practical implementations. Current class activation mapping (CAM) techniques exhibit three critical limitations when applied to cloud segmentation tasks. These deficiencies are particularly problematic for astronomical applications, where precise cloud coverage estimation is crucial. U-Net architecture, widely adopted for semantic segmentation tasks, also faces challenges when applied to cloud detection. While its encoder–decoder structure with skip connections effectively captures multi-scale features, standard U-Net variants struggle with fine boundary delineation of semi-transparent clouds and suffer from false positives in heterogeneous landscapes [30]. Recent analyses indicate that these limitations stem from the uniform treatment of features across scales, without adaptive weighting of contextually relevant regions [31]. To address these challenges and improve the interpretability of cloud segmentation models, we propose an enhanced U-Net architecture incorporating multi-scale feature inputs and a novel multi-scale CAM attention mechanism. This approach not only aims to improve the accuracy of cloud segmentation but also seeks to enhance the transparency and interpretability of the model’s decision-making process, making it more reliable and trustworthy for operational cloud detection systems. Our principal contributions include the following:

Multi-scale CAM: Multi-scale features are processed through channel cascading and score-based weighting to generate post hoc visualizations that enhance interpretability while maintaining segmentation accuracy.
Multi-scale CAM Attention U-Net: Incorporates a mask attention module that synergistically combines pretrained CAM saliency maps with original feature maps through adaptive weight fusion. This innovation enhances boundary sensitivity and morphological recognition while maintaining feature representation integrity.
Multi-scale Feature Enhancement: Augments the model’s receptive field through dilated convolutions in residual separation modules, significantly improving large-scale feature extraction capability while reducing the semantic gap between skip connections and upsampled features.

By integrating these XAI techniques into the cloud segmentation framework, we aim to provide a more interpretable and reliable solution for ground-based cloud detection, which is essential for advancing research and applications in astronomical observations and solar forecasting.

2. Related Work

Class activation mapping (CAM) visualizes region-specific contributions to model predictions in CNNs. As a foundational XAI tool, CAM enhances model transparency by revealing decision-critical features in vision tasks like image segmentation.

Definition 1.

(Class Activation Map) CAM was initially introduced to identify important pixels in images determined by CNN models. It creates a heat map, highlighting the regions that affect the network’s predictions by projecting the weights of the model’s output layers onto a feature map of a convolutional layer (usually the last layer). For interest Class c, the CAM explanation is defined as follows:

C A M = R e L U (\sum_{k} α_{k}^{c} A_{l - 1}^{k}),

(1)

where

α_{k}^{c} = ω_{l, l + 1}^{c} [k],

(2)

ω_{l, l + 1}^{c} [k]

is the weight of the

k t h

neuron after pooling.

The original CAM framework [20] pioneered CNN decision visualization by mapping convolutional features to classification results via global average pooling (GAP). Its mathematical explicability required minimal architectural changes (replacing fully connected layers with GAP + linear layers [32]) but suffered from inflexible GAP-dependence [20] and single-scale feature limitations causing coarse localization [22].

Definition 2.

(Grad-CAM) Grad-CAM is a prominent variant of CAM that uses the gradients of the target concept flowing into the last convolutional layer to weight the importance of each spatial location. The core formula for Grad-CAM is given by

G r a d - C A M = R e L U (\sum_{k} α_{k}^{c} A_{l - 1}^{k}),

(3)

where

α_{k}^{c} = G A P (\frac{ð Y^{c}}{\partial A_{l}^{k}}),

(4)

G A P (\cdot)

denotes the global pooling operation.

A_{l - 1}^{k}

represents the activation mapping of the

k t h

channel in the

l t h

layer and

Y^{c}

is the probability distribution output of category

c

by the network.

Grad-CAM overcomes these limitations through a gradient-based feature weighting approach [33], enabling architecture-agnostic applications while preserving spatial coherence for higher-resolution activation maps [32]. However, it introduced gradient saturation in ReLU networks and insensitivity to discriminative negative gradients [34].

Score-CAM further advances the field of XAI by providing more accurate and reliable visualizations of model predictions. By addressing the gradient saturation problem and incorporating forward-pass importance scoring, it offers a more comprehensive view of the features that contribute to a model’s decisions. This enhancement is crucial for building trust in AI systems and enabling more effective model validation and refinement in real-world applications.

Definition 3.

(Score-CAM) Score-CAM extends Grad-CAM by using the score function’s gradient rather than the logits’ gradient, aiming to provide more precise localization [35]. The core formula for Score-CAM (the structure diagram is shown in Figure 2) is

S c o r e - C A M = R e L U (\sum_{k} α_{k}^{c} A_{l}^{k}),

(5)

where

α_{k}^{c} = C (A_{l}^{k}),

(6)

C (\cdot)

denotes the channel-wise increase of confidence (CIC) for activation map

A_{l}^{k}

. CIC is defined as

C (A_{l}^{k}) = f (X \circ H_{l}^{k}) - f (X_{b}),

(7)

where

H_{l}^{k} = s (U p (A_{l}^{k})),

(8)

s (\cdot)

is the normalization function that maps each element in the input matrix to [0, 1], and

U p (A_{l}^{k})

denotes the operation that upsamples

A_{l}^{k}

into the input size.

X_{b}

is the baseline, and

X

is the input.

Score-CAM replaced gradients with perturbation-based importance scoring, eliminating gradient artifacts and improving noise robustness via multi-pass averaging [23]. This achieved superior visualizations but incurred substantial computational overhead (N forward passes for N channels, increasing inference time 8–10× versus Grad-CAM [22]). Effectiveness diminished in deep networks (>50 layers) due to activation saturation [34], motivating efficient variants like Fast-Score-CAM [35].

These XAI methods have significantly enhanced the interpretability of deep-learning models in various applications. By providing more accurate and reliable visualizations of model predictions, they enable researchers and practitioners to better understand and trust the models they develop and deploy. As XAI continues to evolve, these techniques will play an increasingly important role in advancing the reliability and adoption of AI systems in fields ranging from healthcare to autonomous systems.

3. Methods

3.1. Multi-Scale Class Activation Mapping

To address the specific challenges of image segmentation tasks, we have developed a novel post hoc visualization interpretation method based on class activation mapping: multi-scale CAM. Distinct from conventional CAM-based approaches, our multi-scale CAM method fully utilizes the multi-scale feature information of the two layers before the fully connected layer, enabling simultaneous capture of fine image details while maintaining an optimal balance between shallow and deep semantic information. This dual capability significantly enhances both segmentation accuracy and model robustness. The weighting coefficients for each activation map are derived through the categorical discriminative capacity of target classes, thereby eliminating reliance on gradient computations. The final visualization output is generated through a linear combination of these optimized weights with their corresponding activation maps. The complete implementation details are formally described in Algorithm 1. The core formula for multi-scale CAM (the structure diagram is shown in Figure 3) is

M u l t i - s c a l e - C A M = R e L U (\sum_{k} α_{k}^{c} A_{c o n c a t}^{k}),

(9)

where

α_{k}^{c} = C (A_{c o n c a t}^{k}),

(10)

C (\cdot)

denotes the channel-wise increase of confidence (CIC) [22] for activation map

A_{l}^{k}

. CIC is defined as

C (A_{l}^{k}) = f (X \circ H_{l}^{k}) - f (X_{b})

(11)

where

H_{l}^{k} = s (A_{c o n c a t}^{k})

(12)

A_{c o n c a t}^{k} = [A_{l}^{k}, U p (A_{l - 1}^{k})]

(13)

and where

[\cdot]

denotes channel-wise concat.

The input images are first subjected to pre-trained U-Net extraction of features, a step that generates feature maps with discriminative capabilities at different spatial scales, thus ensuring that fine-grained details and high-level semantic information are captured simultaneously. These feature maps are subsequently upsampled to recover spatial resolution and align feature dimensions at different scales, which are subsequently spliced at the channel dimension to enhance the representation of region-specific saliency. The aggregated multi-scale activation map is subsequently multiplied pixel-by-pixel with the input image, which corresponds to a specific region in the original input space presenting the most relevant spatial location; the perturbed input image is passed into the pre-trained U-Net to obtain pixel-level scores for each perturbed input, such that each channel of the multi-scale activation map has scores corresponding to the same spatial size, and the pixel-level scores are pixel-by-pixel. The pixel-level score is multiplied with the multi-scale activation map pixel by pixel and then summed up, i.e., the activation map is weighted using the score to obtain the preliminary CAM. Then, the foreground prediction result of the input image is masked to the preliminary CAM to obtain the foreground CAM, and the background prediction result is masked to obtain the CAM of the background to make the model better explain the spatial location of the features of interest.

3.2. Framework Architecture

The architectural framework of our proposed model, highlighting its key innovative components, is illustrated in Figure 4. The schematic demonstrates the multi-scale input strategy implemented in the second and third encoding layers, where low-resolution input images are first normalized through “Conv+BN+ReLU” operations. The operation is a sequential combination of a convolutional layer, batch normalization, and ReLU activation function, which performs feature extraction while stabilizing training dynamics and introducing non-linearity. These normalized inputs are then channel-wise concatenated with downsampled features from preceding layers, forming a combined representation that undergoes 1 × 1 convolutional compression for dimensionality reduction. The diagram further delineates the integration of multi-scale image feature blocks, which process these fused features to enhance hierarchical representation. This design effectively captures both local fine-grained details and broader contextual information, addressing the limitations in traditional single-scale feature extraction methods and cloud segmentation architectures.

The schematic diagram illustrates three key architectural advancements. Firstly, the multi-scale feature integration module enhances feature diversity through an adaptive hierarchical fusion strategy. In the second and third encoding layers, input images at different resolutions (224 × 224 and 112 × 112 pixels) are processed in parallel streams. Each stream starts with batch-normalized convolutional operations (Conv-BN-ReLU), employing 3 × 3 kernels with a stride of 1, in line with feature preservation best practices from previous research. The normalized feature maps are then concatenated with downsampled representations from preceding layers, achieved through max-pooling with a stride of 2. This creates a rich multi-scale feature space. Secondly, the framework includes a dimensionality reduction and feature enhancement component. After concatenation, the features pass through a 1 × 1 convolutional compression layer. This is followed by processing through our novel multi-scale residual feature aggregation module (MRFA).

To further optimize performance, the network incorporates multi-scale saliency attention modules following convolutional feature extraction in both encoder and decoder pathways. These modules significantly enhance global context modeling capabilities through a sophisticated attention mechanism: after independently computing attention scores for both feature modalities, a dynamically learnable weighting factor combines these scores to precisely establish token-to-token correlations. This process enables simultaneous token updating, non-local feature extraction, noise reduction, and adaptive feature reorganization to preserve the most semantically valuable information.

Our framework is designed to overcome the main limitations of conventional approaches. It addresses the issue of single-scale feature dependency by using parallel multi-resolution processing streams. Finally, it tackles the problem of boundary blurring through the use of dilated convolutions and skip connections.

Algorithm 1 Multi-Scale CAM algorithm

Require: Image

X_{0}

, Model

f (X)

, layer

l

Ensure:

L_{M S - C A M}

Initialization;
//get activation of two layers;

A_{l}

,

A_{l - 1}

, logit

\leftarrow f (X_{0})

p r e d i c t_{c l a s s} = s i g m o i d (l o g i t)

C l o u d_{m a s k} = (p r e d i c t_{c l a s s} \geq 0.5)

S k y_{m a s k} = (p r e d i c t_{c l a s s} < 0.5)

//up sample

A_{l - 1}

to match

A_{l}

spatial dimensions;

A_{l - 1}^{u p} \leftarrow U p s a m p l e (A_{l - 1})

//concatenate features along channel dimension;

A_{c o n c a t} \leftarrow C o n c a t (A_{l}, A_{l - 1}^{u p})

M \leftarrow []

C \leftarrow t h e n u m b e r o f c h a n n e l s i n A_{c o n c a t}

for

k

in

[0, \dots, C - 1]

do

M_{l}^{k} \leftarrow A_{c o n c a t}^{k}

//normalize the activation map;

M_{l}^{k} \leftarrow s (M_{l}^{k})

//Hadamard product;

M .

append (

M_{l}^{k} \circ X_{0})

end

M \leftarrow B a t c h i f y (M)

//

f (\cdot)

as the logit of class;

S \leftarrow f^{c} (M)

α_{k} \leftarrow S i g m o i d (S_{k}) \circ C l o u d_{m a s k}

β_{k} \leftarrow S i g m o i d (S_{k}) \circ S k y_{m a s k}

L_{C l o u d - C A M} \leftarrow R e L U (\sum_{k} α_{k} A_{c o n c a t}^{k})

L_{S k y - C A M} \leftarrow R e L U (\sum_{k} β_{k} A_{c o n c a t}^{k})

3.3. Multi-Scale Saliency Attention (MSA)

In order to fully utilize the spatial prior information of the CAM salient graph, an innovative dual-attention fusion mechanism is designed in this module (Figure 5). Specifically, a mask is obtained through pre-training to acquire a CAM with rich multi-scale features, and the generated CAMs are extracted by Conv-BN-ReLU features and then go through a linear layer to obtain the query vectors

Q_{c}

and the key vectors

K_{c}

, respectively, and the attention scores M of the saliency maps are computed. At the same time, the traditional query Q and key

F

are computed for the original feature layer to obtain the attention score of the feature layer.

F

is the input feature matrix, and

W_{Q}

and

W_{K}

are trainable projection weights for query and key, respectively. The core innovation of the mask attention mechanism is the weighted fusion of the saliency maps’ attention score M with the feature layer’s attention score through dynamically learnable weight

θ,

as shown in Equation (14), in which M is added as an additional attention augmentation term into the traditional scaled dot product attention computation.

C a m A t t e n (Q, K, V, M) = S o f t m a x (\frac{Q K^{T}}{\sqrt{d_{k}}} + θ M) V

(14)

where the standard query

Q = F W_{Q},

and key

K = F W_{K}

matrices are derived through linear transformations of the input features

F

. The multi-scale CAM-conditioned query and key are generated via a convolutional pathway defined as

Q_{c} = K_{c} = L i n e a r (R e L U (B N (C o n v (C A M))))

. The attention map

M = \frac{Q_{c} K_{c}^{T}}{\sqrt{d_{k}}}

scaled by the square root of the key dimension

d_{k}

encodes spatial dependencies guided by the multi-scale CAM and enhancing region-specific feature aggregation.

The fundamental innovation of our mask attention mechanism lies in its dynamic fusion strategy, where saliency map attention scores M are combined with feature layer attention scores through learnable weighting parameter

θ

(Equation (14)). This integration introduces M as an auxiliary attention augmentation term within the traditional scaled dot-product attention computation framework.

This sophisticated design achieves significant advancements by effectively utilizing the spatial priors from CAM saliency maps to guide precise attention allocation in cloud regions. It establishes a synergistic relationship between global semantic comprehension and local feature perception. Furthermore, it demonstrates substantial performance gains in cloud image segmentation through two complementary mechanisms. The spatial guidance from saliency maps suppresses background noise interference while enhancing boundary recognition accuracy for complex cloud formations. Meanwhile, the dynamic weighting scheme enables context-aware attention strategy adaptation, maintaining segmentation precision across diverse cloud types (including thin cirrus, thick cumulonimbus, and multi-layered systems) and challenging conditions (variable morphology, indistinct boundaries, and complex illumination).

3.4. Multi-Scale Residual Feature Aggregation Module (MRFA)

Due to varying cloud thickness leading to fragmentation in segmentation results, a multi-scale residual feature aggregation module is introduced in the decoder. Through serial fusion of 2D standard convolution and 2D dilated convolution, this approach achieves local detail extraction in segmentation tasks while preserving global context. Standard convolution captures pixel-level high-resolution cloud edges through dense sliding windows, ensuring continuity and precision of local features. The 3 × 3 convolution with a dilation rate of 2 expands the receptive field through interval sampling, covering larger image regions without increasing the parameter count, thereby capturing overall cloud morphology and spatial distribution relationships. This design uses standard convolution to compensate for potential fine-structure fragmentation that may result from the interval sampling of dilated convolution, while leveraging dilated convolution to overcome the receptive field limitations of traditional convolution, effectively avoiding the computational overhead and noise interference associated with large kernel convolutions [36,37].

3.5. Loss Function

In this paper, we combine the binary cross-entropy (BCE) loss with Dice loss, which is calculated as in Equations (15)–(17).

L o s s_{B C E} = - [y_{n} l o g x_{n} + (1 - y_{n}) \log (1 - x_{n})],

(15)

L o s s_{D i c e} = 1 - \frac{2 \sum_{i} x_{i} y_{i}}{\sum_{i} x_{i} + \sum_{i} y_{i}},

(16)

L o s s_{t o t a l} = λ L o s s_{B C E} + (1 - λ) L o s s_{D i c e},

(17)

x_{n}

denotes the value of the element in the prediction graph, and

y_{n}

denotes the value of the corresponding element in the labeling graph.

λ

is a hyperparameter. In the Dice loss formula,

x_{i}

is the probability value that the

i t h

element in the prediction map belongs to a certain category of foreground, and

y_{i}

is the true value of the

i t h

element in the labeled map. The Dice loss is not affected by the size of the foreground, and the BCE loss can play a guiding role for the Dice loss in the network learning process. The BCE loss provides pixel-level classification guidance with strong gradient signals for individual pixels, while the Dice loss addresses the class imbalance problem by focusing on the overlap between the predicted region and the real region. This is particularly important for cloud segmentation because the proportion of clouds/sky is significantly different. This combination leverages the local optimization ability of BCE and the global shape consistency of Dice, thereby achieving more robust boundary delineation and shape preservation.

4. Experimental Results

4.1. Dataset

To address the need for standardized, high-quality datasets in cross-platform remote sensing (RS) segmentation, we introduce four widely used benchmarks—SWINSEG [38], SWIMSEG [38], TCDD [39], and HRC_WHU [40]—each designed to tackle specific challenges in single-modal RGB segmentation.

The SWINSEG dataset contains 1128 daytime and nighttime images of the sky/cloud masses, along with the corresponding binary ground truth maps. The images in the SWINSEG dataset are taken from two early sky/cloud image segmentation datasets from Nanyang Technological University in Singapore—SWIMSEG and SWINSEG. All the images were taken in Singapore during a 12-month period from January to December 2016 using the calibrated ground-based full-sky imager WAHRSIS. The ground truth annotation was accomplished through the collaboration of experts from the Singapore Meteorological and Geophysical Bureau.

The TCDD (Tianjin Dual Platform Dataset) dataset was released by Tianjin University in 2023 and consists of 320 pairs of image sets, covering 80 geographical areas in Tianjin, China. Each image set includes drone images captured by the DJI Matrice 300 RTK camera and satellite images from the High-Resolution Satellite 6 mission (China’s High-Resolution Earth Observation System). The dataset covers urban, suburban, and rural areas, ensuring the diversity of land-cover types. It is often used to evaluate the ability of models to adapt to the differences in spatial resolution, perspective, and feature granularity between drone and satellite data.

The HRC_WHU dataset contains 1776 virtual images along with their corresponding depth maps and camera parameters. These images were captured by a five-angle tilt camera installed on a drone and cover an area of approximately 6.7 × 2.2 square kilometers in Meitan County, Guizhou Province, China. The dataset was processed and edited using software and manual methods, creating a complete 3D city scene and simulating the imaging process of a single-lens camera, generating a virtual aerial image with a resolution of 5376 × 5376 pixels. This dataset is typically used to verify fine segmentation and boundary preservation.

4.2. Implementation Details

The environment used in this experiment was Python 3.10 and torch 2.4.0 + cu124. An NVIDIA A600 GPU (NVIDIA Corporation, Santa Clara, CA, USA) was used for training and testing the model. During the training process of the SWIMSEG, SWINSEG, TCDD, and HRC_WHU datasets, the initial learning rate was set to 0.0005, the optimizer used was AdamW, and cosine decay was set to gradually reduce the learning rate to 5 × 10⁻⁶. The batch size was 4, and the number of training epochs was 100. The input image size for the model was 224 × 224, and the output image size was also 224 × 224. Data augmentation operations added Gaussian noise with a probability of 0.5. The loss function for all models is set to binary cross-entropy loss to optimize model performance. Additionally, the model saves the best weights each time the validation set performance improves.

4.3. Evaluation Metrics

In the experimental component, we used two evaluation metrics, mIoU (mean Intersection on Union) and Dice (Dice Similarity Coefficient), to measure the similarity between predictions and labels. We also used three evaluation metrics—recall, specificity, and precision—to measure the classification accuracy of the prediction graph. The formula is as follows:

M I o U = \frac{1}{N} \sum_{i = 1}^{N} I O U_{i},

(18)

E r r o r R a t e = \frac{F P + F N}{T P + F P + T N + F N},

(19)

R e c a l l = \frac{T P}{T P + F N},

(20)

P r e c i s i o n = \frac{T P}{F P + T P},

(21)

where TP is the true positive, i.e., the foreground region that is correctly predicted in the prediction map. TN is the true negative, i.e., the background region that is correctly predicted in the prediction map. FP is the false positive, i.e., the portion of the prediction map that is judged as the foreground region but the portion of the labeling map that is judged as the background region. FN is the false negative, i.e., the portion of the prediction map that is judged as the background region and the portion of the labeling map that is judged as the foreground region.

4.4. Comparison with Other Methods

This section presents a visual comparison of cloud segmentation results between the proposed method (Figure 6) and several state-of-the-art approaches, including U-Net [41], Att-UNet [42], Swin-UNet [43], UNeXt [44], MA-SegCloud [45], Laplacian [46], CSWin-UNet [47], and Rolling-Unet [48]. We evaluate performance on both daytime and nighttime images from the SWIMSEG/SWINSEG datasets, focusing on edge precision, noise robustness, and morphological detail preservation.

Figure 7 offers a visual comparison of cloud segmentation outcomes from various state-of-the-art methods on daytime and nighttime images sourced from the SWIMSEG/SWINSEG datasets. The evaluation centers on edge precision, noise resilience, and retention of morphological details. Results indicate that our proposed method surpasses other leading approaches in multiple critical aspects. It exhibits a marked advantage in defining cloud boundaries with greater precision and localization accuracy. Unlike other techniques, which can produce indistinct or incomplete edges, our method maintains clear delineation, particularly in complex cloud formations and areas of varying cloud density.

The segmentation results of the TCDD dataset (as shown in Figure 8) reveal that our model demonstrates certain advantages in the case of class imbalance (first row) and rich texture (second and third rows). Specifically, in the segmentation of the sparse small cloud patches (the first row), most models failed to accurately restore the cloud area at the lower left corner of the image. The red boxes highlight regions where baseline methods struggle with texture-rich scenarios. For instance, models like U-Net and Att-UNet show fragmented or incomplete segmentation of thin cloud strips, while CSWin-UNet tends to over-smooth these detailed structures. Our multi-scale CAM attention mechanism specifically addresses this limitation by preserving multi-scale texture information through hierarchical feature fusion. Our proposed multi-scale CAM module can dynamically adjust the model’s focus in the case of class imbalance through its adaptive weight fusion mechanism. This allows the model to better capture the subtle features of sparse small cloud patches, avoiding local over-segmentation and omission. From the edge segmentation results in the second and third rows, it can be seen that when the imbalance of categories and the rich edge texture occur simultaneously, the multi-scale CAM attention U-Net, with its enhanced boundary sensitivity and morphological recognition capabilities, effectively addresses the issue of “discontinuous segmentation of thin cloud strips” commonly found in traditional models. For example, when processing thin cloud strips, the model leverages the attention weights from the CAM module to enhance the expression of edge features. It combines shallow edge detection features with deep semantic features to determine whether the thin strip belongs to the cloud body, thereby achieving continuous and clear segmentation of thin cloud strips.

As can be seen from the segmentation results on the HRC_WHU high-resolution dataset (Figure 9), the first and second rows represent barren and urban scenes, respectively. In the segmentation results, we use a set of transposed rectangular boxes to display. When the edge detail textures have directional characteristics, our multi-scale feature enhancement module, which utilizes dilated convolutions in residual separation modules, effectively captures the directional edge details. This enables the model to restore the image details well in these complex scenes. The third row shows that when the segmentation target is thin clouds and the non-target area has complex, detailed structures, our multi-scale CAM module can effectively distinguish the target from the background through its feature decoupling mechanism. This reduces the phenomenon of misclassification and enhances the model’s ability to handle complex scenes. The fourth row shows the cloud map segmentation result on farmland. The combination of the multi-scale CAM module and the multi-scale feature enhancement module allows the image to retain the detailed information of different scale targets in edge detail-rich segmentation tasks. This demonstrates the model’s robustness in cross-platform data.

The visualization results of the above comparative experiments show that our method demonstrates strong generalization ability in both daytime and nighttime images. It can accurately segment clouds under various lighting conditions and with complex backgrounds. At the same time, it has strong anti-noise capabilities and can generate smoother and more coherent segmentation masks. It can effectively preserve the fine morphological features of clouds, including thin cirrus clouds and small cumulus clouds, and accurately capture their complex structures and shape changes.

By integrating the multi-scale CAM module and the multi-scale CAM attention U-Net, our method achieves more accurate and detailed cloud segmentation results. In terms of edge accuracy, anti-noise ability, and morphological feature retention, it outperforms other state-of-the-art methods in various imaging conditions. The multi-scale CAM module enhances the model’s ability to focus on relevant features, while the attention U-Net improves boundary sensitivity and morphological recognition ability. These innovations make our method more reliable and interpretable in the cloud detection task.

The results in Table 1 and Table 2 show that the proposed method has the highest recall rates in daytime, nighttime, and all-day image segmentation, which are 92.02%, 91.72% and 91.99%, respectively, indicating that it can better detect cloud pixels. The proposed method has the highest F1 scores in daytime, nighttime, and all-day image segmentation, which are 92.22%, 88.69%, and 91.85% respectively, indicating that it has achieved the best balance between the accuracy and recall rate of cloud segmentation. The proposed method has the lowest error rates in daytime, nighttime, and all-day image segmentation, which are 6.38%, 9.21%, and 6.67% respectively, indicating that its segmentation results are the most accurate. The proposed method has the highest MIoU values in daytime, nighttime, and all-day image segmentation, which are 86.72%, 80.39%, and 86.05%, respectively, indicating that its segmentation results exhibit the highest overlap degree with the real labels and the highest segmentation accuracy. The parameter count of the proposed method is 93.23M. Although it is slightly higher than that of some other methods, its performance improvement is significant, indicating that it has higher efficiency and accuracy in the task of cloud image segmentation.

The ROC curve (Figure 10) demonstrates the performance of different algorithms in the task of cloud image segmentation. It can be seen from the figure that the ROC curve of the proposed method (ours) is closest to the upper left corner, indicating that it has the highest true positive rate (TPR) and the lowest false positive rate (FPR) in the cloud image segmentation task, and its performance is superior to the other seven algorithms. The AUC value (area under the curve) is an important indicator for measuring the performance of an algorithm. The AUC value of the proposed method is the largest, indicating that it has the highest accuracy and robustness in the cloud image segmentation task. The above results are shown in Figure 9.

4.5. Ablation Experiments

To enhance feature representation in cloud segmentation tasks, we strategically integrate multi-scale saliency attention across diverse architectural paradigms. In pure convolutional networks (U-Net, Att-UNet, and UNeXt), multi-scale saliency attention is embedded within the bottleneck layers to establish non-local dependencies for advanced semantic features while enhancing the focus on the target region. For transformer-based frameworks (e.g., Laplacian), we replace the attention module in the bottleneck layer of the original model with multi-scale CAM-guided attention, enabling the model to prioritize semantically salient regions during token mixing. The unified design leverages CAM-derived spatial priors to refine both local (convolutional) and global (transformer) feature aggregation, addressing limitations of fixed-receptive-field operations in heterogeneous cloud morphology segmentation.

Comparative analysis of Figure 11 and Figure 12 reveals distinct performance characteristics across architectures. Att-UNet demonstrates compromised edge precision in nocturnal scenes (error rate: 12.65% vs. daytime 8.17%), particularly for stratocumulus formations, while maintaining acceptable daytime performance for dense clouds. The Laplace method achieves superior daytime accuracy (89.37% recall) but struggles with nocturnal thin cirrus detection (76.69% recall), where its physical assumptions break down. U-Net variants exhibit systematic limitations across all conditions, with 15.53% error rates in nighttime scenarios due to inadequate small-scale feature retention. Notably, CAM integration universally enhances performance—improving Att-UNet’s nocturnal recall by 4.63% and reducing Laplace’s boundary errors by 0.51%, though computational overhead increases proportionally. The proposed architecture uniquely maintains consistent accuracy across illumination conditions (day–night recall variance: 0.3% vs. competitors’ 8–12%), demonstrating particular advantages in cumulus-sparse regions and twilight transitions. The segmentation experiment results in Figure 11 show that our method performs best in nighttime image segmentation. It can accurately segment the main regions of clouds and capture the details and edge information of clouds well, and the segmentation results at cumulus and stratocumulus clouds are especially smooth and accurate. The results (Figure 12) highlight improvements in thin cloud segmentation and boundary sensitivity compared to baseline and CAM-enhanced variants (e.g., Att-UNet+CAM, U-Net+CAM).

It can be seen from the results in Table 3 that after the introduction of CAM, the recall rates of Att-UNet, Laplacian, U-Net, and UNeXt have all improved. However, the recall rate of the proposed method is still the highest, indicating that it has a stronger detection ability in cloud segmentation. The F1 scores have all improved, but the F1 score of the proposed method is still the highest, indicating that it has achieved the best balance between the accuracy and recall rate of cloud segmentation. The error rates have all decreased, but the error rate of the proposed method is still the lowest, indicating that its segmentation results are the most accurate. The MIoU values have all increased, but the MIoU value of the proposed method is still the highest, indicating that the overlap degree between its segmentation result and the real label is the highest, and the segmentation accuracy is the highest.

With regard to the occasional performance degradation observed in Table 3 (e.g., Att-UNet+CAM’s lower F1 score for nighttime images), the following factors are more likely responsible. Multi-scale CAM’s dynamic weighting α might overfit to coarse cloud structures while neglecting fine details (e.g., thin cirrus edges) when fused with shallow features. This is exacerbated in models like Att-UNet, where skip connections already prioritize high-level semantics [41]. For compact models (e.g., UNeXt [44]), adding CAM introduces redundant parameters without proportional gains, leading to suboptimal feature fusion. CAM’s upsampled activations often blur cloud boundaries (Figure 8), conflicting with precise segmentation needs. While the proposed multi-scale CAM (Figure 3) alleviates this via hierarchical features, residual errors persist in complex scenes. The dependence of CAM on advanced features will intensify noise in low-light environments (Section 4.2). Adaptive adjustments specific to the architecture (such as adding low-level skip connections) may alleviate this problem—this is the current research focus. We will conduct more in-depth research on the performance of CAM in future work.

The ablation study presented in Table 4 systematically evaluates the contributions of each proposed module to the overall performance of our model. The baseline architecture, serving as the reference, achieves a recall of 89.35%, an F1 score of 89.81%, an error rate of 8.16%, and an MIoU of 83.11%. The incremental addition of each module demonstrates their individual and collective efficacy in addressing the limitations outlined in the introduction, the result of the ablation study aimed at evaluating the contribution of each component in the proposed model.

The introduction of multi-scale input significantly enhances performance, achieving a recall of 89.62%, an F1 score of 91.40%, an error rate of 7.16%, and an MIoU of 85.51%. This improvement is attributed to the module’s capability to process images at multiple resolutions, effectively capturing both local fine-grained details and broader contextual information. From the perspective of explainable AI (XAI), this multi-scale input mechanism enhances the model’s interpretability by allowing it to consider features at various scales, making the decision-making process more transparent and understandable. This addresses a critical limitation of traditional CAM methods, which rely on single-scale features and consequently struggle to precisely localize cloud boundaries, especially in complex scenarios like thin cirrus or dense cumulonimbus clouds. In summary, the multi-scale input module enhances the model’s ability to handle the variability in cloud textures and boundaries by integrating hierarchical features from different network depths, as demonstrated by the quantitative improvements in key metrics [21].

Incorporating the multi-scale feature extraction module further enhances performance, yielding a recall of 90.55%, an F1 score of 91.63%, an error rate of 6.81%, and an MIoU of 85.87%. The module employs dilated convolutions and residual connections to expand the receptive field without excessive computational overhead, preserving edge details and suppressing noise—critical for segmenting small or semi-transparent cloud structures. In terms of XAI, this module improves the model’s ability to provide detailed and noise-resistant representations of cloud features, which is essential for generating reliable and interpretable segmentation results. By leveraging residual learning principles [43], the module ensures both low-level spatial fidelity and high-level semantic abstraction, addressing the gradient decay problem in deeper networks [44].

The most significant performance gains are observed with the addition of the MSA module, achieving a recall of 91.99%, an F1 score of 91.85%, an error rate of 6.67%, and an MIoU of 86.05%. This module dynamically fuses CAM-derived saliency maps with original feature maps through learnable weights, refining attention allocation in cloud regions. The MSA module resolves the misclassification of bright non-cloud objects (e.g., urban areas) [21] by adaptively weighting multi-scale dependencies and suppressing background noise. Within the XAI framework, the MSA module significantly enhances the model’s interpretability by adaptively weighting multi-scale dependencies and suppressing background noise. This allows for a clearer distinction between cloud regions and background, making the model’s predictions more transparent and trustworthy. Its innovative dual-attention mechanism, which combines spatial priors from CAM with feature-layer attention scores, enhances boundary sensitivity and morphological recognition.

The progressive improvements in Table 4 underscore the synergistic effect of the proposed modules. The multi-scale input lays the foundation for hierarchical feature integration, while the feature extraction module ensures detailed and noise-resistant representations. The MSA module then refines these features through context-aware attention, achieving state-of-the-art segmentation accuracy. Collectively, these modules address the three key limitations of existing CAM methods: (1) reliance on pre-fully connected layer features, (2) lack of multi-scale integration, and (3) poor performance in pixel-level segmentation tasks. By integrating these XAI techniques, our model not only improves segmentation accuracy but also provides a more transparent and interpretable decision-making process, which is crucial for the application of AI in fields requiring high reliability and accountability. The results validate our design choices and highlight the importance of combining multi-scale processing with adaptive attention mechanisms for robust cloud detection.

In conclusion, the ablation study demonstrates that each module contributes uniquely to overcoming the challenges of cloud segmentation, with the MSA module playing a pivotal role in achieving superior performance. This holistic approach not only advances the accuracy and interpretability of cloud detection systems but also offers a scalable framework for other remote sensing applications. This holistic approach not only advances the accuracy and interpretability of cloud detection systems but also offers a scalable and transparent framework for other remote sensing applications, aligning with the goals of XAI to build more trustworthy and understandable AI systems.

5. Conclusions

Clouds and their shadows have a significant impact on the quality and usability of optical remote sensing images, so cloud detection is a crucial research topic in this field. This paper proposes an enhanced U-Net architecture that effectively improves the segmentation effect of cloud images by combining multi-scale feature input with a multi-scale saliency attention module. The main contributions of this method are threefold: firstly, a multi-scale class activation map (CAM) is proposed, which integrates multi-scale features to capture fine details and broader contextual information; secondly, a U-Net framework based on multi-scale saliency attention is implemented, using multi-scale weighted encoders to enhance the network’s ability to extract detailed features; finally, a multi-scale image feature module containing dilated convolutions is introduced to further enhance the model’s ability to capture large-scale features.

During the experimental stage, the model was rigorously trained and tested using four datasets. The experimental results show that this method performs well in the cloud image segmentation task, outperforming existing methods in key metrics such as recall rate, F1 score, error rate, and MIoU. Moreover, the model’s parameter count is also reasonably explained. Compared with other models that include class activation maps (CAM), this method’s superiority in cloud segmentation is further confirmed. Additionally, the analysis of ROC curves and AUC values once again demonstrates the high accuracy and stability of this method in cloud image segmentation.

From the perspective of explainable artificial intelligence (XAI), this optimized U-Net architecture provides a transparent and interpretable method for cloud detection. By combining multi-scale CAM and multi-scale saliency attention modules, the model provides a clear basis for its predictions. These modules enable the model to effectively highlight the most influential regions in its decision-making process, thereby improving the model’s reasoning process’s comprehensibility for researchers and practitioners. This is in line with the goals of XAI, making artificial intelligence systems transparent, interpretable, and trustworthy. The proposed method demonstrates the potential of explainable artificial intelligence in practical optical remote sensing applications. By enhancing the interpretability of the cloud detection model, this study contributes to the broader field of explainable artificial intelligence by applying advanced technologies to practical problems and providing a model for the development of this field. The improvements in accuracy and efficiency, as well as the model’s interpretability, make it a valuable tool in various scenarios requiring reliable and interpretable artificial intelligence decisions.

Author Contributions

Conceptualization, Q.X. and Y.C.; methodology, Q.X., Z.Z. and Y.C.; software, Q.X. and Z.Z.; validation, Q.X., Z.Z. and Y.C.; formal analysis, Q.X.; investigation, Q.X.; resources, Q.X., Z.Z. and Y.C.; data curation, Q.X. and Z.Z.; writing—original draft preparation, Q.X., Z.Z. and Y.C.; writing—review and editing, Q.X.; visualization, Q.X.; supervision, Y.C.; project administration, Q.X., G.W. and Y.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors on request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Wang, Q.; Zhou, C.; Zhuge, X.; Liu, C.; Weng, F.; Wang, M. Retrieval of cloud properties from thermal infrared radiometry using convolutional neural network. Remote Sens. Environ. 2022, 278, 113079. [Google Scholar] [CrossRef]
Gupta, R.; Nanda, S.J. Cloud detection in satellite images with classical and deep neural network approach: A review. Multimed. Tools Appl. 2022, 81, 31847–31880. [Google Scholar] [CrossRef]
Bulgin, C.E.; Maidment, R.I.; Ghent, D.; Merchant, C.J. Stability of cloud detection methods for Land Surface Temperature (LST) Climate Data Records (CDRs). Remote Sens. Environ. 2024, 315, 114440. [Google Scholar] [CrossRef]
Geng, J.; Zhang, Y.; Jiang, W. Polarimetric SAR Image Classification Based on Hierarchical Scattering-Spatial Interaction Transformer. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5205014. [Google Scholar] [CrossRef]
Segal-Rozenhaimer, M.; Li, A.; Das, K.; Chirayath, V. Cloud detection algorithm for multi-modal satellite imagery using convolutional neural-networks (CNN). Remote Sens. Environ. 2020, 237, 111446. [Google Scholar] [CrossRef]
Yang, Y.; Zhao, C.; Sun, Y.; Chi, Y.; Fan, H. Convective Cloud Detection and Tracking Using the New-Generation Geostationary Satellite Over South China. IEEE Trans. Geosci. Remote Sens. 2023, 61, 4103912. [Google Scholar] [CrossRef]
Sawant, M.; Shende, M.K.; Feijóo-Lorenzo, A.E.; Bokde, N.D. The State-of-the-Art Progress in Cloud Detection, Identification, and Tracking Approaches: A Systematic Review. Energies 2021, 14, 8119. [Google Scholar] [CrossRef]
Matsunobu, L.M.; Pedro, H.T.; Coimbra, C.F. Cloud detection using convolutional neural networks on remote sensing images. Sol. Energy 2021, 230, 1020–1032. [Google Scholar] [CrossRef]
Peng, Z.; Yu, D.; Huang, D.; Heiser, J.; Yoo, S.; Kalb, P. 3D cloud detection and tracking system for solar forecast using multiple sky imagers. Sol. Energy 2015, 118, 496–519. [Google Scholar] [CrossRef]
Mahajan, S.; Fataniya, B. Cloud detection methodologies: Variants and development—A review. Complex Intell. Syst. 2020, 6, 251–261. [Google Scholar] [CrossRef]
Shang, H.; Letu, H.; Xu, R.; Wei, L.; Wu, L.; Shao, J.; Nagao, T.M.; Nakajima, T.Y.; Riedi, J.; He, J.; et al. A hybrid cloud detection and cloud phase classification algorithm using classic threshold-based tests and extra randomized tree model. Remote Sens. Environ. 2024, 302, 113957. [Google Scholar] [CrossRef]
Miroszewski, A.; Mielczarek, J.; Szczepanek, F.; Czelusta, G.; Grabowski, B.; Le Saux, B.; Nalepa, J. Cloud Detection in Multispectral Satellite Images Using Support Vector Machines with Quantum Kernels. In Proceedings of the IGARSS 2023-2023 IEEE International Geoscience and Remote Sensing Symposium, Pasadena, CA, USA, 16–21 July 2023; pp. 796–799. [Google Scholar] [CrossRef]
Yun, Y.; Kim, T.; Lee, C.; Han, Y. Deep Learning-Based Cloud Detection in High-Resolution Satellite Imagery Using Various Open-Source Cloud Images. In Proceedings of the IGARSS 2023-2023 IEEE International Geoscience and Remote Sensing Symposium, Pasadena, CA, USA, 16–21 July 2023; pp. 6538–6541. [Google Scholar] [CrossRef]
Silva Neto, C.P.D.; Alves Barbosa, H.; Assis Beneti, C.A. A method for convective storm detection using satellite data. Atmósfera 2016, 29, 343–358. [Google Scholar] [CrossRef]
Yi, L.; Li, M.; Liu, S.; Shi, X.; Li, K.-F.; Bendix, J. Detection of dawn sea fog/low stratus using geostationary satellite imagery. Remote Sens. Environ. 2023, 294, 113622. [Google Scholar] [CrossRef]
Hariharan, B.; Arbelaez, P.; Girshick, R.; Malik, J. Object Instance Segmentation and Fine-Grained Localization Using Hypercolumns. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 39, 627–639. [Google Scholar] [CrossRef]
Drozdzal, M.; Vorontsov, E.; Chartrand, G.; Kadoury, S.; Pal, C. The Importance of Skip Connections in Biomedical Image Segmentation. arXiv 2016, arXiv:1608.04117. [Google Scholar] [CrossRef]
Marshak, A.; Davis, A.B. Satellite and Airborne Remote Sensing of Clouds and Aerosols. In Fast Processes in Large Scale Atmospheric Models; Wiley: Hoboken, NY, USA, 2023; pp. 361–397. [Google Scholar] [CrossRef]
Yuan, Q.; Shen, H.; Li, T.; Li, Z.; Li, S.; Jiang, Y.; Xu, H.; Tan, W.; Yang, Q.; Wang, J.; et al. Deep learning in environmental remote sensing: Achievements and challenges. Remote Sens. Environ. 2020, 241, 111716. [Google Scholar] [CrossRef]
Zhou, B.; Khosla, A.; Lapedriza, A.; Oliva, A.; Torralba, A. Learning deep features for discriminative localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 2921–2929. [Google Scholar] [CrossRef]
Li, W.; Zhang, F.; Lin, H.; Chen, X.; Li, J.; Han, W. Cloud Detection and Classification Algorithms for Himawari-8 Imager Measurements Based on Deep Learning. IEEE Trans. Geosci. Remote Sens. 2022, 60, 4107117. [Google Scholar] [CrossRef]
Chattopadhay, A.; Sarkar, A.; Howlader, P.; Balasubramanian, V.N. Grad-Cam++: Generalized Gradient-Based Visual Explanations for Deep Convolutional Networks. In Proceedings of the 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), Lake Tahoe, NV, USA, 12–15 March 2018; pp. 839–847. [Google Scholar] [CrossRef]
Selvaraju, R.R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; Batra, D. Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 618–626. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar] [CrossRef]
Adebayo, J.; Gilmer, J.; Muelly, M.; Goodfellow, I.; Hardt, M.; Kim, B. Sanity Checks for Saliency Maps in Deep Vision. IEEE TPAMI 2022, 44, 3121–3135. [Google Scholar]
Albekairi, M.; Mohamed, M.V.O.; Kaaniche, K.; Abbas, G.; Alanazi, M.D.; Alanazi, T.M.; Emara, A. Multimodal medical image fusion combining saliency perception and generative adversarial network. Sci. Rep. 2025, 15, 10609. [Google Scholar] [CrossRef]
Jiang, P.-T.; Zhang, C.-B.; Hou, Q.; Cheng, M.-M.; Wei, Y. LayerCAM: Exploring Hierarchical Class Activation Maps for Localization. IEEE Trans. Image Process. 2021, 30, 5875–5888. [Google Scholar] [CrossRef]
Wang, T.; Wang, H.; Wang, Z.; Du, M.; Yang, F.; Zhang, Z.; Ding, S. Score-CAM: Score-Weighted Class Activation Mapping for Localizing Discriminative Regions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020. [Google Scholar]
Lau, M.M.; Lim, K.H. Review of Adaptive Activation Function in Deep Neural Network. In Proceedings of the 2018 IEEE-EMBS Conference on Biomedical Engineering and Sciences (IECBES), Kuching, Malaysia, 3–6 December 2018; pp. 686–690. [Google Scholar]
Siddique, N.; Sidike, P.; Elkin, C.; Devabhaktuni, V. U-Net and its variants for medical image segmentation: Theory and applications Image and Video Processing. arXiv 2022, arXiv:2011.01118. [Google Scholar] [CrossRef]
Feng, Y.; Fan, Z.; Yan, Y.; Jiang, Z.; Zhang, S. MFAFNet: Multi-Scale Feature Adaptive Fusion Network Based on DeepLab V3+ for Cloud and Cloud Shadow Segmentation. Remote Sens. 2025, 17, 1229. [Google Scholar] [CrossRef]
Adebayo, J.; Gilmer, J.; Muelly, M.; Goodfellow, I.; Hardt, M.; Kim, B. Sanity Checks for Saliency Maps. Adv. Neural Inf. Process. Syst. (NeurIPS) 2018, 31, 9525–9536. [Google Scholar]
Sundarajan, M.; Taly, A.; Yan, Q. Axiomatic Attribution for Deep Networks. In Proceedings of the 34th International Conference on Machine Learning (ICML), Stockholm, Sweden, 10–15 July 2017; pp. 3319–3328. [Google Scholar]
Hooker, S.; Erhan, D.; Kindermans, P.J.; Kim, B. A Benchmark for Interpretability Methods. Adv. Neural Inf. Process. Syst. (NeurIPS) 2019. [Google Scholar] [CrossRef]
Zhang, Y.; Li, J.; Zhang, D. Fast-Score-CAM: Accelerated Visual Interpretation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023. [Google Scholar]
Wang, W.; Su, C.; Han, G.; Zhang, H. A lightweight crack segmentation network based on knowledge distillation. J. Build. Eng. 2023, 76, 107200. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Volume 30, pp. 5998–6008. [Google Scholar]
Zhao, H.; Kong, X.; He, J.; Qiao, Y.; Dong, C. Efficient image super-resolution using pixel attention. Computer Vision–ECCV 2020 Workshops 2020:56–72. arXiv 2020, arXiv:2010.01073. [Google Scholar] [CrossRef]
Zhang, Z.; Yang, S.; Liu, S.; Xiao, B.; Cao, X. Ground-Based Cloud Detection Using Multiscale Attention Convolutional Neural Network. IEEE Geosci. Remote Sens. Lett. 2022, 19, 8019605. [Google Scholar] [CrossRef]
Liu, J.; Ji, S. A Novel Recurrent Encoder-Decoder Structure for Large-Scale Multi-View Stereo Reconstruction from an Open aerial Dataset. arXiv 2020, arXiv:2003.00637. [Google Scholar] [CrossRef]
Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. In Medical Image Computing and Computer-Assisted Intervention—MICCAI 2015. Lecture Notes in Computer Science; Navab, N., Hornegger, J., Wells, W., Frangi, A., Eds.; Springer: Cham, Switzerland, 2015; Volume 9351. [Google Scholar] [CrossRef]
Oktay, O.; Schlemper, J.; Folgoc, L.L.; Lee, M. Attention U-Net: Learning Where to Look for the Pancreas. arXiv 2018, arXiv:1804.03999. [Google Scholar] [CrossRef]
Cao, H.; Wang, Y.; Chen, J.; Jiang, D.; Zhang, X.; Tian, Q.; Wang, M. Swin-Unet: Unet-like pure transformer for medical image segmentation. Image and Video Processing. arXiv 2021, arXiv:2105.05537. [Google Scholar] [CrossRef]
Valanarasu, J.M.J.; Patel, V.M. UNeXt: MLP-based rapid medical image segmentation network. Image and Video Processing. arXiv 2022, arXiv:2203.04967. [Google Scholar] [CrossRef]
Zhang, L.; Wei, W.; Qiu, B.; Luo, A.; Zhang, M.; Li, X. A Novel Ground-Based Cloud Image Segmentation Method Based on a Multibranch Asymmetric Convolution Module and Attention Mechanism. Remote Sens. 2022, 14, 3970. [Google Scholar] [CrossRef]
Azad, R.; Kazerouni, A.; Azad, B.; Aghdam, E.K.; Velichko, Y.; Bagci, U.; Merhof, D. Laplacian-former: Overcoming the limitations of vision transformers in local texture detection. In Medical Image Computing and Computer-Assisted Intervention–MICCAI 2023; Springer: Cham, Switzerland, 2023; pp. 1–11. [Google Scholar] [CrossRef]
Liu, X.; Gao, P.; Yu, T.; Wang, F.; Yuan, R.-Y. CSWin-UNet: Transformer UNet with cross-shaped windows for medical image segmentation. Inf. Fusion 2025, 113, 102634. [Google Scholar] [CrossRef]
Liu, Y.; Zhu, H.; Liu, M.; Yu, H.; Chen, Z.; Gao, J. Rolling-UNet: Revitalizing MLP’s ability to efficiently extract long-distance dependencies for medical image segmentation. Proc. AAAI Conf. Artif. Intell. 2024, 38, 3819–3827. [Google Scholar] [CrossRef]

Figure 1. Visualization of our proposed method along with Fullgrad CAM [20], Grad-CAM [23], Grad-CAM++ [22], HiResCAM [23], LayerCAM [27], and Score-CAM [28].

Figure 2. The structure of Score-CAM. First, extract the features in Stage 1. Then, each time activate the mask as the original image and obtain its prequel score on the target class. Stage 2 is repeated N times, where N is the number of activated mappings. Finally, the results are generated through a linear combination of score-based weights and activation mappings. Stage 1 and Stage 2 share a CNN module as the feature extractor.

Figure 3. The structure of multi-scale CAM.

Figure 4. The proposed model framework.

Figure 5. The structure of MSA.

Figure 6. Structures of multi-scale residual feature aggregation (MRFA) module.

Figure 7. Qualitative examples of cloud segmentation on SWIMSEG and SWINSEG datasets.

Figure 8. Qualitative examples of cloud segmentation on the TCDD dataset.

Figure 9. Qualitative examples of cloud segmentation on the HRC_WHU dataset.

Figure 10. ROC curve of our method and seven other algorithms.

Figure 11. SWINSEG comparative experimental prediction plots.

Figure 12. SWIMSEG comparative experimental prediction plot.

Table 1. Performance comparison between our algorithm and the other eight cloud image segmentation algorithms. Bold indicates the best option.

Type	Method	Recall ↑	F1 Score ↑	Error Rate ↓	MIoU ↑
Day	U-Net [41]	80.39%	85.99%	10.86%	77.99%
	Att-UNet [42]	87.23%	89.66%	8.17%	82.91%
	Swin-UNet [43]	88.82%	90.70%	7.60%	84.40%
	UNeXt [44]	89.78%	91.00%	7.31%	84.80%
	MA-SegCloud [45]	87.59%	90.45%	7.59%	84.20%
	Laplacian [46]	89.37%	90.67%	7.58%	84.50%
	CSWin-UNet [47]	88.05%	90.82%	7.82%	84.52%
	Rolling-Unet [48]	87.66%	90.27%	7.91%	83.69%
	Ours	92.02%	92.22%	6.38%	86.72%
Night	U-Net	68.90%	76.24%	15.53%	65.03%
	Att-UNet	78.74%	83.67%	12.65%	73.65%
	Swin-UNet	85.87%	85.80%	11.27%	76.72%
	UNeXt	81.75%	84.98%	12.04%	75.71%
	MA-SegCloud	77.21%	82.77%	12.31%	72.70%
	Laplacian	76.69%	83.39%	12.56%	73.62%
	CSWin-UNet	71.26%	80.57%	13.16%	70.06%
	Rolling-Unet	83.51%	83.69%	12.69%	74.33%
	Ours	91.72%	88.69%	9.21%	80.39%
Day + Night	U-Net	79.18%	84.97%	11.35%	76.62%
	Att-UNet	83.71%	87.91%	8.99%	80.60%
	Swin-UNet	88.51%	90.19%	7.98%	83.59%
	UNeXt	88.93%	90.36%	7.81%	83.84%
	MA-SegCloud	86.23%	89.55%	8.22%	82.85%
	Laplacian	88.04%	89.91%	8.10%	83.36%
	CSWin-UNet	86.28%	89.74%	8.38%	83.00%
	Rolling-Unet	87.22%	89.58%	8.41%	82.71%
	Ours	91.99%	91.85%	6.67%	86.05%

Table 2. Performance comparison between our algorithm and the other eight cloud image segmentation algorithms on the TCDD and HRC_WHU datasets. Bold indicates the best option.

Dataset	TCDD				HRC_WHU
Method	Recall ↑	F1 Score ↑	Error Rate ↓	MioU ↑	Recall ↑	F1 Score ↑	Error Rate ↓	MioU ↑
U-Net	87.29%	83.22%	9.30%	75.81%	88.59%	91.36%	6.06%	84.36%
Att-UNet	85.57%	81.80%	8.75%	77.15%	89.65%	91.90%	5.71%	85.34%
Swin-UNet	85.96%	82.87%	8.96%	75.12%	90.93%	91.81%	5.64%	85.32%
UNeXt	85.15%	80.96%	8.66%	75.17%	89.38%	89.92%	6.85%	82.08%
MA-SegCloud	86.72%	80.74%	11.80%	73.41%	90.49%	91.95%	5.68%	85.34%
Laplacian	83.33%	80.14%	9.60%	73.77%	83.39%	84.98%	9.98%	75.23%
CSWin-UNet	83.95%	80.69%	8.48%	74.27%	89.12%	92.23%	5.23%	85.88%
Rolling-Unet	87.41%	82.67%	9.30%	74.93%	88.26%	90.38%	6.73%	82.91%
Ours	88.20%	83.71%	7.38%	82.69%	92.12%	92.60%	5.15%	86.12%

Table 3. Comparison of the performance of the four segmentation models and the corresponding models with CAM added. The values in parentheses indicate the changes after adding CAM.

Type	Method	Recall ↑	F1 Score ↑	Error Rate ↓	MIoU ↑
Day	U-Net	80.39%	85.99%	10.86%	77.99%
	U-Net+CAM	85.67% (+5.28%)	88.85% (+2.86%)	8.53% (−2.33%)	81.92% (+3.93%)
	Att-UNet	87.23%	89.66%	8.17%	82.91%
	Att-UNet+CAM	89.80% (+2.57%)	90.80% (+1.14%)	7.16% (−1.01%)	84.74% (+1.83%)
	UNeXt	89.78%	91.00%	7.31%	84.80%
	UNeXt+CAM	89.77% (−0.01%)	91.23% (+0.23%)	7.31% (−0.00%)	85.08% (+0.28%)
	Laplacian	89.37%	90.67%	7.58%	84.50%
	Laplacian+CAM	89.90% (+0.53%)	91.31% (+0.64%)	7.07% (−0.51%)	85.50% (+1.00%)
Night	U-Net	68.90%	76.24%	15.53%	65.03%
	U-Net+CAM	78.00% (+9.10%)	82.13% (+5.89%)	12.82% (−2.71%)	71.96% (+6.93%)
	Att-UNet	78.74%	83.67%	12.65%	73.65%
	Att-UNet+CAM	83.37% (+4.63%)	83.32% (−0.35%)	13.44% (+0.79%)	73.84% (+0.19%)
	UNeXt	81.75%	84.98%	12.04%	75.71%
	UNeXt+CAM	84.42% (+2.67%)	87.09% (+2.11%)	10.61% (−1.43%)	78.31% (+2.60%)
	Laplacian	76.69%	83.39%	12.56%	73.62%
	Laplacian+CAM	78.04% (+1.35%)	84.33% (+0.94%)	12.55% (−0.01%)	74.43% (+0.81%)
Day + Night	U-Net	79.18%	84.97%	11.35%	76.62%
	U-Net+CAM	84.86% (+5.68%)	88.14% (+3.17%)	8.98% (−2.37%)	80.87% (+4.25%)
	Att-UNet	83.71%	87.91%	8.99%	80.60%
	Att-UNet+CAM	87.33% (+3.62%)	89.15% (+1.24%)	8.50% (−0.49%)	82.28% (+1.68%)
	UNeXt	88.93%	90.36%	7.81%	83.84%
	UNeXt+CAM	89.20% (+0.27%)	90.80% (+0.44%)	7.66% (−0.15%)	84.36% (+0.52%)
	Laplacian	88.04%	89.91%	8.10%	83.36%
	Laplacian+CAM	88.65% (+0.61%)	90.58% (+0.67%)	7.65% (−0.45%)	84.33% (+0.97%)

Table 4. Ablation experiment results.

	Recall ↑	F1 Score ↑	Error Rate ↓	MIoU ↑
Baseline	89.35%	89.81%	8.16%	83.11%
+ Multi-Scale Input	89.62%	91.40%	7.16%	85.51%
+ Multi-Scale Saliency Attention (MSA)	90.55%	91.63%	6.81%	85.87%
+ Multi-Scale Residual Feature Aggregation (MRFA)	91.99%	91.85%	6.67%	86.05%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Xu, Q.; Zhang, Z.; Wang, G.; Chen, Y. Explainable Multi-Scale CAM Attention for Interpretable Cloud Segmentation in Astro-Meteorological Applications. Appl. Sci. 2025, 15, 8555. https://doi.org/10.3390/app15158555

AMA Style

Xu Q, Zhang Z, Wang G, Chen Y. Explainable Multi-Scale CAM Attention for Interpretable Cloud Segmentation in Astro-Meteorological Applications. Applied Sciences. 2025; 15(15):8555. https://doi.org/10.3390/app15158555

Chicago/Turabian Style

Xu, Qing, Zichen Zhang, Guanfang Wang, and Yunjie Chen. 2025. "Explainable Multi-Scale CAM Attention for Interpretable Cloud Segmentation in Astro-Meteorological Applications" Applied Sciences 15, no. 15: 8555. https://doi.org/10.3390/app15158555

APA Style

Xu, Q., Zhang, Z., Wang, G., & Chen, Y. (2025). Explainable Multi-Scale CAM Attention for Interpretable Cloud Segmentation in Astro-Meteorological Applications. Applied Sciences, 15(15), 8555. https://doi.org/10.3390/app15158555

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Explainable Multi-Scale CAM Attention for Interpretable Cloud Segmentation in Astro-Meteorological Applications

Abstract

1. Introduction

2. Related Work

3. Methods

3.1. Multi-Scale Class Activation Mapping

3.2. Framework Architecture

3.3. Multi-Scale Saliency Attention (MSA)

3.4. Multi-Scale Residual Feature Aggregation Module (MRFA)

3.5. Loss Function

4. Experimental Results

4.1. Dataset

4.2. Implementation Details

4.3. Evaluation Metrics

4.4. Comparison with Other Methods

4.5. Ablation Experiments

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI