1. Introduction
With the rapid advancement of satellite remote sensing technology, high-resolution remote sensing images have found widespread application across numerous fields. However, cloud cover has posed a formidable challenge in remote sensing image processing, as it impedes the capture of surface information by optical sensors, resulting in data loss and potential biases in subsequent applications [
1]. Consequently, cloud detection has remained a pivotal and challenging research focus in the preprocessing of remote sensing images within the remote sensing community.
Convolutional Neural Networks (CNNs) [
2], leveraging local receptive fields, weight sharing, translation invariance, and pooling, excel in extracting spectral and geospatial features, enabling deep architectures for capturing high-level semantic representations crucial for cloud detection in remote sensing. DCNet [
3] integrates deformable convolutions within the encoder–decoder framework to adaptively capture cloud morphology, addressing spatial information loss. The encoder–decoder architecture, however, faces issues of spatial information loss and feature dilution, prompting a shift in cloud detection methods from purely convolutional structures towards global feature extraction techniques, such as attention modules. Several studies [
4,
5] have begun to incorporate attention mechanisms into UNet [
6]. CDUNet [
7] incorporates Spatial Prior Self-Attention (SPSA) alongside a dual attention mechanism, which effectively reduces feature redundancy and enhances the network’s ability to detect clouds. CDnetV2 [
8] and CFCA-Net [
9] further leverage channel and spatial attention modules to refine feature maps and highlight critical color, texture, and spatial cloud features. Zhang et al.’s proposed cloud detection model [
10] combines probabilistic upsampling with attention mechanisms to enhance cloud regions, while SEUNet++ [
11] and AFMUNet [
12] integrate lightweight channel/spatial attention modules for adaptive field of view adjustment, improving cloud detection performance.
To enhance deep learning networks’ ability to capture intricate features, researchers have explored multi-feature fusion strategies. Zhang et al. introduced CSD-Net [
13], incorporating Multi-scale Feature Fusion (MFF) and Controllable Depth Supervision and Feature Fusion (CDSFF), effectively capturing salient semantic features for clouds and snow. Wang et al. designed ABNet [
14] with an All-scale Feature Fusion (AF) module, enabling decoders to integrate features from all resolutions. GCDB-UNet [
15] augmented UNet with Global Context Dense Blocks, boosting thin cloud detection. Wu et al.’s boundary-based model [
16] strengthened multi-scale feature extraction for clouds of varying sizes. CRSNet [
17] employed a Multi-scale Global Attention module, improving channel and spatial information for higher accuracy. BABFNet [
18] used a boundary prediction branch to enhance cloud detection in complex regions. GANet [
19] introduced GFAM and PAPM, bridging spatial details and high-level semantics while extracting multi-scale global features. MCDNet [
20] leveraged MSFF to compensate for spatial information loss, improving sensitivity to fragmented clouds. Furthermore, augmenting feature inputs [
21,
22,
23,
24,
25] demonstrates the effectiveness of multi-feature fusion in enriching network capabilities for cloud detection.
CNNs excel in local connections but struggle with global context capture, whereas Transformers, exemplified by ViT [
26] and Swin Transformer [
27], excel at global feature extraction and context understanding. Li et al. proposed a novel CS (Cloud and shadow) detection algorithm, CSDFormer [
28], specifically utilizing a hierarchical converter structure in the encoder stage to extract CS features. Each converter layer incorporates multiple multi-head self-attention mechanisms for computing the long-range connectivity of pixels. Several studies combine Transformers with CNNs to enhance cloud detection networks. Lu et al. [
29] and Gong et al. [
30] integrate Swin Transformer and CNNs for semantic and spatial detail extraction. CNN-TransNet [
31] employs a hybrid CNN-Transformer with Differential Feature Enhancement for improved cloud discrimination. Gu et al.’s [
32] hybrid model incorporates Axial Shared Hybrid Attention and an Attention Guidance Module for efficient fusion. CD-CTFM [
33] utilizes a lightweight CNN-Transformer encoder–decoder with attention gates. MAFNet [
34] collaborates ResNet50 and Swin Transformer with Multi-branch Attention Fusion. The MMA network [
35] leverages multi-scale overlapping blocks, PVT, strip convolution, and Multi-scale Global Aggregation for rich semantic extraction. These hybrid approaches demonstrate the potential of combining CNNs and Transformers for advanced cloud detection. Xu et al. proposed an Integration Transformer with gradient-aware feature aggregation (TransGA-Net) [
36]. This framework employs Transformers as encoders, significantly enhancing the modeling capability for global features and long-term dependencies. Gao et al. proposed SwinCloud [
37] for cloud detection in the thermal infrared spectral range. The network augments the Swin Transformer’s window attention module with a CNN-based parallel pathway to effectively model global-local information.Zhou et al. proposed a dual-branch collaborative optimization network (DFC-Net) [
38], which achieved cross-scale feature fusion and detail enhancement through a dual-branch interaction mechanism that combines global context modeling and local detail perception.
However, state-of-the-art cloud detection models predominantly rely on computationally intensive backbone networks, leading to substantial parameter counts and slow inference speeds. This hinders their deployment on resource-constrained platforms such as edge devices or satellite-borne systems. Existing approaches are often caught in a trade-off: they either enhance detection accuracy by increasing network complexity—sacrificing practicality—or prioritize efficiency through excessive architectural simplification. This simplification particularly compromises the model’s ability to represent weak features (e.g., thin clouds) and precise boundaries, which are critical in challenging scenarios like urban landscapes and snow-cloud coexistence, thereby fundamentally limiting detection performance under these conditions.
To address these limitations, this paper introduces a lightweight cloud detection network. The proposed model comprises four core components: a dual-path feature enhancer, a lightweight backbone network, a progressive feature pyramid decoder, and a bidirectional gated fusion module. Through this systematic and lightweight-oriented design, the network maintains competitive detection accuracy while significantly improving computational efficiency. The main contributions of this work are summarized as follows:
We propose a dual-path feature enhancer, deployed at the front end of the network. It consists of two dedicated paths that respectively capture global structural information and local detail features. An adaptive fusion mechanism is introduced to effectively integrate multi-scale representations from both paths. This design considerably enriches the expressiveness of input features and reduces reliance on a heavy backbone network.
We design a bidirectional gated fusion module. It employs a bidirectional gated attention mechanism to adaptively select informative features from both the multi-scale feature stream (provided by the front-end enhancer) and the deep semantic stream (from the backbone decoder). Combined with dynamic convolution and an enhanced attention mechanism, the module establishes a “selection–fusion–refinement” workflow that preserves spatial details and strengthens semantic consistency during feature fusion.
Experimental results demonstrate that the proposed network achieves competitive detection accuracy on the HRC-WHU dataset while operating with low computational overhead. This results in an effective balance between accuracy and efficiency, offering a practical solution for deploying cloud detection models in resource-limited scenarios.
2. Methodology
Our proposed lightweight cloud detection network comprises four core components: a Dual-Path Feature Enhancer, a ResNet-18 backbone, a Progressive Feature Pyramid Decoder, and a Bidirectional Gated Fusion Module. First, the Dual-Path Feature Enhancer extracts rich multi-scale contextual features to minimize reliance on a complex backbone. These features are subsequently encoded into deep semantic representations by the ResNet-18 backbone. The Progressive Feature Pyramid Decoder then hierarchically decodes the backbone features, while the Bidirectional Gated Fusion Module adaptively merges the decoder’s deep semantic features with the multi-scale features from the front-end enhancer through gated attention and dynamic convolution. This integrated architecture achieves an effective balance between detection accuracy and computational efficiency.
Figure 1 depicts the overall framework of the model.
2.1. Overall Design Rationale
The proposed architecture follows a progressive refinement pipeline where each component addresses a specific challenge and its output serves as optimized input for the subsequent module.
We begin with raw remote sensing images that contain complex backgrounds, noise, and varying cloud scales. Directly feeding such inputs into a deep backbone would force it to simultaneously handle low-level denoising and high-level semantic abstraction, competing objectives that reduce efficiency. The dual-path feature enhancer therefore first performs low-level feature refinement, suppressing background clutter while amplifying salient primitives such as cloud edges and thin clouds. This provides a cleaner, more structured representation for the backbone.
With the enhancer handling low-level extraction, the ResNet-18 backbone can focus its limited capacity on building hierarchical semantic representations through progressive downsampling, generating a multi-scale feature pyramid. This functional decoupling, where the enhancer handles spatial primitives and the backbone handles semantic abstraction, justifies the use of a lightweight backbone without sacrificing performance.
The backbone’s feature pyramid contains rich semantics but at varying resolutions. Directly concatenating these features often leads to semantic misalignment between levels. The progressive decoder addresses this through top-down recursive fusion: deeper semantic features are gradually refined with shallower spatial information, while horizontal alignment ensures feature consistency before each fusion step.
Finally, we must integrate two semantically distinct streams: the enhancer’s output rich in spatial primitives and the decoder’s output rich in semantic context. The bidirectional fusion module projects both streams into a unified space, applies gated attention for channel-wise adaptive recalibration, and uses dynamic convolution for sample-adaptive cross-stream interactions, ensuring the fused representation leverages the strengths of both streams.
In essence, the architecture progresses from low-level refinement to hierarchical semantics, then to consistent multi-scale decoding, and finally to adaptive cross-stream fusion, with each component motivated by the specific needs of its position in this pipeline.
2.2. Dual-Path Feature Enhancer
We propose a hierarchical dual-path feature enhancement architecture that progressively integrates multi-level features through attention-guided enhancement and gated fusion mechanisms.
The concept of dual-path design has been explored in prior works such as BiSeNet [
39], where spatial and context paths operate in parallel to preserve details and capture context. However, our Dual-Path Enhancer differs fundamentally. First, it is positioned as a front-end preprocessing module before the backbone, rather than a parallel branch throughout the network. Second, its purpose is low-level feature refinement, enabling the lightweight backbone to focus solely on semantic modeling. This serialized decoupling contrasts with BiSeNet’s parallel complementarity. Third, its output is later fused with decoder features via bidirectional gating, rather than used directly for prediction. These distinctions position our enhancer as an input refinement mechanism rather than a segmentation backbone.
The entire process can be referenced in
Figure 2. This design enables adaptive extraction of complementary information from different feature sources: one path preserves global structural information to provide a stable foundation, while the other focuses on extracting and amplifying local detail features (e.g., thin clouds and edge details) that may be overlooked in the basic representation. Separating them early prevents the dilution of global structure by local details, allowing each path to specialize before fusion, after which fine-grained weight adjustment optimizes the final feature representation.
Let the base feature map be , and the enhanced feature map be , where C, H, and W represent the number of channels, height, and width, respectively. The complete processing pipeline comprises two core stages:
2.2.1. Attention Feature Enhancement Stage
First, the original features are recalibrated through an attention mechanism. This stage generates spatial-channel adaptive attention weights that enhance useful features while suppressing redundant information:
Here, is the feature fusion module implemented as a lightweight 1 × 1 convolutional layer that reduces the concatenated dual-path features ( channels) to a compact representation of C channels, followed by batch normalization and ReLU activation. This design enables efficient cross-path information exchange while maintaining computational efficiency. denotes the attention module, which first applies global average pooling to aggregate spatial information, then captures channel-wise dependencies through a two-layer bottleneck structure (with reduction ratio 4), and finally generates spatial-channel attention weights via a sigmoid activation. The output is split into and , corresponding to attention weights for the base and enhanced paths respectively. This mechanism allows the model to adaptively emphasize informative channels and spatial regions while suppressing redundant ones.
The enhanced features are then obtained through element-wise multiplication:
2.2.2. The Gating Fusion Stage
While attention provides spatial-channel recalibration, adaptively balancing the relative contributions of the two enhanced streams remains a challenge. To address this, we introduce a gating mechanism that learns optimal fusion weights based on the combined feature context:
where
denotes the gating weight generation module, which consists of two sequential 1 × 1 convolutional layers: the first reduces the channel dimension from
C to
with ReLU activation, and the second produces a 2-channel gating map. A Softmax activation is applied along the channel dimension to normalize the gate weights, yielding
and
that satisfy
. This ensures the two feature streams compete proportionally, with their relative importance adaptively determined per spatial location. This design enables the model to dynamically determine how much each feature stream should contribute based on local context, rather than using fixed fusion ratios.
The final output is then obtained through gated weighted summation, which fuses the two streams according to the learned gate weights:
This weighted summation ensures that the final representation preserves both global structural stability from the base path and local detail sensitivity from the enhanced path, with their relative importance adaptively adjusted per spatial location, a key advantage over simple concatenation or addition.
2.3. Backbone
In the proposed architecture, the backbone network serves the crucial function of extracting hierarchical semantic features from preprocessed inputs. After thorough evaluation of the trade-off between computational efficiency and representational capacity, ResNet-18 was selected as the foundational backbone. This choice is justified by the fact that the preceding dual-path enhancement module already generates rich and well-structured primary features. By performing initial feature extraction and refinement, it reduces the complexity of the input space and provides the backbone with a cleaner, more informative starting point. Consequently, the backbone is relieved from low-level feature learning responsibilities and can focus its limited capacity on building hierarchical semantic feature pyramids, making a lightweight architecture like ResNet-18 sufficient for the task.
The ResNet-18 backbone comprises four consecutive residual layers that collectively generate a multi-resolution feature pyramid through progressive downsampling. The forward propagation process can be formally described as follows:
Here, represents the i-th residual layer, and the output feature set constitutes a feature pyramid covering resolutions from 1/4 to 1/32. This multi-scale feature representation provides rich spatial and semantic information for the subsequent decoder and segmentation head.
To accelerate convergence and enhance generalization capability, the backbone network is initialized with weights pre-trained on the large-scale ImageNet dataset. This initialization strategy leverages generic feature representations learned from diverse visual tasks, effectively mitigating overfitting risks when training data is limited.
2.4. Progressive Feature Pyramid Decoder
To effectively leverage the multi-scale feature pyramid produced by the backbone network, this study designs a Progressive Feature Pyramid Decoder. Through lateral connections, progressive fusion, and gating mechanisms, the decoder achieves top-down, multi-level feature integration and ultimately produces high-quality segmentation feature maps.
2.4.1. Horizontal Connection and Feature Alignment
Let
represent the feature pyramid output by the backbone network, where
. First, features at each level are projected into a unified feature space through lateral convolutional layers:
where
denotes the decoder channel dimension, and
represents the lateral features at the
i-th level. This operation standardizes encoder features with varying channel numbers into a consistent dimensionality, thereby facilitating subsequent fusion operations.
2.4.2. Top-Down Progressive Fusion
The decoding process begins with the deepest semantic features and progressively incorporates shallow spatial information. Let denote the refined features at the i-th level.
A key challenge in multi-scale fusion is the semantic gap between features from different levels, where deeper features carry rich semantics but coarse spatial details, while shallower features contain fine spatial information but limited semantics. To address this, we first align both resolution and semantic distribution before fusion.
For
(top-down):
This upsampling step resolves the resolution mismatch by aligning deeper features to the spatial size of the current level.
Here
further aligns the semantic distribution of lateral features to match that of the upsampled deeper features, reducing semantic discrepancies that would otherwise degrade fusion quality. This two-step alignment ensures that subsequent fusion operates on features that are both spatially and semantically consistent.
where
represents the gated skipping connection module that adaptively regulates feature contributions from different sources.
where
is the progressive fusion block that further optimizes feature representations.
The initial condition is set as , meaning the deepest lateral features serve as the starting point for fusion.
2.4.3. Output Feature Generation
Following the top-down progressive integration, the highest-resolution refined feature
is obtained. The decoder output is subsequently generated through an output projection layer:
where
represents a two-stage convolutional projection that progressively compresses the feature dimension from
to
.
This decoder design offers several notable advantages: First, the lateral connections effectively preserve spatial details from the encoder. Second, the progressive fusion mechanism ensures smooth propagation of semantic information across different levels. Third, the gated skip connections enhance the model’s capacity for adaptive feature importance selection. Finally, the horizontal alignment module improves semantic consistency among features. From the perspective of feature optimization and gradient propagation, these design choices collectively ensure feature consistency by aligning multi-scale representations before fusion, and facilitate gradient flow through recursive connections that create shorter paths for backpropagation. These are key advantages over simpler decoder architectures.
2.5. Bidirectional Gated Fusion Module
To effectively integrate the multi-scale features extracted by the front-end enhancer with the deep semantic features from the backbone decoder, this study proposes a Bidirectional Gated Fusion Module, as shown in
Figure 3. This module achieves multi-level adaptive feature fusion through bidirectional feature projection, gated attention mechanisms, and dynamic convolution operations, significantly enhancing the richness and discriminability of feature representations. This design aligns with the broader trend in the detection literature that leverages adaptive feature aggregation to enhance representation quality [
40].
2.5.1. Feature Projection and Spatial Alignment
Given dual-path features
and decoder features
, the features are first projected into a unified fusion space through 1 × 1 convolutional layers:
where
denotes the predefined fusion channel dimension. When spatial dimensions between input features are mismatched, bilinear interpolation is applied for spatial alignment:
This alignment operation ensures spatial consistency for subsequent fusion steps and establishes the foundation for effective feature interaction.
2.5.2. Bidirectional Gated Attention Mechanism
The core innovation of this module resides in its bidirectional gated attention mechanism, which generates channel-adaptive gating weights by leveraging global contextual information. Specifically, channel-wise statistics are first extracted through global average pooling:
To model the relationships between channels and generate adaptive weights, we pass the channel descriptors through a bottleneck structure that captures inter-channel dependencies:
The reduction ratio of 8 in the bottleneck compresses channel information to learn compact representations of channel relationships, while the sigmoid activation produces soft gating values in (0, 1). This design enables the model to emphasize interdependent channels while suppressing less relevant ones based on global context. Where and are learnable parameters, and denotes the sigmoid activation function. The same computational process is applied to the decoder features to generate .
Ultimately, the gating weights are applied to their corresponding features through element-wise multiplication:
where ⊙ denotes the element-wise multiplication operation.
This mechanism enables the model to adaptively emphasize information-rich channels while suppressing redundant or noisy ones, thereby achieving fine-grained regulation at the feature level.
2.5.3. Dynamic Feature Fusion
To further enhance the fusion performance, this module introduces a dynamic convolution operation. First, the bidirectional gated features are concatenated along the channel dimension. While gated attention provides channel-wise selection, effectively fusing the two streams also requires capturing spatial interactions that vary across samples. To address this, we introduce dynamic convolution that adapts its parameters to each input:
Unlike standard convolution with fixed kernels, generates kernel weights conditioned on the input features, enabling sample-adaptive cross-stream interactions. This allows the model to capture varying feature relationships across different images (e.g., different cloud types or backgrounds), followed by an enhanced attention module for further refinement.
2.5.4. Output Transformation and Feature Refinement
Finally, high-quality fused features are generated through the output projection layer:
The module returns identical feature pairs for subsequent processing, maintaining feature consistency while providing sufficient semantic information.
The bidirectional gated fusion module offers several significant advantages: First, the bidirectional projection mechanism ensures effective alignment and interaction of features from different sources. Second, the global context-based gated attention enables adaptive feature selection at the channel level. Third, the dynamic convolution operation enhances the model’s adaptability to input variations. Finally, the fully differentiable end-to-end design supports efficient gradient propagation and parameter optimization. Experimental results demonstrate that this module effectively improves feature representation quality and discriminability in complex visual understanding tasks.
2.6. Loss Function
This study employs a joint primary-auxiliary loss optimization strategy to balance the learning of deep semantic features and shallow detail features. The overall loss function comprises the main segmentation loss and an auxiliary supervision loss:
where both
and
employ the cross-entropy loss function. The auxiliary loss weight
follows a linear decay schedule and is dynamically adjusted throughout the training process:
Here, t denotes the current training epoch, represents the total number of training epochs, while and indicate the initial and final weight values, respectively. Based on empirical tuning, we set and , as this linear decay schedule showed better convergence than fixed weights in preliminary experiments. This design allows the model to initially rely more heavily on auxiliary supervision for training stability, while progressively shifting focus toward primary task optimization in later stages.
3. Experimental Setup and Results
This section elaborates on the datasets employed, parameter configurations set, and accuracy evaluation criteria adopted in the experiments. Following this, we conducted ablation experiments aimed at verifying the effectiveness and contribution of the modules. Lastly, we performed an in-depth qualitative and quantitative comparative analysis between the proposed model and other state-of-the-art methods across multiple datasets, to comprehensively evaluate its performance and robustness.
3.1. Dataset
The HRC_WHU dataset [
41], created by the SENDIMAGE Lab of Wuhan University, is designed for cloud detection tasks in high-resolution remote sensing images. The dataset comprises 150 high-resolution images sourced from Google Earth, covering various regions of the globe and featuring five primary land object types: water, vegetation, urban, snow and barren, as presented in
Figure 4. In our experiments, the HRC_WHU dataset was cropped into mutually non-overlapping image segments, each measuring
pixels. The training set consists of 8800 patches, while the test set comprises 2800 patches.
In the HR_WHU dataset ground truth masks, pixel values 0 and 1 represent background and cloud, respectively. Thus, in
Figure 4, black pixels in row b correspond to background.
In order to enhance the diversity of the dataset and promote the generalization ability of the model, during the training phase, we employ random cropping and scaling techniques to resize the images to the specified size. The area ratio of the cropped area is randomly varied between 75% and 15%.
3.2. Experimental Setup and Evaluation Metrics
Our experiments were conducted on a workstation with an NVIDIA RTX 3090 GPU (24 GB VRAM), using PyTorch 1.12.1 with CUDA 11.6 and cuDNN 8.4. All models were trained for 100 epochs in FP32 precision, with a batch size of 16 constrained by GPU memory. We adopted the AdamW optimizer (initial learning rate: , = 0.9, = 0.999) and applied L2 weight decay ( = 0.01).
A segmented learning rate scheduling mechanism is employed, comprising a warm-up phase followed by a cosine annealing phase:
Warm-up phase (when
):
Cosine annealing phase (when
):
Here, represents the number of warm-up epochs, denotes the initial learning rate, while and specify the minimum and maximum bounds of the learning rate, respectively. This design ensures training stability through the warm-up phase while achieving fine-tuned optimization via cosine annealing in later stages.
To ensure reproducibility, we fixed random seeds and reported mean results over 3 runs.
The performance of cloud detection results is evaluated comprehensively using a variety of widely adopted quantitative assessment metrics, namely accuracy, precision, recall, F1-score, and intersection over union (IoU). Below are the definitions of these key metrics:
where TP, TN, FP and FN stand for true positive, true negative, false positive and false negative, respectively. MIoU is equal to the arithmetic mean of IoU of all types (cloud and background).
3.3. Analysis of Feature Representation in Dual Paths
Experimental results from the dual-path network reveal a clear functional divergence, which is analyzed as follows.
3.3.1. Mechanism of Feature Representation and Attention
As presented in
Figure 5, the Base branch’s feature map exhibits a relatively high mean value (0.3031), indicating generally large activated pixel values and strong responses. The visualization results reveal that this branch retains the majority of the original image’s structural and content information with clear contours, suggesting its role in learning global and fundamental representations. In contrast, the Enhance branch’s feature map has a lower mean value (0.1931), implying sparser feature activation and a more uniform overall response compared to the Base branch. However, it shows relatively higher activation in regions with significant texture variations. This phenomenon indicates that the Enhance branch is functionally geared not toward global reconstruction, but toward extracting and amplifying local detail information within the image.
Furthermore, the attention map analysis corroborates this functional divergence. While both branches attend to cloud regions as expected for a cloud detection task, a closer examination of their activation patterns reveals distinct yet complementary specializations.
The attention weights of the base branch are more concentrated in areas with rich information and distinct texture features, such as the cloud center regions. In contrast, the enhanced branch exhibits a more dispersed attention pattern throughout the image, showing a more significant response in low-contrast and texture-weakened areas (such as thin clouds), features that are crucial for accurately classifying cloud layers. This is clearly demonstrated in the green and blue boxes in
Figure 5, where the thin cloud regions in the original image show significantly higher attention values in the enhanced branch, confirming its role in restoring fine texture details. Due to the smoothing process in the base feature extraction, the activation level of the base branch is lower in such areas. In the thick and uniform cloud regions marked by the pink boxes (with a smooth appearance and minimal texture features), the attention value of the enhanced branch is slightly higher than that of the base branch, indicating that it can meaningfully contribute to feature representation even in texture-less areas. Overall, the enhanced branch consistently maintains higher attention across all cloud types, especially in thin cloud regions where detail restoration is most critical, and also plays an important role in smooth and texture-less areas. Although the base branch shows slightly lower absolute attention values, it provides the essential structural background. When combined with the continuous enhancement effect of the enhanced branch, it enables more comprehensive cloud detection than either branch acting alone.
3.3.2. Attention and Entropy
As evidenced by the data in
Figure 6, the Base and Enhance branches exhibit distinct attention patterns and information entropy profiles, underscoring their complementary roles. The Base branch, characterized by a dispersed attention distribution (with the top 30% high-attention areas covering only 46.4% of the image) and high information entropy (15,972.31), ensures comprehensive coverage and stable preservation of the global structure. In contrast, the Enhance branch employs a highly concentrated attention mechanism (the top 30% high-attention areas cover 72.9%), coupled with a lower information entropy (13,234.88), which facilitates the precise localization and enhanced processing of critical regions. This collaborative dynamic between the two branches provides an effective solution for image processing tasks, successfully balancing global fidelity with local refinement.
3.3.3. Correlation Analysis
As evidenced by the data in
Figure 7, the analysis of spatial correlation and activation values reveals a high degree of functional differentiation and complementarity between the Base and Enhance branches, as evidenced by their extremely low spatial correlation (0.0107) and activation comparison value (0.0041). These low values reflect structured complementarity rather than noise, as the scatter plot shows clustered distributions (indicating systematic spatial specialization) and the sample-wise comparison reveals consistent activation patterns across images. This statistical evidence confirms their distinct yet complementary roles: the Base branch serves as a foundation, providing a global and stable structural representation, whereas the Enhance branch acts as a specialized enhancer. It is responsible for precisely restoring and enhancing local details and weak features that are overlooked or smoothed out by the Base branch through selective and subtle activation modulation. The combined strategy of a “global foundation” with “local modulation” thus establishes an effective paradigm for high-quality image processing.
3.4. Decoding the Role of the Bidirectional Gated Fusion Module
The proposed bidirectional gated fusion module demonstrates exceptional feature fusion capability across multiple test samples, As presented in
Figure 8. Its hierarchical design first applies a bidirectional gated attention mechanism to weight high-resolution details and semantic information selectively. This “feature purification” is crucial for the subsequent deep fusion.
Quantitative results on three representative samples (11_12.png, 19_69.png, 24_67.png) confirm the module’s efficacy. It achieves significant feature energy enhancement (by factors of 6.26×, 7.37×, and 7.04×, respectively), a nonlinear effect primarily attributable to the dynamic convolution layer. This layer intelligently integrates gated features through learnable kernel parameters, producing a synergistic effect beyond simple superposition.
Furthermore, the module optimizes the feature distribution via an enhanced attention mechanism, improving semantic consistency while preserving rich details. The stability of feature diversity is significantly increased (by factors of 1.32×, 1.42×, and 1.45×). This is reflected in the standard deviation of the features, which is optimized from 0.455 in the decoder features to a range of 0.5386–0.6002 after fusion, striking an ideal balance between distribution breadth and stability. Notably, sample 24_67.png achieved a 7.04× energy gain and a 1.45× diversity improvement despite a relatively low initial feature mean (0.3291), underscoring the module’s robust adaptability.
Finally, an output projection layer with 256-channel capacity ensures the enhanced features possess sufficient expressive power for subsequent cloud detection tasks, providing high activation intensity, rich diversity, and strong discriminability.
In summary, the module constructs an efficient pipeline through a serial process of “gated selection → dynamic fusion → attention refinement → capacity retention.” Its core advantage lies in performing adaptive, nonlinear, multi-dimensional deep fusion, thereby combining the “breadth” of high-resolution details with the “depth” of semantic information to generate high-quality feature representations with significant gains in both energy and diversity.
3.5. Ablation Experiments
To validate the individual contributions of the proposed dual-path architecture, gated fusion module, and auxiliary output mechanism, we conducted systematic ablation studies. As shown in
Table 1, all experiments used identical datasets and training settings. We quantitatively evaluated performance improvements by incrementally adding each module.
Our baseline (M1) is a standard U-Net architecture with a ResNet-18 encoder and a basic decoder consisting of upsampling layers and convolutional blocks. We first introduced the dual-path feature extraction module (M2). This addition yielded a significant performance gain, increasing the mIoU from 89.04% to 90.91% (+1.87 pp). This confirms the effectiveness of the dual-path design, which captures both high-level semantics and multi-scale details through its parallel base and enhance paths, thereby constructing a more discriminative feature representation.
Building upon the dual-path architecture, we integrated the bidirectional gated fusion module (M3). The results show that the mIoU was further improved to 92.02% (+1.11 pp over M2). This indicates that simple concatenation or addition is suboptimal for fusion. In contrast, the gating mechanism adaptively learns the interdependencies between the two paths and dynamically adjusts their contribution weights, leading to more efficient and robust feature fusion that fully leverages their complementary advantages.
Finally, we incorporated an auxiliary output supervision into M3 to form our complete proposed model (M4). This model achieved optimal performance across all metrics, with mIoU, F1-score, and OA reaching 92.82%, 96.27%, and 96.31%, respectively. Although the absolute improvement from the auxiliary output was smaller, it played a critical role in stabilizing the training process and mitigating gradient vanishing in deep networks, thereby contributing to the final performance.
In summary, our ablation study demonstrates the cumulative benefit of each component. The dual-path architecture was the primary source of improvement (contributing approximately 49.5% of the total mIoU gain), establishing the model’s fundamental capability. The gated fusion module was a key optimization (contributing approximately 29.4%), fully unlocking the potential of the dual paths.The auxiliary output provided essential training stability and regularization (contributing approximately 21.1%). These results collectively validate the necessity and collaborative efficacy of every proposed module in our framework.
3.6. Comparison Study
To comprehensively evaluate our model, we conducted a comparative analysis on the HRC_WHU dataset against several state-of-the-art methods. The selected competitors, which include CNN-based (AFMUnet [
12], BABFNet [
18]), Transformer-based (CSDFormer [
28]), and hybrid architectures (TransGA [
36], SwinCloud [
37], DFC-Net [
38]), encompass the main network paradigms currently prevailing in the field. This comprehensive comparison validates the synergistic effect of our two core innovations.First, the dual-path feature enhancer enriches input representations at the front-end, allowing for a lightweight backbone. Second, the Bidirectional Gated Fusion Module further refines these features, ultimately enhancing fusion accuracy and the model’s discriminative power beyond existing specialized or hybrid designs.
The proposed model achieves a breakthrough in balancing computational efficiency and classification accuracy. As shown in
Figure 9, with a complexity of only 12.04 GFLOPs, our model attains the lowest level among all competitors—approximately 30% lower than the second-most efficient model, AFMUNet (15.74 GFLOPs). Crucially, this exceptional efficiency is achieved without sacrificing accuracy. The model reaches an overall accuracy (OA) of 96.31%, significantly outperforming all compared methods. It surpasses the second-ranked DFC-Net (95.82% OA) by 0.31 percentage points while reducing computational cost by approximately 58%. Furthermore, compared to AFMUNet (93.11% OA), which has similar complexity, our model delivers a substantial accuracy gain of 3.02 percentage points. This demonstrates that our innovative architecture enables highly efficient parameter utilization, delivering optimal classification performance with minimal computational burden, thus presenting an ideal solution for high-precision vision tasks in resource-constrained environments.
As comprehensively summarized in
Table 2, our model achieves state-of-the-art performance across all evaluation metrics. It attains an mIoU of 92.82%, an F1-score of 96.27%, a precision of 96.26%, a recall of 96.29%, and an overall accuracy (OA) of 96.31%. Notably, it surpasses the strong DFC-Net baseline by a significant margin of 1.83 percentage points in the critical mIoU metric. Furthermore, our model secures the highest recall while maintaining top-tier precision, demonstrating an effective balance between minimizing false alarms and reducing missed detections. The results show that our method outperforms other advanced models, which represent different architectural paradigms, thereby validating the superiority and effectiveness of the proposed approach.
The qualitative comparison in
Figure 10 visually underscores the superior performance of our method across various challenging scenarios. In the first row, which presents a clear cloud image, all baseline models demonstrate competent detection performance at a coarse level. However, our proposed method yields significantly finer boundaries and preserves more complete edge details of the cloud bodies. In the second row, depicting an urban scene with thin clouds, our method achieves higher accuracy in capturing both the cloud boundaries and the subtle, semi-transparent regions where ground information is faintly visible, demonstrating a enhanced capability for weak feature extraction.
The coexistence of clouds and snow poses a persistent challenge in cloud detection. As shown in the third row, when clouds and snow are spatially separate, existing models retain a basic detection capability. However, when thin clouds overlap with underlying thick snow cover (rows four and five), these models struggle to differentiate between the two, leading to significant missed detections or misclassifications. In stark contrast, our method maintains robust and stable detection performance even in these complex scenarios, substantially outperforming all existing alternatives. While our dual-path design does not explicitly model the distinction between clouds and snow, we attribute this performance gain to the enhanced feature representations from complementary paths (global structure and local details) and the adaptive gating mechanism that dynamically selects the most relevant features for each region. These properties collectively improve discrimination in complex scenes, even without task-specific supervision for cloud-snow separation. Further investigation into explicit cloud-snow discrimination is left for future work.
4. Discussion
Advantage. Based on comprehensive experiments and analysis, our model demonstrates three key strengths. First, it achieves a breakthrough in balancing computational efficiency and detection accuracy. Operating at only 12.04 GFLOPs, it attains 96.31% overall accuracy and 92.82% mean IoU on the HRC_WHU dataset, substantially outperforming comparable models and offering a practical solution for compute-constrained onboard or edge deployment.
Second, the model enables efficient multi-level feature extraction and fusion through its dual-path feature enhancer and bidirectional gated fusion module. The front-end enhancer uses a parallel architecture to diversify input representations and reduce dependency on heavy backbones, while the fusion module employs gated attention and dynamic convolution to adaptively combine semantic and spatial features, significantly improving weak feature capture and boundary preservation.
Finally, the architecture exhibits strong robustness in complex scenarios. It consistently maintains high detection accuracy—even with thin clouds, urban landscapes, and cloud-snow coexistence—effectively balancing false alarms and missed detections. Its superior performance across diverse settings validates the effectiveness and general applicability of the proposed method under challenging real-world conditions.
Disadvantage. The proposed model was developed and evaluated exclusively on three-channel RGB imagery, which limits its ability to leverage the rich spectral information available in near-infrared and short-wave infrared bands. In the context of our current RGB-only experiments, this constitutes both a data limitation (the input lacks spectral information) and a model limitation (our architecture is designed for three-channel inputs). This inherently restricts its potential for more refined cloud analysis tasks, such as detailed cloud phase classification (e.g., discriminating between ice and water clouds) or fine-grained cloud type recognition (e.g., distinguishing convective clouds from cumulonimbus). Consequently, the model’s discriminative capability in certain complex scenarios, particularly those where spectral characteristics are critical for accurate detection, remains limited compared to what could be achieved with multispectral or hyperspectral input. Addressing this limitation requires architectural evolution to accommodate multispectral inputs, which we outline as a primary direction in our future work.
5. Conclusions and Future Work
In this paper, we propose a novel lightweight network for cloud detection in high-resolution remote sensing imagery, effectively balancing accuracy and efficiency. To address the limitations of existing models that rely on computationally intensive backbones, we introduce two key innovations: a dual-path feature enhancer that enriches multi-scale input representations while reducing backbone dependency, and a bidirectional gated fusion module that adaptively integrates semantic and spatial features during decoding.
Experiments on the HRC_WHU dataset demonstrate that our model achieves state-of-the-art performance, with 96.31% OA and 92.82% mIoU, while requiring only 12.04 GFLOPs—significantly more efficient than comparable methods. The model maintains high precision on challenging features like thin clouds and fine boundaries without sacrificing efficiency.
These results confirm that our approach successfully bridges the gap between detection performance and operational practicality. The proposed network offers an efficient, robust solution suitable for resource-constrained environments such as onboard satellite processing and edge computing, advancing real-time cloud detection capabilities in operational scenarios.
Building on the presented RGB-based framework, future research will explore its extension into multispectral and multi-modal domains. A primary direction is to generalize the architecture for multispectral inputs, specifically utilizing VNIR and SWIR bands to advance edge analysis and precise cloud typing. Concurrently, we plan to evolve the Dual-Path Feature Enhancer into an adaptive multi-modal matching module. This innovation will empower it to extract and fuse complementary features from diverse data sources in a unified high-dimensional space, achieving powerful multi-modal fusion without resorting to modality-specific backbone alterations, thus upholding the model’s lightweight and versatile design principle.