Next Article in Journal
Influence of Pre-Strain and Notching on the Fatigue Life of DD11 Low-Carbon Steel
Previous Article in Journal
Hyperthermia and Chemotherapy Combination in Triple-Negative Breast Cancer Cells
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Dual-Branch Transformer–CNN Fusion for Enhanced Cloud Segmentation in Remote Sensing Imagery

1
School of Microelectronics Industry-Education Integration, Lanzhou University of Technology, Lanzhou 730050, China
2
School of Automation and Electrical Engineering, Lanzhou University of Technology, Lanzhou 730050, China
*
Author to whom correspondence should be addressed.
Appl. Sci. 2025, 15(18), 9870; https://doi.org/10.3390/app15189870
Submission received: 10 August 2025 / Revised: 4 September 2025 / Accepted: 5 September 2025 / Published: 9 September 2025

Abstract

Cloud coverage and obstruction significantly affect the usability of remote sensing images, making cloud detection a key prerequisite for optical remote sensing applications. In existing cloud detection methods, using U-shaped convolutional networks alone has limitations in modeling long-range contexts, while Vision Transformers fall short in capturing local spatial features. To address these issues, this study proposes a dual-branch framework, TransCNet, which combines Transformer and CNN architectures to enhance the accuracy and effectiveness of cloud detection. TransCNet addresses this by designing dual encoder branches: a Transformer branch capturing global dependencies and a CNN branch extracting local details. A novel feature aggregation module enables the complementary fusion of multi-level features from both branches at each encoder stage, enhanced by channel attention mechanisms. To mitigate feature dilution during decoding, aggregated features compensate for information loss from sampling operations. Evaluations on 38-Cloud, SPARCS, and a high-resolution Landsat-8 dataset demonstrate TransCNet’s competitive performance across metrics, effectively balancing global semantic understanding and local edge preservation for clearer cloud boundary detection. The approach resolves key limitations in existing cloud detection frameworks through synergistic multi-branch feature integration.

1. Introduction

Clouds serve as critical components of meteorological systems, with their dynamic variations serving as essential indicators for climate analysis. It has been estimated that clouds obscure more than 65% across the globe at any particular moment, presenting substantial challenges for remote sensing applications that rely on optical imagery [1]. Accordingly, the accurate delineation of cloud cover has become a subject of considerable research interest. However, owing to the heterogeneity and complexity inherent in remote sensing data, current cloud detection techniques are often susceptible to both omission and commission errors, making methodological advancement particularly complex.
Among established cloud detection approaches, threshold-based methods have been extensively utilized. For instance, the Fmask algorithm introduced by Zhu et al. employs a decision tree for pixel-level discrimination between cloudy and non-cloudy regions [2]. Other notable thresholding strategies include the NOAA Advanced Very-High-Resolution Radiometer (AVHRR) cloud detection algorithm and the ACCA method [3,4]. These methods typically apply empirically defined thresholds to multispectral imagery for cloud discrimination. However, their effectiveness is often limited by their reliance on expert-driven parameter selection and ongoing calibration, which may constrain the detection accuracy.
The use of convolutional neural networks (CNNs) has greatly accelerated progress in automated cloud identification, largely owing to their powerful image feature extraction capabilities. Approaches including Cloud-Net and Cloud-Net+ [5,6] have delivered significant improvements in detection outcomes. Furthermore, research has shown that CNNs are adept in learning low-level cues, and, with the introduction of more advanced networks (e.g., CDNet, CDNetV2), refined fusion and attention mechanisms have been incorporated to further enhance the accuracy [7,8,9]. Nevertheless, CNNs often struggle to represent long-range dependencies and the global semantic context, which still impacts the performance of these models.
Recently, Transformer-based architectures—initially originating in the context of natural language processing—have seen growing adoption in computer vision tasks, where they deliver remarkable results. With the introduction of self-attention mechanisms, Transformers can model relationships between distant parts of an image, thus capturing comprehensive global information. Networks such as DBNet and Cloudformer [10,11] have demonstrated improved results over traditional CNN models. By leveraging both CNNs’ detailed local representations and Transformers’ global modeling, next-generation frameworks are being developed to further advance the cloud detection accuracy.
This study presents a novel feature extraction system named TransCNet, which includes two parallel segmentation branches: a CNN path and a Transformer path. The Transformer path adopts the Pyramid Vision Transformer (PVT) for the modeling of global dependencies, while the CNN path—based on ResNet—specializes in capturing detailed local features from remote sensing images [12,13,14]. Since Transformer models focus primarily on the global context and may miss fine-grained object boundaries, incorporating the CNN path helps to ensure more complete feature representation. The feature aggregation module combines multi-scale features from both branches, supporting the precise joint prediction of cloud regions. Channel attention mechanisms are further applied in decoding to maximize semantic and spatial feature fusion. Validation experiments on the 38-Cloud and SPARCS datasets, as well as a new high-resolution Landsat-8 collection, confirm that the proposed method achieves state-of-the-art performance. The principal contributions of this work are outlined below.
(1)
A novel cloud detection approach is introduced, combining Transformer and CNN architectures to capture both local details and global contexts. This framework improves the precision of cloud region segmentation by aggregating features from both models.
(2)
The proposed feature aggregation module (FAM) efficiently merges multi-level outputs from both pathways, integrating spatial and contextual information. This design overcomes the limitations of using CNN or Transformer models alone, retaining more relevant details for accurate cloud delineation.
(3)
A large-scale, high-resolution cloud detection dataset, named CHLandsat-8, was assembled from 64 full scenes of Landsat-8 satellite imagery, covering various regions in China from January to December 2021. Each image includes pixel-level cloud annotations through manual labeling. The dataset will be publicly available, serving as a benchmark for the development and testing of cloud recognition techniques and supporting ongoing research in the field.

2. Related Works

2.1. Cloud Detection in Remote Sensing Images

Machine and deep learning have transformed remote sensing, as they allow algorithms to learn patterns from data and make predictions, leading to important advances in many applications [15,16]. Recently, deep learning approaches leveraging convolutional neural networks (CNNs) have become the standard for cloud recognition within remote sensing data. As a landmark, Mohajerani et al. put forward an end-to-end framework built upon fully convolutional architectures, laying the groundwork for subsequent improvements in the area [5,6]. Among these, Cloud-Net+ (2021) is notable for its effectiveness on Landsat-8 data, achieving improved detection performance. In pursuit of more precise boundary and spatial characterization, Hu et al. introduced CDUNet in 2021, employing deep learning techniques dedicated to cloud extraction [14]. Later, Pu et al. (2022) advanced the field with a new detection system that employs self-attention layers and a spatial pyramid pooling scheme, with DenseNet serving as the core feature encoder to enrich semantic representation [17]. It has been established that low-level CNN features sharpen the delineation of cloud edges, while higher-level abstractions from deeper layers help to pinpoint the overall cloud location. Yang et al. (2019) presented CDNet, combining codec structures, feature pyramids, and boundary refinement modules to improve the segmentation effectiveness [8]. Building on this, Guo et al. (2020) introduced CDNetV2, which incorporates adaptive feature fusion and semantic flow modules to better distinguish between clouds and snow in satellite imagery [9]. Building upon residual network architectures, Lu et al. presented a multi-scale strip-pool aggregation approach (2022), supplemented by pyramid pooling and strip-based boundary refinement to retain cloud edge information [18]. Additional research has focused on reinforcing global context modeling in CNNs. For instance, Xia et al. devised a global attention fusion residual network featuring specialized modules for accurate cloud region and edge delineation [19]. Guo et al. also presented a streamlined CNN-based network employing multi-feature fusion from visible and near-infrared data, achieving effective segmentation with a fully convolutional structure, termed ClouDet [20]. Despite these advances, limitations in modeling long-range dependencies persist, as the intrinsic characteristics of CNNs impede global contextual integration, thus constraining the segmentation accuracy. This limitation has motivated the exploration of Vision Transformer architectures, which are recognized for their superior global modeling capabilities and form the basis for the present study’s investigation into Transformer-based cloud segmentation [21].

2.2. Vision Transformer

Initially developed for natural language processing, the introduction of Transformer models has led to significant progress in computer vision. Originally de-signed for natural language processing, the Transformer—with its stacked multi-head self-attention and multi-layer perceptron modules—was first presented by Vaswani et al. in 2017 [22]. Dosovitskiy subsequently applied Transformer networks to image classification, demonstrating strong results on ImageNet and driving broad adoption for visual recognition tasks [21,23]. Numerous Transformer-based vision models have since emerged, frequently outperforming traditional CNN architectures across a range of computer vision applications. For example, Liu et al. developed a Transformer-only framework for both RGB and RGB-D saliency detection [24]. Qiu et al. designed an asymmetric bilateral U-shaped Transformer network to simultaneously model global contexts and local details for saliency tasks [25]. In medical image analysis, Zhang et al. constructed an intra-branch parallel architecture that integrates both CNN and Transformer components, yielding improved segmentation robustness [26]. In the context of remote sensing image cloud detection, hybrid architectures integrating CNNs and Transformers have been proposed. Lu et al. described a dual-branch framework for the separate extraction of semantic and spatial detail features, addressing the challenges of false positives and missed detections in high-resolution imagery [18]. Zhang et al. further advanced this approach by designing a detection framework that fuses CNN and Transformer outputs for accurate cloud identification in optical remote sensing data [26]. Nevertheless, the direct application of non-remote sensing-oriented cloud detection methods to satellite imagery remains suboptimal, largely due to the high spatial resolution and heterogeneity of satellite data. Current Transformer-based methods often fail to fully exploit the global contextual features captured by Transformers and the local detail features provided by CNNs, resulting in limited segmentation precision and increased rates of omission and false detection.

3. Method

The core architecture of TransCNet consists of a dual-branch feature extraction network that operates in parallel, along with a highly efficient feature aggregation module. This section first introduces the overall framework of TransCNet and then provides a detailed explanation of the design and construction of the feature aggregation module.

3.1. Problem Formulation

The primary objective of cloud detection in remote sensing imagery is to generate a binary mask that highlights the cloud regions within the input image. The input is represented in the RGB color space, and the output is a binary map that distinguishes between cloud and non-cloud areas. TransCNet, as introduced in this study, is formulated as a composite function that combines CNN and Transformer architectures. The central goal is to develop an effective mapping function that leverages the complementary strengths of both CNNs and Transformers, enabling the model to learn discriminative features for more accurate cloud region segmentation [27,28].

3.2. Overall Framework

The proposed framework is designed to simultaneously acquire global contextual information and local feature representations, which are critical for both the accurate localization and detailed boundary refinement of clouds. Conventional CNN-based methods for cloud detection in satellite imagery (Hu et al., 2021; Pu et al., 2022; Lu et al., 2022; Xia et al., 2021; Guo et al., 2021) have been shown to lack sufficient capabilities for context modeling [14,17,18,19,20], whereas Transformer-based approaches in various vision applications (Qiu et al., 2021; Zhang et al., 2021; Samplawski et al., 2021; Sun et al., 2021; Wang et al., 2021; Dang et al., 2022; Botach et al., 2022) have demonstrated limitations in effectively representing local spatial details [25,26,29,30,31,32,33]. To address these objectives, TransCNet is developed as a parallel dual-branch feature extraction network that leverages the complementary strengths of Transformer and CNN architectures. The network inherits the Transformer’s capacity for global context modeling alongside the CNN’s effectiveness in local feature extraction. TransCNet consists of a parallel dual-branch encoder and decoder.
As illustrated in Figure 1, the architecture features two distinct encoding branches: the grey Transformer path and the orange CNN path. There are four levels in the well-known Transformer network PVT, called the Transformer path. The issue with this path is that it ignores local knowledge in favor of understanding only long-term dependencies. By using the CNN path, which is a well-known CNN network, ResNet-34, this issue is lessened. A dedicated feature aggregation module is constructed to effectively integrate hierarchical features from the two branches. For cloud region segmentation, the decoding network incorporates a channel attention mechanism to enhance the accuracy of the final output.
Transformer Path. In this study, the widely recognized Transformer network PVT is utilized as the backbone for the Transformer path. The four stages of the PVT adopt a uniform architecture consisting of a patch embedding layer followed by multiple Transformer blocks. Specifically, for an input image of size H × W × 3 , the patch embedding layer divides it into H / 4 × W / 4 patches, each with dimensions 4 × 4 × 3 . Positional embeddings are then added to the flattened patches, which are subsequently processed by Transformer blocks. From the second stage onward, the feature map is further partitioned into halves through an additional patch embedding layer, followed by the addition of positional embeddings and processing by a Transformer block. With 64, 128, 320, and 512 channels, respectively, T1, T2, T3, and T4 in Figure 1 show the output features of the PVT from top to bottom for the four stages.
CNN Path. For the CNN branch, the widely used ResNet-34 network is adopted. As indicated in Figure 1, the features from stages C1, C2, C3, and C4 in the CNN path are fused with the corresponding outputs from the Transformer path. Notably, the CNN path is designed to be adaptable, allowing for the substitution of other convolutional architectures if desired. The input to the CNN path is the original remote sensing image, enabling the extraction of rich spatial detail features.
To improve the expressiveness of decoded features, an adapted channel attention (CA) module is integrated within the decoder. As illustrated in Figure 2, this module adopts the conventional squeeze-and-excitation (SE) block framework [34]. Specifically, the CA unit receives both the decoded feature Di and the fused feature Fi from the aggregation module, generating Di−1 as input for the following decoding layer.
The mathematical formulation describing channel attention is presented as follows:
CA v h , W = σ c o n v 2 R c o n v 1 G v h , W 1 , W 2 ,
where v h denotes the channel feature vector, W 1 and W 2 denote the parameters learned, σ ( ) is the Sigmoid function, R ( ) is the ReLU activation function, G ( ) denotes the global average pooling operation, and c o n v 1 ( ) and c o n v 2 ( ) are the convolution operations.
The inputs and outputs of the channel attention module are defined as
D i 1 = CA c o n v 3 C a t F i , D i , W ,     f o r   i { 4 , 3 , 2 , 1 } ,
where C a t [ , ] denotes the feature cascade operation, c o n v 3 ( ) is the convolution operation, and CA ( v h , W ) is the above-mentioned channel attention mechanism.
The outputs generated by each decoder stage (P1–P5) are aggregated to produce the final prediction mask. To enhance training, strong supervision is imposed on these outputs. During the decoding phase, the features undergo batch normalization and activation, followed by two convolutional layers that reduce them to a single channel. A Sigmoid function is then used to produce the probability map, with values ranging from 0 to 1. For joint optimization, TransCNet is trained end-to-end using the binary cross-entropy (BCE) loss [35], which supervises all output layers during training. The overall loss function for deep supervision is given as
L = i = 1 5 i 5 L B C E P i , G ,
Here, G denotes the ground-truth cloud region, and L B C E refers to the binary cross-entropy loss function, defined as
L B C E = r , c G r , c log S r , c + 1 G r , c log 1 S r , c ,
where G ( r , c ) is the mask label of the pixel ( r , c ) and S ( r , c ) is the predicted cloud area.

3.3. Feature Aggregation Module

To combine the representations obtained from the CNN and Transformer branches, this study introduces an original feature aggregation module (FAM), illustrated in Figure 3. The FAM incorporates both an attention mechanism and a multimodal feature fusion strategy. In Figure 3, C i denotes the features extracted by the CNN branch, T i refers to those from the Transformer branch, and F i represents the output features generated by the FAM. Here, the index i covers values of 1, 2, 3 and 4.
Specifically, in this paper, fused features are obtained by the following operations. First, inspired by DANet, the features of the two branches are fused in a channel-like attention mechanism, defined as follows [36]:
C i = softmax c o n v 4 C i × c o n v 5 T i × c o n v 6 C i + C i T i = softmax c o n v 4 C i × c o n v 5 T i × c o n v 7 T i + T i ,
Then, to further fuse the two branch features, while enhancing the cloud region features and suppressing noise, the fused features are processed using the channel note SE-Block [34]:
C T i = ChannelAttn C a t c o n v 4 C i , c o n v 5 T i ,
where ChannelAttn ( ) denotes the channel note SE-Block, and C a t [ , ] denotes the feature cascade operation. The final output features of the FAM are as follows:
F i = c o n v 8 C a t C i , T i , C T i ,

4. Experimental Results and Analysis

4.1. Experimental Setup

TransCNet is implemented in PyTorch 1.8.1. For training, the CNN branch is initialized with ResNet-34 weights, while the Transformer branch uses weights from PVT-Tiny. All other convolutional layers and modules are randomly initialized. The Adam optimizer is selected, with an initial learning rate of 5 × 10−3 and weight decay of 0.0005. The learning rate is reduced by a factor of ten every ten epochs. Models are trained for 40 epochs with a batch size of 16. All training is conducted on an NVIDIA Tesla V100 GPU, and evaluation is carried out on an NVIDIA RTX 3060 GPU with 12 GB RAM [25,26].

4.2. Dataset

To comprehensively evaluate the approach, three widely used remote sensing cloud detection datasets are utilized: the 38-Cloud and 95-Cloud public datasets, the SPARCS dataset, and a bespoke CHLandsat-8 dataset. Among these, CHLandsat-8 is notable for its higher spatial resolution and increased diversity of land cover types, posing greater challenges for cloud classification.
The 38-Cloud dataset, as described by Mohajerani S et al. (2019), consists of 38 Landsat-8 scenes, with 18 allocated for training and 20 for testing. Each image is preprocessed to 384 × 384 pixels, producing 8400 samples for training and 9201 for testing [5]. The SPARCS dataset includes 1000 × 1000 pixel patches extracted from 80 different Landsat-8 scenes [35] and is commonly used as a remote sensing benchmark.
CHLandsat-8 comprises 64 large-scale images collected by Landsat-8 satellites from various regions of China in 2021, representing the first open-access high-resolution dataset for cloud detection in the country. Its coverage extends over the northwest, north, Qinghai–Tibet, and southern regions, and it contains a diverse array of environments, including urban areas, snow, ice, grasslands, mountains, forests, oceans, and deserts. Each image is approximately 8000 × 8000 × 3 pixels. All cloud masks were manually annotated at the pixel level and are freely available to support further research.
Annotation was performed by a team of 4 annotators, each responsible for labeling the cloud regions. The process involved assigning a value of 1 for cloud pixels and 0 for background pixels. During annotation, a quality control procedure was implemented, where each annotation was continuously validated and adjusted based on real-time feedback, ensuring consistent and accurate labeling. In case of disagreements, the final decision was made by selecting the annotation with the better performance in tests, ensuring the most reliable output. Of the 64 images, 44 were randomly selected for training (CHLandsat-8-TR), while the remaining 20 were reserved for testing (CHLandsat-8-TE).
To accommodate GPU memory limitations, all images were resized to 352 × 352 × 3 prior to model training. Further information about each dataset is summarized in Table 1.

4.3. Evaluation Indicators

Model evaluation is based on six quantitative criteria commonly used in remote sensing cloud detection: the maximum F-measure (MaxFm), mean absolute error (MAE), weighted F-measure (WFm), average F-measure (AvgFm), S-measure (Sm), and E-measure (Em).
The F-measure (Fβ) balances precision and recall, with β 2 = 0.3 used to emphasize precision in this study. Both the peak and average values are reported for a comprehensive assessment [37,38]:
F β   =   1   +   β 2 precision recall β 2 precision   +   recall ,
Here, β 2 is configured as 0.3 to highlight accuracy. For the evaluation in this research, the maximum F-measure and the average F-measure are computed [39].
The MAE evaluates the average absolute difference, pixel by pixel, between the predicted and true cloud masks [40]:
MAE   =   1 W × H i = 1 W j = 1 H G i , j S i , j ,
where S 0 , 1 W × H is the predicted mask and G 0 , 1 W × H is the ground truth.
The weighted F-measure builds upon the standard F-measure by mapping TP, FP, TN, and FN to continuous values, assigning different error penalties [41], and leveraging local spatial neighborhood information ω .
F β ω = 1 + β 2 precision ω recall ω β 2 precision ω + recall ω
The S-measure offers a foreground structural assessment by blending region-aware ( S r ) and object-aware ( S o ) similarity, with the formula
S m = α S o + 1 - α S r ,
where α is set to 0.5 in this study [42].
The E-measure quantifies both global agreement and local alignment in the segmentation output, given by [43]
E m = 1 W × H c = 1 W r = 1 H Φ r , c ,
where Φ is the enhanced alignment matrix, H and W denote the image size, and ( r , c ) are coordinates.
In summary, the MAE evaluates the average absolute difference between the predicted and true cloud masks. Sm primarily focuses on structural evaluation, blending region-aware and object-aware similarity, emphasizing the shapes and boundaries of clouds. Em quantifies both global consistency and local alignment, with a focus on pixel-level alignment. Meanwhile, WFm enhances the traditional F-measure by introducing a weighted mechanism and leveraging local spatial information, improving its performance in handling class imbalance and spatial structure in cloud detection.

4.4. Comparative Experiments

To provide a comprehensive evaluation, the proposed method is benchmarked against multiple leading models, including ResNet-34 [13], PVT [12], FCN8S [44], UNet [45], PSPNet [46], SEGNet [47], GFRNet [48], CDNet [8], CDNetV2 [9], Cloud-Net [5], ClouDet [20], Mask2former [49], Segformer [50], and Cloudformer [11]. All models are trained and tested using the same open-source framework and experimental configuration. Training utilizes the CHLandsat-8-TR set, with the evaluation conducted on three distinct test datasets.

4.4.1. Quantitative Experiments

The segmentation performance on the CHLandsat-8-TE test set is summarized in Table 2. The results show that TransCNet consistently provides segmentation outputs that closely match the manual labels and achieves top rankings in most evaluation metrics (MaxFm, MAE, WFm, AvgFm, Sm, and Em). The results for each metric are color-coded to indicate the ranking: red (best), blue (second), green (third). While TransCNet achieves leading results in five out of six indicators, its E-measure is only slightly lower than the top score, further validating the effectiveness of the proposed framework for cloud detection.
Table 3 summarizes the results obtained when training on CHLandsat-8-TR and testing on the 38-Cloud-Test dataset, following the same evaluation protocol. TransCNet ranks first in two of the six metrics. For MaxFm, it is 0.012 below UNet; for Em, it is 0.02 lower than SEGNet; for AvgFm, it is 0.012 lower than Cloud-Net; and, for WFm, it lags by 0.005 compared to ClouDet.
Table 4 reports the results on the SPARCS dataset following training on CHLandsat-8-TR. TransCNet consistently delivers the best performance across all six metrics, confirming its superior cloud detection capabilities relative to existing advanced approaches.

4.4.2. Qualitative Experiments

To thoroughly illustrate the segmentation performance, qualitative visualizations are presented for TransCNet and several established baseline models, such as FCN8S, UNet, PSPNet, and SEGNet, among others. These results demonstrate that TransCNet can reliably recognize a variety of cloud structures across different remote sensing contexts. These categories are systematically summarized in Table 5.
TransCNet demonstrates robust adaptability to a range of scene types, as evidenced by the results on the CHLandsat-8-TE evaluation set, which encompasses environments such as grasslands, deserts, plateaus, forests, oceans, and snowfields. Figure 4 displays the side-by-side visual detection outcomes for TransCNet and other competing approaches. It is observed that many CNN-based methods tend to produce more false alarms and omissions, especially when faced with challenging or heterogeneous landscapes, whereas TransCNet substantially reduces such errors.
Figure 5 displays the detection performance on the 38-Cloud-Test test dataset, featuring scene types such as simple, cloudy, partly cloudy, ice and snow, and ocean. TransCNet demonstrates stable and accurate detection, excelling especially in challenging ice and snow scenarios.
On the SPARCS dataset (Figure 6), the visual comparison covers scenes such as thin cloud, simple, complex, snow, cloudy, cloud shadow, confusing background, and partly cloudy. TransCNet consistently yields clearer, more precise cloud delineation than its counterparts.
The results across all three test datasets demonstrate that TransCNet consistently surpasses competing algorithms in overall cloud detection accuracy for a wide range of remote sensing scenes. While thick cloud layers are readily identified by both TransCNet and the baseline models, the detection of cirrus and thin clouds remains challenging due to their high transparency, resulting in increased false negatives. Notably, TransCNet achieves a lower false negative rate than other methods, owing to its integration of Transformer and CNN modules, which jointly enhance both local detail extraction and global contextual understanding. This complementary architecture enables more precise cloud localization and segmentation. However, thin clouds are often misclassified as bright surfaces such as water or snow due to spectral similarity, leading to residual errors. Incorporating additional spectral bands (e.g., NIR, SWIR) or frequency-domain features may help to alleviate these misclassifications, providing a potential direction for further improvement.
In contrast, a standalone CNN model is limited by the size of the convolutional kernels, making it difficult to effectively capture global semantic information. On the other hand, a standalone Transformer model lacks sufficient local information, which weakens its ability to detect small-sized clouds. TransCNet, through its feature aggregation module, cleverly integrates multi-level spatial features and contextual information from both branches, enabling the network to more accurately distinguish cloud regions, reduce false positives, and deliver clearer and more accurate cloud detection results under challenging conditions.
Moreover, TransCNet excels in scenarios characterized by cloud–snow coexistence and complex backgrounds, outperforming the comparison algorithms. This advantage is largely attributed to the feature aggregation module, which effectively combines multi-level spatial and contextual features from both branches. As a result, the network is able to better distinguish cloud regions, reduce false positives, and deliver clearer, more accurate cloud detection even in challenging conditions.

4.5. Ablation Experiments

To systematically examine the contributions of each architectural component, a series of ablation studies is performed. The process begins with training baseline networks that utilize either ResNet-34 (for the CNN branch) or PVT-Tiny (for the Transformer branch), as detailed in Table 6. Next, the channel attention (CA) module is incorporated into both baselines, producing the CNN + CA and Transformer + CA variants. Finally, both the CNN and Transformer branches are merged with the proposed feature aggregation module (FAM). Performance is assessed using the MAE, MaxFm, and Sm metrics, with the results summarized in Table 6. Building upon this, we replaced the backbone with ConvNeXt-Tiny [51] and Swin-T [52] for further ablation studies. After replacing the backbones, all metrics showed improvements.

4.6. Interpretive Experiments

To substantiate the contribution of the developed feature aggregation module (FAM), comparative visualizations of intermediate features from the dual-branch aggregation are presented in Figure 7 and Figure 8. In Figure 7, the second-stage feature maps, C2, T2, and F2, correspond to the outputs of the CNN branch, Transformer branch, and FAM, respectively, as defined in Figure 1. “Concatenation” refers to the direct concatenation of C2 and T2 features. The visualization results reveal that the F2 features provide a more accurate and distinct representation of cloud regions compared to simple concatenation, highlighting the effectiveness of the FAM in combining information from both branches.
The comparative results of the third-stage feature visualization of the two-branch feature aggregation are shown in Figure 8. Similarly, C3, T3, and F3 denote the CNN network branch features, the Transformer network branch features, and the FAM output features, respectively, as shown in Figure 1. The clouds in the visualization results in the figure are bright pixel points, i.e., the brighter the color of the pixel point, the higher the probability of it being a cloud. The feature visualization results in Figure 5, comparing C3 with T3 and F3, the FAM outputs features that are more focused on the cloud region, while the background noise is effectively suppressed, resulting in the whole network learning more features that are valuable for cloud detection.

5. Discussion and Analysis

The experimental results across multiple benchmark datasets clearly demonstrate that TransCNet consistently outperforms conventional CNN-based and Transformer-based models in cloud detection tasks. This superiority can be attributed to the complementary strengths of CNNs in capturing fine-grained local spatial details and Transformers in modeling long-range contextual dependencies. By combining these two types of feature representations, TransCNet achieves robust and reliable performance across diverse scene types, as confirmed by both qualitative comparisons and quantitative evaluations.
It is worth noting that, compared with widely used architectures such as UNet and PSPNet, TransCNet produces more accurate segmentation under complex background conditions, particularly in regions with subtle cloud–background transitions. Moreover, in contrast to single-backbone models such as CDNetV1 and CDNetV2, which mainly rely on scaling the backbone complexity to improve the performance, TransCNet employs the feature aggregation module (FAM) to effectively integrate features from its dual backbones. This strategy enables the more efficient utilization of network capacity and achieves a favorable balance between accuracy and computational demands.
Building on the above discussion, Section 5.1 provides a quantitative analysis of the computational complexity to further investigate the trade-off between accuracy and efficiency across different models. This is followed by Section 5.2, which discusses the limitations of the proposed method, thereby offering a more comprehensive perspective on its applicability and outlining potential directions for future research.

5.1. Computational Complexity Analysis

Building upon the complexity analysis of the CDNetV1 [8] and CDNetV2 [9] models, and further comparing them with other models in Table 7, the results show that TransCNet outperforms other models in cloud detection while maintaining moderate computational complexity. This is reflected in its trainable parameters (Params), floating point operations (FLOPs), and running time at a resolution of 1 k × 1 k. Although TransCNet adopts a dual-backbone structure with ResNet-34 and PVT-Tiny for multi-level feature extraction and fuses the dual-backbone features through the FAM, it gradually restores the cloud mask during the decoding stage using a channel attention mechanism. Its computational complexity is mainly determined by the size of the backbone network. By using ResNet-34 and PVT-Tiny as backbones, TransCNet ensures sufficient depth for the extraction of richer semantic information while maintaining reasonable computational complexity compared to ConvNeXt-Tiny [51] and Swin-T [52], which has dimension settings of (96, 192, 384, 768). The dimensions of (64, 128, 320, 512) in TransCNet effectively prevent an excessive computational burden.

5.2. Limitations

TransCNet adopts a dual-backbone structure as the encoder, which combines the strengths of CNNs and Transformers. While this hybrid architecture has shown strong performance in cloud detection tasks, it also introduces a higher runtime due to the computational cost associated with processing both convolutional and attention-based mechanisms. The use of two backbones, while enhancing feature extraction and enabling the model to capture both local and global dependencies, inevitably leads to an increase in computational complexity. As a result, the model’s runtime may not be ideal for real-time applications, especially when working with larger datasets or higher-resolution images.
Additionally, the current model is limited to cloud detection using RGB images and has not yet been extended to multispectral data. This restricts the model’s applicability to scenarios where multispectral or hyperspectral information could provide additional valuable insights, such as in detecting different types of cloud cover or distinguishing between clouds and other atmospheric phenomena. The potential for expanding TransCNet to handle multispectral data is considerable, but it remains an area for future work, as multispectral images would introduce greater complexity in terms of data preprocessing, feature extraction, and model training.

6. Conclusions

In this study, we propose TransCNet, a novel cloud detection model that integrates a dual-backbone architecture with CNN and Transformer branches. The feature aggregation module (FAM) effectively fuses information from both branches, as confirmed by intermediate feature visualizations (Figure 7 and Figure 8), where FAM-enhanced features (F2) yield more precise cloud region representations than simple concatenation. The experimental results demonstrate that TransCNet consistently outperforms competing models; however, it also exhibits a higher computational cost and is currently restricted to RGB inputs, which limits its applicability to multispectral data. These findings reveal a trade-off between performance and efficiency. Future research will focus on reducing the computational complexity, extending the framework to multispectral imagery, enabling real-time applications, and exploring cloud removal techniques to further improve the detection accuracy and overall model robustness.

Author Contributions

Conceptualization, S.C. and X.D.; Methodology, S.C., H.G. and H.W.; Software, H.W. and S.C.; Validation, S.C. and H.G.; Writing—original draft, S.C.; Writing—review and editing, S.C. and X.D.; Visualization, S.C., H.G. and H.W.; Supervision, X.D.; Project administration, S.C. and X.D.; Funding acquisition, S.C. All authors have read and agreed to the published version of the manuscript.

Funding

This work is supported by the Scientific and Technological Project of Gansu Province (24CXGA050), the Scientific and Technological Project of Lanzhou City (2024-QN-63), and the Innovation Fund Project of Gansu Provincial Department of Education (2025A-025).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Acknowledgments

The authors thank all the editors and the reviewers for their time.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Hagolle, O.; Huc, M.; Pascual, D.V.; Dedieu, G. A multi-temporal method for cloud detection, applied to FORMOSAT-2, VENµS, LANDSAT and SENTINEL-2 images. Remote Sens. Environ. 2010, 114, 1747–1755. [Google Scholar] [CrossRef]
  2. Zhu, Z.; Wang, S.; Woodcock, C.E. Improvement and expansion of the Fmask algorithm: Cloud, cloud shadow, and snow detection for Landsats 4–7, 8, and Sentinel 2 images. Remote Sens. Environ. 2015, 159, 269–277. [Google Scholar] [CrossRef]
  3. Stowe, L.L.; Davis, P.A.; McClain, E.P. Scientific basis and initial evaluation of the CLAVR-1 global clear/cloud classification algorithm for the Advanced Very High Resolution Radiometer. J. Atmos. Ocean Technol. 1999, 16, 656–681. [Google Scholar] [CrossRef]
  4. Irish, R.R.; Barker, J.L.; Goward, S.N.; Arvidson, T. Characterization of the Landsat-7 ETM+ automated cloud-cover assessment (ACCA) algorithm. Photogramm. Eng. Remote Sens. 2006, 72, 1179–1188. [Google Scholar] [CrossRef]
  5. Mohajerani, S.; Saeedi, P. Cloud-Net: An end-to-end cloud detection algorithm for Landsat 8 imagery. In Proceedings of the IGARSS 2019-2019 IEEE International Geoscience and Remote Sensing Symposium, Yokohama, Japan, 28 July–2 August 2019. [Google Scholar] [CrossRef]
  6. Mohajerani, S.; Saeedi, P. Cloud and cloud shadow segmentation for remote sensing imagery via filtered jaccard loss function and parametric augmentation. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2021, 14, 4254–4266. [Google Scholar] [CrossRef]
  7. Hou, Q.; Cheng, M.M.; Hu, X.; Borji, A.; Tu, Z.; Torr, P.H. Deeply supervised salient object detection with short connections. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
  8. Yang, J.; Guo, J.; Yue, H.; Liu, Z.; Hu, H.; Li, K. CDnet: CNN-based cloud detection for remote sensing imagery. IEEE Trans. Geosci. Remote Sens. 2019, 57, 6195–6211. [Google Scholar] [CrossRef]
  9. Guo, J.; Yang, J.; Yue, H.; Tan, H.; Hou, C.; Li, K. CDnetV2: CNN-based cloud detection for remote sensing imagery with cloud-snow coexistence. IEEE Trans. Geosci. Remote Sens. 2020, 59, 700–713. [Google Scholar] [CrossRef]
  10. Lu, C.; Xia, M.; Qian, M.; Chen, B. Dual-branch Network for Cloud and Cloud Shadow Segmentation. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–12. [Google Scholar] [CrossRef]
  11. Zhang, Z.; Xu, Z.; Liu, C.A.; Tian, Q.; Wang, Y. Cloudformer: Supplementary aggregation feature and mask-classification network for cloud detection. Appl. Sci. 2022, 12, 3221. [Google Scholar] [CrossRef]
  12. Wang, Y.; Xu, Z.; Wang, X.; Shen, C.; Cheng, B.; Shen, H.; Xia, H. End-to-end video instance segmentation with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 19–25 June 2021. [Google Scholar]
  13. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
  14. Hu, K.; Zhang, D.; Xia, M. CDUNet: Cloud Detection UNet for Remote Sensing Imagery. Remote Sens. 2021, 13, 4533. [Google Scholar] [CrossRef]
  15. Garofalo, S.P.; Ardito, F.; Sanitate, N.; De Carolis, G.; Ruggieri, S.; Giannico, V.; Rana, G.; Ferrara, R.M. Robustness of Actual Evapotranspiration Predicted by Random Forest Model Integrating Remote Sensing and Meteorological Information: Case of Watermelon (Citrullus lanatus, (Thunb.) Matsum. & Nakai, 1916). Water 2025, 17, 323. [Google Scholar] [CrossRef]
  16. Zhang, H.K.; Qiu, S.; Suh, J.W.; Luo, D.; Zhu, Z. Machine learning and deep learning in remote sensing data analysis. Ref. Modul. Earth Syst. Environ. Sci. 2024. [Google Scholar] [CrossRef]
  17. Pu, W.; Wang, Z.; Liu, D.; Zhang, Q. Optical Remote Sensing Image Cloud Detection with Self-Attention and Spatial Pyramid Pooling Fusion. Remote Sens. 2022, 14, 4312. [Google Scholar] [CrossRef]
  18. Lu, C.; Xia, M.; Lin, H. Multi-scale strip pooling feature aggregation network for cloud and cloud shadow segmentation. Neural Comput. Appl. 2022, 34, 6149–6162. [Google Scholar] [CrossRef]
  19. Xia, M.; Wang, T.; Zhang, Y.; Liu, J.; Xu, Y. Cloud/shadow segmentation based on global attention feature fusion residual network for remote sensing imagery. Int. J. Remote Sens. 2021, 42, 2022–2045. [Google Scholar] [CrossRef]
  20. Guo, H.; Bai, H.; Qin, W. ClouDet: A Dilated Separable CNN-Based Cloud Detection Framework for Remote Sensing Imagery. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2021, 14, 9743–9755. [Google Scholar] [CrossRef]
  21. Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Houlsby, N. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
  22. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. arXiv 2017, arXiv:1706.03762. [Google Scholar]
  23. Russakovsky, O.; Deng, J.; Su, H.; Krause, J.; Satheesh, S.; Ma, S.; Fei-Fei, L. Imagenet large scale visual recognition challenge. Int. J. Comput. Vis. 2015, 115, 211–252. [Google Scholar] [CrossRef]
  24. Liu, N.; Zhang, N.; Wan, K.; Shao, L.; Han, J. Visual saliency transformer. In Proceedings of the 18th IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021. [Google Scholar]
  25. Qiu, Y.; Liu, Y.; Zhang, L.; Xu, J. Boosting Salient Object Detection with Transformer-based Asymmetric Bilateral U-Net. arXiv 2021, arXiv:2108.07851. [Google Scholar] [CrossRef]
  26. Zhang, Y.; Liu, H.; Hu, Q. Transfuse: Fusing transformers and cnns for medical image segmentation. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Strasbourg, France, 27 September–1 October 2021. [Google Scholar] [CrossRef]
  27. Du, X.; Wu, H. Cloud-Graph: A feature interaction graph convolutional network for remote sensing image cloud detection. J. Intell. Fuzzy Syst. 2023, 45, 9123–9139. [Google Scholar] [CrossRef]
  28. Du, X.; Wu, H. Gated aggregation network for cloud detection in remote sensing image. Vis. Comput. 2024, 40, 2517–2536. [Google Scholar] [CrossRef]
  29. Samplawski, C.; Marlin, B.M. Towards Transformer-Based Real-Time Object Detection at the Edge: A Benchmarking Study. In Proceedings of the MILCOM 2021-2021 IEEE Military Communications Conference, San Diego, CA, USA, 29 November–2 December 2021. [Google Scholar] [CrossRef]
  30. Sun, Z.; Cao, S.; Yang, Y.; Kitani, K.M. Rethinking transformer-based set prediction for object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021. [Google Scholar]
  31. Wang, W.; Lai, Q.; Fu, H.; Shen, J.; Ling, H.; Yang, R. Salient object detection in the deep learning era: An in-depth survey. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 3239–3259. [Google Scholar] [CrossRef] [PubMed]
  32. Dang, L.M.; Wang, H.; Li, Y.; Nguyen, T.N.; Moon, H. DefectTR: End-to-end defect detection for sewage networks using a transformer. Constr. Build. Mater. 2022, 325, 126584. [Google Scholar] [CrossRef]
  33. Botach, A.; Zheltonozhskii, E.; Baskin, C. End-to-end referring video object segmentation with multimodal transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–24 June 2022. [Google Scholar]
  34. Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018. [Google Scholar]
  35. De Boer, P.T.; Kroese, D.P.; Mannor, S.; Rubinstein, R.Y. A tutorial on the cross-entropy method. Ann. Oper. Res. 2005, 134, 19–67. [Google Scholar] [CrossRef]
  36. Fu, J.; Liu, J.; Tian, H.; Li, Y.; Bao, Y.; Fang, Z.; Lu, H. Dual attention network for scene segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019. [Google Scholar]
  37. Achanta, R.; Hemami, S.; Estrada, F.; Susstrunk, S. Frequency-tuned salient region detection. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009. [Google Scholar] [CrossRef]
  38. Wang, W.; Xie, E.; Li, X.; Fan, D.P.; Song, K.; Liang, D.; Shao, L. Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021. [Google Scholar]
  39. Hughes, M.J.; Hayes, D.J. Automated detection of cloud and cloud shadow in single-date Landsat imagery using neural networks and spatial post-processing. Remote Sens. 2014, 6, 4907–4926. [Google Scholar] [CrossRef]
  40. Fu, K.; Fan, D.P.; Ji, G.P.; Zhao, Q.; Shen, J.; Zhu, C. Siamese network for RGB-D salient object detection and beyond. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 44, 5541–5559. [Google Scholar] [CrossRef] [PubMed]
  41. Margolin, R.; Zelnik-Manor, L.; Tal, A. How to evaluate foreground maps? In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014. [Google Scholar]
  42. Fan, D.P.; Cheng, M.M.; Liu, Y.; Li, T.; Borji, A. Structure-measure: A new way to evaluate foreground maps. In Proceedings of the 16th IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017. [Google Scholar]
  43. Fan, D.P.; Gong, C.; Cao, Y.; Ren, B.; Cheng, M.M.; Borji, A. Enhanced-alignment measure for binary foreground map evaluation. arXiv 2018, arXiv:1805.10421. [Google Scholar] [CrossRef]
  44. Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015. [Google Scholar]
  45. Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted, Munich, Germany, 5–9 October 2015. [Google Scholar] [CrossRef]
  46. Zhao, H.; Shi, J.; Qi, X.; Wang, X.; Jia, J. Pyramid scene parsing network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
  47. Badrinarayanan, V.; Kendall, A.; Cipolla, R. Segnet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 2481–2495. [Google Scholar] [CrossRef]
  48. Amirul Islam, M.; Rochan, M.; Bruce, N.D.; Wang, Y. Gated feedback refinement network for dense image labeling. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
  49. Cheng, B.; Misra, I.; Schwing, A.G.; Kirillov, A.; Girdhar, R. Masked-attention Mask Transformer for Universal Image Segmentation. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 1280–1289. [Google Scholar] [CrossRef]
  50. Xie, E.; Wang, W.; Yu, Z.; Anandkumar, A.; Alvarez, J.M.; Luo, P. SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers. In Advances in Neural Information Processing Systems; Curran Associates, Inc.: New York, NY, USA, 2021; Volume 34, pp. 12077–12090. [Google Scholar]
  51. Liu, Z.; Mao, H.; Wu, C.Y.; Feichtenhofer, C.; Darrell, T.; Xie, S. A convnet for the 2020s. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 11976–11986. [Google Scholar]
  52. Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 10012–10022. [Google Scholar]
Figure 1. The overall architecture of TransCNet, consisting of a dual-branch feature extraction network, with the Transformer branch capturing the global context and the CNN branch extracting local details. In the feature aggregation module (FAM), the features from both branches are aggregated to enhance the accuracy of cloud detection.
Figure 1. The overall architecture of TransCNet, consisting of a dual-branch feature extraction network, with the Transformer branch capturing the global context and the CNN branch extracting local details. In the feature aggregation module (FAM), the features from both branches are aggregated to enhance the accuracy of cloud detection.
Applsci 15 09870 g001
Figure 2. Structure of the channel attention module.
Figure 2. Structure of the channel attention module.
Applsci 15 09870 g002
Figure 3. Structure of feature aggregation module.
Figure 3. Structure of feature aggregation module.
Applsci 15 09870 g003
Figure 4. Visual comparison of cloud detection results across diverse scene types on the CHLandsat-8-TE dataset, obtained by TransCNet (ours), FCN8S [44], UNet [45], GFRNet [48], PSPNet [46], SEGNet [47], CDNet [8], and CDNetV2 [9].
Figure 4. Visual comparison of cloud detection results across diverse scene types on the CHLandsat-8-TE dataset, obtained by TransCNet (ours), FCN8S [44], UNet [45], GFRNet [48], PSPNet [46], SEGNet [47], CDNet [8], and CDNetV2 [9].
Applsci 15 09870 g004
Figure 5. Visual comparison of cloud detection results across diverse scene types on the 38-Cloud-Test dataset, obtained by TransCNet (ours), FCN8S [44], UNet [45], GFRNet [48], PSPNet [46], SEGNet [47], CDNet [8], and CDNetV2 [9].
Figure 5. Visual comparison of cloud detection results across diverse scene types on the 38-Cloud-Test dataset, obtained by TransCNet (ours), FCN8S [44], UNet [45], GFRNet [48], PSPNet [46], SEGNet [47], CDNet [8], and CDNetV2 [9].
Applsci 15 09870 g005
Figure 6. Visual comparison of cloud detection results across diverse scene types on the SPARCS dataset, obtained by TransCNet (ours), FCN8S [44], UNet [45], GFRNet [48], PSPNet [46], SEGNet [47], CDNet [8], and CDNetV2 [9].
Figure 6. Visual comparison of cloud detection results across diverse scene types on the SPARCS dataset, obtained by TransCNet (ours), FCN8S [44], UNet [45], GFRNet [48], PSPNet [46], SEGNet [47], CDNet [8], and CDNetV2 [9].
Applsci 15 09870 g006
Figure 7. Comparison results of the second stage of feature visualization for two-branch feature aggregation.
Figure 7. Comparison results of the second stage of feature visualization for two-branch feature aggregation.
Applsci 15 09870 g007
Figure 8. Comparison results of the third stage of feature visualization for two-branch feature aggregation.
Figure 8. Comparison results of the third stage of feature visualization for two-branch feature aggregation.
Applsci 15 09870 g008
Table 1. Dataset details [27].
Table 1. Dataset details [27].
DatasetScenesImagesTrain/Test
CHLandsat-8-TR4422,616Train
CHLandsat-8-TE2010,080Test
38-Cloud-Test2010,906Test
SPARCS80720Test
Table 2. Quantitative comparison with SOTA methods on the CHLandsat-8-TE dataset [27].
Table 2. Quantitative comparison with SOTA methods on the CHLandsat-8-TE dataset [27].
ModelCHLandsat-8-TE (20)
MAEMaxFmAvgFmWFmSmEm
FCN8s0.1060.8740.8580.7840.7420.827
UNet0.1130.8620.8260.7450.7290.789
PSPNet0.0970.8790.8690.7990.7670.858
SEGNet0.1020.8740.8500.7780.7540.824
GFRNet0.1230.8520.8270.7360.7160.779
Cloud-Net0.1010.8750.8460.7640.7360.795
ClouDet0.0950.8840.8700.7960.7640.828
CDNet0.1290.8480.8140.7220.7090.763
CDNetV20.1250.8420.8230.7350.7140.790
ResNet-340.1330.8410.8230.7160.6910.763
PVT0.1440.8500.8180.7060.7530.824
Mask2former0.1020.8770.8390.7650.7370.797
Segformer0.1090.8670.8310.7520.7330.791
Cloudformer0.1120.8560.8280.7420.7660.781
TransCNet0.0820.8930.8740.8150.8500.844
Table 3. Quantitative comparison with SOTA methods on the 38-Cloud-Test dataset [27].
Table 3. Quantitative comparison with SOTA methods on the 38-Cloud-Test dataset [27].
Model38-Cloud-Test (20)
MAEMaxFmAvgFmWFmSmEm
FCN8s0.0690.8570.8330.7680.7680.817
UNet0.0640.8790.8620.7970.7850.856
PSPNet0.0650.8310.8230.7590.7770.889
SEGNet0.0560.8570.8500.8000.8060.912
GFRNet0.0790.8430.8240.7510.7540.817
Cloud-Net0.0550.8900.8780.7610.7980.870
ClouDet0.0520.8960.8820.8190.8240.899
CDNet0.1060.8350.8230.7380.7270.793
CDNetV20.1080.8170.8050.7180.7210.781
ResNet-340.0900.8400.7970.7040.7190.770
PVT0.1010.8510.6970.6470.7370.735
Mask2former0.0780.8460.8410.7540.7610.845
Segformer0.0800.8420.8310.7510.7460.821
Cloudformer0.0820.8410.8350.7520.7420.831
TransCNet0.0450.8670.8660.8140.8590.892
Table 4. Quantitative comparison with SOTA methods on the SPARCS dataset [27].
Table 4. Quantitative comparison with SOTA methods on the SPARCS dataset [27].
ModelSPARCS (80)
MAEMaxFmAvgFmWFmSmEm
FCN8s0.1430.4640.3860.3070.5170.456
UNet0.1310.5270.4510.3650.5420.506
PSPNet0.1260.5430.4800.3760.5410.550
SEGNet0.1100.6310.5550.4700.5920.595
GFRNet0.1310.5160.4440.3630.5470.512
Cloud-Net0.1210.5470.4620.3800.5530.517
ClouDet0.1050.5660.5020.4520.5810.554
CDNet0.1160.6160.5460.4590.5920.595
CDNetV20.1220.5870.5140.4250.5700.560
ResNet-340.1480.4030.3640.3010.4900.442
PVT0.1800.5120.4420.3550.4790.495
Mask2former0.1150.6180.5570.4660.5950.602
Segformer0.1220.5450.4710.3780.5620.546
Cloudformer0.1120.6240.5630.4750.6020.611
TransCNet0.1050.6450.5820.4900.6060.627
Table 5. Confusion matrix [28].
Table 5. Confusion matrix [28].
Cloud RegionNon-Cloud Region
Predicted: cloudTrue PositiveFalse Positive
Predicted: non-cloudFalse NegativeTrue Negative
Table 6. Results of the proposed network structure ablation experiment.
Table 6. Results of the proposed network structure ablation experiment.
BackboneFAMCACHLandsat8-TE (20)38-Cloud-Test (20)SPARCS (80)
MAEMaxFmSmMAEMaxFmSmMAEMaxFmSm
ResNet-340.1330.8410.6910.0900.8400.7190.1480.4030.490
PVT-Tiny0.1440.8500.7530.1010.8510.7370.1800.5120.479
ResNet-340.1280.8400.6980.0820.8510.7080.1490.4080.451
PVT-Tiny0.1370.8550.7470.0960.8480.7430.1740.5220.483
ResNet-34 + PVT-Tiny0.1060.8590.7940.0680.8270.7850.1360.6020.571
ResNet-34 + PVT-Tiny0.0880.8860.8530.0510.8690.8280.1080.6970.637
ResNet-34 + PVT-Tiny0.0820.8930.8500.0450.8670.8590.0980.7280.662
ConvNeXt-Tiny + Swin-T0.0970.8710.8130.0610.8320.8190.1250.6470.608
ConvNeXt-Tiny + Swin-T0.0810.8940.8590.0470.8710.8560.0970.7310.669
ConvNeXt-Tiny + Swin-T0.0760.9020.8540.0420.8690.8870.090.7580.693
Table 7. Computational complexity analysis of different methods.
Table 7. Computational complexity analysis of different methods.
ModelParams (M)FLOPs (G)
(224 × 224)
Running Time (s)
(1 k × 1 k)
UNet8.625.21.09
PSPNet46.619.31.05
SEGNet29.790.21.28
CDNetV164.848.51.26
CDNetV265.931.51.31
ConvNeXt-Tiny + Swin-T31.633.81.38
TransCNet23.024.61.29
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Cheng, S.; Guo, H.; Wu, H.; Du, X. Dual-Branch Transformer–CNN Fusion for Enhanced Cloud Segmentation in Remote Sensing Imagery. Appl. Sci. 2025, 15, 9870. https://doi.org/10.3390/app15189870

AMA Style

Cheng S, Guo H, Wu H, Du X. Dual-Branch Transformer–CNN Fusion for Enhanced Cloud Segmentation in Remote Sensing Imagery. Applied Sciences. 2025; 15(18):9870. https://doi.org/10.3390/app15189870

Chicago/Turabian Style

Cheng, Shengyi, Hangfei Guo, Hailei Wu, and Xianjun Du. 2025. "Dual-Branch Transformer–CNN Fusion for Enhanced Cloud Segmentation in Remote Sensing Imagery" Applied Sciences 15, no. 18: 9870. https://doi.org/10.3390/app15189870

APA Style

Cheng, S., Guo, H., Wu, H., & Du, X. (2025). Dual-Branch Transformer–CNN Fusion for Enhanced Cloud Segmentation in Remote Sensing Imagery. Applied Sciences, 15(18), 9870. https://doi.org/10.3390/app15189870

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop