3.1. Experimental Datasets
To comprehensively evaluate the performance and generalization capability of the proposed DSFNet in pixel-level cloud and cloud shadow semantic segmentation tasks, extensive experimental analyses were conducted on three publicly available benchmark datasets: GF1_WHU, HRC_WHU and Cloud and Cloud Shadow Dataset. These datasets present significant challenges, including multi-scale targets, complex surface backgrounds, and variable lighting conditions, making them highly valuable for assessing the robustness of our network.
The GF1_WHU dataset, produced by the RS IDEA team at Wuhan University, serves as the primary benchmark for this study. It contains 108 Level-2A images captured by the Wide Field of View (WFV) sensor onboard the Gaofen-1 (GF-1) satellite. These images feature a spatial resolution of 16 m and encompass four multispectral bands: red, green, blue, and near-infrared. The dataset provides extensive global coverage across diverse land cover types—including urban areas, water bodies, vegetation, and deserts—under varying cloud cover conditions. The ground truth masks were meticulously annotated by experts into three distinct categories: background (0), cloud shadow (128), and cloud (255). The diversity of cloud morphologies and the potentially confusing background features make it an ideal standard for validating segmentation accuracy. Representative samples and labels are illustrated in
Figure 8.
Due to GPU memory constraints, the original large-scale images were divided into non-overlapping patches of size 256 × 256, and patches containing invalid background or severe blur were discarded. The resulting patches were then randomly divided into training, validation, and test subsets with a ratio of 8:1:1. To enhance robustness to multi-scale and multi-directional cloud structures, data augmentation techniques, including horizontal flipping, vertical flipping, and random rotation, were applied during training. Ultimately, 5428 patches were used for training, 680 for validation, and 680 for testing. The validation set was used for hyperparameter tuning and model selection, while the test set was reserved exclusively for the final performance evaluation.
To evaluate the generalization robustness of the model across multi-source data, this paper further employed the HRC_WHU dataset and the Cloud and Cloud Shadow Dataset. The HRC_WHU dataset consists of 150 high-resolution RGB images sourced from Google Earth, with spatial resolutions ranging from 0.5 to 15 m. It covers five typical land surface types: water, vegetation, urban, ice/snow, and bare land, which often exhibit spectral similarities to clouds, thereby increasing segmentation difficulty. This dataset primarily provides binary annotations for cloud and non-cloud regions. Following the same protocol, the original images were cropped into 3200 patches of 256 × 256 pixels and then divided into 2560 training patches, 320 validation patches, and 320 test patches. The Cloud and Cloud Shadow Dataset, also sourced from Google Earth, comprises high-resolution remote sensing imagery collected by professional meteorologists across various geographical regions, including the Yunnan-Guizhou Plateau, the Qinghai–Tibet Plateau, and the Yangtze River Delta. It covers five typical backgrounds: water, forest, farmland, towns, and deserts, with refined manual annotations for clouds, cloud shadows, and background. Considering the larger variation in spatial extent and aspect-ratio distribution of the original images in this dataset, as well as the fact that the effective target regions are relatively concentrated in some samples, the images were cropped into 224 × 224 patches rather than 256 × 256. In our preliminary preprocessing comparisons, 256 × 256 patches were observed to introduce more boundary redundancy and lower effective region utilization for this dataset, which was particularly unfavorable for preserving thin clouds and subtle edge details. After removing invalid samples, a total of 2560 patches were obtained and further divided into 2048 training patches, 256 validation patches, and 256 test patches. It should be emphasized that, within each dataset, all compared methods used exactly the same patch size and preprocessing strategy to ensure a fair comparison.
3.2. Experimental Details
All experiments were conducted using the PyTorch 2.2.2 framework on an NVIDIA RTX 4090 GPU, with CUDA 11.8 for acceleration. For training, this paper employed the Adam optimizer due to its superior convergence properties. The initial learning rate was set to 0.001, utilizing a cosine annealing strategy for dynamic decay to facilitate model convergence. The learning rate was adjusted according to the following formula:
where
denotes the initial learning rate, the minimum learning rate
is set to
, and
represents the total number of training epochs, and
denotes the current epoch. Considering hardware memory limits and convergence characteristics, the batch size was set to 16, and the model was trained for a total of 200 epochs. Notably, the proposed DSFNet was optimized using a hybrid loss function, while compared models utilized standard cross-entropy loss. During training, hyperparameters were determined based on the validation set, and the checkpoint achieving the best validation MIoU was selected as the final model for test evaluation.
To quantitatively evaluate the segmentation accuracy and computational efficiency of DSFNet across the three datasets, seven core metrics were selected: Pixel Accuracy (PA), Mean Pixel Accuracy (MPA), F1-score, Mean Intersection over Union (MIoU), Frequency Weighted Intersection over Union (FWIoU), single-frame inference time (Time) and Floating-Point Operations (FLOPs).
Calculation variables were defined based on the confusion matrix: for a dataset with classes (including background), denotes the number of pixels belonging to class but predicted as class . Therefore, for a specific class , represents the true positives (TP). The false positives (FP) and false negatives (FN) are calculated as and , respectively.
PA is defined as the ratio of correctly classified pixels to the total pixels:
In pixel-level semantic segmentation, PA is essentially equivalent to Overall Accuracy (OA), as both quantify the proportion of correctly classified pixels over the entire image.
MPA calculates the average accuracy across all categories:
F1-score is the harmonic mean of Precision (P) and Recall (R):
The Intersection over Union (IoU) for class
is defined as:
MIoU measures the overlap between predicted cloud/shadow regions and ground truth by averaging the IoU of each class:
FWIoU evaluates overall segmentation performance by weighting the IoU of each class according to its frequency in the dataset:
For the GF1_WHU and the Cloud and Cloud Shadow Dataset, three classes (cloud, cloud shadow, and background) were used for calculation. For the HRC_WHU dataset, two classes (cloud and background) were used.
3.4. Comparative Experiment
To verify the superiority and robustness of DSFNet in complex cloud shadow scenarios, we performed a rigorous benchmark comparison against several representative methods in the field. The comparison covers three mainstream architectural paradigms: classic CNN architectures (Unet, FCN, DeepLabV3+), advanced Transformers (SegFormer, Swin-Unet), and high-resolution or hybrid architectures designed specifically for cloud segmentation (HRNet, EDFF-Unet, MFAFNet). Experimental results (
Table 3) indicate that while advanced models like HRNet and SegFormer perform well on certain metrics, DSFNet demonstrates consistent superiority across multiple evaluation metrics. Specifically, our model reached the highest scores in PA (90.62%), MPA (85.42%), and MIoU (76.97%), proving its significant advantage in complex feature extraction and boundary refinement.
To further assess the class-specific performance of the proposed model, evaluation was conducted for the cloud and cloud-shadow classes, as shown in
Table 4. For cloud detection, DSFNet achieved the highest recall (94.01%), F1 score (93.18%), and IoU (87.23%), while its precision (92.36%) was slightly lower than that of HRNet. For the more challenging cloud-shadow class, DSFNet attained 82.15% precision, 70.84% recall, 75.99% F1 score, and 61.39% IoU, outperforming the second-best method by 3.3 percentage points in IoU. These results indicate that the proposed method not only performs strongly on the relatively easier cloud class, but also yields more substantial gains on the harder cloud-shadow class, which is more susceptible to background interference and boundary ambiguity.
Based on an in-depth comparative analysis of the visual results in
Figure 9, classical models exhibit significant limitations when handling complex cloud shadow scenes. In the figure, black represents the background, white denotes clouds, and gray indicates cloud shadows, with red boxes highlighting regions of significant discrepancy. Although HRNet maintains high-resolution features and produces relatively smooth boundaries, it still exhibits instability when processing thin cirrus clouds with translucent characteristics or irregular broken clouds, leading to partial false positives and loss of detail in complex scenes (as shown in the first row of
Figure 9). While Transformer architectures like SegFormer and Swin-Unet outperform traditional CNNs in capturing global contexts, they fall short in restoring local textures of cloud layers: Swin-Unet displays distinct jagged artifacts on segmentation edges, whereas SegFormer suffers from extensive false negatives in fragmented cloud regions (as shown in the second row of
Figure 9). Despite DeepLabV3+ demonstrating powerful multi-scale capture capabilities via atrous convolutions, its pursuit of a large receptive field limits its perception of high-frequency details; consequently, high-level semantics dominate the prediction results, causing subtle targets in the extremely faint cloud shadow region in the bottom-left corner of the fourth row to be missed. Building upon the strong semantic representation capabilities inherited from DeepLabV3+, DSFNet introduces DSRM to strengthen the extraction of directional broken clouds and strip-like structures. It leverages the statistical attention mechanism of ASCA to effectively suppress complex background noise, significantly reducing false negatives for small cloud shadows. Furthermore, by adaptively aggregating multi-scale features through AGMF in the decoding stage, it sharply defines boundaries and generates high-quality segmentation masks with tight contours and smooth edges.
To better demonstrate the generalization capability of our model in different scenarios, we compared the segmentation results of various models across eight different scenes as shown in
Figure 10, including grassland, desert, rocky terrain, urban, mountainous areas, water bodies, snow/ice, and barren land. In the first set of images, HRNet shows relatively coarse edge recognition for irregular cloud shadows; while other models perform better in edge segmentation, their detailed rendering is still inferior to that of our model.
In the second set of images, HRNet and SegFormer experience structural distortion when processing irregular cloud shadow edges. Although DeepLabV3+ and Swin-Unet successfully recover the overall shape and contours of the cloud shadows, our proposed method achieves the best overall performance in preserving the morphological integrity and edge sharpness of the cloud shadows, achieving segmentation results that are most highly consistent with the ground truth labels.
In the third set of images, the red box highlights a cloud shadow with a ring-like topological structure. DeepLabV3+, SegFormer, and Swin-Unet tend to fill in the central hole, resulting in a loss of the topological structure. In contrast, our proposed method successfully preserves the hole features within the cloud shadow, reflecting its capability to maintain complex geometric structures.
In the fourth set of images, all comparative models merge adjacent independent cloud shadow patches into connected regions. Only our model successfully segments the intervals between cloud shadows, maintaining clear physical boundaries between targets.
In the fifth set of images, Swin-Unet confuses the terrain shadows caused by topographic variations in mountainous areas with cloud shadows. Our model, however, effectively distinguishes the shadows cast by clouds from the inherent terrain shadows of the surface.
In the sixth set of images, water body scenes are typically accompanied by spectral variations caused by waves or turbidity, making thin clouds extremely difficult to observe over water. For the extremely thin cloud layers above the water surface, HRNet and SegFormer almost completely miss them, whereas the proposed method successfully captures these low-contrast thin-cloud targets.
In the seventh set of images, HRNet was largely unable to distinguish between snow-capped mountain peaks and cloud pixels. Other competing methods suffered from interference due to different objects with similar spectra, misclassifying portions of snow-covered summits as clouds. Our model exhibited the lowest false positive rate, demonstrating superior feature discrimination and robustness in complex backgrounds.
In the eighth set of images, DeepLabV3+ failed to adequately identify the fragmented black background areas within cloud layers, resulting in the loss of internal topological details. Our model, however, fully preserved the internal cavities within the cloud layers.
Figure 11 visualizes the segmentation performance of different methods for cloud and cloud-shadow regions with varying optical thicknesses and morphological heterogeneity. Cumulus clouds typically exhibit complex geometric boundaries and evident internal brightness variations, which increase the difficulty of maintaining consistent segmentation within the cloud body. In such cases, the proposed method produces more coherent predictions across both bright and dark cloud regions by effectively modeling long-range contextual dependencies. Broken clouds display highly discontinuous spatial distributions, and these fine-scale structures are easily smoothed out or omitted during feature extraction. As shown in the highlighted regions, DSFNet preserves these fragmented cloud components more effectively and reduces local omissions.
The interior of thick clouds is usually easier to identify due to the strong reflectance associated with large optical thickness. However, near the cloud margins, optical thickness gradually decreases and reflectance drops sharply, leading to blurred transitions between the cloud and the background. In these low-contrast regions, DSFNet yields cloud contours that are sharper and more consistent with the ground truth. Thin clouds, by contrast, are semi-transparent and allow background radiative signals to partially penetrate the cloud layer, so that the observed pixel values become a mixture of cloud and surface spectra. Over highly reflective backgrounds such as deserts, snow/ice, or urban areas, thin clouds are therefore particularly prone to omission. In these challenging cases, DSFNet suppresses background interference more effectively and improves the completeness of thin-cloud segmentation.
Cloud-shadow regions exhibit even stronger ambiguity under complex backgrounds. Although their overall shapes are not always regular or strictly elongated, they often show local directional extension, boundary continuity, and projection-consistent structural coherence, especially in weak-response regions and near irregular cloud boundaries. These characteristics are particularly difficult to preserve under low contrast and strong background interference. As further illustrated by the samples in rows 6–9 of
Figure 11, such cloud-shadow regions may appear fragmented, locally stretched, or weakly contrasted rather than forming regular stripe-like patterns. Compared with the competing methods, DSFNet produces cloud-shadow predictions with better regional continuity, tighter boundary adherence, and fewer structural breaks.
Overall, the proposed method yields more complete cloud regions, more continuous cloud-shadow predictions, and more accurate boundaries across diverse cloud and cloud-shadow scenarios.
3.5. Generalization Performance Analysis
3.5.1. Evaluation on Additional Datasets
To further validate the cross-dataset generalization performance of the proposed DSFNet, comparative experiments were conducted on two benchmark datasets, namely the HRC_WHU dataset (Dataset 1) and the Cloud and Cloud Shadow Dataset (Dataset 2). The segmentation accuracy was quantitatively evaluated using three widely adopted metrics, including PA, MPA, and MIoU. The quantitative results are reported in
Table 5.
Experimental results indicate that our model achieves the best overall performance on both datasets. On Dataset 1, our model comprehensively leads all compared models, achieving the highest scores of 94.57% in PA, 93.76% in MPA, and 87.81% in MIoU. Similarly, on Dataset 2, it exhibits the best overall performance with a PA of 97.78%, MPA of 96.73%, and MIoU of 93.31%. Notably, despite the distribution discrepancy between the two datasets, DSFNet maintains consistent superiority, highlighting its strong domain adaptability. Overall, although models like DATransunet and MFAFNet achieve competitive results on certain metrics, our model consistently outperforms them across both datasets. This robustly demonstrates the strong generalization capability and adaptability of our proposed model in complex cloud and cloud shadow segmentation tasks.
It is worth noting that differences in absolute performance across datasets should be interpreted with caution. Such differences largely reflect variations in dataset setting and intrinsic segmentation difficulty, rather than inconsistency in the evaluation protocol. The absolute performance differences across datasets are mainly attributable to differences in task setting, spatial resolution, and scene complexity. In particular, HRC_WHU is a binary cloud/background segmentation task, whereas GF1_WHU and the Cloud and Cloud Shadow Dataset adopt a three-class setting including cloud shadow, which introduces additional inter-class confusion. Moreover, GF1_WHU has a coarser spatial resolution, making boundary localization more difficult due to mixed pixels and blurred transitions. It also contains more heterogeneous backgrounds and more complex scene variations, which further increase the segmentation difficulty. Therefore, the relatively lower absolute metrics on GF1_WHU mainly reflect the higher intrinsic difficulty of this dataset rather than any inconsistency in the experimental protocol.
3.5.2. Cross-Dataset Transfer Evaluation
To further examine the transferability of the proposed model under domain shift, bidirectional cross-dataset experiments were conducted between the GF1_WHU dataset and the Cloud and Cloud Shadow Dataset (Dataset 2). Specifically, the model was first trained on GF1_WHU and evaluated on the Cloud and Cloud Shadow Dataset, after which the training and testing domains were reversed. The quantitative results are summarized in
Table 6.
Under both transfer settings, DSFNet consistently outperforms DeepLabV3+. When trained on GF1_WHU and evaluated on the Cloud and Cloud Shadow Dataset, DSFNet improves PA, MPA, F1, and MIoU by 2.61%, 3.91%, 4.58%, and 5.62%, respectively. When the training and testing domains are reversed, the corresponding improvements remain 3.35%, 4.50%, 5.36%, and 4.53%. These results indicate that, although both models suffer from performance degradation under cross-dataset testing, the proposed method maintains stable advantages in the presence of distribution shifts.
The superior transfer robustness of DSFNet can be attributed to the complementary design of the proposed framework. The DSRM enhances direction-sensitive structural continuity, which is beneficial for preserving projected cloud-shadow morphology under varying scene layouts. The ASCA improves robustness to radiometric ambiguity and anomalous responses, thereby alleviating confusion between cloud shadows and visually similar dark regions. Meanwhile, the AGMF strengthens cross-level interaction by explicitly exploiting feature discrepancies, which helps preserve boundary details when the source and target distributions are inconsistent. As a result, compared with the baseline model, DSFNet exhibits stronger robustness under cross-dataset transfer and better tolerance to domain shifts.