Next Article in Journal
Impact of Data Modality and Batch Normalization Layers on Very High-Resolution Impervious Surface Mapping Using DeepLabv3+ and U-Net Under Regional Cross-City and Cross-Season Domain Shifts
Previous Article in Journal
Precision Estimation of Aboveground Carbon Stock in Acidosasa edulis Bamboo Forests: A Fusion Approach with UAV-LiDAR, Allometric Equations, and Machine Learning
Previous Article in Special Issue
Deep Learning Style Transfer for Enhanced Smoke Plume Visibility: A Standardized False Color Composite (SFCC) in GEMS Satellite Imagery
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

DSFNet: A Directional Statistical Fusion Network for Cloud and Cloud Shadow Segmentation

1
School of Mathematics and Statistics, Nanjing University of Information Science and Technology, Nanjing 210044, China
2
School of Automation, Nanjing University of Information Science and Technology, Nanjing 210044, China
3
Collaborative Innovation Center on Atmospheric Environment and Equipment Technology, Nanjing University of Information Science and Technology, Nanjing 210044, China
*
Author to whom correspondence should be addressed.
Remote Sens. 2026, 18(9), 1432; https://doi.org/10.3390/rs18091432
Submission received: 2 March 2026 / Revised: 18 April 2026 / Accepted: 1 May 2026 / Published: 4 May 2026

Highlights

What are the main findings?
  • DSFNet integrates directional structural enhancement, robust statistical modulation, and adaptive grouped multi-scale fusion for joint cloud and cloud shadow segmentation.
  • By incorporating directional refinement and adaptive grouped multi-scale fusion, DSFNet handles elongated shadow structures and achieves 76.97% mIoU on the GF1_WHU dataset, a 1.51% improvement over the DeepLabV3+ baseline.
What are the implications of the main findings?
  • The proposed framework demonstrates the value of combining directional structural priors with robust statistical modeling for remote sensing segmentation under heterogeneous background interference.
  • By preserving boundary details and semantic consistency, DSFNet enables more reliable preprocessing for downstream agricultural and climate monitoring.

Abstract

Accurate cloud and cloud shadow segmentation is a critical prerequisite for remote sensing image preprocessing. However, this task remains challenging due to the directional continuity of projected cloud shadows, the radiometric ambiguity between low-reflectance shadows and other dark surfaces, and the difficulty of preserving semantic consistency and fine boundaries in complex scenes. To address these issues, this paper proposes a Directional Statistical Fusion Network (DSFNet) based on an enhanced DeepLabV3+ architecture. Specifically, a Directional Scale Refinement Module (DSRM) is introduced in parallel with Atrous Spatial Pyramid Pooling to strengthen the representation of direction-sensitive cloud-shadow structures and multi-scale cloud regions. An Adaptive Statistical Context Attention (ASCA) module is further designed to perform robust feature modulation by jointly exploiting global statistics, edge-aware statistics, and median-based normalization, thereby suppressing anomalous responses under heterogeneous backgrounds. In the decoder, an Adaptive Grouped Multi-scale Fusion (AGMF) module is employed to adaptively fuse shallow detail features and high-level semantic features through discrepancy-guided grouped gating, improving structural consistency and boundary recovery. In addition, a hybrid loss is adopted to further optimize segmentation. Experiments on the GF1_WHU dataset show that DSFNet achieves 76.97% mIoU, demonstrating strong effectiveness and robustness in complex remote sensing scenes.

1. Introduction

With the rapid advancement of remote sensing technologies, optical imagery has become indispensable for global terrestrial monitoring. However, cloud cover and the associated shadows frequently obscure the land surface, degrade radiometric consistency, and reduce the reliability of subsequent image interpretation. Such interference poses substantial challenges to downstream applications, including land cover mapping, object detection, and environmental monitoring. Therefore, accurate segmentation of clouds and cloud shadows has become a fundamental prerequisite for remote sensing image preprocessing, as it is essential for mitigating atmospheric contamination and improving the usability and analytical reliability of remote sensing data in applications such as agricultural monitoring, solar energy assessment, and climate change research.
Traditional cloud and cloud shadow detection methods primarily rely on physical models and statistical features, which can generally be categorized into three main categories. The first category consists of threshold-based detection methods, which construct physical discrimination rules based on radiometric characteristics in the visible and thermal infrared bands [1,2]. These methods are simple and efficient, but they are often sensitive to thin clouds and high-reflectance backgrounds. The second category comprises geometry- and texture-based detection methods, which aim to mitigate spectral confusion by incorporating spatial geometric relationships and textural information [3,4]. However, their performance is usually unstable in complex scenes with weak boundaries or heterogeneous surface backgrounds. The third category includes time-series and statistical-learning-based methods, which enhance robustness by exploiting multi-temporal variation or statistical classifiers [5,6,7]. Despite their improved robustness, time-series and statistical-learning-based methods still depend heavily on handcrafted features and prior assumptions [8], making it difficult to capture complex structures and fine boundary details.
Since 2015, deep learning has fundamentally reshaped semantic segmentation in remote sensing. Early representative models, such as the Fully Convolutional Network (FCN) [9], established the paradigm of end-to-end pixel-wise prediction, while subsequent architectures mainly advanced along three directions: preserving spatial details, strengthening multi-scale contextual modeling, and capturing long-range dependencies. Specifically, encoder–decoder networks such as U-Net [10] and SegNet [11], as well as high-resolution architectures such as HRNet [12], emphasized the recovery and preservation of fine spatial structures. To better handle scale variations and contextual aggregation, the DeepLab series [13,14,15,16] and PSPNet [17] introduced dilated convolutions, ASPP, and pyramid pooling to enlarge receptive fields and enhance multi-scale representation. More recently, Transformer-based models, including Swin-Unet [18], SegFormer [19], and DA-TransUNet [20], have further strengthened global dependency modeling through self-attention mechanisms. These advances have laid an important foundation for cloud and cloud shadow segmentation in remote sensing, where the targets often exhibit large scale variation, weak boundaries, and strong interference from complex land-surface backgrounds.
In recent years, increasing attention has been paid to the integration of direction-aware enhancement, attention-based modulation, and cross-scale feature fusion for cloud and cloud shadow segmentation in remote sensing imagery. For instance, MSFANet [21] improves the anisotropic structural representation of cloud shadows through multi-scale strip pooling, while AGASMA [22] introduces axial hybrid attention within a parallel CNN–Transformer framework to enhance fine-grained segmentation. In terms of multi-scale and multi-modal interaction, HyCloudX [23], Wang et al. [24], and Feng et al. [25] further improve segmentation performance from the perspectives of spectral–spatial collaboration, channel–spatial attention, and adaptive feature fusion, respectively. Collectively, these studies indicate the importance of direction-sensitive structural representation, robust feature modulation, and effective cross-level interaction.
Nevertheless, several critical challenges remain in joint cloud and cloud shadow segmentation. First, cloud shadows are not merely dark regions, but projection-induced structures formed under solar illumination. Their spatial morphology is closely related to solar elevation, solar azimuth, and local projection geometry, and therefore often exhibits strong structural continuity along specific directions [3,26,27]. Meanwhile, cloud bodies usually contain multi-scale textures and irregular local boundaries, making it difficult for mainstream isotropic convolutions and pooling operations to simultaneously preserve cloud details and shadow continuity. Second, from a radiometric perspective, cloud shadows usually appear as low-reflectance regions and are easily confused with water bodies, building shadows, and other dark surfaces [28], while snow, bright buildings, and highly reflective water may introduce anomalous responses. Consequently, attention mechanisms based on mean statistics remain sensitive to noise and outliers, leading to weight bias and misclassification in high-interference backgrounds [29,30]. Third, in the decoding stage, conventional multi-scale fusion strategies still rely mainly on direct concatenation or summation, without explicitly exploiting discrepancy information between features of different levels to guide aggregation, which may result in boundary degradation and structural discontinuity [31,32]. Therefore, a more task-oriented framework is still needed to jointly address projection-aware structural modeling, robust radiometric modulation, and selective cross-scale aggregation in a unified manner.
These observations directly motivate the proposed framework. Specifically, the direction-sensitive continuity of cloud-shadow projection motivates the Directional Scale Refinement Module (DSRM), the need to suppress radiometric ambiguity and anomalous responses motivates the Adaptive Statistical Context Attention (ASCA), and the limitation of direct cross-scale aggregation motivates the Adaptive Grouped Multi-scale Fusion (AGMF) in the decoder. From this perspective, DSFNet is not a simple combination of existing techniques, but a task-oriented framework that reorganizes these three methodological cues around the specific challenges of joint cloud and cloud shadow segmentation. On this basis, this paper proposes a Directional Statistical Fusion Network (DSFNet) for precise cloud and cloud shadow segmentation in complex remote sensing imagery. The main contributions of this work are summarized as follows:
  • We propose a Directional Statistical Fusion Network that enhances directional structural representation and improves statistical robustness, thereby alleviating cross-scale feature imbalance and background interference in cloud and cloud-shadow segmentation.
  • A Directional Scale Refinement Module is designed to enhance direction-aware feature modeling and improve representation of directionally continuous cloud shadows and anisotropic cloud structures without compromising multi-scale receptive fields.
  • The Adaptive Statistical Context Attention module is developed based on a median-centered robust modulation principle to suppress noise-sensitive feature responses in complex remote sensing environments.
  • To mitigate cross-scale feature imbalance, an Adaptive Grouped Multi-scale Fusion module is introduced to facilitate fine-grained fusion between deep semantic abstractions and shallow spatial details.
  • The training strategy integrates a hybrid loss function combining Logit-Adjusted Focal Cross-Entropy and Multi-class Focal Tversky Loss. This combination guides the model to prioritize cloud shadow boundaries and minority classes, reducing omission errors and boosting overall accuracy.

2. Materials and Methods

2.1. Network Structure

The overall architecture of the proposed DSFNet is illustrated in Figure 1. In the encoding stage, the backbone first extracts hierarchical features from the input image, where shallow features preserve rich boundary, texture, and local semantic information, while high-level features provide stronger abstract semantic representations. The high-level features are then fed into the ASPP module for multi-scale context aggregation, while a Directional Scale Refinement Module (DSRM) is introduced in parallel to further enhance the representation of elongated cloud-shadow structures and multi-scale cloud regions. The aggregated features are subsequently refined by the Adaptive Statistical Context Attention (ASCA) module, which suppresses anomalous responses under complex backgrounds and improves the stability of semantic features. In the decoding stage, instead of directly concatenating multi-level features, the network employs an Adaptive Grouped Multi-scale Fusion (AGMF) module to spatially align features and perform discrepancy-guided fusion of shallow detail features, high-level semantic features, and the enhanced semantic feature from the encoder. On this basis, the fused semantic feature produced by AGMF is further concatenated with the shallow feature to complement stable high-level semantics with low-level boundary and texture details, thereby reinforcing the structural consistency of cloud-shadow regions while preserving fine details, and finally generating the cloud and cloud shadow segmentation result.
The proposed framework is built upon two closely related ideas, namely directional modeling and statistical fusion. In DSRM, directional modeling does not simply refer to elongated objects, but to the representation of 2D spatial context as aggregated responses along predefined directional bases, thereby capturing direction-sensitive structural continuity in the feature space. This is particularly relevant to cloud shadows, whose spatial morphology is closely related to solar illumination and projection geometry and therefore often exhibits locally extended and direction-consistent structures. For ASCA, statistical fusion is reflected in robust feature modulation through the joint use of global statistics, edge-aware statistics, and median-based location estimation, which reduces the sensitivity of mean-dominated responses to outliers and anomalous background interference. Furthermore, AGMF extends statistical fusion to cross-level feature aggregation by explicitly exploiting discrepancy information between features of different levels to guide adaptive fusion, rather than relying solely on direct concatenation or summation. This design is consistent with the radiometric characteristics of remote sensing imagery, where cloud shadows usually appear as low-reflectance regions and are easily confused with other dark surfaces, while strong reflections from bright objects may introduce anomalous responses. By integrating direction-aware structural enhancement, robust statistical modulation, and discrepancy-guided cross-scale fusion, DSFNet improves the discrimination of cloud and cloud-shadow regions in complex scenes.

2.2. Directional Scale Refinement Module (DSRM)

In cloud and cloud shadow segmentation, cloud shadows often exhibit direction-sensitive structural continuity due to projection geometry, while cloud bodies usually involve multi-scale variations and irregular local textures. Such structural heterogeneity makes it difficult for conventional isotropic feature extraction to simultaneously preserve shadow continuity and cloud details.
Although ASPP enlarges the receptive field through multi-scale atrous convolutions, its sampling pattern remains spatially sparse. For elongated shadows and low-contrast transition regions, such sparse sampling may produce discontinuous feature responses, thereby leading to fragmented shadow boundaries or local missed detections. To alleviate this limitation without disturbing the original multi-scale receptive-field distribution, we introduce a Directional Scale Refinement Module as a parallel branch to ASPP, as shown in Figure 2. DSRM is designed to supplement ASPP with explicit directional context modeling and complementary local multi-scale refinement.
Formally, for an input feature map X R B × C × H × W , its directional representation is defined as:
D ( X ) = { P θ ( X ) θ Θ }
where Θ = { 0 ,   π / 2 } denotes the predefined orthogonal directional bases corresponding to the horizontal and vertical directions, and P θ denotes the strip-wise aggregation operator along direction θ . Unlike conventional anisotropic convolutions that still operate within local receptive fields, this formulation performs global aggregation along one spatial dimension and therefore emphasizes long-range directional structural continuity. From a physical perspective, DSRM does not explicitly encode solar elevation or azimuth angles, but implicitly responds to the projection-induced structural regularity of cloud shadows in image space.
Figure 3 illustrates the complementary multi-scale feature refinement branch used in DSRM. Although directional aggregation is effective in capturing long-range structural continuity, it is insufficient for fully describing local boundary transitions, fragmented textures, and scale variations. To compensate for these missing cues, a multi-scale feature refinement branch [33] is first introduced to generate the complementary feature F m . This branch employs two parallel depthwise convolution paths with kernel sizes 3 × 3 and 7 × 7 to capture local and mesoscale contexts, respectively, and fuses their responses to form F m .
With this local complement, the directional context branch shown in Figure 2 extracts orientation-sensitive responses through Strip Average Pooling and Strip Maximum Pooling along the horizontal and vertical directions. Specifically, aggregation along width W yields horizontal strip descriptors:
s h a v g c , h , 1 = 1 W w = 1 W   X c , h , w ,
s h m a x c , h , 1 = m a x w   X c , h , w
while aggregation along height H yields vertical strip descriptors:
s w a v g c , 1 , w = 1 H h = 1 H   X c , h , w
s w m a x ( c , 1 , w ) = m a x h   X ( c , h , w )
These strip-wise responses are concatenated and projected to produce the directional context feature F d , which mainly captures long-range structural continuity and directional consistency:
F d = ϕ B N C o n v 1 × 1 C o n c a t E s h a v g , E s h m a x , E s w a v g , E s w m a x
Finally, the directional context feature F d is fused with the multi-scale refined feature F m through channel concatenation and a 1 × 1 convolution, followed by residual addition with the input feature:
F o u t = C o n v 1 × 1 ( C o n c a t [ F d , F m ] ) + X
In this way, DSRM establishes a direction-aware structural prior for cloud-shadow projection and complements it with local multi-scale refinement. By jointly enhancing global directional continuity and local structural variability, the proposed module effectively mitigates the discontinuity introduced by sparse atrous sampling, while improving shadow continuity preservation and cloud-detail representation in complex remote sensing scenes.

2.3. Adaptive Statistical Context Attention Module (ASCA)

Traditional attention mechanisms usually rely on global averaging to estimate feature importance. However, in remote sensing imagery, mean-based statistics are highly sensitive to radiometric ambiguity and extreme responses. In particular, low-reflectance cloud shadows are easily confused with other dark surfaces, while highly reflective objects may significantly bias the global mean, which is especially detrimental to thin clouds, faint shadows, and low-contrast transition regions. To alleviate this limitation, we introduce an Adaptive Statistical Context Attention module. The overall architecture of ASCA and the LGCG operator are illustrated in Figure 4 and Figure 5, respectively. Instead of relying on a single mean-based descriptor, ASCA incorporates multiple statistical cues and a median-centered robust spatial reference [34,35] to construct more stable and discriminative attention responses under complex radiometric conditions.
The LGCG operator first generates a statistically guided channel descriptor. Given an input feature tensor F R B × C × H × W , global semantic statistics are first extracted by Global Average Pooling:
g g l o b a l = G A P F R B × C
To further incorporate structural variation cues, a depthwise convolution is applied to obtain a smoothed response, and the absolute difference between the input and the smoothed feature is then used to characterize edge-aware variation:
F e d g e = F D W C o n v 3 × 3 ( F )
g e d g e = G A P ( F e d g e ) R B × C
The two statistics are then fused to generate a channel-wise guidance vector:
g = σ C o n v 1 × 1 g g l o b a l + g e d g e , g ( 0, 1 ) B × C
where g ( 0,1 ) B × C . In this way, LGCG jointly encodes global semantic context and local variation information, making the channel guidance less susceptible to radiometric bias caused by complex backgrounds.
Based on the guidance signal produced by LGCG, ASCA further performs robust spatial modulation, as shown in Figure 4. The guided response Q g u i d e d is obtained by modulating the depthwise-convolved feature with the LGCG-generated channel guidance vector g . The spatial saliency map W is then obtained by channel-wise aggregation:
W = c = 1 C   Q g u i d e d c
To suppress the influence of extreme responses, ASCA introduces a median-based spatial normalization strategy. Specifically, the saliency map is flattened along the spatial dimension, and its median is used as a robust location statistic:
m m a p = M e d i a n ( W ) R B × 1 × 1 × 1
where M e d i a n ( ) denotes median estimation over the flattened spatial dimension. The normalized spatial attention is then formulated as:
α = σ β ( W m m a p )
where β is a learnable scaling factor and α ( 0,1 ) B × 1 × H × W denotes the robust spatial attention map. Compared with conventional average-based attention, this design is less sensitive to outliers and therefore more suitable for remote sensing scenes with radiometric heterogeneity.
Finally, the channel guidance and the robust spatial attention are jointly used to modulate the feature response, followed by a 1 × 1 convolution and residual addition with the input feature to obtain the output of ASCA.
Y = F + γ C o n v 1 × 1 ( V α g )
where V denotes the depthwise-convolved feature response and γ is a learnable scaling coefficient that balances the contributions between the attention branch and the identity pathway.
By jointly exploiting global statistics, edge-aware statistics, and median-based robust normalization, ASCA effectively reduces the ambiguity between cloud shadows and other dark surfaces, suppresses interference from extreme radiometric responses, and enhances cloud-shadow discrimination in complex remote sensing backgrounds.

2.4. Adaptive Grouped Multi-Scale Fusion Module (AGMF)

In the decoding stage, accurate cloud and cloud shadow segmentation requires both stable semantic abstraction and sufficient boundary detail. However, these cues are not equally preserved across feature levels: shallow features retain richer local structures and boundary information, whereas deeper features encode more reliable semantic context. Conventional decoding strategies usually merge them through direct concatenation or summation, which may obscure the discrepancy between semantic abstraction and local texture responses. In complex remote sensing scenes, this cross-level discrepancy is especially informative for ambiguous cloud-shadow regions. Therefore, rather than relying on uniform fusion, the decoder should explicitly exploit such discrepancy as a statistical indicator to guide adaptive aggregation, so as to reduce ambiguity and preserve structural integrity.
To address this issue, we introduce an Adaptive Grouped Multi-scale Fusion Module, whose overall structure is shown in Figure 6. Let the deep semantic feature refined by ASCA and the subsequent 1 × 1 convolution be denoted as F s e m , the high-level feature from the backbone be denoted as F h i g h , and the shallow feature be denoted as F l o w . To facilitate pixel-level fusion, taking the spatial resolution of the shallow feature as the baseline, F s e m and F h i g h are first upsampled to H 4 × W 4 by bilinear interpolation. Subsequently, a 1 × 1 convolution is applied to project the three branches into the same channel dimension C , yielding F ^ s e m , F ^ h i g h , F ^ l o w R B × C × H 4 × W 4 .
To avoid redundant coupling caused by full-channel fusion [36], the concatenated feature X R 3 C × H × W is processed by grouped convolution. Formally, X is split into G channel groups X = { X 1 , X 2 , , X G } ,   X g R 3 C G × H × W , and each group is independently transformed as:
Y g = ϕ C o n v g ( X g )
The grouped responses are then concatenated to form the fused representation, thereby reducing redundant channel coupling. The grouped convolution structure used in this work is illustrated in Figure 7.
Based on the grouped-convolution design, an Adaptive Grouped Gating (AGG) unit is introduced as the basic fusion operator in AGMF. For each dual-input feature pair, AGG is responsible for performing discrepancy-aware adaptive gating before the subsequent multi-branch aggregation. Its detailed procedure is summarized in Algorithm 1.
Algorithm 1 Adaptive Grouped Gating (AGG)
Input: H , L R C × H × W
Output: F a g g
1: Compute discrepancy D H L
2: Concatenate X C o n c a t [ H , L , D ]
3: Apply grouped 3 × 3 convolution to obtain local grouped representation
4: Apply grouped 1 × 1 convolution and Sigmoid to generate gate β
5: Fuse the two inputs as
           F a g g β H + ( 1 β ) L
6: Return  F a g g
Accordingly, AGMF performs pairwise fusion on three aligned feature pairs: A G G F ^ l o w , F ^ h i g h , A G G F ^ l o w , F ^ s e m , A G G F ^ h i g h , F ^ s e m . The three fusion outputs are then concatenated along the channel dimension and compressed by a 1 × 1 convolution:
F m i x = ϕ C o n v 1 × 1 ( C o n c a t ( A G G F ^ l o w , F ^ h i g h 1 , A G G F ^ l o w , F ^ s e m , A G G F ^ h i g h , F ^ s e m ) )
where ϕ ( ) denotes a nonlinear transformation consisting of Batch Normalization followed by ReLU activation. To further enhance local spatial consistency, a depthwise separable 3 × 3 convolution is applied to obtain the final output:
F o u t = ϕ D W C o n v 3 × 3 F m i x
In this way, AGMF can be understood as a discrepancy-guided cross-scale fusion mechanism. By explicitly introducing cross-level discrepancy into the gating process, the proposed module better preserves semantic consistency and local structural details, while reducing the ambiguity between low-reflectance cloud shadows and visually similar dark surfaces in complex remote sensing scenes.

2.5. Loss Function

In cloud and cloud shadow segmentation tasks, cloud shadow pixels are typically far fewer than background pixels, and their boundaries exhibit characteristics of weak contrast and elongated structures. Such significant class imbalance and boundary blurriness tend to bias the model toward the background class during training, subsequently leading to missed detections of small-scale shadow targets and structural fragmentation. To address the aforementioned issues, this paper proposes a hybrid loss function that balances pixel-level discriminative stability and region-level structural consistency. This function is composed of a weighted combination of the Logit-Adjusted Focal Cross-Entropy Loss [37] and the multi-class Focal Tversky Loss (mFTL) [38]. For convenience, the proposed hybrid loss is denoted as FJFL.
At the pixel level, to alleviate the prior bias caused by the long-tail distribution, Logit prior adjustment is introduced into the Softmax layer, combined with a Focal mechanism to mine hard-to-classify samples. The pixel-level loss is defined as follows:
L L A F o c a l C E = k = 1 C η k y k ( 1 p k ) γ log ( p k )
where C denotes the total number of classes, y k { 0, 1 } represents the ground-truth label for class   k , and η k is the class balancing weight. The focusing parameter γ controls the penalty for hard-to-classify samples. The probability p k is computed based on the adjusted logit z k = z k + τ log ( π k ) , where z k is the original network output logit, π k represents the empirical class frequency, and τ is the temperature scaling factor. By explicitly correcting the class priors, this mechanism effectively enhances the model’s discriminative capability for minority-class cloud shadow pixels.
At the region level, to further reduce the missed detection rate of cloud shadows and maintain the integrity of elongated structures, this paper introduces mFTL as a region-level constraint. By adjusting the penalty weights ( α k , β k ) for false positives and false negatives in the Tversky index and incorporating a focusing parameter, this loss function is able to concentrate more on difficult regions with low overlap:
L m F T L = k = 1 C w k 1 T P k T P k + α k F P k + β k F N k γ
where w k is the assigned weight for class k . T P k , F P k , and F N k denote the number of true positive, false positive, and false negative pixels for class k , respectively. The exponent γ further scales the loss to emphasize difficult-to-segment regions.
The final total loss function is integrated via a weighted fusion controlled by a balancing coefficient λ :
L = λ L L A F o c a l C E + ( 1 λ ) L m F T L
where λ serves as the balancing coefficient used to adjust the contribution ratio between pixel-level supervision and region-level supervision. This hybrid loss guarantees training stability while significantly improving the recall rate of cloud shadow regions.
During the experiments, multiple hyperparameter combinations were tested, as shown in Table 1. The best segmentation performance was observed when λ   =   0.6 . Therefore, this parameter setting was adopted for subsequent experiments.

3. Experiments and Analysis

3.1. Experimental Datasets

To comprehensively evaluate the performance and generalization capability of the proposed DSFNet in pixel-level cloud and cloud shadow semantic segmentation tasks, extensive experimental analyses were conducted on three publicly available benchmark datasets: GF1_WHU, HRC_WHU and Cloud and Cloud Shadow Dataset. These datasets present significant challenges, including multi-scale targets, complex surface backgrounds, and variable lighting conditions, making them highly valuable for assessing the robustness of our network.
The GF1_WHU dataset, produced by the RS IDEA team at Wuhan University, serves as the primary benchmark for this study. It contains 108 Level-2A images captured by the Wide Field of View (WFV) sensor onboard the Gaofen-1 (GF-1) satellite. These images feature a spatial resolution of 16 m and encompass four multispectral bands: red, green, blue, and near-infrared. The dataset provides extensive global coverage across diverse land cover types—including urban areas, water bodies, vegetation, and deserts—under varying cloud cover conditions. The ground truth masks were meticulously annotated by experts into three distinct categories: background (0), cloud shadow (128), and cloud (255). The diversity of cloud morphologies and the potentially confusing background features make it an ideal standard for validating segmentation accuracy. Representative samples and labels are illustrated in Figure 8.
Due to GPU memory constraints, the original large-scale images were divided into non-overlapping patches of size 256 × 256, and patches containing invalid background or severe blur were discarded. The resulting patches were then randomly divided into training, validation, and test subsets with a ratio of 8:1:1. To enhance robustness to multi-scale and multi-directional cloud structures, data augmentation techniques, including horizontal flipping, vertical flipping, and random rotation, were applied during training. Ultimately, 5428 patches were used for training, 680 for validation, and 680 for testing. The validation set was used for hyperparameter tuning and model selection, while the test set was reserved exclusively for the final performance evaluation.
To evaluate the generalization robustness of the model across multi-source data, this paper further employed the HRC_WHU dataset and the Cloud and Cloud Shadow Dataset. The HRC_WHU dataset consists of 150 high-resolution RGB images sourced from Google Earth, with spatial resolutions ranging from 0.5 to 15 m. It covers five typical land surface types: water, vegetation, urban, ice/snow, and bare land, which often exhibit spectral similarities to clouds, thereby increasing segmentation difficulty. This dataset primarily provides binary annotations for cloud and non-cloud regions. Following the same protocol, the original images were cropped into 3200 patches of 256 × 256 pixels and then divided into 2560 training patches, 320 validation patches, and 320 test patches. The Cloud and Cloud Shadow Dataset, also sourced from Google Earth, comprises high-resolution remote sensing imagery collected by professional meteorologists across various geographical regions, including the Yunnan-Guizhou Plateau, the Qinghai–Tibet Plateau, and the Yangtze River Delta. It covers five typical backgrounds: water, forest, farmland, towns, and deserts, with refined manual annotations for clouds, cloud shadows, and background. Considering the larger variation in spatial extent and aspect-ratio distribution of the original images in this dataset, as well as the fact that the effective target regions are relatively concentrated in some samples, the images were cropped into 224 × 224 patches rather than 256 × 256. In our preliminary preprocessing comparisons, 256 × 256 patches were observed to introduce more boundary redundancy and lower effective region utilization for this dataset, which was particularly unfavorable for preserving thin clouds and subtle edge details. After removing invalid samples, a total of 2560 patches were obtained and further divided into 2048 training patches, 256 validation patches, and 256 test patches. It should be emphasized that, within each dataset, all compared methods used exactly the same patch size and preprocessing strategy to ensure a fair comparison.

3.2. Experimental Details

All experiments were conducted using the PyTorch 2.2.2 framework on an NVIDIA RTX 4090 GPU, with CUDA 11.8 for acceleration. For training, this paper employed the Adam optimizer due to its superior convergence properties. The initial learning rate was set to 0.001, utilizing a cosine annealing strategy for dynamic decay to facilitate model convergence. The learning rate was adjusted according to the following formula:
η t = η m i n + 1 2 η m a x η m i n 1 + cos t T m a x π
where η m a x denotes the initial learning rate, the minimum learning rate η m i n is set to   0.01 η m a x , and T m a x represents the total number of training epochs, and t denotes the current epoch. Considering hardware memory limits and convergence characteristics, the batch size was set to 16, and the model was trained for a total of 200 epochs. Notably, the proposed DSFNet was optimized using a hybrid loss function, while compared models utilized standard cross-entropy loss. During training, hyperparameters were determined based on the validation set, and the checkpoint achieving the best validation MIoU was selected as the final model for test evaluation.
To quantitatively evaluate the segmentation accuracy and computational efficiency of DSFNet across the three datasets, seven core metrics were selected: Pixel Accuracy (PA), Mean Pixel Accuracy (MPA), F1-score, Mean Intersection over Union (MIoU), Frequency Weighted Intersection over Union (FWIoU), single-frame inference time (Time) and Floating-Point Operations (FLOPs).
Calculation variables were defined based on the confusion matrix: for a dataset with k + 1 classes (including background), p i , j denotes the number of pixels belonging to class i but predicted as class j . Therefore, for a specific class i , p i , i represents the true positives (TP). The false positives (FP) and false negatives (FN) are calculated as j i   p j , i and j i   p i , j , respectively.
PA is defined as the ratio of correctly classified pixels to the total pixels:
P A = i = 0 k   p i , i i = 0 k   j = 0 k   p i , j
In pixel-level semantic segmentation, PA is essentially equivalent to Overall Accuracy (OA), as both quantify the proportion of correctly classified pixels over the entire image.
MPA calculates the average accuracy across all categories:
M P A = 1 k + 1 i = 0 k   p i , i j = 0 k   p i , j
F1-score is the harmonic mean of Precision (P) and Recall (R):
P = T P T P + F P
R = T P T P + F N
F 1 = 2 × P × R P + R
The Intersection over Union (IoU) for class i is defined as:
I o U i = p i , i j = 0 k   p i , j + j = 0 k   p j , i p i , i
MIoU measures the overlap between predicted cloud/shadow regions and ground truth by averaging the IoU of each class:
M I o U = 1 k + 1 i = 0 k   p i , i j = 0 k   p i , j + j = 0 k   p j , i p i , i
FWIoU evaluates overall segmentation performance by weighting the IoU of each class according to its frequency in the dataset:
F W I o U = 1 i = 0 k   j = 0 k   p i , j i = 0 k   p i , i j = 0 k   p i , j j = 0 k   p i , j + j = 0 k   p j , i p i , i
For the GF1_WHU and the Cloud and Cloud Shadow Dataset, three classes (cloud, cloud shadow, and background) were used for calculation. For the HRC_WHU dataset, two classes (cloud and background) were used.

3.3. Ablation Experiment

In this section, we first utilized DeepLabV3+ as the backbone network and, while keeping the overall network structure unchanged, replaced our proposed modules with an auxiliary module consisting of standard convolutions and upsampling, subsequently concatenating them for output. Next, we progressively integrated the designed modules (DSRM, ASCA, AGMF, and FJFL) into the network to replace the auxiliary module, thereby validating the feasibility of each module and the entire network. Table 2 details the improvement in network prediction accuracy contributed by each module.
  • Ablation of DSRM: Standard dilated convolutions in traditional ASPP struggle to capture directional geometric features. DSRM introduces strip pooling and multi-scale depthwise separable convolutions to effectively establish long-range directional dependencies while preserving local textures. Results show that DSRM improved MPA and MIoU by 0.24% and 0.56%, respectively, demonstrating its efficacy in enhancing directional consistency and textural detail.
  • Ablation of ASCA: At the end of feature aggregation, overlapping multi-scale context often results in an uneven distribution of spatial responses. ASCA utilizes semantic guidance vectors and a median-centered spatial reweighting mechanism to adaptively emphasize significant spatial positions while suppressing anomalous noise. Its inclusion improved MPA and MIoU by 0.22% each, indicating enhanced spatial discrimination and feature robustness in complex cloud shadow regions.
  • Ablation of AGMF: The feature fusion method in the decoding stage determines boundary precision. AGMF replaces simple concatenation with a dynamic grouping mechanism and local context gating, adaptively adjusting fusion granularity based on feature complexity. AGMF contributed a 0.32% increase in MIoU, validating its superiority in recovering spatial details. Although the F1-score slightly fluctuates, the consistent improvement in MIoU indicates that AGMF contributes to better overall regional overlap.
  • Ablation of FJFL: Cloud shadow segmentation suffers from severe class imbalance and weak contrast, particularly for elongated and small-scale shadow structures. To address this issue, FJFL integrates Logit-Adjusted Focal Cross-Entropy to mitigate long-tail distribution bias and incorporates a multi-class Focal Tversky Loss to emphasize hard-to-classify regions while preserving structural continuity. This hybrid loss formulation jointly enhances pixel-level discriminative robustness and region-level structural consistency. The inclusion of FJFL improves MPA, MIoU, and F1-score by 0.35%, 0.41%, and 0.63%, respectively, effectively reducing missed detections and improving overall segmentation robustness.

3.4. Comparative Experiment

To verify the superiority and robustness of DSFNet in complex cloud shadow scenarios, we performed a rigorous benchmark comparison against several representative methods in the field. The comparison covers three mainstream architectural paradigms: classic CNN architectures (Unet, FCN, DeepLabV3+), advanced Transformers (SegFormer, Swin-Unet), and high-resolution or hybrid architectures designed specifically for cloud segmentation (HRNet, EDFF-Unet, MFAFNet). Experimental results (Table 3) indicate that while advanced models like HRNet and SegFormer perform well on certain metrics, DSFNet demonstrates consistent superiority across multiple evaluation metrics. Specifically, our model reached the highest scores in PA (90.62%), MPA (85.42%), and MIoU (76.97%), proving its significant advantage in complex feature extraction and boundary refinement.
To further assess the class-specific performance of the proposed model, evaluation was conducted for the cloud and cloud-shadow classes, as shown in Table 4. For cloud detection, DSFNet achieved the highest recall (94.01%), F1 score (93.18%), and IoU (87.23%), while its precision (92.36%) was slightly lower than that of HRNet. For the more challenging cloud-shadow class, DSFNet attained 82.15% precision, 70.84% recall, 75.99% F1 score, and 61.39% IoU, outperforming the second-best method by 3.3 percentage points in IoU. These results indicate that the proposed method not only performs strongly on the relatively easier cloud class, but also yields more substantial gains on the harder cloud-shadow class, which is more susceptible to background interference and boundary ambiguity.
Based on an in-depth comparative analysis of the visual results in Figure 9, classical models exhibit significant limitations when handling complex cloud shadow scenes. In the figure, black represents the background, white denotes clouds, and gray indicates cloud shadows, with red boxes highlighting regions of significant discrepancy. Although HRNet maintains high-resolution features and produces relatively smooth boundaries, it still exhibits instability when processing thin cirrus clouds with translucent characteristics or irregular broken clouds, leading to partial false positives and loss of detail in complex scenes (as shown in the first row of Figure 9). While Transformer architectures like SegFormer and Swin-Unet outperform traditional CNNs in capturing global contexts, they fall short in restoring local textures of cloud layers: Swin-Unet displays distinct jagged artifacts on segmentation edges, whereas SegFormer suffers from extensive false negatives in fragmented cloud regions (as shown in the second row of Figure 9). Despite DeepLabV3+ demonstrating powerful multi-scale capture capabilities via atrous convolutions, its pursuit of a large receptive field limits its perception of high-frequency details; consequently, high-level semantics dominate the prediction results, causing subtle targets in the extremely faint cloud shadow region in the bottom-left corner of the fourth row to be missed. Building upon the strong semantic representation capabilities inherited from DeepLabV3+, DSFNet introduces DSRM to strengthen the extraction of directional broken clouds and strip-like structures. It leverages the statistical attention mechanism of ASCA to effectively suppress complex background noise, significantly reducing false negatives for small cloud shadows. Furthermore, by adaptively aggregating multi-scale features through AGMF in the decoding stage, it sharply defines boundaries and generates high-quality segmentation masks with tight contours and smooth edges.
To better demonstrate the generalization capability of our model in different scenarios, we compared the segmentation results of various models across eight different scenes as shown in Figure 10, including grassland, desert, rocky terrain, urban, mountainous areas, water bodies, snow/ice, and barren land. In the first set of images, HRNet shows relatively coarse edge recognition for irregular cloud shadows; while other models perform better in edge segmentation, their detailed rendering is still inferior to that of our model.
In the second set of images, HRNet and SegFormer experience structural distortion when processing irregular cloud shadow edges. Although DeepLabV3+ and Swin-Unet successfully recover the overall shape and contours of the cloud shadows, our proposed method achieves the best overall performance in preserving the morphological integrity and edge sharpness of the cloud shadows, achieving segmentation results that are most highly consistent with the ground truth labels.
In the third set of images, the red box highlights a cloud shadow with a ring-like topological structure. DeepLabV3+, SegFormer, and Swin-Unet tend to fill in the central hole, resulting in a loss of the topological structure. In contrast, our proposed method successfully preserves the hole features within the cloud shadow, reflecting its capability to maintain complex geometric structures.
In the fourth set of images, all comparative models merge adjacent independent cloud shadow patches into connected regions. Only our model successfully segments the intervals between cloud shadows, maintaining clear physical boundaries between targets.
In the fifth set of images, Swin-Unet confuses the terrain shadows caused by topographic variations in mountainous areas with cloud shadows. Our model, however, effectively distinguishes the shadows cast by clouds from the inherent terrain shadows of the surface.
In the sixth set of images, water body scenes are typically accompanied by spectral variations caused by waves or turbidity, making thin clouds extremely difficult to observe over water. For the extremely thin cloud layers above the water surface, HRNet and SegFormer almost completely miss them, whereas the proposed method successfully captures these low-contrast thin-cloud targets.
In the seventh set of images, HRNet was largely unable to distinguish between snow-capped mountain peaks and cloud pixels. Other competing methods suffered from interference due to different objects with similar spectra, misclassifying portions of snow-covered summits as clouds. Our model exhibited the lowest false positive rate, demonstrating superior feature discrimination and robustness in complex backgrounds.
In the eighth set of images, DeepLabV3+ failed to adequately identify the fragmented black background areas within cloud layers, resulting in the loss of internal topological details. Our model, however, fully preserved the internal cavities within the cloud layers.
Figure 11 visualizes the segmentation performance of different methods for cloud and cloud-shadow regions with varying optical thicknesses and morphological heterogeneity. Cumulus clouds typically exhibit complex geometric boundaries and evident internal brightness variations, which increase the difficulty of maintaining consistent segmentation within the cloud body. In such cases, the proposed method produces more coherent predictions across both bright and dark cloud regions by effectively modeling long-range contextual dependencies. Broken clouds display highly discontinuous spatial distributions, and these fine-scale structures are easily smoothed out or omitted during feature extraction. As shown in the highlighted regions, DSFNet preserves these fragmented cloud components more effectively and reduces local omissions.
The interior of thick clouds is usually easier to identify due to the strong reflectance associated with large optical thickness. However, near the cloud margins, optical thickness gradually decreases and reflectance drops sharply, leading to blurred transitions between the cloud and the background. In these low-contrast regions, DSFNet yields cloud contours that are sharper and more consistent with the ground truth. Thin clouds, by contrast, are semi-transparent and allow background radiative signals to partially penetrate the cloud layer, so that the observed pixel values become a mixture of cloud and surface spectra. Over highly reflective backgrounds such as deserts, snow/ice, or urban areas, thin clouds are therefore particularly prone to omission. In these challenging cases, DSFNet suppresses background interference more effectively and improves the completeness of thin-cloud segmentation.
Cloud-shadow regions exhibit even stronger ambiguity under complex backgrounds. Although their overall shapes are not always regular or strictly elongated, they often show local directional extension, boundary continuity, and projection-consistent structural coherence, especially in weak-response regions and near irregular cloud boundaries. These characteristics are particularly difficult to preserve under low contrast and strong background interference. As further illustrated by the samples in rows 6–9 of Figure 11, such cloud-shadow regions may appear fragmented, locally stretched, or weakly contrasted rather than forming regular stripe-like patterns. Compared with the competing methods, DSFNet produces cloud-shadow predictions with better regional continuity, tighter boundary adherence, and fewer structural breaks.
Overall, the proposed method yields more complete cloud regions, more continuous cloud-shadow predictions, and more accurate boundaries across diverse cloud and cloud-shadow scenarios.

3.5. Generalization Performance Analysis

3.5.1. Evaluation on Additional Datasets

To further validate the cross-dataset generalization performance of the proposed DSFNet, comparative experiments were conducted on two benchmark datasets, namely the HRC_WHU dataset (Dataset 1) and the Cloud and Cloud Shadow Dataset (Dataset 2). The segmentation accuracy was quantitatively evaluated using three widely adopted metrics, including PA, MPA, and MIoU. The quantitative results are reported in Table 5.
Experimental results indicate that our model achieves the best overall performance on both datasets. On Dataset 1, our model comprehensively leads all compared models, achieving the highest scores of 94.57% in PA, 93.76% in MPA, and 87.81% in MIoU. Similarly, on Dataset 2, it exhibits the best overall performance with a PA of 97.78%, MPA of 96.73%, and MIoU of 93.31%. Notably, despite the distribution discrepancy between the two datasets, DSFNet maintains consistent superiority, highlighting its strong domain adaptability. Overall, although models like DATransunet and MFAFNet achieve competitive results on certain metrics, our model consistently outperforms them across both datasets. This robustly demonstrates the strong generalization capability and adaptability of our proposed model in complex cloud and cloud shadow segmentation tasks.
It is worth noting that differences in absolute performance across datasets should be interpreted with caution. Such differences largely reflect variations in dataset setting and intrinsic segmentation difficulty, rather than inconsistency in the evaluation protocol. The absolute performance differences across datasets are mainly attributable to differences in task setting, spatial resolution, and scene complexity. In particular, HRC_WHU is a binary cloud/background segmentation task, whereas GF1_WHU and the Cloud and Cloud Shadow Dataset adopt a three-class setting including cloud shadow, which introduces additional inter-class confusion. Moreover, GF1_WHU has a coarser spatial resolution, making boundary localization more difficult due to mixed pixels and blurred transitions. It also contains more heterogeneous backgrounds and more complex scene variations, which further increase the segmentation difficulty. Therefore, the relatively lower absolute metrics on GF1_WHU mainly reflect the higher intrinsic difficulty of this dataset rather than any inconsistency in the experimental protocol.

3.5.2. Cross-Dataset Transfer Evaluation

To further examine the transferability of the proposed model under domain shift, bidirectional cross-dataset experiments were conducted between the GF1_WHU dataset and the Cloud and Cloud Shadow Dataset (Dataset 2). Specifically, the model was first trained on GF1_WHU and evaluated on the Cloud and Cloud Shadow Dataset, after which the training and testing domains were reversed. The quantitative results are summarized in Table 6.
Under both transfer settings, DSFNet consistently outperforms DeepLabV3+. When trained on GF1_WHU and evaluated on the Cloud and Cloud Shadow Dataset, DSFNet improves PA, MPA, F1, and MIoU by 2.61%, 3.91%, 4.58%, and 5.62%, respectively. When the training and testing domains are reversed, the corresponding improvements remain 3.35%, 4.50%, 5.36%, and 4.53%. These results indicate that, although both models suffer from performance degradation under cross-dataset testing, the proposed method maintains stable advantages in the presence of distribution shifts.
The superior transfer robustness of DSFNet can be attributed to the complementary design of the proposed framework. The DSRM enhances direction-sensitive structural continuity, which is beneficial for preserving projected cloud-shadow morphology under varying scene layouts. The ASCA improves robustness to radiometric ambiguity and anomalous responses, thereby alleviating confusion between cloud shadows and visually similar dark regions. Meanwhile, the AGMF strengthens cross-level interaction by explicitly exploiting feature discrepancies, which helps preserve boundary details when the source and target distributions are inconsistent. As a result, compared with the baseline model, DSFNet exhibits stronger robustness under cross-dataset transfer and better tolerance to domain shifts.

4. Discussion

4.1. Advantages of the DSFNet

The advantages of DSFNet mainly arise from the consistency between its design and the intrinsic characteristics of cloud and cloud-shadow segmentation in remote sensing imagery. In complex scenes, cloud shadows often exhibit weak responses, local directional extension, and strong confusion with visually similar dark surfaces, while cloud regions usually involve multi-scale structures and irregular boundaries. In addition, overexposed snow/ice regions may introduce abnormally strong responses and further increase the difficulty of distinguishing clouds from bright surface objects. Such characteristics are difficult to handle using conventional frameworks that rely mainly on isotropic context modeling and direct cross-level fusion.
In this framework, DSRM enhances direction-sensitive structural representation and helps preserve the continuity of elongated cloud-shadow regions. ASCA improves robustness to radiometric heterogeneity by suppressing anomalous responses through statistically guided feature modulation. AGMF further strengthens boundary recovery and local consistency by explicitly introducing cross-level discrepancy into adaptive fusion. Through the complementary action of these modules, DSFNet shows particular advantages in challenging scenarios involving low contrast, complex backgrounds, and ambiguous boundaries. At the same time, these improvements are achieved within a relatively conventional DeepLabV3+ based framework, which makes the proposed method both effective and practical for remote sensing image segmentation.

4.2. Outlook of the DSFNet

Despite achieving a competitive MIoU of 76.97%, DSFNet may still exhibit prediction errors and boundary ambiguity in highly challenging scenarios, particularly when cloud or cloud-shadow regions present low contrast, complex background interference, or confusing spectral responses with visually similar surface objects. From a physical perspective, remote sensing observations are inherently affected by atmospheric scattering, illumination variation, surface reflectance heterogeneity, and scale inconsistency, which make thin clouds, shadow transitions, and blurred boundaries difficult to separate in a strictly deterministic manner. From a mathematical perspective, the current framework still mainly relies on discriminative feature learning under pixel-level supervision, and thus may suffer from insufficient inter-class separability and elevated predictive uncertainty in structurally ambiguous or low-confidence regions. As a result, pixels around difficult boundaries are more likely to be oversmoothed toward dominant categories during optimization, leading to local omission and error accumulation.
To alleviate these limitations, future work will focus on developing a dynamic error-aware optimization framework that more closely couples physical prior information with uncertainty-guided representation learning. Specifically, an image-level performance evaluation repository can be established during training to continuously track the segmentation quality of individual samples and identify persistently difficult cases. Based on this repository, the network can adaptively prioritize low-confidence or poorly segmented samples for targeted hard example mining and iterative refinement, thereby improving its sensitivity to ambiguous structures and boundary transitions [41]. In addition, physically meaningful constraints, such as illumination consistency, boundary transition regularity, and cross-scale contextual coherence, may be further introduced to regularize the optimization process and enhance robustness in challenging regions. This direction is also supported by recent studies showing that global–local dependency modeling, cross-level feature interaction, and token-enhanced contextual representation are effective for capturing complex structures and difficult boundaries in remote sensing interpretation tasks [42,43,44,45].

5. Conclusions

Accurate segmentation of clouds and cloud shadows is a critical prerequisite for remote sensing image preprocessing, particularly in complex scenarios with diverse morphological variations and terrestrial interference. To address the inherent limitations of traditional methods in preserving fine edge details and semantic consistency, we proposed DSFNet, a novel framework built upon an enhanced DeepLabV3+ architecture.
DSFNet incorporates several targeted structural innovations to achieve high-precision segmentation. DSRM enhances the capture of elongated shadows and multi-scale clouds, while ASCA effectively suppresses terrestrial noise and bolsters feature robustness. Furthermore, AGMF, guided by a grouped gating mechanism, adaptively recovers critical boundary information during the decoding stage. Experimental results explicitly demonstrate the superiority of our approach: evaluated on the GF1_WHU dataset, DSFNet achieves a Mean Intersection over Union (MIoU) of 76.97%, outperforming the second-best method by a margin of 0.59%.
DSFNet provides significant practical value in meteorological monitoring and remote sensing analysis by ensuring robust segmentation in intricate atmospheric conditions. However, similar to other high-performance multi-scale fusion networks, the computational complexity introduced by the parallel modules remains a limitation, potentially restricting its direct deployment on resource-constrained satellite edge devices. Future research will prioritize lightweight design strategies—such as knowledge distillation or model pruning—to minimize the network’s parameter footprint while preserving its high segmentation accuracy. Ultimately, these advancements will facilitate low-latency, real-time remote sensing image processing for broader earth observation applications.

Author Contributions

Methodology, Y.F. and Z.F.; Software, Y.F. and X.Y.; Validation, Z.F. and M.X.; Formal analysis, Y.F., Z.F., X.Y., M.X. and N.L.; Data curation, Y.F., Z.F. and N.L.; Writing—original draft, Y.F.; Writing—review & editing, Y.F., Z.F. and M.X.; Visualization, X.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The data presented in this study are available on request from the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Ackerman, S.A.; Strabala, K.I.; Menzel, W.P.; Frey, R.A.; Moeller, C.C.; Gumley, L.E. Discriminating clear sky from clouds with MODIS. J. Geophys. Res. Atmos. 1998, 103, 32141–32157. [Google Scholar] [CrossRef]
  2. Zhang, Y.; Guindon, B.; Cihlar, J. An image transform to characterize and compensate for spatial variations in thin cloud contamination of Landsat images. Remote Sens. Environ. 2002, 82, 173–187. [Google Scholar] [CrossRef]
  3. Zhu, Z.; Woodcock, C.E. Object-based cloud and cloud shadow detection in Landsat imagery. Remote Sens. Environ. 2012, 118, 83–94. [Google Scholar] [CrossRef]
  4. Fisher, A. Cloud and cloud-shadow detection in SPOT5 HRG imagery with automated morphological feature extraction. Remote Sens. 2014, 6, 776–800. [Google Scholar] [CrossRef]
  5. Hagolle, O.; Huc, M.; Pascual, D.V.; Dedieu, G. A multi-temporal method for cloud detection, applied to FORMOSAT-2, VENµS, LANDSAT and SENTINEL-2 images. Remote Sens. Environ. 2010, 114, 1747–1755. [Google Scholar] [CrossRef]
  6. Zhu, Z.; Woodcock, C.E. Automated cloud, cloud shadow, and snow detection in multitemporal Landsat data: An algorithm designed specifically for monitoring land cover change. Remote Sens. Environ. 2014, 152, 217–234. [Google Scholar] [CrossRef]
  7. Gómez-Chova, L.; Camps-Valls, G.; Calpe-Maravilla, J.; Guanter, L.; Moreno, J. Cloud-screening algorithm for ENVISAT/MERIS multispectral images. IEEE Trans. Geosci. Remote Sens. 2007, 45, 4105–4118. [Google Scholar] [CrossRef]
  8. Mahajan, S.; Fataniya, B. Cloud detection methodologies: Variants and development—A review. Complex Intell. Syst. 2020, 6, 251–261. [Google Scholar] [CrossRef]
  9. Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; IEEE: New York City, NY, USA, 2015; pp. 3431–3440. [Google Scholar]
  10. Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention; Springer International Publishing: Cham, Switzerland, 2015; pp. 234–241. [Google Scholar]
  11. Badrinarayanan, V.; Kendall, A.; Cipolla, R. Segnet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 2481–2495. [Google Scholar] [CrossRef] [PubMed]
  12. Sun, K.; Xiao, B.; Liu, D.; Wang, J. Deep high-resolution representation learning for human pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: New York City, NY, USA, 2019; pp. 5693–5703. [Google Scholar]
  13. Chen, L.C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. Semantic image segmentation with deep con-volutional nets and fully connected crfs. arXiv 2014, arXiv:1412.7062. [Google Scholar]
  14. Chen, L.C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 40, 834–848. [Google Scholar] [CrossRef]
  15. Chen, L.C.; Papandreou, G.; Schroff, F.; Adam, H. Rethinking atrous convolution for semantic image segmentation. arXiv 2017, arXiv:1706.05587. [Google Scholar] [CrossRef]
  16. Chen, L.C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European Conference on Computer Vision (ECCV) 2018; Springer International Publishing: Cham, Switzerland, 2018; pp. 801–818. [Google Scholar]
  17. Zhao, H.; Shi, J.; Qi, X.; Wang, X.; Jia, J. Pyramid scene parsing network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; IEEE: New York City, NY, USA, 2017; pp. 2881–2890. [Google Scholar]
  18. Cao, H.; Wang, Y.; Chen, J.; Jiang, D.; Zhang, X.; Tian, Q.; Wang, M. Swin-unet: Unet-like pure transformer for medical image segmentation. In European Conference on Computer Vision; Springer Nature: Cham, Switzerland, 2022; pp. 205–218. [Google Scholar]
  19. Xie, E.; Wang, W.; Yu, Z.; Anandkumar, A.; Alvarez, J.M.; Luo, P. SegFormer: Simple and efficient design for semantic segmentation with transformers. Adv. Neural Inf. Process. Syst. 2021, 34, 12077–12090. [Google Scholar]
  20. Sun, G.; Pan, Y.; Kong, W.; Xu, Z.; Ma, J.; Racharak, T.; Xin, J. DA-TransUNet: Integrating spatial and channel dual attention with transformer U-net for medical image segmentation. Front. Bioeng. Biotechnol. 2024, 12, 1398237. [Google Scholar] [CrossRef]
  21. Chen, K.; Dai, X.; Xia, M.; Weng, L.; Hu, K.; Lin, H. MSFANet: Multi-scale strip feature attention network for cloud and cloud shadow segmentation. Remote Sens. 2023, 15, 4853. [Google Scholar] [CrossRef]
  22. Gu, G.; Wang, Z.; Weng, L.; Lin, H.; Zhao, Z.; Zhao, L. Attention Guide Axial Sharing Mixed Attention (AGASMA) network for cloud segmentation and cloud shadow segmentation. Remote Sens. 2024, 16, 2435. [Google Scholar] [CrossRef]
  23. Hu, Z.; Weng, L.; Xia, M.; Hu, K.; Lin, H. HyCloudX: A multibranch hybrid segmentation network with band fusion for cloud/shadow. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 6762–6778. [Google Scholar] [CrossRef]
  24. Wang, X.; Fan, Z.; Jiang, Z.; Yan, Y.; Yang, H. EDFF-Unet: An Improved Unet-Based Method for Cloud and Cloud Shadow Segmentation in Remote Sensing Images. Remote Sens. 2025, 17, 1432. [Google Scholar] [CrossRef]
  25. Feng, Y.; Fan, Z.; Yan, Y.; Jiang, Z.; Zhang, S. MFAFNet: Multi-Scale Feature Adaptive Fusion Network Based on DeepLab V3+ for Cloud and Cloud Shadow Segmentation. Remote Sens. 2025, 17, 1229. [Google Scholar] [CrossRef]
  26. Wang, J.; Feng, Z.; Jiang, Y.; Yang, S.; Meng, H. Orientation attention network for semantic segmentation of remote sensing images. Knowl.-Based Syst. 2023, 267, 110415. [Google Scholar] [CrossRef]
  27. Zhang, Y.; Mazurowski, M.A. Convolutional neural networks rarely learn shape for semantic segmentation. Pattern Recognit. 2024, 146, 110018. [Google Scholar] [CrossRef]
  28. Zhu, Z.; Wang, S.; Woodcock, C.E. Improvement and expansion of the Fmask algorithm: Cloud, cloud shadow, and snow detection for Landsats 4–7, 8, and Sentinel 2 images. Remote Sens. Environ. 2015, 159, 269–277. [Google Scholar] [CrossRef]
  29. Zhu, Z.; Qiu, S.; He, B.; Deng, C. Cloud and cloud shadow detection for Landsat images: The fundamental basis for analyzing Landsat time series. In Remote Sensing Time Series Image Processing; Weng, Q., Ed.; CRC Press: Boca Raton, FL, USA, 2018; pp. 3–24. [Google Scholar] [CrossRef]
  30. Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; Springer: Cham, Switzerland, 2018; pp. 3–19. [Google Scholar]
  31. Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; IEEE: New York City, NY, USA, 2018; pp. 7132–7141. [Google Scholar]
  32. Hou, Q.; Zhang, L.; Cheng, M.M.; Feng, J. Strip pooling: Rethinking spatial pooling for scene parsing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: New York City, NY, USA, 2020; pp. 4003–4012. [Google Scholar]
  33. Xu, H.; Huang, Q.; Liao, H.; Nong, G.; Wei, W. MFFP-net: Building segmentation in remote sensing images via multi-scale feature fusion and foreground perception enhancement. Remote Sens. 2025, 17, 1875. [Google Scholar] [CrossRef]
  34. Duan, S.B.; Li, Z.L.; Min, X.; Wu, P.; Wei, R.; Liu, X.; Gao, C. An uncertainty-based outlier detection method for satellite-derived land surface temperature validation using in situ measurements. IEEE Trans. Geosci. Remote Sens. 2025, 63, 1–17. [Google Scholar] [CrossRef]
  35. Paul, M.; Kuchibhotla, A.K. Inference for Median and a Generalization of HulC. arXiv 2024, arXiv:2403.06357. [Google Scholar] [CrossRef]
  36. Sun, X.; Liu, W.; Wang, M.; Zhang, J.; Neri, F.; Wang, Y. Grouped convolution dual-attention network for time series forecasting of water temperature in offshore aquaculture net pen. Expert Syst. Appl. 2025, 278, 127438. [Google Scholar] [CrossRef]
  37. Zhou, X.; Wu, O.; Yang, N. Class and attribute-aware logit adjustment for generalized long-tail learning. In Proceedings of the AAAI Conference on Artificial Intelligence; Association for the Advancement of Artificial Intelligence: Washington, DC, USA, 2025; Volume 39, pp. 22991–22999. [Google Scholar]
  38. Abraham, N.; Khan, N.M. A novel focal tversky loss function with improved attention u-net for lesion segmentation. In 2019 IEEE 16th International Symposium on Biomedical imaging (ISBI 2019); IEEE: New York City, NY, USA, 2019; pp. 683–687. [Google Scholar]
  39. Yu, C.; Gao, C.; Wang, J.; Yu, G.; Shen, C.; Sang, N. Bisenet v2: Bilateral network with guided aggregation for real-time semantic segmentation. Int. J. Comput. Vis. 2021, 129, 3051–3068. [Google Scholar] [CrossRef]
  40. Qin, X.; Zhang, Z.; Huang, C.; Dehghan, M.; Zaiane, O.R.; Jagersand, M. U2-Net: Going deeper with nested U-structure for salient object detection. Pattern Recognit. 2020, 106, 107404. [Google Scholar] [CrossRef]
  41. Sun, Y.; Liang, D.; Li, S.; Chen, S.; Huang, S.J. Handling noisy annotation for remote sensing semantic segmentation via boundary-aware knowledge distillation. IEEE Trans. Geosci. Remote Sens. 2025, 63, 1–20. [Google Scholar] [CrossRef]
  42. Liu, S.; Zhu, C.; Yin, H.; Qin, K.; Lin, H.; Huang, J.; Weng, L. GLMamba: A Global–Local Mamba Network for Efficient Remote Sensing Change Detection. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2026, 19, 11344–11360. [Google Scholar] [CrossRef]
  43. Ren, Z.; Weng, L.; Xia, M.; Lin, H. MCINet: Multi-attentive cross-level interaction network for cloud and snow segmentation. J. Appl. Remote Sens. 2026, 20, 021404. [Google Scholar] [CrossRef]
  44. Ni, Y.; Liu, S.; Guo, T.; Xia, M. TiBT-Net: A High-Resolution Remote Sensing Image Change Detection Network Integrating Bi-Temporal Space Enhancement and Token Interaction. Remote Sens. 2026, 18, 805. [Google Scholar] [CrossRef]
  45. Yin, H.; Wang, J.; Liu, S.; Wang, Y.; Liu, Y.; Guo, T.; Xia, M. MISA-Net: Multi-Scale Interaction and Supervised Attention Network for Remote-Sensing Image Change Detection. Remote Sens. 2026, 18, 376. [Google Scholar] [CrossRef]
Figure 1. Directional statistical fusion network structure based on DeepLabV3+.
Figure 1. Directional statistical fusion network structure based on DeepLabV3+.
Remotesensing 18 01432 g001
Figure 2. The structure of the directional scale refinement module.
Figure 2. The structure of the directional scale refinement module.
Remotesensing 18 01432 g002
Figure 3. The structure of the multi-scale feature refinement branch.
Figure 3. The structure of the multi-scale feature refinement branch.
Remotesensing 18 01432 g003
Figure 4. The structure of the adaptive statistical context attention module.
Figure 4. The structure of the adaptive statistical context attention module.
Remotesensing 18 01432 g004
Figure 5. The structure of the lightweight global context gating module.
Figure 5. The structure of the lightweight global context gating module.
Remotesensing 18 01432 g005
Figure 6. The structure of the adaptive grouped multi-scale fusion module.
Figure 6. The structure of the adaptive grouped multi-scale fusion module.
Remotesensing 18 01432 g006
Figure 7. Illustration of the grouped convolution operation in AGMF.
Figure 7. Illustration of the grouped convolution operation in AGMF.
Remotesensing 18 01432 g007
Figure 8. Some data from the GF-1 WHU dataset. The first row shows the original images, and the second row corresponds to their labels, where black, gray, and white indicate background, cloud shadow, and cloud, respectively: (a) Urban areas; (b) Water Bodies; (c) Grasslands; (d) Gobi; (e) Desert; (f) Farmland.
Figure 8. Some data from the GF-1 WHU dataset. The first row shows the original images, and the second row corresponds to their labels, where black, gray, and white indicate background, cloud shadow, and cloud, respectively: (a) Urban areas; (b) Water Bodies; (c) Grasslands; (d) Gobi; (e) Desert; (f) Farmland.
Remotesensing 18 01432 g008
Figure 9. Comparison of segmentation results of clouds and cloud shadows by different methods: (a) test image; (b) labels; (c) segmentation result of our DSFNet; (d) segmentation result of DeepLabV3+; (e) segmentation result of HRNet; (f) segmentation result of SegFormer; (g) segmentation result of Swin-Unet. (The red boxes highlight the areas where the segmentation results differ between models).
Figure 9. Comparison of segmentation results of clouds and cloud shadows by different methods: (a) test image; (b) labels; (c) segmentation result of our DSFNet; (d) segmentation result of DeepLabV3+; (e) segmentation result of HRNet; (f) segmentation result of SegFormer; (g) segmentation result of Swin-Unet. (The red boxes highlight the areas where the segmentation results differ between models).
Remotesensing 18 01432 g009
Figure 10. Comparison of segmentation results of clouds and cloud shadows by different methods in different scenarios: (a) test image; (b) labels; (c) segmentation result of DSFNet; (d) segmentation result of DeepLabV3+; (e) segmentation result of HRNet; (f) segmentation result of SegFormer; (g) segmentation result of Swin-Unet. (The red boxes highlight the areas where the segmentation results differ between models.)
Figure 10. Comparison of segmentation results of clouds and cloud shadows by different methods in different scenarios: (a) test image; (b) labels; (c) segmentation result of DSFNet; (d) segmentation result of DeepLabV3+; (e) segmentation result of HRNet; (f) segmentation result of SegFormer; (g) segmentation result of Swin-Unet. (The red boxes highlight the areas where the segmentation results differ between models.)
Remotesensing 18 01432 g010aRemotesensing 18 01432 g010b
Figure 11. Comparison of segmentation results of different cloud types and cloud shadows by different methods: (a) test image; (b) labels; (c) segmentation result of DSFNet; (d) segmentation result of DeepLabV3+; (e) segmentation result of HRNet; (f) segmentation result of SegFormer; (g) segmentation result of Swin-Unet. (The red boxes highlight the areas where the segmentation results differ between models.)
Figure 11. Comparison of segmentation results of different cloud types and cloud shadows by different methods: (a) test image; (b) labels; (c) segmentation result of DSFNet; (d) segmentation result of DeepLabV3+; (e) segmentation result of HRNet; (f) segmentation result of SegFormer; (g) segmentation result of Swin-Unet. (The red boxes highlight the areas where the segmentation results differ between models.)
Remotesensing 18 01432 g011aRemotesensing 18 01432 g011b
Table 1. Segmentation performance of the proposed model under different hyperparameter λ values.
Table 1. Segmentation performance of the proposed model under different hyperparameter λ values.
λ MPA (%)MIoU (%)
0.284.9876.21
0.485.2476.46
0.685.4276.97
0.885.1376.48
Note: Bold indicates the best result.
Table 2. Ablation results for different module combinations.
Table 2. Ablation results for different module combinations.
DeepLabV3+DSRMASCAAGMFFJFLMPA (%)MIoU (%)F1 (%)
84.2675.4685.69
84.5076.0285.75
84.7276.2486.47
85.0776.5686.09
85.4276.9786.72
Note: Bold indicates the best result, and the underline indicates the second-best result.
Table 3. Comparison of segmentation results on the GF1_WHU dataset.
Table 3. Comparison of segmentation results on the GF1_WHU dataset.
NetworksPA (%)MPA (%)MIoU (%)FWIoU (%)FLOPs (G)Time (ms)
BiSeNetV2 [39]86.1580.2067.8575.7210.012.13
Unet [10]89.7183.1174.6280.1530.779.04
U2net [40]89.3083.1172.7278.9325.928.36
DeepLabV3+ [16]89.4684.2675.4681.4820.6935
FCN [9]89.1282.1373.5180.5137.1311.43
SegNet [11]88.2281.3672.1179.0928.707.23
PSPNet [17]89.1282.9074.2179.8828.7410.36
HRNet [12]89.3083.9574.6980.9914.8014.05
SegFormer [19]88.9784.0174.2880.3919.988.39
Swin-Unet [18]89.7985.2175.6381.8332.875.78
DATransunet [20]90.0682.9075.1181.4344.5671
EDFF-Unet [24]89.9284.4376.1382.7128.557.71
MFAFNet [25]90.3184.8076.3882.9532.608.94
DSFNet (ours)90.6285.4276.9783.4129.608.1
Note: Bold indicates the best result, and the underline indicates the second-best result.
Table 4. Comparison of segmentation results of different models in cloud and cloud shadow.
Table 4. Comparison of segmentation results of different models in cloud and cloud shadow.
NetworksCloudCloud Shadow
P (%)R (%)F1 (%)IoU (%)P (%)R (%)F1 (%)IoU (%)
BiSeNetV287.1389.9388.5179.3877.2652.1362.2545.20
Unet89.6592.1890.9283.3178.6264.4370.8154.82
U2net89.1091.2090.1482.0577.8062.5069.3253.04
DeepLabV3+90.9592.1691.5584.4281.4564.5372.0156.26
FCN90.6292.4491.5284.3779.0461.8269.3153.11
SegNet88.8992.5490.6882.9577.1761.1768.2451.80
PSPNet90.0193.1291.5384.4078.1263.5270.0153.93
HRNet92.5590.4491.4884.3075.5669.1272.1956.49
SegFormer91.0691.3891.2183.8675.6270.3472.8857.34
Swin-Unet91.5893.1892.3785.8374.2971.5272.8857.33
DATransunet92.1192.5792.3485.7781.8162.7971.0555.10
EDFF-Unet91.8293.2392.5286.0880.5066.0472.5356.93
MFAFNet92.0893.5292.7986.5681.0767.2173.3458.09
DSFNet(ours)92.3694.0193.1887.2382.1570.8475.9961.39
Note: Bold indicates the best result, and the underline indicates the second-best result.
Table 5. Comparison of segmentation results across different models on additional datasets.
Table 5. Comparison of segmentation results across different models on additional datasets.
NetworksHRC_WHUCloud and Cloud Shadow Dataset
PA (%)MPA (%)MIoU (%)PA (%)MPA (%)MIoU (%)
BiSeNetV292.7191.5585.6294.4292.3686.91
Unet93.4992.6286.5396.8395.6191.83
U2net93.2892.4186.3196.5495.2391.46
DeepLabV3+93.1791.9886.2997.0595.6492.03
FCN92.8591.6785.9394.3192.1787.12
SegNet92.9491.8886.0194.6192.5287.39
PSPNet93.2092.0786.4295.0893.4189.57
HRNet93.0891.9186.1596.5294.8891.08
SegFormer93.7193.1286.7497.1496.0392.28
Swin-Unet93.7993.2187.0597.2696.1892.61
DATransunet94.4393.3887.1297.3996.3192.74
EDFF-Unet94.1893.4487.2897.5796.4792.89
MFAFNet94.2893.5187.4797.4896.4992.97
DSFNet(ours)94.5793.7687.8197.7896.7393.31
Note: Bold indicates the best result, and the underline indicates the second-best result.
Table 6. Cross-dataset transfer performance between GF1_WHU and the Cloud and Cloud Shadow Dataset.
Table 6. Cross-dataset transfer performance between GF1_WHU and the Cloud and Cloud Shadow Dataset.
TrainTestMethodPA (%)MPA (%)F1 (%)MoU (%)
GF1_WHUDataset 2DeepLabV3+86.9978.0679.0567.39
GF1_WHUDataset 2DSFNet89.6081.9783.6373.01
Dataset 2GF1_WHUDeepLabV3+79.1169.3470.5657.23
Dataset 2GF1_WHUDSFNet82.4673.8475.9261.76
Note: Bold indicates the better result between DeepLabV3+ and DSFNet.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Fang, Y.; Fan, Z.; Xia, M.; Li, N.; Yang, X. DSFNet: A Directional Statistical Fusion Network for Cloud and Cloud Shadow Segmentation. Remote Sens. 2026, 18, 1432. https://doi.org/10.3390/rs18091432

AMA Style

Fang Y, Fan Z, Xia M, Li N, Yang X. DSFNet: A Directional Statistical Fusion Network for Cloud and Cloud Shadow Segmentation. Remote Sensing. 2026; 18(9):1432. https://doi.org/10.3390/rs18091432

Chicago/Turabian Style

Fang, Yuqi, Zhiyong Fan, Min Xia, Ni Li, and Xiaolin Yang. 2026. "DSFNet: A Directional Statistical Fusion Network for Cloud and Cloud Shadow Segmentation" Remote Sensing 18, no. 9: 1432. https://doi.org/10.3390/rs18091432

APA Style

Fang, Y., Fan, Z., Xia, M., Li, N., & Yang, X. (2026). DSFNet: A Directional Statistical Fusion Network for Cloud and Cloud Shadow Segmentation. Remote Sensing, 18(9), 1432. https://doi.org/10.3390/rs18091432

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop