HDAMNet: Hierarchical Dilated Adaptive Mamba Network for Accurate Cloud Detection in Satellite Imagery

Wang, Yongcong; Li, Yunxin; Yang, Xubing; Jiang, Rui; Zhang, Li

doi:10.3390/rs17172992

Open AccessArticle

HDAMNet: Hierarchical Dilated Adaptive Mamba Network for Accurate Cloud Detection in Satellite Imagery

by

Yongcong Wang

¹

,

Yunxin Li

²,

Xubing Yang

¹

,

Rui Jiang

³ and

Li Zhang

^1,*

¹

College of Information Science and Technology, Nanjing Forestry University, Nanjing 210037, China

²

Jiangsu Provincial Geological Surveying and Mapping Brigade, Nanjing 211102, China

³

College of Telecommunications and Information Engineering, Nanjing University of Posts and Telecommunications, Nanjing 210003, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2025, 17(17), 2992; https://doi.org/10.3390/rs17172992

Submission received: 20 June 2025 / Revised: 17 August 2025 / Accepted: 27 August 2025 / Published: 28 August 2025

(This article belongs to the Special Issue Deep Learning-Based Cloud Detection and Removal for Remote Sensing Images)

Download

Browse Figures

Versions Notes

Abstract

Cloud detection is one of the primary challenges in preprocessing high-resolution remote sensing imagery, the accuracy of which is severely constrained by the multi-scale and complex morphological characteristics of clouds. Many approaches have been proposed to detect cloud. However, these methods still face significant challenges, particularly in handling the complexities of multi-scale cloud clusters and reliably distinguishing clouds from snow, ice and complex cloud shadows. To overcome these challenges, this paper proposes a novel cloud detection network based on the state space model (SSM), termed the Hierarchical Dilated Adaptive Mamba Network (HDAMNet). This network utilizes an encoder–decoder architecture, significantly expanding the receptive field and improving the capture of fine-grained cloud boundaries by introducing the Hierarchical Dilated Cross Scan (HDCS) mechanism in the encoder module. The multi-resolution adaptive feature extraction (MRAFE) integrates multi-scale semantic information to reduce channel confusion and emphasize essential features effectively. The Layer-wise Adaptive Attention (LAA) mechanism adaptively recalibrates features at skip connections, balancing fine-grained boundaries with global semantic information. On three public cloud detection datasets, HDAMNet achieves state-of-the-art performance across key evaluation metrics. Particularly noteworthy is its superior performance in identifying small-scale cloud clusters, delineating complex cloud–shadow boundaries, and mitigating interference from snow and ice.

Keywords:

semantic segmentation; cloud detection; mamba; multi-scale feature extraction

1. Introduction

Currently, high-resolution remote sensing images produced by optical sensors onboard satellites are extensively used in various applications, including change monitoring [1,2], agricultural development [3], and disaster management [4]. However, the impact of cloud coverage on high-resolution remote sensing imagery analysis is significant, with global average annual cloud coverage being approximately 66% [5]. Thus, accurate cloud detection is an essential preprocessing step in remote sensing image analysis, directly influencing subsequent application accuracy.

With advancements in remote sensing technology, numerous cloud detection algorithms have been proposed, primarily categorized into traditional methods, machine learning-based methods, and deep learning-based methods. Traditional approaches mainly include threshold-based and spatial-feature-based methods. The threshold method, a classical cloud detection technique, distinguishes potential cloud pixels from non-cloud pixels using predefined spectral thresholds, allowing for the segmentation of cloud masks. Notable examples include the Fmask algorithm [6] and the automatic cloud coverage assessment (ACCA) method [7]. Additionally, spatial feature extraction methods segment images based on spectral and textural characteristics [8], reducing reliance on preset thresholds and prior knowledge, thereby improving detection accuracy and efficiency. However, these methods often lack generalization and are sensitive to thresholds or handcrafted rules. To overcome the rigidity of rule-based systems, machine learning methods were introduced. Unlike traditional algorithms that rely on fixed thresholds, machine learning models can automatically learn complex and non-linear decision boundaries from data. This data-driven approach allows them to achieve higher accuracy and better generalization across diverse imaging conditions and geographical locations. For instance, methods such as support vector machine (SVM) classifiers and multi-feature embedded learning have enhanced detection accuracy for various cloud types [9]. Moreover, Zhang et al. [10] proposed a multi-temporal image combination approach, effectively preserving cloud details and boundaries to generate accurate cloud masks. Similarly, its performance is constrained by the dependence on handcrafted features, a typical limitation of conventional machine learning approaches. With the rapid advancement of deep learning technologies, numerous CNN-based methods have emerged for cloud detection. Mohajerani et al. [11] developed Cloud-Net, a fully convolutional network built upon the U-Net architecture [12]. Furthermore, inspired by Transformer architectures’ success in natural language processing (NLP), Dosovitskiy et al. [13] introduced the Vision Transformer (ViT) to computer vision tasks by dividing images into multiple patches. Subsequently, the Swin Transformer [14] reduced computational complexity via shifted windowing mechanisms, achieving notable success in semantic segmentation tasks. Recently, Transformer-based models [15,16,17] have significantly advanced cloud detection. Methods like Cloudformer successfully leveraged self-attention to model global dependencies, surpassing the limited receptive fields of traditional CNNs. Furthermore, models such as CloudViT have specifically emphasized lightweight designs, enhancing their practicality for resource-constrained applications. However, a fundamental limitation persists in their reliance on dividing images into fixed-size patches, which can struggle with the highly irregular shapes and diverse scales inherent to clouds. This approach can lead to the loss of fine-grained boundary details. Moreover, lightweight architectures, while efficient, often achieve this by simplifying the model, which may reduce their capacity to discriminate clouds from spectrally similar surfaces like snow and ice in complex scenes. Therefore, a gap remains in developing an architecture that can achieve state-of-the-art accuracy, particularly in capturing multi-scale features and delineating cloud boundaries with high precision.

Recently, the state space model (SSM) has garnered attention due to its global receptive fields and linear computational complexity. The Mamba architecture, introduced by Gu et al. [18], enhances the structured state space sequence model through selective mechanisms and hardware optimization, showing remarkable performance in dense data processing. Numerous studies [19,20,21,22] have successfully applied Mamba to image processing tasks. For instance, the visual state space model (VSSM) [23] integrates a cross scan module (CSM) to traverse spatial domains and convert non-causal visual images into ordered patch sequences, further expanding Mamba’s applicability in computer vision. In remote sensing applications, Samba [24] and RS³Mamba [25] demonstrated Mamba’s unique strengths in global modeling and computational efficiency for high-resolution imagery. Currently, in the field of cloud detection in remote sensing imagery, despite extensive research and the application of various techniques, studies employing the Mamba architecture remain relatively limited.

To bridge this research gap, we propose the Hierarchical Dilated Adaptive Mamba Network (HDAMNet), contributing to the early exploration of applying the Mamba architecture to cloud detection in remote sensing imagery. Our primary contributions are the following:

A novel state-space-model-based deep neural network, termed HDAMNet, is proposed in this paper for cloud segmentation. The network innovatively introduces the visual state space model (VSSM) into the cloud detection domain by replacing the convolutional downsampling modules in traditional U-shaped network encoders with HDAMamba Blocks, effectively establishing long-range spatial dependency modeling.
We design the Hierarchical Dilated Cross Scan (HDCS) mechanism to address the multi-scale features of clouds. It employs a multi-scale shifted windowing scheme and an adaptive dilation strategy to expand the receptive field for hierarchical feature perception. The resulting features are then dynamically fused by our multi-resolution adaptive feature extraction (MRAFE) module to enhance the final representation.
To resolve boundary ambiguities, we introduce a Layer-wise Adaptive Attention (LAA) mechanism at the skip connections. This mechanism adaptively recalibrates feature channels to promote an effective fusion of shallow spatial details with deep semantic context, leading to more precise segmentation.

Experimental results on the HRC_WHU, CloudS_M24, and WHU Cloud datasets demonstrate that HDAMNet achieves state-of-the-art performance. Our code is available at https://github.com/ycwang31/HDAMNet (accessed on 10 August 2025).

2. Related Work

With rapid advancements of deep learning, many methods [26,27] based on deep learning technology have been successfully applied to cloud detection in remote sensing imagery. These methods can be divided into two categories: convolutional neural network (CNN)-based methods and Transformer-based methods. Recently, some approaches based on the Mamba architecture have been introduced into remote sensing analysis.

2.1. CNN-Based Methods

The U-Net model proposed by Ronneberger et al. [12] initially achieved substantial success in medical image segmentation, prompting extensive adoption of its U-shaped structure across numerous segmentation applications. Consequently, various U-Net-based algorithms have emerged for cloud detection tasks [11,28,29]. Researchers have increasingly recognized that specialized model designs addressing the unique characteristics and challenges of cloud detection are essential for improving accuracy. Yang et al. [30] proposed CDNet, which enhances cloud detection accuracy in low-resolution images by integrating boundary refinement with a feature pyramid structure. Li et al. [31] developed MSCFF, emphasizing multi-scale feature fusion for high-resolution remote sensing images. Hu et al. [32] introduced CDUNet, effectively capturing fine-grained details by exploiting the spatial hierarchy within cloud structures. Lu et al. [33] proposed the multi-scale strip pooling feature aggregation network, incorporating an improved pyramid pooling module to extract multi-scale features effectively.

2.2. Transformer-Based Methods

Transformers [34] have achieved remarkable results in natural language processing (NLP) and, subsequently, have expanded into image segmentation, exemplified by models such as SETR [35], SegFormer [36], UNetFormer [37], Swin-Unet [38], CMTFNet [39], and TransUNet [40]. The Swin Transformer [14], a hierarchical vision Transformer, reduces computational complexity by confining self-attention calculations within local windows, employing a shifted windowing scheme, and using a window shifting strategy to facilitate cross-window information exchange, thus effectively capturing multi-scale features. However, directly applying these models to cloud detection in remote sensing imagery poses challenges, such as differentiating thin clouds from underlying surfaces, identifying small cloud clusters, and avoiding the misclassification of bright surfaces or icebergs as clouds. Consequently, adapting or enhancing models originally designed for natural images to meet the specific requirements of remote sensing cloud detection is necessary. For instance, Zhang et al. [17] developed Cloudformer, integrating CNN and Transformer architectures via a dual-path decoder for the accurate differentiation of visually similar objects and employing a pyramid-structured encoder to effectively detect thin and sparse cloud regions. Zhang et al. subsequently proposed CloudformerV2 [41] and CloudformerV3 [42], further improving computational efficiency and detection accuracy. More recently, Zhang et al. [16] introduced CloudViT, a lightweight ViT-based cloud detection method addressing the need for robust feature extraction within constrained computational resources, facilitating practical deployment.

2.3. Mamba-Based Methods

Recently, Mamba architecture [18] based on the state space model (SSM) has demonstrated notable potential for modeling long-range dependencies in vision tasks. Several studies have validated Mamba’s efficacy in vision applications [43]. Vim [44] introduced a generic visual backbone employing bidirectional Mamba Blocks for image classification and semantic segmentation. VMamba [23] advanced this approach with a hierarchical backbone and cross-scan module (CSM), addressing orientation sensitivity issues arising from the disparity between one-dimensional sequences and two-dimensional images. Particularly noteworthy is Mamba’s extensive application in medical image segmentation. Models including U-Mamba [45], VM-UNet [19], SegMamba [46], and Mamba-UNet [20] have achieved state-of-the-art performance on their respective medical datasets. Furthermore, P-Mamba [47] effectively reduces background noise while preserving target boundaries by integrating probabilistic diffusion with Mamba architecture. The Swin-UMamba study [48] further emphasized the importance of ImageNet-based pre-training for Mamba networks in medical image segmentation tasks. Additionally, Vivim [49] applied the Mamba architecture successfully to medical video object segmentation. In remote sensing segmentation, Zhu et al. [24] introduced Samba, marking the initial exploration of Mamba in this domain. Subsequently, Ma et al. [25] developed RS³Mamba, a novel two-branch network utilizing Mamba architecture for enhanced segmentation performance.

3. Methodology

3.1. Preliminaries

State Space Model. The SSM is a powerful sequential model that maps an input sequence

x (t)

to an output

y (t)

through a hidden state

h (t)

. Specifically, the SSM can be formulated as follows:

h^{'} (t) = A h (t) + B x (t)

(1)

y (t) = C h (t) + D x (t)

(2)

here,

x (t)

denotes the input at the current state,

h (t)

is the previous hidden state,

h^{'} (t)

represents the updated hidden state, and

y (t)

indicates the prediction for the current state.

Discretization. However, conventional SSMs are unsuitable for discrete data inputs. To address this issue, the structured state space sequence model (S4) [50] employs the zero-order hold (ZOH) technique for discretizing the continuous SSM, represented by the following:

h_{k} = \bar{A} h_{k - 1} + \bar{B} x_{k}

(3)

y (t) = C h (t) + D x (t)

(4)

\bar{A} = e^{Δ A}

(5)

\bar{B} = ({Δ A)}^{- 1} (e^{Δ A} - I) \cdot Δ B

(6)

here,

\bar{A}

and

\bar{B}

denote the discretized forms of matrices

A

and

B

, respectively;

B, C \in R^{D \times N}

and

Δ \in R^{D}

.

h_{k - 1}

is the previous hidden state; and

h_{k}

represents the current hidden state.

Two-Dimensional Selective Scan. Models designed for one-dimensional (1D) data are ineffective at capturing two-dimensional (2D) spatial information, making the direct application of Mamba architecture to vision tasks challenging. The 2D Selective Scan (SS2D) [23] effectively resolves this limitation. The SS2D organizes image segments from four distinct orientations, generating four independent sequences. This four-directional scanning strategy ensures that each element in the feature map integrates spatial information from all directions, thus creating a global receptive field without increasing computational complexity. Subsequently, each feature sequence is processed using the selective scan state space sequence (S6) model, after which the 2D feature map is reconstructed through merging operations:

z_{i} = expand (z, i)

(7)

{\bar{z}}_{i} = S 6 (z_{i})

(8)

\bar{z} = merge ({\bar{z}}_{1}, {\bar{z}}_{2}, {\bar{z}}_{3}, {\bar{z}}_{4})

(9)

here,

i \in {1, 2, 3, 4}

represents one of the four scanning directions, and expand (·) and merge (·) denote the sweep expansion and merging operations defined in [23], respectively. The S6 block enables each element within the one-dimensional sequence to interact with previously scanned samples through a compressed hidden state.

3.2. HDAMNet

HDAMNet adopts a symmetric encoder–decoder architecture while innovatively redesigning the feature extraction and propagation mechanisms by incorporating SSM, as illustrated in Figure 1. Specifically, the encoder employs HDAMamba Blocks to replace traditional convolutional downsampling modules. Each HDAMamba Block integrates a HDCS mechanism, enabling effective multi-scale feature extraction from the input image down to the

1 / 8

resolution feature map. During the feature processing stage, the MRAFE module dynamically fuses features across multiple scales. Additionally, the LAA mechanism is integrated within the skip connections, enhancing feature propagation. Finally, the decoder progressively restores the feature map resolution using double convolutional upsampling, outputting the cloud detection probability map via a sigmoid activation function. This encoder–decoder design significantly enhances the network’s capability to detect and represent clouds across various scales.

To clarify the interaction flow among the core components, the architecture should be viewed hierarchically. As illustrated in the overall framework (Figure 1), the LAA mechanism is positioned at each skip connection to mediate feature fusion between the encoder and decoder. The encoder itself is built from a series of HDAMamba Blocks. The internal workflow of these blocks is detailed in Figure 2, which shows that input features are first processed by the HDCS mechanism to capture multi-scale information, and its output is then fed into the MRAFE module for adaptive feature fusion. This hierarchical design allows for specialized processing at different levels of the network.

3.3. HDAMamba Block

The HDAMamba Block, illustrated in Figure 2, is an efficient feature extraction module designed for cloud detection. It integrates the HDCS and MRAFE mechanisms to capture multi-scale semantic features while integrating both local and global contextual information.

3.3.1. HDCS

We propose the HDCS to comprehensively capture multi-scale cloud volumetric features and boundary details. The HDCS utilizes adaptive window sampling and dilated scanning strategies, efficiently perceiving cloud morphology across scales. Specifically, HDCS encompasses two key stages: multi-scale window partition (MSWP) and dilated cross scan (DCS). Figure 3 illustrates the overall framework of the HDCS.

HDCS begins by extracting multiple local regions from the input feature map according to a set of predefined window scales. Each scale is designed to cover critical positions such as the center and the four corners, allowing the network to acquire region-specific features at varying resolution levels. This strategy enhances the model’s capacity to capture spatial diversity and structural information across different scales. Given an input feature map

x \in R^{B \times C \times H \times W}

, where

B

,

C

,

H

, and

W

denote the batch size, number of channels, height, and width, respectively, the HDCS defines multiple scales for shifted windows

r \in {0.25, 0.5, 1.0}

, and computes the corresponding sub-window dimensions

H_{r}

and

W_{r}

accordingly.

H_{r} = ⌊H \times r⌋

(10)

W_{r} = ⌊ W \times r ⌋

(11)

Five sub-regions

x_{0}

,

x_{1}

,

x_{2}

,

x_{3}

,

x_{4}

are subsequently extracted from the feature map

x

in a shifted window. These sub-regions are then processed by the DCS module to yield a corresponding set of features

F_{i, 0}, F_{i, 1}, F_{i, 2}, F_{i, 3}, F_{i, 4}

. The spatial locations of these regions are determined by calculating the corresponding starting coordinates (

s_{h}

,

s_{w}

). For each predefined scale ratio, this procedure yields five localized feature patches, which are subsequently employed in the DCS module to capture scale-specific variations in cloud morphology across multiple spatial resolutions.

s_{h} = \frac{H - H_{r}}{2}

(12)

s_{w} = \frac{W - W_{r}}{2}

(13)

F_{i} = M S W P {F_{i, 0}, F_{i, 1}, F_{i, 2}, F_{i, 3}, F_{i, 4}} \in R^{5 B \times C \times H_{r} \times W_{r}}

(14)

Motivated by the effectiveness of atrous convolutions [51] in expanding the receptive field while preserving resolution, we propose the DCS method, designed to reduce computational complexity in the SS2D and standardize feature dimensions. The DCS method achieves efficiency by employing sparse sampling with a dilated rate

d

across the feature map

x \in R^{B \times C \times H \times W}

, thus dividing it into spatial feature subsets

{W_{i}}_{i = 1}^{4}

:

H_{m i n} = ⌊ H \times m i n ({0.25, 0.5, 1.0}) ⌋

(15)

d = m a x (1, ⌊ \frac{H_{r}}{H_{m i n}} ⌋)

(16)

W_{i} \overset{D C S}{\leftarrow} X [:, m : : d, n : : d]

(17)

{{\tilde{W}}_{i}}_{i = 1}^{4} \overset{H D C S}{\leftarrow} ({W_{i}}_{i = 1}^{4})

(18)

Y [:, m : : d, n : : d] \overset{merge}{\leftarrow} {\tilde{W}}_{i}

(19)

specifically, positions are computed as

(m, n) = (\frac{1}{2} + \frac{1}{2} s i n \frac{π}{2} (i - 2), \frac{1}{2} + \frac{1}{2} c o s \frac{π}{2} (i - 2))

and subsets are defined accordingly. The slicing operation

[:, m ∷ d, n ∷ d]

indicates channel-wise extraction starting from position

m

in height and

n

in width, selecting elements at intervals of

d

.

In this scanning approach, the full scanning method is decomposed into two components: local sparse scanning and global sparse recombination. By implementing sparse sampling in the local receptive fields, DCS selectively examines a limited number of patches within the feature map, significantly reducing computational complexity. Specifically, with a dilation rate

d

, the spatial dimensions of the feature map reduce from

(C, H, W)

to

(C, \frac{H}{d}, \frac{W}{d})

, effectively decreasing the number of tokens processed in each scanning and merging operation from

N

to

\frac{N}{d^{2}}

, thereby markedly enhancing feature extraction efficiency. Concurrently, DCS recombines processed patches globally to reconstruct spatial structure, balancing local detail capture with global contextual information, thus enriching the multi-scale representational capability of the features. Consequently, the efficiency of the HDCS method is significantly improved without sacrificing global integration, thereby making feature extraction across spatial dimensions more comprehensive. This design robustly supports the accurate capture of multi-scale cloud morphology while notably reducing computational resource requirements. Subsequently, these processed sequences are integrated and analyzed using the Mamba architecture to effectively model long-range dependencies.

By combining a multi-scale shifted windowing scheme with dilated scanning, this method expands the receptive field and finely discriminates multi-scale cloud structures without substantially increasing computational complexity. Compared to traditional fixed-stride convolutional or deformable convolution methods, HDCS achieves differential sampling across various resolutions and spatial locations, capturing long-range spatial dependencies through state space modeling. Thus, it enhances adaptability in detecting both sparsely distributed large clouds and densely clustered local clouds.

3.3.2. MRAFE

After completing the stitching and preliminary fusion of output features from the MSWP and DCS, this study introduces the multi-resolution adaptive feature extraction (MRAFE) module to achieve refined and robust feature fusion across multiple scales (see Figure 4).

To integrate features across different scales and highlight key semantic information, the proposed MRAFE module combines scale-specific channel attention with semantic attention after fusion, aiming to fully exploit the diversity and complementarity of multi-scale information. Specifically, the classical Squeeze-and-Excitation (SE) channel attention mechanism is applied independently to each scale of feature maps. We selected the SE mechanism due to its proven effectiveness and computational efficiency. It excels at dynamically recalibrating channel-wise feature responses by explicitly modeling interdependencies between channels, which is highly suitable for identifying the most salient feature maps at each scale before their fusion. Channel attention weights are computed through global adaptive pooling followed by a dimensionality reduction and expansion process using convolutional operations. These weights are then multiplied, channel-wise, with the original features. This step preserves the overall semantic representation of each scale while effectively suppressing redundant or noisy channels, thereby emphasizing critical local boundary details and global structural cues essential for accurate cloud detection. The channel-refined multi-scale features are subsequently concatenated along the channel dimension and passed through an additional semantic attention module, which consists of a sequence of 1 × 1 convolution, Batch Normalization, ReLU activation, and Sigmoid activation. This semantic attention mechanism further captures the intrinsic correlations of features across channels and spatial dimensions, facilitating deeper feature fusion. The resulting unified feature representation balances local and global discriminative information, providing robust input for the subsequent decoding stage. The formal definition of the MRAFE module is presented as follows:

z_{i} = G A P (x_{i}) = \frac{1}{H \times W} \sum_{h = 1}^{H} \sum_{w = 1}^{W} x_{i} (:, :, h, w)

(20)

s_{i} = σ (W_{2, i} δ (W_{1, i} z_{i}))

(21)

{\tilde{x}}_{i} = x_{i} ⊙ s_{i}

(22)

here,

x_{i} \in R^{B \times C \times H \times W}

denotes the input feature map at the

i^{t h}

scale (

i

= 1, 2, 3),

r

represents the channel dimensionality reduction factor,

δ (\cdot)

denotes the ReLU activation function,

σ (\cdot)

signifies the Sigmoid activation function,

⊙

indicates element-wise multiplication, and

*

represents the convolution operation. Additionally,

W_{1, i} \in R^{\frac{C}{r} \times C}

and

W_{2, i} \in R^{C \times \frac{C}{r}}

.

X_{cat} = Concat ({\tilde{x}}_{1}, {\tilde{x}}_{2}, {\tilde{x}}_{3}) \in R^{B \times 3 C \times H \times W}

(23)

X_{out} = T r a n s C o n v (X_{cat}) \in R^{B \times C \times H \times W}

(24)

It is noteworthy that the output features of MRAFE maintain consistency with individual inputs in spatial dimensions (

H \times W

) and channel count (

C

), enhancing the module’s portability.

3.4. LAA

Although skip connections in encoder–decoder architectures help preserve detail, they can introduce noise during feature fusion. To address this limitation, we introduce an LAA mechanism within the skip connections of HDAMNet. This module adaptively fuses multi-level features via a channel attention mechanism, effectively integrating coarse-scale semantic information with fine-grained details. Figure 5 illustrates the LAA in detail.

In particular, upper-level features typically possess higher spatial resolutions, whereas lower-level outputs contain deeper semantic abstraction but exhibit lower resolutions. To ensure spatial alignment, lower-level outputs undergo deconvolution, matching the dimensions of higher-level features. These two levels of features are then stacked along a new dimension, forming a multi-dimensional tensor containing scale-specific information. Subsequently, the LAA mechanism applies global statistical methods to weigh channels. Initially, channel dimensions of the multi-dimensional features are uniformly reduced through volumetric pooling. Then, a lightweight two-layer fully connected network computes attention coefficients for each channel. After normalization, these coefficients are multiplied element-wise with the original tensor, selectively enhancing or suppressing features from different scales. Consequently, fused features incorporating both detailed and global semantic information are generated.

Following attention weighting, the LAA mechanism splits the weighted upper- and lower-level features along the newly introduced dimension and combines them element-wise, forming a refined skip connection propagated to the subsequent decoding layers. This adaptive channel-weighting process strengthens the network’s capability in precise cloud boundary detection and comprehensive multi-scale feature modeling, while maintaining the global semantic information encoded in deeper layers. Formally, given a high-resolution feature

x_{1} \in R^{B \times C \times H \times W}

and a low-resolution feature

x_{2} \in R^{B \times 2 C \times \frac{H}{2} \times \frac{W}{2}}

, the output

y \in R^{B \times C \times H \times W}

is computed as follows:

ϕ = s t a c k (x_{1}, T r a n s C o n v (x_{2})) \in R^{B \times C \times H \times W \times 2}

(25)

W_{3} = σ (W_{2} δ (W_{1} \cdot P o o l (ϕ)))

(26)

\begin{matrix} y = W_{3} ⊙ ϕ [:, :, :, :, 1] + W_{3} ⊙ ϕ [:, :, :, :, 2] \end{matrix}

(27)

where

W_{1} \in R^{\frac{C}{r} \times C}

and

W_{2} \in R^{C \times \frac{C}{r}}

are the weight matrices of the two-layer fully connected network,

P o o l (\cdot)

denotes a global average pooling operation performed across the spatial dimensions,

δ (\cdot)

is the ReLU activation function,

σ (\cdot)

is the Sigmoid activation function,

W_{3} \in R^{B \times C \times H \times W \times 2}

is the calculated attention weight tensor, and

⊙

represents the element-wise multiplication operation.

3.5. Loss Function

In binary semantic segmentation tasks, it is common practice to combine the strengths of Binary Cross-Entropy Loss (

L_{BCE}

) and Dice Loss (

L_{Dice}

) to construct an effective composite loss function.

L_{BCE}

measures the discrepancy between predicted values and ground truth labels, while Dice Loss, based on the Dice coefficient, is particularly effective for image segmentation tasks as it accounts for spatial overlap and boundary information. The

L_{BCE}

and

L_{Dice}

are defined as follows:

L_{BCE} = - \frac{1}{N} \sum_{i = 1}^{N} [y_{i} l o g (λ_{i}) + (1 - y_{i}) l o g (1 - λ_{i})]

(28)

L_{Dice} = 1 - \frac{2 \sum_{i = 1}^{N} λ_{i} y_{i}}{\sum_{i = 1}^{N} λ_{i}^{2} + \sum_{i = 1}^{N} y_{i}^{2}}

(29)

where

N

denotes the number of samples,

y_{i}

is the true label for the

i^{t h}

sample, and

λ_{i}

is the predicted probability that the

i^{t h}

sample belongs to the positive class. Thus, the overall loss function can be presented as follows:

L_{a l l} = L_{BCE} + ρ L_{Dice}

(30)

where

ρ

is a hyperparameter that balances the contribution of the two loss components.

4. Results

4.1. Datasets

To comprehensively evaluate the performance of our method, we employ three carefully selected datasets, each targeting specific challenges in cloud detection: HRC_WHU [31] focuses on multi-scale cloud clusters, CloudS_M24 [52] addresses the disambiguation of clouds from snow and ice, and WHU Cloud [53] emphasizes the differentiation between clouds and cloud shadows.

4.1.1. HRC_WHU Dataset

The High-Resolution Cloud Detection dataset (HRC_WHU) employed in this study comprises 150 high-resolution RGB images sourced from various regions around the world. To enhance the dataset’s accuracy and practical utility, reference cloud masks were manually annotated by remote sensing image interpretation experts at Wuhan University, ensuring the high quality and reliability of the annotations. Each image is in RGB format, with spatial resolutions ranging from 0.5 m to 15 m, providing rich detail for analysis and processing. Due to the high pixel resolution of the original remote sensing images and the limited memory capacity of GPUs, directly training models on full-size images is computationally expensive. Therefore, the images were cropped into patches of 512 × 512 pixels for training. In total, the dataset contains 2256 patches, with 1806 used for training and 450 for testing. The dataset primarily includes five major land cover types—water, vegetation, urban, snow and ice, and desert—selected to represent a broad spectrum of surface features and to enable the evaluation of cloud detection algorithms across diverse environments. Figure 6 illustrates several representative examples from the dataset.

4.1.2. CloudS_M24 Dataset

The CloudS_M24 dataset comprises 24 high-resolution remote sensing images with meter-level spatial resolution, collected between 2014 and 2018, and geographically distributed across Europe, Asia, and North America. The dataset focuses on scenes involving clouds, snow, and ice, making it particularly valuable for evaluating cloud detection performance in challenging conditions. These complex and diverse conditions provide an ideal testbed to assess the model’s ability to differentiate clouds from snow in challenging environments. To facilitate model training, the original images were cropped into patches of 512 × 512 pixels. In total, 1120 patches were used for training and 46 patches were used for testing. Figure 7 displays several representative examples from the CloudS_M24 dataset.

4.1.3. WHU Cloud Dataset

The dataset includes a wide variety of land cover types such as forests, urban areas, oceans, lakes, mountains, and coastal plains. The relatively low cloud coverage in the target images—approximately 5%—reflects common operational conditions in real-world remote sensing applications. The original scene sizes range from 7500 × 7700 to 7800 × 8000 pixels. Due to their high pixel resolution, the images were seamlessly cropped into smaller patches of 512 × 512 pixels to facilitate efficient model training and inference. After applying data augmentation techniques, the number of training samples was expanded to 2720. An additional 129 patches were used for testing. The dataset places particular emphasis on the distinction between clouds and cloud shadows, which is critical for evaluating model robustness in complex atmospheric conditions. To better evaluate the model’s generalization capability in identifying clouds, we reclassified cloud shadows as non-cloud in our experiments. Several representative examples from the WHU Cloud dataset are shown in Figure 8.

4.2. Comparative Methods

To comprehensively evaluate the performance of our proposed HDAMNet, we selected a diverse range of state-of-the-art semantic segmentation models from three major architectural categories: CNN-based, Transformer-based, and Mamba-based methods.

4.2.1. CNN-Based Methods

UNet [12]: A foundational encoder–decoder architecture with skip connections, widely recognized as a strong baseline in semantic segmentation due to its effectiveness in capturing multi-level features.

PSPNet [54]: This model introduces the pyramid pooling module to aggregate global context information at multiple scales, which is highly relevant for perceiving large and structurally varied cloud formations.

CloudNet [11]: A UNet-based architecture specifically designed for cloud detection in Landsat imagery, making it a highly relevant competitor for this task.

CDnet [30]: A CNN-based network focused on achieving a balance between accuracy and efficiency for cloud detection, serving as a benchmark for real-world application potential.

CloudSegNet [55]: An encoder–decoder network specifically proposed for cloud segmentation tasks, which provides another domain-specific baseline for comparison.

4.2.2. Transformer-Based Methods

UNetFormer [37]: A model that integrates a lightweight Transformer into a UNet-like architecture, designed for the efficient and effective semantic segmentation of remote sensing imagery.

Swin-Unet [38]: A pure Transformer model with a U-shaped architecture that uses shifted windows for hierarchical feature extraction, making it highly capable of capturing multi-scale visual patterns relevant to clouds.

CMTFNet [39]: A fusion network that combines the strengths of both CNNs and Transformers to leverage local texture information and global contextual dependencies simultaneously.

TransUNet [40]: A pioneering model that uses a Transformer as the encoder to capture global context and a CNN-based decoder to restore fine-grained details, setting a precedent for hybrid architectures.

4.2.3. Mamba-Based Methods

Samba [24]: Represents an early and important exploration of the Mamba architecture for semantic segmentation in remote sensing images, serving as a crucial baseline in this category.

VM-UNet [19] and Mamba-UNet [20]: These models integrate Visual Mamba Blocks into a U-Net architecture, representing direct architectural competitors for segmentation tasks in computer vision and medical imaging.

RS³Mamba [56]: A novel Mamba-based network designed specifically for remote sensing image segmentation, making it a state-of-the-art Mamba-based competitor for our task.

4.3. Implementation Details

HDAMNet was implemented using the PyTorch2.1 framework. Training was conducted on a workstation equipped with an NVIDIA GeForce RTX 4090 GPU (24 GB) and an Intel(R) Xeon(R) Platinum 8352V CPU @ 2.10GHz. The batch size was set to 2, and the model was trained for 150 epochs. During inference, each image was input into the network at a resolution of 512 × 512 pixels. No post-processing techniques were required to obtain the final predictions.

During training, we employed the AdamW optimizer to update the network parameters. The initial learning rate

l r_{0}

was set to 0.001, the weight decay coefficient was set to 0.01, the exponential decay rate

β_{1}

for first-moment estimation was set to 0.9, and

β_{2}

for second-moment estimation was set to 0.999. We adopted the Cosine Annealing Learning Rate strategy (CosineAnnealingLR) for learning rate scheduling. The learning rate at epoch

t

, denoted as

l r_{t}

, is computed as follows:

l r_{t} = η_{m i n} + \frac{1}{2} (l r_{0} - η_{m i n}) (1 + c o s (\frac{t π}{T_{m a x}}))

(31)

where

η_{m i n}

is the minimum learning rate,

t

is the current epoch number, and

T_{\max}

is the total number of epochs in the cosine annealing cycle.

4.4. Evaluation Metrics

To quantitatively evaluate the performance of our algorithm, we employ the following evaluation metrics: Jaccard Index, Precision, Recall, Specificity, F1-score, and Overall Accuracy (OA). The formal definitions of these metrics are provided below:

Jaccard Index = \frac{T P}{T P + F N + F P}

(32)

Precision = \frac{T P}{T P + F P}

(33)

Recall = \frac{T P}{T P + F N}

(34)

Specificity = \frac{T N}{T N + F P}

(35)

F 1 -score = \frac{2 \times Precision \times Recall}{Precision + Recall}

(36)

Overall Accuracy = \frac{T P + T N}{T P + T N + F N + F P}

(37)

where

T P

,

T N

,

F P

, and

F N

are the total number of true positive, true negative, false positive, and false-negative pixels, respectively. These metrics—Jaccard, F1, and OA—are global indicators that reflect overall model performance.

4.5. Ablation Study of Key Components

To systematically dissect the functionality of each key component within HDAMNet and to quantify their respective contributions to the model’s overall efficacy, a comprehensive ablation study was conducted on the HRC_WHU dataset. The study was designed with a progressive framework, beginning with a baseline model and incrementally integrating our proposed modules to construct distinct experimental variants. Variant-I is established as the baseline, employing a standard UNet architecture. To specifically target the challenge of multi-scale feature representation, Variant-II then replaces these standard conventional convolutional downsampling blocks with our proposed HDAMamba Blocks, which incorporate the specialized HDCS mechanism. Subsequently, Variant-III enhances this configuration by integrating the MRAFE module within the HDAMamba Blocks for refined feature fusion. To better isolate the contribution of the LAA module, we also designed Variant-IV, which applies only the LAA mechanism to the skip connections of the baseline model. Finally, the complete HDAMNet model is realized by augmenting Variant-III with the LAA mechanism at the skip connections. To ensure a fair comparison, all experiments were conducted under identical hyperparameter settings, with the results summarized in Table 1. ✗ in the table stands for not being added to the model and ✓ stands for being added to the model.

The results offer a clear stepwise validation of each module’s efficacy. The analysis commences with the baseline U-Net model (Variant-I), which established a performance benchmark with a Jaccard Index of 86.97%. The first major architectural modification involved replacing the baseline’s encoder with our custom HDAMamba Blocks (Variant-II), a change that yielded a substantial performance increase of 1.33%, elevating the Jaccard Index to 88.30%. This significant improvement is primarily attributable to the specialized HDCS mechanism within our blocks, which is engineered to capture the complex, multi-scale characteristics of clouds more effectively than a generic SSM implementation. Building on this robust foundation, the integration of the MRAFE module (Variant-III) provided a further enhancement, raising the Jaccard Index to 89.49%. This gain highlights MRAFE’s effectiveness in refining the rich features generated by HDCS while mitigating channel redundancy. Finally, the incorporation of the LAA mechanism at the skip connections culminated in our full HDAMNet model, which achieved the top performance with a Jaccard Index of 91.12%. This final 1.63% improvement underscores the critical role of LAA in achieving precise boundary delineation through its adaptive fusion of multi-level features.

Interestingly, Variant-IV shows a slight performance degradation (86.27% Jaccard) compared to the baseline itself. This result highlights the strong synergistic relationship among our proposed components. The LAA mechanism is specifically optimized to fuse the rich hierarchically structured features generated by the HDAMamba block, which contains HDCS and MRAFE. When applied to the relatively simpler features from a standard convolutional encoder, its sophisticated recalibration process may not find the same meaningful patterns to amplify, potentially introducing minor noise. This underscores the fact that the full power of LAA is unlocked when it works in concert with the HDAMamba encoder. In summary, the ablation study confirms that the proposed components—HDCS, MRAFE, and LAA—contribute synergistically, progressively enhancing the model’s capabilities in multi-scale feature extraction, adaptive fusion, and hierarchical detail preservation.

Specifically, the HDCS mechanism significantly boosts the network’s ability to capture fine-grained cloud boundaries and represent large-scale structures through multi-scale shifted window partitioning, dynamic convolution, and multi-resolution feature integration. The MRAFE module effectively mitigates channel redundancy and enhances semantic clarity during cross-level fusion. Meanwhile, the LAA mechanism adaptively emphasizes high-level semantic features and suppresses noise through channel attention, significantly boosting the model’s capability in precise cloud boundary detection and comprehensive multi-scale feature representation.

4.6. Comparison Test of the HRC_WHU Dataset

To comprehensively evaluate the effectiveness of the proposed model, we conducted a series of comparative experiments on the HRC_WHU dataset, benchmarking HDAMNet against state-of-the-art semantic segmentation models. These models include UNet, PSPNet, UNetFormer, Swin-Unet, CMTFNet, TransUNet, Samba, VM-UNet, Mamba-UNet, and RS³Mamba. Additionally, we compared HDAMNet with recent models specifically designed for cloud detection, such as CloudNet, CloudSegNet, and CDNet. To ensure fairness and reproducibility, all models were trained using their default parameter settings. Table 2 summarizes the experimental results, with the best-performing scores highlighted in bold and the second-best underlined for clarity. The results demonstrate that the proposed HDAMNet consistently ranks among the top across all evaluation metrics, achieving the best performance on key indicators for semantic segmentation, including the Jaccard Index, OA, and F1-score. Compared to other models, HDAMNet yields a minimum improvement of 0.72% in the Jaccard Index and at least 0.41% in the F1-score.

Figure 9 presents the semantic segmentation visualizations on the HRC_WHU dataset for the top six models ranked by Jaccard Index. Representative examples were selected from various background types, including water, vegetation, urban areas, snow and ice, and desert scenes. Segmentation results produced by different models at the same spatial location are highlighted with orange rectangular boxes to facilitate direct visual comparison. For instance, in the first row (desert background) and the fourth row (urban background), some SOTA models exhibit frequent omission of small-scale cloud clusters and struggle to capture the fine details of thin or fragmented clouds. In contrast, our model successfully identifies nearly all cloud structures, including those with ambiguous boundaries. This improvement is primarily attributed to the incorporation of the SSM, which enhances long-range spatial dependency modeling by replacing conventional convolutional downsampling modules in the encoder with HDAMamba Blocks, thereby significantly improving multi-scale cloud perception. In the third row (urban background), other models tend to produce false positives when encountering complex cloud formations, especially near cloud–background boundaries, where boundary fuzziness becomes evident. Our model demonstrates superior performance in such scenarios, largely owing to the HDCS mechanism and the MRAFE module. These components substantially expand the receptive field and enable the dynamic fusion of multi-scale features, thereby enhancing both detail representation and adaptability to varying cloud morphologies across scales.

Our model leverages the SSM to establish a novel paradigm for long-range spatial modeling in multi-scale cloud detection. The proposed HDCS and MRAFE modules enhance the extraction of fine-grained multi-scale features, while the LAA mechanism demonstrates strong effectiveness in detail refinement and boundary delineation. Comprehensive experimental results confirm that HDAMNet consistently delivers superior performance across diverse background conditions, with marked advantages in capturing multi-scale cloud features and achieving precise cloud boundary segmentation, significantly outperforming other comparative models.

4.7. Generalization Experiment of the CloudS_M24 Dataset

We employed the CloudS_M24 dataset to evaluate the generalization capability of the proposed method in cloud–snow coexistence scenarios. All parameters were set to their respective default values. The experimental results are presented in Table 3, where the best results are highlighted in bold and the second-best results are underlined. Since only two categories—cloud and non-cloud—were selected for comparison, most models demonstrated satisfactory performance.

Figure 10 displays the semantic segmentation visualizations of the top six models ranked by Jaccard Index on the CloudS_M24 dataset. We selected representative images with diverse backgrounds, all containing snow and ice, which closely resemble clouds in terms of reflectance characteristics and local texture. As a result, snow and ice serve as the primary noise source, increasing the complexity of cloud detection. To clearly illustrate the segmentation performance of each model, we marked the same image regions using orange rectangular boxes. In the first row, several comparative models exhibited missed or false detections when segmenting fine clouds. Benefiting from the multi-scale feature perception provided by the HDCS mechanism, our model accurately captured cloud details and effectively reduced false detections. In the second row, under a complex cloud–shadow background, other models produced blurry boundaries at the interface between cloud and shadow and struggled to identify small thin clouds. In contrast, our model accurately delineated boundary details thanks to the dynamic feature fusion enabled by the LAA mechanism, which adaptively integrates features across different levels. In the third row, the extensive snow and ice coverage introduces strong interference, making cloud identification difficult even for human observers. Leveraging the SSM, our model effectively captured long-range spatial dependencies, enabling the accurate segmentation of complex cloud structures and significantly reducing the false detection rate. In the fourth and fifth rows, HDAMNet performed especially well in detecting thin clouds and handling complex cloud–shadow boundaries. It was particularly effective in the upper-left region of the fifth image, which contains a mixture of cloud shadows, snow, and ice. This performance can be attributed to the MRAFE module, which dynamically fuses multi-scale features while balancing fine boundary refinement and overall segmentation accuracy.

4.8. Generalization Experiment of the WHU Cloud Dataset

Similarly, to assess the model’s generalization, we conducted experiments on the WHU Cloud dataset using default parameters. The experimental results are presented in Table 4, with the best scores highlighted in bold and the second-best underlined. On this dataset, HDAMNet achieved the highest scores for the key metrics of Jaccard Index and F1-score, demonstrating strong and competitive performance across all evaluation metrics.

Figure 11 displays the semantic segmentation visualizations of the top six models ranked by Jaccard Index on the WHU Cloud dataset. Compared with other methods, our method achieves more precise segmentation results, particularly around complex and irregular cloud boundaries, where false positives and false negatives are significantly reduced. This improvement can be attributed to the model’s multi-scale hierarchical design, which effectively captures both global contextual information and fine-grained local details. As a result, HDAMNet is able to better distinguish between thin clouds and background, as well as suppress noisy predictions in non-cloud regions, demonstrating its strong capability in handling challenging cloud detection scenarios.

5. Discussion

5.1. Analysis of Shifted Window Scale Settings

To effectively capture the diverse spatial patterns of clouds, which range from fine-grained local textures to extensive global structures, the design of an efficient multi-scale feature extraction mechanism is crucial. In our proposed HDCS mechanism, the shifted window scales directly influence the spatial receptive field and the granularity of extracted features at each resolution, which is particularly important for delineating cloud boundaries and modeling structural variations across scales. Therefore, systematically investigating different window scale combinations is essential to validate the effectiveness of our architectural design and to understand how multi-resolution information contributes to enhanced cloud detection performance.

As summarized in Table 5, our experiments demonstrate that employing multiple shifted window scales substantially improves model performance compared to using a single fixed scale. Specifically, the combination of scales {0.25, 0.50, 1.00} achieves the best overall results, with a Jaccard Index of 91.12%, Overall Accuracy of 97.42%, and F1 Score of 95.36%. This setting enables the model to simultaneously capture fine cloud boundary details, intermediate morphological variations, and global contextual structures, leading to more precise and robust cloud segmentation. In contrast, finer-scale combinations such as {0.10, 0.30, 0.50} and {0.10, 0.20, 0.30} exhibited declines in Jaccard Index (88.64% and 88.44%, respectively) and F1 scores, despite achieving relatively high Precision. These results suggest that overly fine partitions tend to overemphasize local details at the expense of global structure perception, resulting in reduced Recall and higher omission rates. Conversely, using fewer scales, such as {0.25, 0.25, 0.50}, still maintained a competitive Jaccard Index (90.82%), presenting a potential trade-off between the number of scales used and the resulting segmentation performance.

These findings highlight that multi-scale spatial perception is indispensable for robust cloud feature extraction. Smaller windows enhance the detection of fine-grained cloud boundaries, medium-sized windows capture regional morphological structures, and larger windows enable the modeling of long-range dependencies critical for identifying extensive sparse cloud formations. The optimal combination {0.25, 0.50, 1.00} allows the model to flexibly adapt to both dense and sparse cloud scenarios, enhancing its generalization capability across diverse atmospheric conditions. Furthermore, simply introducing more scales, such as {0.20, 0.40, 0.75}, does not necessarily lead to further performance improvements, underscoring the importance of selecting an optimal set of scales for feature richness.

5.2. Comparative Analysis of Core Modules: Mamba and Transformer

To further validate our architectural choice of Mamba over the more established Transformer, we conducted a direct comparative experiment. We created an experimental variant of HDAMNet by replacing the core Mamba component within each HDAMamba Block with a standard Transformer block, while keeping all other parameters and training settings identical. We then compared the overall segmentation accuracy of these two model variants, and analyzed the computational cost of their respective core blocks.

The results summarized in Table 6 are compelling. The model variant equipped with our Mamba-based block significantly outperforms the Transformer-based equivalent in overall segmentation accuracy, achieving a huge improvement in the Jaccard Index. This clear performance advantage is coupled with a substantial gain in the efficiency of the core block itself; the Mamba-based block requires only one-third of the computational operations (FLOPs) and exhibits a faster inference time. This analysis validates that, for the specific challenges of multi-scale cloud detection, our Mamba-based architectural choice provides a superior balance of both accuracy and efficiency.

6. Conclusions

To address the challenges of multi-scale feature extraction and complex cloud morphology modeling in remote sensing cloud detection, this paper proposes HDAMNet, a novel architecture based on the SSM. By integrating the HDCS mechanism into the encoder, the network’s receptive field is effectively expanded, enhancing its ability to capture fine-grained cloud boundaries. The MRAFE module further fuses semantic information across different scales, mitigating channel confusion and emphasizing critical features. Additionally, the LAA mechanism dynamically allocates feature weights at skip connections, balancing fine boundary details with global semantic information. Experimental results on the HRC_WHU, CloudS_M24, and WHU Cloud datasets demonstrate that HDAMNet achieves optimal or near-optimal performance across key metrics, including Jaccard Index, F1-score, and Overall Accuracy. The model exhibits notable advantages in handling small-scale clouds, complex cloud–shadow boundaries, and interference from snow and ice, offering a promising solution for high-precision cloud detection. In future work, we aim to further optimize the model by reducing its computational complexity and memory footprint while maintaining high accuracy.

Author Contributions

Conceptualization, Y.W. and L.Z.; methodology, Y.W.; software, Y.W.; validation, Y.W., X.Y., and Y.L.; formal analysis, R.J.; investigation, X.Y.; resources, Y.L.; data curation, L.Z.; writing—original draft preparation, Y.W.; writing—review and editing, L.Z.; visualization, Y.W.; supervision, L.Z.; project administration, L.Z.; funding acquisition, L.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Innovation and Entrepreneurship Training Program for College Students, grant number 202410298058Z.

Data Availability Statement

The data and the code of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Verbesselt, J.; Hyndman, R.; Newnham, G.; Culvenor, D. Detecting Trend and Seasonal Changes in Satellite Image Time Series. Remote Sens. Environ. 2010, 114, 106–115. [Google Scholar] [CrossRef]
Dutrieux, L.P.; Verbesselt, J.; Kooistra, L.; Herold, M. Monitoring Forest Cover Loss Using Multiple Data Streams, a Case Study of a Tropical Dry Forest in Bolivia. ISPRS J. Photogramm. Remote Sens. 2015, 107, 112–125. [Google Scholar] [CrossRef]
Prasad, A.K.; Chai, L.; Singh, R.P.; Kafatos, M. Crop Yield Estimation Model for Iowa Using Remote Sensing and Surface Parameters. Int. J. Appl. Earth Obs. Geoinf. 2006, 8, 26–33. [Google Scholar] [CrossRef]
Tralli, D.M.; Blom, R.G.; Zlotnicki, V.; Donnellan, A.; Evans, D.L. Satellite Remote Sensing of Earthquake, Volcano, Flood, Landslide and Coastal Inundation Hazards. ISPRS J. Photogramm. Remote Sens. 2005, 59, 185–198. [Google Scholar] [CrossRef]
Zhang, Y.; Rossow, W.B.; Lacis, A.A.; Oinas, V.; Mishchenko, M.I. Calculation of Radiative Fluxes from the Surface to Top of Atmosphere Based on ISCCP and Other Global Data Sets: Refinements of the Radiative Transfer Model and the Input Data. J. Geophys. Res. Atmos. 2004, 109, D19105. [Google Scholar] [CrossRef]
Zhu, Z.; Woodcock, C.E. Object-Based Cloud and Cloud Shadow Detection in Landsat Imagery. Remote Sens. Environ. 2012, 118, 83–94. [Google Scholar] [CrossRef]
Irish, R.R.; Barker, J.L.; Goward, S.N.; Arvidson, T. Characterization of the Landsat-7 ETM+ Automated Cloud-Cover Assessment (ACCA) Algorithm. Photogramm. Eng. Remote Sens. 2006, 72, 1179–1188. [Google Scholar] [CrossRef]
Hu, X.; Wang, Y.; Shan, J. Automatic Recognition of Cloud Images by Using Visual Saliency Features. IEEE Geosci. Remote Sens. Lett. 2015, 12, 1760–1764. [Google Scholar]
Zhang, W.; Jin, S.; Zhou, L.; Xie, X.; Wang, F.; Jiang, L.; Zheng, Y.; Qu, P.; Li, G.; Pan, X. Multi-Feature Embedded Learning SVM for Cloud Detection in Remote Sensing Images. Comput. Electr. Eng. 2022, 102, 108177. [Google Scholar] [CrossRef]
Zhang, H.; Huang, Q.; Zhang, L. Multi-Temporal Cloud Detection Based on Robust PCA for Optical Remote Sensing Imagery. Comput. Electron. Agric. 2021, 188, 106342. [Google Scholar] [CrossRef]
Mohajerani, S.; Saeedi, P. Cloud-Net: An End-to-End Cloud Detection Algorithm for Landsat 8 Imagery. In Proceedings of the IGARSS 2019–2019 IEEE International Geoscience and Remote Sensing Symposium, Yokohama, Japan, 28 July–2 August 2019; pp. 1029–1032. [Google Scholar]
Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention; Springer International Publishing: Cham, Switzerland, 2015. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image Is Worth 16 × 16 Words: Transformers for Image Recognition at Scale. arXiv 2021, arXiv:2010.11929. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 9992–10002. [Google Scholar]
Ge, W.; Yang, X.; Jiang, R.; Shao, W.; Zhang, L. CD-CTFM: A Lightweight CNN-Transformer Network for Remote Sensing Cloud Detection Fusing Multiscale Features. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 4538–4551. [Google Scholar] [CrossRef]
Zhang, B.; Zhang, Y.; Li, Y.; Wan, Y.; Yao, Y. CloudViT: A Lightweight Vision Transformer Network for Remote Sensing Cloud Detection. IEEE Geosci. Remote Sens. Lett. 2023, 20, 1–5. [Google Scholar] [CrossRef]
Zhang, Z.; Xu, Z.; Liu, C.; Tian, Q.; Wang, Y. Cloudformer: Supplementary Aggregation Feature and Mask-Classification Network for Cloud Detection. Appl. Sci. 2022, 12, 3221. [Google Scholar] [CrossRef]
Gu, A.; Dao, T. Mamba: Linear-Time Sequence Modeling with Selective State Spaces. arXiv 2024, arXiv:2312.00752. [Google Scholar]
Ruan, J.; Li, J.; Xiang, S. VM-UNet: Vision Mamba UNet for Medical Image Segmentation. arXiv 2024, arXiv:2402.02491. [Google Scholar] [CrossRef]
Wang, Z.; Zheng, J.-Q.; Zhang, Y.; Cui, G.; Li, L. Mamba-UNet: UNet-Like Pure Visual Mamba for Medical Image Segmentation. arXiv 2024, arXiv:2402.05079. [Google Scholar]
Wang, Y.; Cao, L.; Deng, H. MFMamba: A Mamba-Based Multi-Modal Fusion Network for Semantic Segmentation of Remote Sensing Images. Sensors 2024, 24, 7266. [Google Scholar] [CrossRef]
Zhao, S.; Chen, H.; Zhang, X.; Xiao, P.; Bai, L.; Ouyang, W. RS-Mamba for Large Remote Sensing Image Dense Prediction. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5633314. [Google Scholar] [CrossRef]
Liu, Y.; Tian, Y.; Zhao, Y.; Yu, H.; Xie, L.; Wang, Y.; Ye, Q.; Liu, Y. VMamba: Visual State Space Model. Adv. Neural Inf. Process. Syst. 2024, 37, 103031–103063. [Google Scholar]
Zhu, Q.; Cai, Y.; Fang, Y.; Yang, Y.; Chen, C.; Fan, L.; Nguyen, A. Samba: Semantic Segmentation of Remotely Sensed Images with State Space Model. Heliyon 2024, 10, e38495. [Google Scholar] [CrossRef] [PubMed]
Ma, X.; Zhang, X.; Pun, M.-O. RS3Mamba: Visual State Space Model for Remote Sensing Image Semantic Segmentation. IEEE Geosci. Remote Sens. Lett. 2024, 21, 1–5. [Google Scholar] [CrossRef]
Zhang, J.; Li, Y.; Yang, X.; Jiang, R.; Zhang, L. RSAM-Seg: A SAM-Based Model with Prior Knowledge Integration for Remote Sensing Image Semantic Segmentation. Remote Sens. 2025, 17, 590. [Google Scholar] [CrossRef]
Zhang, L.; Sun, J.; Yang, X.; Jiang, R.; Ye, Q. Improving Deep Learning-Based Cloud Detection for Satellite Images with Attention Mechanism. IEEE Geosci. Remote Sens. Lett. 2022, 19, 1–5. [Google Scholar] [CrossRef]
Francis, A.; Sidiropoulos, P.; Muller, J.-P. CloudFCN: Accurate and Robust Cloud Detection for Satellite Imagery with Deep Learning. Remote Sens. 2019, 11, 2312. [Google Scholar] [CrossRef]
Jeppesen, J.H.; Jacobsen, R.H.; Inceoglu, F.; Toftegaard, T.S. A Cloud Detection Algorithm for Satellite Imagery Based on Deep Learning. Remote Sens. Environ. 2019, 229, 247–259. [Google Scholar] [CrossRef]
Yang, J.; Guo, J.; Yue, H.; Liu, Z.; Hu, H.; Li, K. CDnet: CNN-Based Cloud Detection for Remote Sensing Imagery. IEEE Trans. Geosci. Remote Sens. 2019, 57, 6195–6211. [Google Scholar] [CrossRef]
Li, Z.; Shen, H.; Cheng, Q.; Liu, Y.; You, S.; He, Z. Deep Learning Based Cloud Detection for Medium and High Resolution Remote Sensing Images of Different Sensors. ISPRS J. Photogramm. Remote Sens. 2019, 150, 197–212. [Google Scholar] [CrossRef]
Hu, K.; Zhang, D.; Xia, M. CDUNet: Cloud Detection UNet for Remote Sensing Imagery. Remote Sens. 2021, 13, 4533. [Google Scholar] [CrossRef]
Lu, C.; Xia, M.; Lin, H. Multi-Scale Strip Pooling Feature Aggregation Network for Cloud and Cloud Shadow Segmentation. Neural Comput. Appl. 2022, 34, 6149–6162. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention Is All You Need. arXiv 2023, arXiv:1706.03762. [Google Scholar]
Zheng, S.; Lu, J.; Zhao, H.; Zhu, X.; Luo, Z.; Wang, Y.; Fu, Y.; Feng, J.; Xiang, T.; Torr, P.H.S.; et al. Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 6881–6890. [Google Scholar]
Xie, E.; Wang, W.; Yu, Z.; Anandkumar, A.; Alvarez, J.M.; Luo, P. SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers. In Proceedings of the Advances in Neural Information Processing Systems, Red Hook, NY, USA, 14–16 December 2021; Volume 34, pp. 12077–12090. [Google Scholar]
Wang, L.; Li, R.; Zhang, C.; Fang, S.; Duan, C.; Meng, X.; Atkinson, P.M. UNetFormer: A UNet-like Transformer for Efficient Semantic Segmentation of Remote Sensing Urban Scene Imagery. ISPRS J. Photogramm. Remote Sens. 2022, 190, 196–214. [Google Scholar] [CrossRef]
Cao, H.; Wang, Y.; Chen, J.; Jiang, D.; Zhang, X.; Tian, Q.; Wang, M. Swin-Unet: Unet-Like Pure Transformer for Medical Image Segmentation. In Proceedings of the Computer Vision—ECCV 2022 Workshops, Tel Aviv, Israel, 23–27 October 2022; Karlinsky, L., Michaeli, T., Nishino, K., Eds.; Springer Nature: Cham, Switzerland, 2023; pp. 205–218. [Google Scholar]
Wu, H.; Huang, P.; Zhang, M.; Tang, W.; Yu, X. CMTFNet: CNN and Multiscale Transformer Fusion Network for Remote-Sensing Image Semantic Segmentation. IEEE Trans. Geosci. Remote Sens. 2023, 61, 1–12. [Google Scholar] [CrossRef]
Chen, J.; Mei, J.; Li, X.; Lu, Y.; Yu, Q.; Wei, Q.; Luo, X.; Xie, Y.; Adeli, E.; Wang, Y.; et al. TransUNet: Rethinking the U-Net Architecture Design for Medical Image Segmentation through the Lens of Transformers. Med. Image Anal. 2024, 97, 103280. [Google Scholar] [CrossRef]
Zhang, Z.; Xu, Z.; Liu, C.; Tian, Q.; Zhou, Y. Cloudformer V2: Set Prior Prediction and Binary Mask Weighted Network for Cloud Detection. Mathematics 2022, 10, 2710. [Google Scholar] [CrossRef]
Zhang, Z.; Tan, S.; Zhou, Y. CloudformerV3: Multi-Scale Adapter and Multi-Level Large Window Attention for Cloud Detection. Appl. Sci. 2023, 13, 12857. [Google Scholar] [CrossRef]
Zhang, H.; Zhu, Y.; Wang, D.; Zhang, L.; Chen, T.; Wang, Z.; Ye, Z. A Survey on Visual Mamba. Appl. Sci. 2024, 14, 5683. [Google Scholar] [CrossRef]
Zhu, L.; Liao, B.; Zhang, Q.; Wang, X.; Liu, W.; Wang, X. Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model. arXiv 2024, arXiv:2401.04722. [Google Scholar] [CrossRef]
Ma, J.; Li, F.; Wang, B. U-Mamba: Enhancing Long-Range Dependency for Biomedical Image Segmentation. arXiv 2024, arXiv:2401.04722. [Google Scholar]
Xing, Z.; Ye, T.; Yang, Y.; Liu, G.; Zhu, L. SegMamba: Long-Range Sequential Modeling Mamba for 3D Medical Image Segmentation. In Proceedings of the Medical Image Computing and Computer Assisted Intervention—MICCAI 2024, Marrakesh, Morocco, 6–10 October 2024; Linguraru, M.G., Dou, Q., Feragen, A., Giannarou, S., Glocker, B., Lekadir, K., Schnabel, J.A., Eds.; Springer Nature: Cham, Switzerland, 2024; pp. 578–588. [Google Scholar]
Ye, Z.; Chen, T.; Wang, F.; Zhang, H.; Zhang, L. P-Mamba: Marrying Perona Malik Diffusion with Mamba for Efficient Pediatric Echocardiographic Left Ventricular Segmentation. arXiv 2024, arXiv:2402.08506. [Google Scholar] [CrossRef]
Liu, J.; Yang, H.; Zhou, H.-Y.; Xi, Y.; Yu, L.; Li, C.; Liang, Y.; Shi, G.; Yu, Y.; Zhang, S.; et al. Swin-UMamba: Mamba-Based UNet with ImageNet-Based Pretraining. In Proceedings of the Medical Image Computing and Computer Assisted Intervention—MICCAI 2024, Marrakesh, Morocco, 6–10 October 2024; Linguraru, M.G., Dou, Q., Feragen, A., Giannarou, S., Glocker, B., Lekadir, K., Schnabel, J.A., Eds.; Springer Nature: Cham, Switzerland, 2024; pp. 615–625. [Google Scholar]
Yang, Y.; Xing, Z.; Yu, L.; Huang, C.; Fu, H.; Zhu, L. Vivim: A Video Vision Mamba for Medical Video Segmentation. arXiv 2024, arXiv:2401.14168. [Google Scholar] [CrossRef]
Gu, A.; Goel, K.; Ré, C. Efficiently Modeling Long Sequences with Structured State Spaces. arXiv 2022, arXiv:2111.00396. [Google Scholar] [CrossRef]
Chen, L.-C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 40, 834–848. [Google Scholar] [CrossRef] [PubMed]
Zhang, G.; Gao, X. Lightweight and High-Precision Cloud Detection Method for Cloud-Snow Coexistence Areas in High-Resolution Remote Sensing Imagery. Acta Geod Cartogr. Sin. 2023, 52, 93. [Google Scholar]
Ji, S.; Dai, P.; Lu, M.; Zhang, Y. Simultaneous Cloud Detection and Removal From Bitemporal Remote Sensing Images Using Cascade Convolutional Neural Networks. IEEE Trans. Geosci. Remote Sens. 2021, 59, 732–748. [Google Scholar] [CrossRef]
Zhao, H.; Shi, J.; Qi, X.; Wang, X.; Jia, J. Pyramid Scene Parsing Network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2881–2890. [Google Scholar]
Dev, S.; Nautiyal, A.; Lee, Y.H.; Winkler, S. CloudSegNet: A Deep Network for Nychthemeron Cloud Image Segmentation. IEEE Geosci. Remote Sens. Lett. 2019, 16, 1814–1818. [Google Scholar] [CrossRef]
Chen, K.; Liu, C.; Shi, Z. RSMamba: Remote Sensing Image Classification with State Space Model. IEEE Geosci. Remote Sens. Lett. 2024, 21, 1–5. [Google Scholar] [CrossRef]

Figure 1. The overview of HDAMNet.

Figure 2. The structure of HDAMamba Block.

Figure 3. (a) Illustration of HDCS. (b) Details of DCS.

Figure 4. The structure of MRAFE.

Figure 5. The structure of LAA, illustrating the Layer-wise Adaptive Attention (LAA) mechanism with stacked multi-level features, channel-wise attention weighting, and adaptive fusion through element-wise operations.

Figure 6. The samples of the HRC-WHU dataset. The first row displays the cropped images. The second row displays the corresponding labels.

Figure 7. The samples of the CloudS_M24 dataset. The first row displays the cropped images. The second row displays the corresponding labels.

Figure 8. The samples of the WHU Cloud dataset. The first row displays the cropped images. The second row displays the corresponding labels.

Figure 9. Visual comparisons of different cloud detection methods in HRC_WHU dataset. White area represents cloud, black area represents non-cloud, green area represents false-positive detection, and red area represents false-negative detection. (a) Images. (b) UNetFormer. (c) VM-UNet. (d) Samba. (e) Mamba-UNet. (f) RS³Mamba. (g) HDAMNet. (h) Label.

Figure 10. Visual comparisons of different cloud detection methods in CloudS_M24 dataset. White area represents cloud, black area represents non-cloud, green area represents false-positive detection, and red area represents false-negative detection. (a) Images. (b) TransUNet. (c) RS3Mamba. (d) UNetFormer. (e) VM-UNet. (f) Samba. (g) HDAMNet. (h) Label.

Figure 11. Visual comparisons of different cloud detection methods in WHU Cloud dataset. White area represents cloud, black area represents non-cloud, green area represents false-positive detection, and red area represents false-negative detection. (a) Images. (b) CMTFNet. (c) CDnet. (d) Mamba-UNet. (e) TransUNet. (f) UNet. (g) HDAMNet. (h) Label.

Table 1. Ablation for different modules on HRC_WHU dataset (in%).

Method	HDAMamba	MRAFE	LAA	Jaccard	OA	F1-Score
Variant-I	✗	✗	✗	86.97	96.20	93.03
Variant-II	✓	✗	✗	88.30	96.85	93.79
Variant-III	✓	✓	✗	89.49	97.22	94.45
Variant-IV	✗	✗	✓	86.27	96.02	92.63
HDAMNet	✓	✓	✓	91.12	97.42	95.36

Table 2. Evaluation results of different models on HRC_WHU dataset. The best result is bold, and the second-best result is underlined (in%).

Network Type	Method	Jaccard	Precision	Recall	Specificity	OA	F1-Score
CNN-based	UNet	86.97	96.89	89.47	98.87	96.20	93.03
	PSPNet	87.34	95.59	91.00	98.34	96.26	93.24
	CloudNet	87.50	97.18	89.77	98.97	96.37	93.33
	CDnet	87.51	95.88	90.93	98.45	96.32	93.34
	CloudSegNet	83.95	93.17	89.46	97.41	95.16	91.27
Transformer-based	UNetFormer	87.76	96.07	91.03	98.53	96.41	93.48
	Swin-Unet	85.67	96.11	88.75	98.58	95.80	92.28
	CMTFNet	86.81	95.80	90.24	98.44	96.12	92.24
	TransUNet	83.35	95.32	87.45	98.30	95.23	91.22
Mamba-based	Samba	89.11	94.94	93.54	98.03	96.76	94.24
	VM-UNet	88.98	94.76	93.58	97.96	96.72	94.17
	Mamba-UNet	89.83	96.41	92.93	98.63	97.02	94.64
	RS³Mamba	90.40	95.52	94.39	98.25	97.16	94.95
	HDAMNet	91.12	97.04	93.73	98.87	97.42	95.36

Table 3. Evaluation results of different models on CloudS_M24 dataset. The best result is bold, and the second-best result is underlined (in%).

Network Type	Method	Jaccard	Precision	Recall	Specificity	OA	F1-Score
CNN-based	UNet	74.65	86.24	84.74	97.76	95.92	85.47
	PSPNet	78.22	89.97	85.70	98.42	96.61	87.78
	CloudNet	78.98	87.98	88.52	98.00	96.66	88.25
	CDnet	75.90	89.20	83.58	98.33	96.23	86.3
	CloudSegNet	69.27	84.48	79.37	97.59	95.00	81.84
Transformer-based	UNetFormer	81.36	88.71	90.75	98.09	97.05	89.72
	Swin-Unet	68.71	87.59	76.12	98.22	95.08	81.45
	CMTFNet	78.58	91.53	84.74	98.70	96.72	88.00
	TransUNet	80.68	91.95	86.82	98.74	97.05	89.31
Mamba-based	Samba	81.99	90.80	89.42	98.50	97.21	90.10
	VM-UNet	81.88	91.15	88.95	98.57	97.21	90.04
	Mamba-UNet	71.37	91.37	76.53	98.80	95.64	83.29
	RS³Mamba	80.83	89.60	89.20	98.29	97.00	89.40
	HDAMNet	84.29	92.61	90.37	98.81	97.61	91.47

Table 4. Evaluation results of different models on WHU Cloud dataset. The best result is bold, and the second-best result is underlined (in%).

Network Type	Method	Jaccard	Precision	Recall	Specificity	OA	F1-Score
CNN-based	UNet	62.16	72.94	80.79	98.95	98.33	76.66
	PSPNet	35.89	80.19	39.37	99.66	97.60	52.83
	CloudNet	53.00	72.35	66.47	99.10	97.99	69.28
	CDnet	58.89	75.17	73.12	99.15	98.26	74.13
	CloudSegNet	41.79	87.69	44.39	99.78	97.89	58.94
Transformer-based	UNetFormer	41.87	66.55	53.04	99.06	97.49	59.03
	Swin-Unet	57.85	82.13	66.18	99.49	98.36	73.30
	CMTFNet	58.41	65.98	83.58	98.48	97.97	73.74
	TransUNet	61.50	69.96	83.57	98.73	98.22	76.16
Mamba-based	Samba	53.30	63.65	76.63	98.46	97.72	69.54
	Mamba-UNet	60.74	79.47	72.03	99.35	98.42	75.57
	RS³Mamba	52.02	57.22	85.13	97.76	97.33	68.44
	HDAMNet	63.01	74.11	80.78	99.01	98.38	77.30

Table 5. Comparison of model performance under different shifted window scale settings. The results demonstrate that combining multiple window scales improves segmentation accuracy (in%). The best result is bold, and the second-best result is underlined (in%).

Scale Setting	Jaccard	Precision	Recall	Specificity	OA	F1-Score
{0.25, 0.25, 0.25}	89.52	97.13	91.95	98.93	96.95	94.47
{0.25, 0.25, 0.50}	90.82	96.58	93.84	98.69	97.32	95.19
{0.10, 0.20, 0.30}	88.44	96.30	91.55	98.61	96.61	93.87
{0.10, 0.30, 0.50}	88.64	97.06	91.09	98.91	96.70	93.98
{0.20, 0.40, 0.75}	89.81	96.74	92.62	98.77	97.03	94.63
{0.25, 0.50, 0.75}	90.11	96.14	93.50	98.52	97.10	94.80
{0.25, 0.50, 1.00}	91.12	97.04	93.73	98.87	97.42	95.36

Table 6. Comparison of HDAMNet with a Transformer-based equivalent.

Method	Inference Time	FLOPs	Jaccard	OA	F1-Score
Transformer-based	0.4571 s	571.695 M	80.74%	94.05%	89.34%
Mamba-based	0.3182 s	197.15 M	91.12%	97.42%	95.36%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wang, Y.; Li, Y.; Yang, X.; Jiang, R.; Zhang, L. HDAMNet: Hierarchical Dilated Adaptive Mamba Network for Accurate Cloud Detection in Satellite Imagery. Remote Sens. 2025, 17, 2992. https://doi.org/10.3390/rs17172992

AMA Style

Wang Y, Li Y, Yang X, Jiang R, Zhang L. HDAMNet: Hierarchical Dilated Adaptive Mamba Network for Accurate Cloud Detection in Satellite Imagery. Remote Sensing. 2025; 17(17):2992. https://doi.org/10.3390/rs17172992

Chicago/Turabian Style

Wang, Yongcong, Yunxin Li, Xubing Yang, Rui Jiang, and Li Zhang. 2025. "HDAMNet: Hierarchical Dilated Adaptive Mamba Network for Accurate Cloud Detection in Satellite Imagery" Remote Sensing 17, no. 17: 2992. https://doi.org/10.3390/rs17172992

APA Style

Wang, Y., Li, Y., Yang, X., Jiang, R., & Zhang, L. (2025). HDAMNet: Hierarchical Dilated Adaptive Mamba Network for Accurate Cloud Detection in Satellite Imagery. Remote Sensing, 17(17), 2992. https://doi.org/10.3390/rs17172992

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

HDAMNet: Hierarchical Dilated Adaptive Mamba Network for Accurate Cloud Detection in Satellite Imagery

Abstract

1. Introduction

2. Related Work

2.1. CNN-Based Methods

2.2. Transformer-Based Methods

2.3. Mamba-Based Methods

3. Methodology

3.1. Preliminaries

3.2. HDAMNet

3.3. HDAMamba Block

3.3.1. HDCS

3.3.2. MRAFE

3.4. LAA

3.5. Loss Function

4. Results

4.1. Datasets

4.1.1. HRC_WHU Dataset

4.1.2. CloudS_M24 Dataset

4.1.3. WHU Cloud Dataset

4.2. Comparative Methods

4.2.1. CNN-Based Methods

4.2.2. Transformer-Based Methods

4.2.3. Mamba-Based Methods

4.3. Implementation Details

4.4. Evaluation Metrics

4.5. Ablation Study of Key Components

4.6. Comparison Test of the HRC_WHU Dataset

4.7. Generalization Experiment of the CloudS_M24 Dataset

4.8. Generalization Experiment of the WHU Cloud Dataset

5. Discussion

5.1. Analysis of Shifted Window Scale Settings

5.2. Comparative Analysis of Core Modules: Mamba and Transformer

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI