Next Article in Journal
Estimation of PM2.5 Vertical Profiles from MAX-DOAS Observations Based on Machine Learning Algorithms
Previous Article in Journal
Hydro-Topographic Contribution to In-Field Crop Yield Variation Using High-Resolution Surface and GPR-Derived Subsurface DEMs
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

FDEN: Frequency-Band Decoupling Detail Enhancement Network for High-Fidelity Water Boundary Segmentation

1
Shaanxi Key Laboratory of Earth Surface System and Environmental Carrying Capacity, College of Urban and Environmental Science, Northwest University, Xi’an 710127, China
2
State Key Laboratory of Integrated Services Networks, School of Telecommunications Engineering, Xidian University, Xi’an 710071, China
3
Guizhou Provincial Key Laboratory of Geographic State Monitoring of Watershed, School of Geography and Resources, Guizhou Education University, Guiyang 550018, China
*
Author to whom correspondence should be addressed.
Remote Sens. 2025, 17(17), 3062; https://doi.org/10.3390/rs17173062
Submission received: 14 July 2025 / Revised: 21 August 2025 / Accepted: 1 September 2025 / Published: 3 September 2025

Abstract

Accurate extraction of water bodies in remote sensing images is crucial for natural disaster prediction, aquatic ecosystem monitoring, and resource management. However, most existing deep-learning-based methods primarily operate in the raw pixel space of images and fail to leverage the frequency characteristics of remote sensing images, resulting in an inability to fully exploit the representational power of deep models when predicting mask images. This paper proposes a Frequency-Band Decoupling Detail Enhancement Network (FDEN) to achieve high-precision water body extraction. The FDEN begins with an initial decoupling and enhancement stage for frequency information. Based on this multi-frequency representation, we further propose a Multi-Band Detail-Aware Module (MDAM), designed to adaptively enhance salient structural cues for water bodies across frequency bands while effectively suppressing irrelevant or noisy components. Extensive experiments demonstrate that the FDEN model outperforms state-of-the-art methods in terms of its segmentation accuracy and robustness.

1. Introduction

Benefiting from continuous advancements in remote sensing technology, increasingly detailed and higher-resolution input images can now be obtained in lake water body segmentation. The primary goal of this technology is to accurately distinguish water bodies from adjacent land, thereby monitoring the expansion and contraction of water bodies to provide critical information for flood prevention and disaster mitigation, water resource management, climate change analysis, etc. [1,2,3,4,5].
To achieve this goal, traditional methods primarily rely on spectral-index-based threshold segmentation (such as the Normalized Difference Water Index and the Modified Normalized Difference Water Index) and morphological operations. However, these methods often depend on single spectral features or brightness differences, making it difficult to extract multi-dimensional features in complex scenes, which limits their segmentation capability in highly diverse and intricate environments. In recent years, some advanced architectures based on convolutional neural networks (CNNs) have largely overcome the limitations of traditional segmentation methods, achieving significant improvements in segmentation accuracy [6,7,8,9,10,11]. Among them, Ronneberger et al. [12] proposed the U-Net architecture, featuring a symmetric encoder–decoder structure that combined high-level semantic information with low-level spatial details, effectively enhancing the detection of small objects and blurry boundaries. Since then, researchers have developed and refined the U-Net framework from various perspectives. With the integration of technologies such as the Pyramid Pooling Module (PPM), residual connections, and attention mechanisms [13,14,15], the U-Net framework has seen improvements in multi-scale feature extraction, feature transfer, feature selection enhancement, etc. Despite these advancements, CNN-based methods may still struggle to capture global contextual information due to the limitations of convolutional receptive fields. The self-attention mechanism in Vision Transformers (ViTs) [16] effectively addresses this shortcoming. Subsequently, a series of segmentation models based on Vision Transformers (ViTs) rapidly emerged [17,18,19,20,21], enabling efficient processing of both the global context and local features, thereby further improving the overall segmentation accuracy. However, most of these existing methods primarily operate in the original pixel space of images, with limited exploration of the frequency-domain characteristics of remote sensing imagery. This constraint hinders their ability to fully leverage the representational power of deep learning models for segmentation tasks.
It is worth noting that some challenging remotely sensed images usually present rich frequency components, where high-frequency information (e.g., watershed boundaries and small tributaries) helps to refine the segmentation results, whereas medium- and low-frequency components (e.g., major bodies of water and cloud cover) provide essential structural context. Both play a crucial role in addressing challenges like low contrast and over-segmentation. However, two key issues persist: (1) the difficulty of designing an optimal convolutional operator capable of simultaneously adapting to the distinct characteristics of low-frequency smoothness and high-frequency details and (2) the need to enhance structural saliency across different frequency bands further.
For the first issue, due to the different characteristics of low-frequency smoothing and high-frequency details, we choose to decouple the frequency bands of the inputs and utilize a multilevel feature extraction network to deal with these components separately through a “divide-and-conquer” strategy. For the second issue, we aim to improve the structural saliency of different frequency bands, which requires adaptive enhancement of the structural characteristics of each band during feature extraction.
Based on these considerations, we design a frequency-band decoupling detail enhancement framework for water body segmentation. Our approach first uses wavelet transform components to replace the original image inputs, constructing a feature extraction network that focuses on both low- and high-frequency characteristics. Based on this, we introduce the Multi-Band Detail-Aware Module (MDAM), which achieves accurate enhancement of the water body detail features and interference suppression through a multi-scale structural feature analysis and a dynamic weight adjustment mechanism, as shown in Figure 1. Experiments demonstrate that this framework outperforms existing water body segmentation models in both its detail perception and segmentation accuracy.
In summary, the contributions of this work are as follows:
  • We propose a novel water body segmentation framework, the Frequency-Band Decoupling Detail Enhancement Network (FDEN), to synergistically optimize the high-frequency edge information and low-frequency texture features in the frequency domain space, thereby improving the segmentation accuracy for lake water bodies.
  • We propose a Multi-Band Detail-Aware Module (MDAM), which further senses multi-band features and performs adaptive enhancement, effectively mitigating the problems of boundary blurring and missed detections of fine water bodies in remotely sensed water body images.
  • Experimental results providing a comparison with the SOTA show that the FDEN that we have developed is more advantageous than previous water body segmentation methods, and a large number of ablation experiments also demonstrate the effectiveness of our contribution.
This article is organized as follows: Section 2 briefly reviews related works, Section 3 describes the proposed model, Section 4 presents the datasets and experimental settings, and Section 5 concludes this paper.

2. Related Work

2.1. Lake Water Body Segmentation

(1) Traditional Methods: The traditional lake segmentation methods typically include threshold segmentation [22,23], water index methods [24], and support vector machines [25]. Otsu [26] proposed a method for automatically determining the optimal threshold value based on a grayscale histogram, which is widely applied in image segmentation. Xu [23] introduced the Modified NDWI, enhancing the contrast between water bodies and other features by incorporating a short-wave infrared band. Lv et al. [27] developed a technique for segmenting water bodies in SAR images by combining GLCM-based features with support vector machines. Despite their speed, the traditional methods often lack sufficient segmentation accuracy.
(2) Deep Learning Methods: Recent advancements in computer vision have led to the development of numerous convolutional neural network (CNN)-based approaches that improve the accuracy of lake water body segmentation. For instance, Zhong et al. [28] developed a Noise Reduction Transformer Network with an interference attenuation module to filter out non-lake features. Wang et al. [29] proposed the Multi-Scale Lake Water Body Extraction Network, which utilizes depth-separable convolution for high-level feature extraction and includes a multi-scale densely connected module to expand the receptive field. Xiang et al. [30] introduced a Dense Pyramid Pool Module to effectively capture global contextual information, including outliers, by densely scaling the distribution. Qi et al. [31] proposed an innovative end-to-end framework that combines a geodesic active contour model with a Vision Transformer architecture, enhancing the segmentation performance. While these CNN-based methods have demonstrated promise, they still encounter challenges with capturing fine details, especially around complex boundaries. Our work aims to address this issue by improving the perception of different frequency bands.

2.2. The Wavelet Transform

The wavelet transform has been widely used to improve boundary localization and suppress noise thanks to its multi-scale decomposition and capacity to emphasize high-frequency components. Niedermeier et al. [32] applied wavelets in conjunction with active contours to segment coastlines from SAR images. They initially detected boundaries with an edge detection algorithm and then refined the segmentation by selecting local edges in the coastal region and propagating them across multiple wavelet scales. Similarly, Jung et al. [33] developed a robust watershed segmentation method based on wavelet transforms, where redundant wavelet transforms were employed to denoise the image and enhance the edges at multiple resolutions, leading to improved image gradients.

2.3. The Hessian Matrix

The Hessian matrix is another valuable tool for enhancing the geometric features in images by analyzing the second-order derivative information, and it has been widely applied in medical imaging. Frangi et al. [34] developed a method for vascular structure enhancement using the Hessian matrix. The proposed filter leverages the eigenvalues of the Hessian matrix to measure the shape characteristics of local structures, adjusting the eigenvalue weights to effectively enhance linear features such as blood vessels. Building on this, Zhang et al. [35] proposed a technique that combined the Hessian matrix with the level set method. The eigenvalues of the Hessian matrix help identify the geometric features of regions of interest, allowing the curves to converge accurately to target boundaries. This approach significantly improves the robustness and precision of the level set method, especially in noisy or poorly defined images.

3. The Proposed Method

Although stacking deep networks with small kernels can extract diverse physical features in remote sensing images, it is severely limited by the local consistency and the pixel-wise independent processing of individual convolutions in CNNs. In addition, usually, human visual attention pays different attention to different frequency bands, e.g., human vision is more likely to notice impulse stimuli at image boundaries than smooth structures. Therefore, for feature extraction in remotely sensed images, we propose processing each component in a divide-and-conquer manner and regulating the structural salience of different frequency bands. This approach not only enhances the extraction of key structural features but also reduces the loss of important information in low-frequency components. By emphasizing high-frequency details, our method aligns with human visual perception better.
In this section, we focus on the multi-band detail-aware mechanism and its working principle in our method. In addition, we also introduce the DWT process and the prediction module in the framework, which provide a good ground for effective enhancement of image details.

3.1. An Overview of Our Method

The overall framework of the FDEN is shown in Figure 2. The design of the FDEN fully utilizes the ability of frequency domain information to guide the spatial features. Specifically, we first use Discrete Wavelet Transform (DWT) to transform the input into wavelet space to provide a global frequency-band benchmark. This process can be expressed as follows:
F e = E ( x )
where ε denotes the transformation matrix that converts the input image x R h × w × 3 into F e , which contains a series of wavelet components. Subsequently, F e will be passed to the feature encoding stage to extract the deep features of the image.
The encoding process of the FDEN plays a crucial role in achieving adaptive enhancement of the detail features. We fully combine the ability of multi-scale Hessian filtering to extract the boundary detail features in different frequency bands with the dynamic feature selection advantage of the attention mechanism to construct the Multi-Band Detail-Aware Module (MDAM). In terms of the feature transmission paths, the encoding process of the FDEN can also be divided into the multi-band detail feature perception path and the encoded feature transmission path. Features of these two paths on the same scale interact with each other through the attention mechanism, thus enabling detail feature enhancement. The encoding process can be described as
( D a N ,   D f N ) = M D A M ( D a N 1 ,   D f N 1 )
where D f N 1 , D a N 1 represent the inputs of the current MDAM, D f N , and D a N represent the outputs of the current MDAM. MDAM serves as the mapping function for the MDAM. When N = 1, D f 0 = F e , D a 0 = F e . It is worth noting that the MDAM operates at 64 × 64 and 32 × 32 resolutions. At a 128 × 128 resolution, we replaced DAA with a simple multiplication form to achieve lower FLOPs.
After the feature encoding stage, the multi-scale feature maps { D f 0 ,   D f 1 ,     D f N } obtained from the processing of the MDAM are fed into the prediction module to generate the final segmentation results:
y ^ = P ( D f 0 ,   D f 1 ,     D f N )
where P represents the mapping function of the prediction module, while y ^ denotes the final segmented result.

3.2. Discrete Wavelet Transform

Discrete Wavelet Transform (DWT) has been widely applied to low-level vision tasks. Inspired by previous works [36,37,38], we utilize the 2-D DWT to compute the Haar wavelet. Haar wavelets consist of the low-pass filter L and the high-pass filter H, as follows:
L = 1 2 [ 1 , 1 ] T ,   H = 1 2 [ 1 ,   1 ] T
We can obtain four subbands, which can be expressed as
X L L ,   X L H ,   X H L ,   X H H = D W T ( x )
where X L L ,   X L H ,   X H L ,   X H H R H 2 × W 2 × C represent the low-frequency component and the high-frequency components in the vertical, horizontal, and diagonal directions, respectively. These subbands are then concatenated along the channel dimension and passed through a convolutional layer to produce the transformed feature representation:
F e = f ( X L L ,   X H L ,   X L H ,   X H H )
where f represents the combined mapping function of the concatenation operation and the convolutional layer. Although the concatenation followed by the convolutional operation leads to the frequency-band features being blended, this process does not entirely eliminate the individual frequency characteristics of each band. It enables the model to preserve and propagate key frequency-specific information to a certain extent, allowing each frequency band’s unique structural details to still influence the subsequent feature extraction and enhancement stages.
As shown in Figure 1, we present visualized images of the four subbands after the wavelet transform. The L L subband captures the coarse information and overall contours of the image, while the L H and H L subbands highlight the variations in the horizontal and vertical directions, respectively. The H H subband emphasizes the fine details and subtle variations in the image. In traditional convolutional neural networks, input images are directly processed through convolutional layers, which may result in the loss of some high-frequency information, leading to blurred edges or a loss of detail. The wavelet transform explicitly separates the low-frequency and high-frequency components, enabling the network to focus on different levels of detail better and thereby improving its ability to represent fine-grained features.

3.3. The Multi-Band Detail-Aware Module

3.3.1. Multi-Scale Hessian Filtering

The Hessian matrix [39,40,41], a common criterion for describing the structural characteristics of images at a specific point, is calculated using the second partial derivatives of the image in the horizontal and vertical directions. The eigenvalues of the Hessian matrix correspond to the principal curvatures at a point, where larger eigenvalues are typically associated with prominent structural transitions such as edges or ridges, while smaller eigenvalues correspond to smoother or more homogeneous regions. The Hessian matrix is calculated as follows:
H = 2 I x 2 2 I x y 2 I x y 2 I y 2
where I represents the input image, and ( x ,   y ) indicates the coordinate position of the pixel in the Hessian matrix. The terms ( 2 I ) / ( x 2 ) and ( 2 I ) / ( y 2 ) denote the second-order partial derivatives of the image pixels with respect to the x and y directions, respectively. The term ( 2 I ) / ( x y ) represents the mixed partial derivatives of the image pixels with respect to x and y. To enhance the computational efficiency, the convolution operation is employed to calculate the second-order derivatives of the image pixels. Therefore, Equation (7) can be simplified as follows:
H = G h h ( I ) G h v ( I ) G h v ( I ) G v v ( I )
where G h h ( · ) , G v v ( · ) , and G h v ( · ) are gradient-based convolution operators.
With the advancement of deep learning and convolutional neural networks (CNNs), nearly all gradient-based feature detection operators can be efficiently implemented on GPUs using predefined convolutional kernels. A more intuitive representation of this process is illustrated in Figure 3. The Hessian matrix of D f depends on its second-order derivatives, which are available using some specific filters:
G h h = 0 0 0 1 2 1 0 0 0 ,   G h v = 1 0 1 0 0 0 1 0 1
here, G h h corresponds to the second-order derivative in the horizontal direction, while the vertical second-order derivative G v v can be obtained by transposing G h h , i.e., G v v = G h h T .
The feature mapping obtained after filtering the image using Equation (9) effectively highlights the feature information of the source image, enhancing both edges and linear structures. Let H k e r ( · ) denote the Hessian filter, where k e r represents the size of the convolution kernel used in the convolution operator. The image processing performed by the H 3 ( · ) filter can be expressed as follows:
H 3 ( I ) = G h h ( I ) + G v v ( I ) + G h v ( I )
In order to achieve the multifaceted perception of high-frequency boundary details and subject contour information, we constructed a multi-scale Hessian filter containing attentional focusing. It contains H 5 ( · ) and H 7 ( · ) filters in addition to the H 3 ( · ) filter to extract features using convolution kernels of different sizes. The filter H 5 ( · ) can be expressed as follows:
H 5 ( I ) = G 5 , h h ( I ) + G 5 , v v ( I ) + G 5 , h v ( I )
G 5 , h h = 1 ,   0 ,   2 ,   0 ,   1 G 5 , v v = 1 ,   0 ,   2 ,   0 ,   1 T G 5 , h v = 1 ,   0 ,   1 T 1 ,   0 ,   1
the filter H 7 ( · ) can also be expressed in a similar form. Hessian kernels of different sizes naturally form a bandpass filter bank. Specifically, smaller-scale filters pay more attention to inferring the lower-order maximum eigenvalue of the Hessian and exploit the gradients of closer neighboring pixels to generate finer information (e.g., edges and textures). In contrast, larger-scale filters pay more attention to inferring the higher-order maximum eigenvalue of the Hessian and generating coarser information (e.g., contours). Multi-scale Hessian filters extract the higher-order derivative responses of the image at different scales, which can be viewed as extracting geometric variation information with different frequency features from multiple spatial scales, thus enhancing the model’s ability to perceive structural details. As shown in Figure 3, we introduce a learnable multi-scale Hessian filtering mechanism, which can be represented as follows:
D a = C o n c a t ( β 3 H ¯ 3 ( D f ) ,   β 5 H ¯ 5 ( D f ) ,   β 7 H ¯ 7 ( D f ) )
where H ¯ ( D f ) represents the Hessian features after mean aggregation along the channel dimensions. We use β 3 , β 5 , and β 7 , which are learnable weight parameters, to combine multi-scale Hessian feature maps. These feature maps differ in their ability to represent fine details and coarse structure, so their weights reflect the importance of each feature map in capturing different levels of detail.
As shown in Figure 1, we present the scaled Hessian features. It can be observed that Hessian filters at different scales vary in their effectiveness in capturing details and contours. This multi-scale approach is the core of our “divide-and-conquer” strategy, enabling multi-band information to be processed at different granularities. Specifically, smaller-scale Hessian feature filters focus on fine-grained details, while larger-scale filters emphasize coarse-grained structures. By fusing these multi-scale feature maps with learnable weights, the resulting detail-aware fusion map effectively captures the boundary details of the water body. This approach ensures that the frequency-specific information at each level is preserved and enhanced, leading to more accurate and robust segmentation results.

3.3.2. Detail-Aware Attention

The specific structure of Detail-Aware Attention (DAA) is shown in Figure 2. The purpose of DAA is to facilitate the interaction of information from multi-scale Hessian features and encoded features, thus enhancing the network’s ability to perceive different detail information further. We use H i n and F i n to denote the input features of DAA, which represent the detail features obtained by Hessian filtering and the encoded features after deep encoding, respectively. In DAA, we use the linear projection of H i n to construct Q and K and the linear projection of F i n to construct V and L. Thus, the computation in DAA can be expressed as
F o u t = S o f t max ( Q K T d k ) V + C A ( L ) + F i n
where d k is the number of columns in matrix Q.
The MDAM combines multi-scale Hessian filtering and cross-attention mechanisms to fully leverage the complementary nature of local features and the global structure in image processing. The Hessian matrix captures structural changes in images (such as edges and textures) through second-order derivatives, and filters of different scales extract detailed and structural information from the image. This weighted fusion method preserves the unique characteristics of each frequency band while providing the network with a multi-level perspective.
From a mathematical perspective, the eigenvalues of the Hessian matrix reveal the curvature of the image in different directions, and filters of different scales exhibit a different performance in capturing these eigenvalues. Through weighted fusion, we effectively integrate multi-scale information while preserving the diversity and independence of frequency-band information. The cross-attention mechanism optimizes the fusion of the Hessian features and CNN deep features further, achieving effective information integration in the frequency dimension. Theoretically, this fusion of multi-scale information not only aligns with the principle of frequency-band separation but also significantly enhances the model’s ability to perceive complex structures, particularly in tasks involving water body boundaries and detail segmentation, thereby improving the model’s robustness and accuracy.

3.4. The Prediction Model

In the prediction module, we adopt a progressive upsampling strategy, gradually reconstructing low-resolution deep features to the original input size through multi-level feature fusion and sub-pixel convolution operations. Specifically, we first perform cross-scale feature fusion. The multi-scale features from each MDAM output are downsampled via average pooling and then fused with the deepest layer’s features through concatenation and convolution operations. The fused features are first processed through a convolution layer, where the number of feature map channels is expanded, followed by upsampling via PixelShuffle operations to gradually restore the feature map resolution. After upsampling, the features are connected to the output of the next-deepest MDAM layer via skip connections. Further convolution and upsampling operations continue to enhance the resolution and expressive power of the feature maps until they are restored to the size of the original input image. This design enables the model to learn rich feature representations while minimizing the loss of detail information during feature extraction and learning.

3.5. The Loss Function

To effectively address class imbalances and boundary ambiguity in water body segmentation, we adopt a hybrid loss function that combines Binary Cross-Entropy (BCE) Loss and Dice Loss:
L t o t a l = 0.5 × L B C E + L D i c e
where
L B C E ( x ,   y ) = 1 N i = 1 N ( log ( 1 + e x i ) + ( 1 y i ) x i )
L D i c e = 1 1 N n = 1 N 2 i = 1 M p n i y n i + ε i = 1 M p n i + i = 1 M y n i + ε
This combined objective balances pixel-level accuracy and region-level consistency, enhancing the segmentation of small and ambiguous water body boundaries.

4. The Experimental Procedures and Analysis

In this section, we first present the datasets utilized for the experiments, the evaluation metrics employed to assess the model’s performance, and the details of the optimization process. Next, we compare the proposed FDEN with several state-of-the-art segmentation methods on the datasets. Finally, we conduct a series of ablation experiments to validate the contributions of the various components of our proposed method and discuss its limitations.

4.1. Benchmarks

4.1.1. Datasets

In this experiment, we utilized four datasets for training and testing: the Tibetan Plateau Dataset (TPD), the Gaofen Image Dataset (GID), the Wuhan Dense Labeling Dataset (WHDLD), and the Dense Labeling Remote Sensing Dataset (DLRSD). Here, we randomly selected 70% from the three datasets, TPD, GID, and WHDLD, to construct the training set, and the remaining portion was used for testing. This split strikes a balance between training diversity and evaluation robustness. In contrast, the DLRSD is used exclusively for testing, serving as an independent benchmark to evaluate the model’s generalization capability on unseen domains. Below, we provide a detailed introduction to each of these datasets.
(i) The Tibetan Plateau Dataset (TPD): The TPD was developed by Wang et al. [29] at Lanzhou University. It contains a total of 6774 images, each sized 256 × 256 pixels, with a depth of 24 bits and a DPI of 96. All images in this dataset feature lakes located on the Tibetan Plateau, which are surrounded by deserts or Gobi areas. The TPD is specifically tailored to the segmentation of lake water bodies, and it is also the primary dataset we focus on. For this experiment, we used 4743 images for training and 2031 images for testing.
(ii) The Gaofen Image Dataset (GID): The GID [42] is a large-scale land cover dataset collected using the Gaofen-2 satellite, known for its extensive coverage, wide distribution, and high spatial resolution. As shown in Figure 4, the images from the GID contain a variety of elements such as houses, roads, rivers, and vegetation, featuring complex terrain and environmental information, which can improve the segmentation accuracy of the models for irregular edges and complex structural shapes. We synthesized the large-scale classification set and the fine land cover classification set and then sliced the images into 256 × 256 chunks, resulting in a total of 31,500 images. Of these, 22,050 images were allocated for training, and 9450 images were reserved for testing.
(iii) The Wuhan Dense Labeling Dataset (WHDLD): The WHDLD [43] is a densely labeled dataset suitable for multi-label tasks such as remote sensing image retrieval (RSIR) and semantic segmentation. The main feature of the images in this dataset is that the contrast between the water body and the surrounding vegetation is very low, which can effectively test the segmentation accuracy and robustness of different methods in low-contrast scenes. It contains 4940 RGB images, each measuring 256 × 256 pixels. In our study, we used 3458 of these images for training and 1482 images for testing.
(iv) The Dense Labeling Remote Sensing Dataset (DLRSD): The DLRSD is widely used for remote sensing image classification tasks [44] and primarily consists of high-resolution remote sensing images. The DLRSD comprises 21 classes, with 100 images per class. We use river images from this dataset as part of the test set. As shown in Figure 4, these river images exhibit both low contrast and intricate boundaries, making them ideal for evaluating the generalization capability of the model.

4.1.2. Evaluation

The metrics chosen for evaluating the model’s performance include the Mean Intersection over Union (mIoU), Recall, Precision, F1 score, and boundary F1 score (BF1). Each metric is defined as follows:
(i) The Mean Intersection over the Union (mIoU): The mIoU measures the overlap between the predicted and ground truth regions:
m I o U = 1 N i = 1 N P i G i P i G i
where N is the number of categories, and P i and G i denote the predicted and ground truth regions for class i.
(ii) Recall: Recall is the ratio of correctly predicted positive pixels to all actual positives:
Re c a l l = T P T P + F N
(iii) Precision: Precision is the ratio of correctly predicted positives to all predicted positives:
P r e c i s i o n = T P T P + F P
(iv) F1 Score: The F1 score is the harmonic mean of Precision and Recall:
F 1 = 2 × P r e c i s i o n × Re c a l l P r e c i s i o n + Re c a l l
(v) Boundary F1 score (BF1): The BF1 score evaluates the accuracy of boundary segmentation by computing the F1 score between the predicted and ground truth boundaries. The boundaries are obtained by applying morphological dilation (radius = 5) followed by an exclusive OR operation with the original masks. The score is calculated as
B F 1 = 2 × P r e c i s i o n b o u n d a r y × Re c a l l b o u n d a r y P r e c i s i o n b o u n d a r y + Re c a l l b o u n d a r y
where P r e c i s i o n b o u n d a r y = T P T P + F P and Re c a l l b o u n d a r y = T P T P + F N are computed using boundary pixels only.
These metrics together provide a comprehensive assessment of the lake water segmentation model’s performance, ensuring that the model maintains high detection rates while also achieving accurate predictions.

4.1.3. Optimization Settings

Our experiments were conducted on an NVIDIA RTX 3090 GPU using Python 3.9.19 and PyTorch 2.4.0. All comparison methods were evaluated under the same experimental settings and used the exact same training and testing sets. We employed the Stochastic Gradient Descent (SGD) algorithm as the optimizer, maintaining the hyperparameters at their default settings. Four worker threads were used during the training process, with a batch size set to 16. An initial learning rate of 0.001 was established and followed a cosine annealing scheduler, with a minimum learning rate of 0.00001, and the number of training epochs was set to 100. The momentum parameter was set to 0.9, and the weight decay parameter was set to 0.0001. The learning rate was decreased by 0.1 in each scheduling cycle. Additionally, all reported statistical data are derived from the experimental results of this study and do not include data from previous papers for comparison.

4.2. Comparison with State-of-the-Art Methods

To comprehensively demonstrate the advantages of our proposed method, we compared it with 16 mainstream algorithms widely used in semantic segmentation tasks: PSPNet [13], DeepLabV3 [45], MultiResUNet [46], ResUNet++ [47], CENet [48], UNet++ [14], ResUNet-a [49], LANet [50], MANet [51], TransUNet [21], DFSNet [52], DensePPMUNet-a [30], GAC-ViT [31], U-KAN [53], EoMT [54], and HGBT [55]. Among these algorithms, UNet++, ResUNet-a, MultiResUNet, ResUNet++, DensePPMUNet-a, and U-KAN are primarily based on the U-Net architecture, inheriting the classical encoder–decoder design. TransUNet, MANet, and GAC-ViT enhance the modeling capabilities by integrating Transformer structures, which are effective for capturing global contextual dependencies in complex scenes. In particular, EoMT achieves efficient image segmentation by simplifying the ViT architecture and introducing a novel mask annealing strategy. HGBT enhances BiFormer [56] by introducing a hypergraph convolution module, a dual pooling module, and a feature aggregation module, effectively improving the modeling of complex spatial relationships and multi-scale features in high-resolution remote sensing images. DeepLabV3, PSPNet, CENet, LANet, and DFSNet rely on multi-scale feature extraction, contextual information capture, and pyramid aggregation modules to enhance the perception of global features. Comparison results on four datasets are shown in Table 1, Table 2, Table 3, and Table 4, respectively. Note that our training set consists of 70% randomly selected data from three datasets, TPD, GID, and WHDLD. Therefore, in the first three test sets, each method can achieve a good performance. Since DLRSD is not involved in the construction of the training set, the performance of each method on the DLRSD test set can fully reflect the generalization performance of these methods.

4.2.1. Visualization Results

The segmentation results of various methods on the TPD are shown in Figure 5. From this figure, we observe that methods such as U-KAN, DensePPMUNet-a, TransUNet, and ResUNet-a successfully capture the overall contours of the lakes; however, their segmentation at boundary details with high-frequency characteristics is not as good as that of our method. For example, in the first row of images, the FDEN more completely extracts small lake water bodies. In the third row, the segmentation results of the FDEN are more precise when handling complex shapes or boundaries that exhibit intricate details. For LANet and MANet, they have improved sensitivity to small features within the water column but are therefore prone to introducing noise. This is most evident in the results in the second row, where LANet and MANet mis-segment the waves as land.
The segmentation results on the WHDLD are shown in Figure 6. Our method is more effective for segmentation in low-contrast situations. For example, small pieces of water bodies in the boxes in the third and fourth rows of images are accurately extracted by the FDEN. These details fully demonstrate the enhancement of the FDEN for high-frequency details in remote sensing images. In contrast, although ResUNet-a, UNet++, and CENet can recognize the general region, they are still not as good as our method in terms of the completeness of segmentation. DensePPMUNet-a, on the other hand, is able to extract the global context prior with a dense range of scales, and the segmentation is much more coherent, but the segmentation is not as effective for such small regions with low contrast. In addition, the segmentation results of U-KAN are closest to those of our method, but our method is more accurate at complex boundaries, as can be seen in the image in the fifth row.

4.2.2. Quantitative Results

As shown in Table 1, on the TPD, our method achieves an mIoU of 95.28%, a Recall of 97.48%, a Precision of 97.63%, and an F1 score of 97.47%. From the quantitative results, all methods showed high values on the TPD. This can be attributed to the fact that most of the images in the TPD show clear and well-defined boundaries between the lake and the land. Among the U-Net-based models, U-KAN, ResUNet-a, and DensePPMUNet-a achieved relatively good results. They fully incorporate multi-scale as well as contextual information to achieve a more advanced performance. Compared with U-KAN, our method improves the IoU by 2.57% and the F1 score by 1.40%. In the transformer-based models, TransUNet combines the local feature extraction capability of CNNs with the contextual information extraction capability of a ViT to obtain equally good segmentation results. Compared to TransUNet, our method improves the IoU by 2.67% and the F1 score by 1.41%.
Table 2 presents the segmentation results on the GID. The images in the GID are characterized by rich colors and complex regional structures, which increase the difficulty of accurately segmenting the edges of the images. Among the various comparison methods, models that consistently preserve low-level information through residual connections and fuse multi-scale information via skip connections generally exhibit a better performance. This highlights the importance of integrating information from different levels, especially in low-contrast images, as demonstrated by models such as UNet++ and DensePPMUNet-a. In the transformer-based method, HGBT combined with a hypergraph achieves the synergistic modeling of the global and local features and shows a good performance. The FDEN is substantially improved compared to all comparative methods. Compared to the next best, U-KAN, our method improves the IoU by 5.63% and the F1 score by 4.08%.
Table 3 shows the quantitative results of the comparison methods on the WHDLD. The image elements in the WHDLD are mostly buildings, roads, and vegetation, and the morphological structures are mostly regular polygons or linear, which can detect the detail fidelity of the model segmentation at the edges well. Table 3 shows that UNet++, LANet, TransUNet, DensePPMUNet-a, GAC-ViT, and U-KAN achieve a good performance on the WHDLD. Notably, LANet does not a achieve satisfactory performance on the GID. This is highly related to the morphological structure of the elements in the image. LANet enhances the embedding of contextual information by introducing the Patch Attention Module (PAM) on the one hand and proposes the Attention Embedding Module (AEM) on the other hand for the semantic information of low-level features. Thus, it is able to show a better performance in segmenting locally regular shapes. Our method improves the IoU by 9.90% and the F1 score by 6.04% compared to those of LANet.
Unlike other datasets, the DLRSD is only used as a test set in this experiment, so it can reflect the generalization ability of the model well. The river images in the DLRSD exhibit low contrast between water bodies and the land. The quantitative results of our method and the comparison methods on the DLRSD are given in Table 4. U-KAN shows a better segmentation performance among the comparison methods. Our method improves the IoU by 4.96% and the F1 score by 6.04% compared to those of U-KAN.
In addition to the four metrics of the mIoU, Recall, Precision, and F1 score, we also evaluated the boundary segmentation accuracy of each method using the metric of the boundary F1 score. As shown in Table 5, the FDEN achieves a BF1 of 0.6513 on the TPD, which is an improvement of 3.46% over that of the suboptimal U-KAN. On the DLRSD, the FDEN exceeds U-KAN by 5.15%. These results consistently and robustly validate the excellent performance of the FDEN in the water body boundary extraction task.
To assess the FDEN’s robustness across various challenging scenarios, we provide visual comparisons with several other high-performing models in Figure 7. Both scenes are sourced from the GID. The results clearly demonstrate that the FDEN excels in handling complex scenes, achieving superior boundary segmentation completeness and preserving finer details. This highlights the model’s ability to maintain a high-quality segmentation performance even in challenging and varied environments.

4.3. Ablation Experiments

In this subsection, we first analyze the trade-off between model performance and the number of parameters. To demonstrate the model’s capability to extract edge details and small targets, we then conduct a quantitative analysis of the influence and contributions of the wavelet transform and the MDAM through a series of experiments. Finally, we investigate the effect of wavelet transforms at different depths and Hessian matrices with various operators on the model performance. Our analysis primarily focuses on the experimental results from the TPD and the WHDLD.

4.3.1. The Model Parameters and the Computational Overhead

Typically, an increase in the computational complexity of the model also results in performance gains. The objective of this study is to find the optimal balance by analyzing the impact of the number of Conv Blocks in the MDAM and the number of channels on both the network complexity and performance. We denote B and C as the number of blocks and channels, respectively, across multiple experiments. The results are summarized in Table 6. As B and C increase, the segmentation performance of the network improves, but this also leads to a rise in the number of parameters and the overall complexity. It is observed that the rate of performance improvement diminishes when B and C reach a certain threshold. Therefore, we decided to accept a slight reduction in the performance gain in exchange for a lower computational complexity. The final configuration chosen was B = 10 and C = 32.
As shown in Table 7, our method strikes a good balance between computational efficiency and performance. It outperforms several existing models in terms of the inference time, achieving a faster processing speed of 3.75 ms, while maintaining a competitive performance with 6.97 M parameters and 12.79 G FLOPs. This demonstrates the efficiency of our approach for water body extraction, making it suitable for practical, real-time applications.

4.3.2. The Effectiveness of the Wavelet Transform

To investigate the impact of the wavelet transform on the model’s performance, we conducted an experiment where the wavelet transform module was replaced with a conventional convolutional layer and a pooling layer. All of the other settings were kept unchanged, and the model was retrained. The quantitative results, shown in Table 8, indicate that after substituting the wavelet transform module, the mIoU dropped to 91.47%, with a Recall of 94.02%, a Precision of 97.03%, and an F1 score of 95.36%. Our model showed an improved performance in these metrics when the wavelet transform was included, highlighting its significant influence on the overall segmentation performance. We believe that the wavelet transform has such a large impact because it provides a frequency-band benchmark for the FDEN, which improves the multi-band detail perception further.

4.3.3. The Effectiveness of the MDAM

In our study of the effectiveness of the MDAM, we focus on the role of DAA, a key module in the MDAM. We evaluate the impact of DAA by removing it completely, i.e., eliminating the guiding role of multi-scale Hessian filtering for encoded features, and then retraining the model while keeping all other settings unchanged. The results, as presented in Table 8, show that the model’s performance metrics declined across the board after the removal of DAA. This quantitative analysis confirms that multi-scale Hessian filtering is crucial for enhancing the segmentation accuracy of the model.
To assess the effectiveness of the scaled Hessian filter in capturing fine edge details further, we conducted six experiments using different filters: G 3 , h h , G 3 , v v , G 3 , h v , H 3 ( · ) , H 5 ( · ) , and H 7 ( · ) . The results, summarized in Table 9, indicate that the H ker ( · ) filter improves the model performance more significantly when the kernel size ( k e r ) is smaller. Among the various configurations, our learnable multi-scale Hessian filtering achieved the best segmentation results, demonstrating its superior ability to enhance the model’s edge detail extraction.
In addition, to understand the adaptive mechanism of multi-scale Hessian filtering better, we analyze the dynamic evolution of the three learnable weights ( β 3 , β 5 , and β 7 ) during the training process. As shown in Figure 8, these weights gradually converge around the 90th epoch. Although the absolute differences between them are relatively small, the consistent ordering of β 3 > β 5 > β 7 indicates a stable preference of the network toward smaller kernels. This phenomenon may be partly attributed to the high complementarity among different Hessian scales for water body segmentation, which prevents the model from assigning overwhelming dominance to any single kernel. In addition, the diversity of sample styles in the dataset and the dynamics of the training process (e.g., the gradual decrease in the learning rate) may also contribute to the relatively close values of β . Nevertheless, the persistent trend that favors smaller kernels suggests that fine-scale filtering may contribute more to capturing high-frequency boundary details, while larger kernels are still preserved to ensure regional smoothness and continuity. Importantly, none of the scales are completely suppressed, which we consider as evidence that the network benefits from a balanced yet preferential utilization of multi-scale Hessian features.

4.3.4. Discussion on Limitations

Although our model achieved encouraging results, we identified some limitations in the dataset and the model during the experiments. As shown in Figure 9, we have listed some scenarios with segmentation errors, and the incorrectly annotated parts are enlarged in the figure. Specifically,
Dataset limitations: Figure 9a shows the segmentation errors caused by geographical names obscuring the ground truth; Figure 9b shows errors where the model segments water bodies that were not annotated in the ground truth. These errors highlight the need for continuous dataset optimization, including improving the annotation accuracy and refining the dataset to reflect real-world conditions better.
Method limitations: Figure 9c,d show the segmentation errors in our method at extremely rare complex boundaries. The red-boxed area in Figure 9c exhibits complex gradient changes, and some regions have colors very similar to those in areas of land, leading our method to incorrectly segment portions of water bodies as land. The boundary regions in Figure 9d primarily exhibit extremely low contrast, with some land areas having colors very similar to those of lakes, resulting in segmentation errors. Figure 9c,d not only demonstrate the FDEN’s strong ability to perceive boundary details but also indicate that the FDEN still has room for improvement when faced with such highly misleading scenes. We believe that improving the FDEN’s ability to capture contextual information, possibly through the integration of advanced mechanisms such as transformer-based blocks, could mitigate these challenges and further refine the model’s performance in highly ambiguous scenarios.

5. Conclusions

In this paper, a novel lake water body segmentation network is proposed. To enhance the structural saliency of boundary details in remotely sensed images, we integrate a wavelet transform, multi-scale Hessian filtering, and an attention mechanism within an encoder–decoder framework. The wavelet transform decomposes the image into low- and high-frequency components, while the multi-scale Hessian filtering enhances the detail information across frequencies using specific differential operators. A learnable weighting mechanism is introduced to adaptively balance the contributions of features at different scales. This design effectively addresses the limited detail perception in previous water body segmentation models, thereby improving the overall segmentation accuracy and preserving fine-scale structures better. Extensive experiments on multiple benchmark datasets demonstrate that our method consistently outperforms the existing state-of-the-art techniques in both quantitative and qualitative evaluations. By enhancing the precision of segmentation, especially in challenging scenarios with complex boundaries and small water bodies, our approach enables more accurate and robust data extraction for long-term lake monitoring, ultimately supporting more informed decision-making in water conservation and environmental management.

Author Contributions

Conceptualization: S.W. and N.W.; methodology: K.G. and S.W.; writing—original draft: K.G.; writing—review and editing: S.W., N.W. and L.T.; visualization: S.W. and K.G. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Second Tibetan Plateau Scientific Expedition and Research Program, grant number 2024QZKK0400, and the China Postdoctoral Science Foundation, grant number 2022M722570.

Data Availability Statement

The data used in this study are all from publicly available datasets, as cited in the manuscript. The Tibetan Plateau Dataset (TPD): Available from the National Cryosphere Desert Data Center at https://www.ncdc.ac.cn/portal/metadata/b4d9fb27-ec93-433d-893a-2689379a3fc0, accessed on 1 July 2025. The Gaofen Image Dataset (GID): Available at https://github.com/CAPTAIN-WHU/GID?utm_source=chatgpt.com, accessed on 1 July 2025. The Wuhan Dense Labeling Dataset (WHDLD): Available at https://sites.google.com/view/zhouwx/dataset#h.p_hQS2jYeaFpV0, accessed on 1 July 2025. The Dense Labeling Remote Sensing Dataset (DLRSD): Available at https://sites.google.com/view/zhouwx/dataset#h.p_hQS2jYeaFpV0, accessed on 1 July 2025.

Acknowledgments

We thank the associate editor and the reviewers for their useful feedback that improved this paper.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Cai, F.; Tang, B.H.; Sima, O.; Chen, G.; Zhang, Z. Fine Extraction of Plateau Wetlands Based on a Combination of Object-Oriented Machine Learning and Ecological Rules: A Case Study of Dianchi Basin. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 5364–5377. [Google Scholar] [CrossRef]
  2. Liu, Q.; Tian, Y.; Zhang, L.; Chen, B. Urban Surface Water Mapping from VHR Images Based on Superpixel Segmentation and Target Detection. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2022, 15, 5339–5356. [Google Scholar] [CrossRef]
  3. Wang, S.; Peppa, M.V.; Xiao, W.; Maharjan, S.B.; Joshi, S.P.; Mills, J.P. A second-order attention network for glacial lake segmentation from remotely sensed imagery. ISPRS J. Photogramm. Remote Sens. 2022, 189, 289–301. [Google Scholar] [CrossRef]
  4. El-Alem, A.; Chokmani, K.; Laurion, I.; El-Adlouni, S.E.; Raymond, S.; Ratté-Fortin, C. Ensemble-based systems to monitor algal bloom with remote sensing. IEEE Trans. Geosci. Remote Sens. 2019, 57, 7955–7971. [Google Scholar] [CrossRef]
  5. Liu, Y.; Li, J.; Xiao, C.; Zhang, F.; Wang, S.; Yin, Z.; Wang, C.; Zhang, B. A classification-based, semianalytical approach for estimating water clarity from a hyperspectral sensor onboard the ZY1-02D satellite. IEEE Trans. Geosci. Remote Sens. 2022, 60, 4206714. [Google Scholar] [CrossRef]
  6. Ma, H.; Yang, X.; Fan, R.; Han, W.; He, K.; Wang, L. Refined Water-Body Types Mapping Using a Water-Scene Enhancement Deep Models by Fusing Optical and SAR Images. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 17430–17441. [Google Scholar] [CrossRef]
  7. Huang, B.; Li, P.; Lu, H.; Yin, J.; Li, Z.; Wang, H. WaterDetectionNet: A New Deep Learning Method for Flood Mapping With SAR Image Convolutional Neural Network. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 14471–14485. [Google Scholar] [CrossRef]
  8. Song, Y.; Rui, X.; Li, J. AEDNet: An Attention-Based Encoder–Decoder Network for Urban Water Extraction from High Spatial Resolution Remote Sensing Images. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 1286–1298. [Google Scholar] [CrossRef]
  9. Thayammal, S.; Jayaraghavi, R.; Priyadarsini, S.; Selvathi, D. Analysis of water body segmentation from Landsat imagery using deep neural network. Wirel. Pers. Commun. 2022, 123, 1265–1282. [Google Scholar] [CrossRef]
  10. Kang, J.; Guan, H.; Peng, D.; Chen, Z. Multi-scale context extractor network for water-body extraction from high-resolution optical remotely sensed images. Int. J. Appl. Earth Obs. Geoinf. 2021, 103, 102499. [Google Scholar] [CrossRef]
  11. Kang, J.; Guan, H.; Ma, L.; Wang, L.; Xu, Z.; Li, J. WaterFormer: A coupled transformer and CNN network for waterbody detection in optical remotely-sensed imagery. ISPRS J. Photogramm. Remote Sens. 2023, 206, 222–241. [Google Scholar] [CrossRef]
  12. Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the Medical Image Computing and COMPUTER-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, 5–9 October 2015; Proceedings, Part III 18. Springer: Berlin/Heidelberg, Germany, 2015; pp. 234–241. [Google Scholar]
  13. Zhao, H.; Shi, J.; Qi, X.; Wang, X.; Jia, J. Pyramid scene parsing network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2881–2890. [Google Scholar]
  14. Zhou, Z.; Siddiquee, M.M.R.; Tajbakhsh, N.; Liang, J. UNet++: Redesigning Skip Connections to Exploit Multiscale Features in Image Segmentation. IEEE Trans. Med. Imaging 2020, 39, 1856–1867. [Google Scholar] [CrossRef] [PubMed]
  15. Fu, J.; Liu, J.; Tian, H.; Li, Y.; Bao, Y.; Fang, Z.; Lu, H. Dual attention network for scene segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 3146–3154. [Google Scholar]
  16. Dosovitskiy, A. An image is worth 16 × 16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
  17. Cheng, B.; Schwing, A.; Kirillov, A. Per-pixel classification is not all you need for semantic segmentation. Adv. Neural Inf. Process. Syst. 2021, 34, 17864–17875. [Google Scholar]
  18. Cheng, B.; Misra, I.; Schwing, A.G.; Kirillov, A.; Girdhar, R. Masked-attention mask transformer for universal image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 1290–1299. [Google Scholar]
  19. Cao, H.; Wang, Y.; Chen, J.; Jiang, D.; Zhang, X.; Tian, Q.; Wang, M. Swin-unet: Unet-like pure transformer for medical image segmentation. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 205–218. [Google Scholar]
  20. Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 10012–10022. [Google Scholar]
  21. Chen, J.; Lu, Y.; Yu, Q.; Luo, X.; Adeli, E.; Wang, Y.; Lu, L.; Yuille, A.L.; Zhou, Y. Transunet: Transformers make strong encoders for medical image segmentation. arXiv 2021, arXiv:2102.04306. [Google Scholar] [CrossRef]
  22. Lu, S.; Wu, B.; Yan, N.; Wang, H. Water body mapping method with HJ-1A/B satellite imagery. Int. J. Appl. Earth Obs. Geoinf. 2011, 13, 428–434. [Google Scholar] [CrossRef]
  23. Xu, H. Modification of normalised difference water index (NDWI) to enhance open water features in remotely sensed imagery. Int. J. Remote Sens. 2006, 27, 3025–3033. [Google Scholar] [CrossRef]
  24. McFeeters, S.K. The use of the Normalized Difference Water Index (NDWI) in the delineation of open water features. Int. J. Remote Sens. 1996, 17, 1425–1432. [Google Scholar] [CrossRef]
  25. Cherkassky, V. The nature of statistical learning theory. IEEE Trans. Neural Netw. 1997, 8, 1564. [Google Scholar] [CrossRef] [PubMed]
  26. Otsu, N. A threshold selection method from gray-level histograms. Automatica 1975, 11, 23–27. [Google Scholar] [CrossRef]
  27. Lv, W.; Yu, Q.; Yu, W. Water extraction in SAR images using GLCM and support vector machine. In Proceedings of the IEEE 10th International Conference on Signal Processing Proceedings, Beijing, China, 24–28 October 2010. [Google Scholar]
  28. Zhong, H.F.; Sun, Q.; Sun, H.M.; Jia, R.S. NT-Net: A semantic segmentation network for extracting lake water bodies from optical remote sensing images based on transformer. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5627513. [Google Scholar] [CrossRef]
  29. Wang, Z.; Gao, X.; Zhang, Y.; Zhao, G. MSLWENet: A novel deep learning network for lake water body extraction of Google remote sensing images. Remote Sens. 2020, 12, 4140. [Google Scholar] [CrossRef]
  30. Xiang, D.; Zhang, X.; Wu, W.; Liu, H. Denseppmunet-a: A robust deep learning network for segmenting water bodies from aerial images. IEEE Trans. Geosci. Remote Sens. 2023, 61, 4202611. [Google Scholar] [CrossRef]
  31. Qi, H.; Kong, X.; Cheng, L.; Hu, J.; Gu, J. Addressing Fine-Grained Lake Water Body Extraction: A Hybrid Approach Combining Vision Transformer and Geodesic Active Contour. IEEE Trans. Geosci. Remote Sens. 2024, 62, 4204614. [Google Scholar] [CrossRef]
  32. Niedermeier, A.; Romaneessen, E.; Lehner, S. Detection of coastlines in SAR images using wavelet methods. IEEE Trans. Geosci. Remote Sens. 2000, 38, 2270–2281. [Google Scholar] [CrossRef]
  33. Jung, C.R.; Scharcanski, J. Robust watershed segmentation using wavelets. Image Vis. Comput. 2005, 23, 661–669. [Google Scholar] [CrossRef]
  34. Frangi, A.F.; Niessen, W.J.; Vincken, K.L.; Viergever, M.A. Multiscale vessel enhancement filtering. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention—MICCAI’98: First International Conference, Cambridge, MA, USA, 11–13 October 1998; Proceedings 1. Springer: Berlin/Heidelberg, Germany, 1998; pp. 130–137. [Google Scholar]
  35. Zhang, K.; Zhang, L.; Song, H.; Zhou, W. Active contours with selective local or global segmentation: A new formulation and level set method. Image Vis. Comput. 2010, 28, 668–676. [Google Scholar] [CrossRef]
  36. Guo, T.; Seyed Mousavi, H.; Huu Vu, T.; Monga, V. Deep wavelet prediction for image super-resolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Honolulu, HI, USA, 21–26 July 2017; pp. 104–113. [Google Scholar]
  37. Xin, J.; Li, J.; Jiang, X.; Wang, N.; Huang, H.; Gao, X. Wavelet-based dual recursive network for image super-resolution. IEEE Trans. Neural Netw. Learn. Syst. 2020, 33, 707–720. [Google Scholar] [CrossRef] [PubMed]
  38. Bozorgasl, Z.; Chen, H. Wav-kan: Wavelet kolmogorov-arnold networks. arXiv 2024, arXiv:2405.12832. [Google Scholar] [CrossRef]
  39. Zhao, C.; Wu, S.; Wu, J. Parameter-Adaptive NLM Denoising Algorithm Based on Hessian Matrix. In Proceedings of the 2021 4th International Conference on Intelligent Autonomous Systems (ICoIAS), Wuhan, China, 14–16 May 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 216–221. [Google Scholar]
  40. Alhussein, M.; Aurangzeb, K.; Haider, S.I. An unsupervised retinal vessel segmentation using hessian and intensity based approach. IEEE Access 2020, 8, 165056–165070. [Google Scholar] [CrossRef]
  41. Yang, L.; Gai, M.; Wang, T.; Xing, M. Elaborated-Structure Awareness SAR Imagery Using Hessian-Enhanced TV Regularization. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5216016. [Google Scholar] [CrossRef]
  42. Tong, X.Y.; Xia, G.S.; Lu, Q.; Shen, H.; Li, S.; You, S.; Zhang, L. Land-cover classification with high-resolution remote sensing images using transferable deep models. Remote Sens. Environ. 2020, 237, 111322. [Google Scholar] [CrossRef]
  43. Shao, Z.; Zhou, W.; Deng, X.; Zhang, M.; Cheng, Q. Multilabel remote sensing image retrieval based on fully convolutional network. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2020, 13, 318–328. [Google Scholar] [CrossRef]
  44. Shao, Z.; Yang, K.; Zhou, W. Performance evaluation of single-label and multi-label remote sensing image retrieval using a dense labeling dataset. Remote Sens. 2018, 10, 964. [Google Scholar] [CrossRef]
  45. Chen, L.C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 40, 834–848. [Google Scholar] [CrossRef] [PubMed]
  46. Ibtehaz, N.; Rahman, M.S. MultiResUNet: Rethinking the U-Net architecture for multimodal biomedical image segmentation. Neural Netw. 2020, 121, 74–87. [Google Scholar] [CrossRef] [PubMed]
  47. Jha, D.; Smedsrud, P.H.; Riegler, M.A.; Johansen, D.; De Lange, T.; Halvorsen, P.; Johansen, H.D. Resunet++: An advanced architecture for medical image segmentation. In Proceedings of the 2019 IEEE International Symposium on Multimedia (ISM), San Diego, CA, USA, 9–11 December 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 225–230. [Google Scholar]
  48. Gu, Z.; Cheng, J.; Fu, H.; Zhou, K.; Hao, H.; Zhao, Y.; Zhang, T.; Gao, S.; Liu, J. Ce-net: Context encoder network for 2d medical image segmentation. IEEE Trans. Med. Imaging 2019, 38, 2281–2292. [Google Scholar] [CrossRef] [PubMed]
  49. Diakogiannis, F.I.; Waldner, F.; Caccetta, P.; Wu, C. ResUNet-a: A deep learning framework for semantic segmentation of remotely sensed data. ISPRS J. Photogramm. Remote Sens. 2020, 162, 94–114. [Google Scholar] [CrossRef]
  50. Ding, L.; Tang, H.; Bruzzone, L. LANet: Local attention embedding to improve the semantic segmentation of remote sensing images. IEEE Trans. Geosci. Remote Sens. 2020, 59, 426–435. [Google Scholar] [CrossRef]
  51. Li, R.; Zheng, S.; Zhang, C.; Duan, C.; Su, J.; Wang, L.; Atkinson, P.M. Multiattention network for semantic segmentation of fine-resolution remote sensing images. IEEE Trans. Geosci. Remote Sens. 2021, 60, 5607713. [Google Scholar] [CrossRef]
  52. Ma, Z.; Xia, M.; Weng, L.; Lin, H. Local feature search network for building and water segmentation of remote sensing image. Sustainability 2023, 15, 3034. [Google Scholar] [CrossRef]
  53. Li, C.; Liu, X.; Li, W.; Wang, C.; Liu, H.; Liu, Y.; Chen, Z.; Yuan, Y. U-kan makes strong backbone for medical image segmentation and generation. In Proceedings of the AAAI Conference on Artificial Intelligence, Philadelphia, PA, USA, 25 February–4 March 2025; Volume 39, pp. 4652–4660. [Google Scholar]
  54. Kerssies, T.; Cavagnero, N.; Hermans, A.; Norouzi, N.; Averta, G.; Leibe, B.; Dubbelman, G.; de Geus, D. Your ViT is Secretly an Image Segmentation Model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 11–15 June 2025. [Google Scholar]
  55. Jing, W.; Zhang, W.; Di, D.; Li, C.; Emam, M.; Mian, A. Hypergraph BiFormer for Semantic Segmentation of High-Resolution Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2025, 63, 4406915. [Google Scholar] [CrossRef]
  56. Zhu, L.; Wang, X.; Ke, Z.; Zhang, W.; Lau, R.W. Biformer: Vision transformer with bi-level routing attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 10323–10333. [Google Scholar]
  57. Liao, D.; Sun, J.; Deng, Z.; Zhao, Y.; Zhang, J.; Ou, D. A Lightweight Network for Water Body Segmentation in Agricultural Remote Sensing Using Learnable Kalman Filters and Attention Mechanisms. Appl. Sci. 2025, 15, 6292. [Google Scholar] [CrossRef]
  58. Gao, F.; Fu, M.; Cao, J.; Dong, J.; Du, Q. Adaptive Frequency Enhancement Network for Remote Sensing Image Semantic Segmentation. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5619415. [Google Scholar] [CrossRef]
Figure 1. Visualization of the wavelet transform and detail-aware fusion in the first MDAM. Here, LL, LH, HL, and HH represent the average maps of the four subband features from the wavelet transform, and H 3 ( · ) , H 5 ( · ) , and H 7 ( · ) denote the average maps of the scaled Hessian features for ker = 3, 5, and 7, respectively. The attention maps are derived from these Hessian features and effectively capture the edge information in the image.
Figure 1. Visualization of the wavelet transform and detail-aware fusion in the first MDAM. Here, LL, LH, HL, and HH represent the average maps of the four subband features from the wavelet transform, and H 3 ( · ) , H 5 ( · ) , and H 7 ( · ) denote the average maps of the scaled Hessian features for ker = 3, 5, and 7, respectively. The attention maps are derived from these Hessian features and effectively capture the edge information in the image.
Remotesensing 17 03062 g001
Figure 2. The pipeline of the proposed FDEN. We replace the original input with a frequency-domain representation and then perform coding optimization with stacked MDAMs. The green arrows in the figure above represent the multi-band detail feature perception path, which transmits detail features extracted using Hessian kernels of different sizes. The black arrows represent the encoded feature transmission path, which transmits the main encoded features processed through the conv block and detail refinement.
Figure 2. The pipeline of the proposed FDEN. We replace the original input with a frequency-domain representation and then perform coding optimization with stacked MDAMs. The green arrows in the figure above represent the multi-band detail feature perception path, which transmits detail features extracted using Hessian kernels of different sizes. The black arrows represent the encoded feature transmission path, which transmits the main encoded features processed through the conv block and detail refinement.
Remotesensing 17 03062 g002
Figure 3. Multi-scale Hessian filtering: H k e r ( · ) represents Hessian filters at different scales, where k e r indicates the kernel size of the convolution operator used by the Hessian filters. β 3 , β 5 , and β 7 are learnable weight parameters.
Figure 3. Multi-scale Hessian filtering: H k e r ( · ) represents Hessian filters at different scales, where k e r indicates the kernel size of the convolution operator used by the Hessian filters. β 3 , β 5 , and β 7 are learnable weight parameters.
Remotesensing 17 03062 g003
Figure 4. (a,b) Samples from the GID; (c,d) samples from the DLRSD.
Figure 4. (a,b) Samples from the GID; (c,d) samples from the DLRSD.
Remotesensing 17 03062 g004
Figure 5. Visualized segmentation results for several scenarios in the TPD: (a) original image; (b) ground truth; (c) our method; (d) U-KAN; (e) DensePPMUNet-a; (f) TransUNet; (g) LANet; (h) MANet; (i) ResUNet-a; (j) UNet++; (k) CENet. Some details are marked with red rectangles.
Figure 5. Visualized segmentation results for several scenarios in the TPD: (a) original image; (b) ground truth; (c) our method; (d) U-KAN; (e) DensePPMUNet-a; (f) TransUNet; (g) LANet; (h) MANet; (i) ResUNet-a; (j) UNet++; (k) CENet. Some details are marked with red rectangles.
Remotesensing 17 03062 g005
Figure 6. Visualized segmentation results for several scenarios in the WHDLD: (a) original image; (b) ground truth; (c) our method; (d) U-KAN; (e) DensePPMUNet-a; (f) TransUNet; (g) LANet; (h) MANet; (i) ResUNet-a; (j) UNet++; (k) CENet. Some details are marked with red rectangles.
Figure 6. Visualized segmentation results for several scenarios in the WHDLD: (a) original image; (b) ground truth; (c) our method; (d) U-KAN; (e) DensePPMUNet-a; (f) TransUNet; (g) LANet; (h) MANet; (i) ResUNet-a; (j) UNet++; (k) CENet. Some details are marked with red rectangles.
Remotesensing 17 03062 g006
Figure 7. Comparison of visual results in complex scenes between FDEN and several high-performance models. White regions indicate the output results.
Figure 7. Comparison of visual results in complex scenes between FDEN and several high-performance models. White regions indicate the output results.
Remotesensing 17 03062 g007
Figure 8. The weight evolution curves of multi-scale Hessian kernels ( β 3 : 3 × 3 kernels; β 5 : 5 × 5 kernels; β 7 : 7 × 7 kernels).
Figure 8. The weight evolution curves of multi-scale Hessian kernels ( β 3 : 3 × 3 kernels; β 5 : 5 × 5 kernels; β 7 : 7 × 7 kernels).
Remotesensing 17 03062 g008
Figure 9. Analysis of failure cases. (a) Mis-segmentation caused by text in the image; (b) Omissions in the annotation of water areas in the dataset; (c) Errors in segmentation at areas with dramatic color changes; (d) Incomplete segmentation at low-contrast boundaries. Prediction represents the output of FDEN. The segmentation error regions are highlighted in red boxes.
Figure 9. Analysis of failure cases. (a) Mis-segmentation caused by text in the image; (b) Omissions in the annotation of water areas in the dataset; (c) Errors in segmentation at areas with dramatic color changes; (d) Incomplete segmentation at low-contrast boundaries. Prediction represents the output of FDEN. The segmentation error regions are highlighted in red boxes.
Remotesensing 17 03062 g009
Table 1. Quantitative evaluation on TPD.
Table 1. Quantitative evaluation on TPD.
MethodsmIoU (%)Recall (%)Precis (%)F1 (%)
PSPNet [13]90.8494.9795.3795.04
DeepLabV3 [45]89.4994.6094.2394.34
MultiResUNet [46]90.9692.0293.4595.09
ResUNet++ [47]90.3692.5195.1394.56
CENet [48]91.9895.3696.1795.65
UNet++ [14]89.2291.7396.8994.13
ResUNet-a [49]91.8595.3096.1195.59
LANet [50]90.9293.4297.0595.07
MANet [51]84.8886.8996.3191.66
TransUNet [21]92.6195.9596.4096.06
DSFNet [52]89.1993.2795.2294.09
DensePPMUNet-a [30]91.4895.5795.4595.39
GAC-ViT [31]92.6794.9596.3895.91
U-KAN [53]92.7196.1796.2296.07
EoMT [54]92.3794.1697.5995.73
HGBT [55]92.6895.1097.2896.06
LKF-DCANet [57]91.8094.7896.6095.57
AFENet [58]94.5696.7897.6097.10
Ours95.2897.4897.6397.47
Table 2. Quantitative evaluation on GID.
Table 2. Quantitative evaluation on GID.
MethodsmIoU (%)Recall (%)Precis (%)F1 (%)
PSPNet [13]52.5871.9356.4360.18
DeepLabV3 [45]52.7476.0857.2467.75
MultiResUNet [46]53.0672.3856.7060.74
ResUNet++ [47]40.1050.5641.8943.04
CENet [48]42.4374.4044.2652.53
UNet++ [14]59.5776.6169.7670.97
ResUNet-a [49]56.2770.8961.7363.53
LANet [50]48.0266.4054.5156.72
MANet [51]41.2271.2948.1655.31
TransUNet [21]57.4472.8261.9764.22
DSFNet [52]48.7369.2053.8157.55
DensePPMUNet-a [30]58.3967.4865.8964.25
GAC-ViT [31]64.3779.6663.3467.45
U-KAN [53]65.2976.8768.0169.96
EoMT [54]57.1573.6560.1063.69
HGBT [55]61.4273.2565.3566.89
LKF-DCANet [57]55.3272.6159.0462.43
AFENet [58]68.8679.1071.4473.39
Ours70.9280.1671.7274.04
Table 3. Quantitative evaluation on WHDLD.
Table 3. Quantitative evaluation on WHDLD.
MethodsmIoU (%)Recall (%)Precis (%)F1 (%)
PSPNet [13]78.7885.1890.9187.65
DeepLabV3 [45]69.8986.5778.5582.17
MultiResUNet [46]78.8989.6486.4687.75
ResUNet++ [47]73.3080.2185.8981.99
CENet [48]78.6284.4791.2987.47
UNet++ [14]80.5585.1293.6089.04
ResUNet-a [49]77.3286.5187.3386.66
LANet [50]80.5586.7891.4188.80
MANet [51]64.1071.6185.5177.71
TransUNet [21]80.1789.6688.1588.49
DSFNet [52]66.6769.6992.7878.94
DensePPMUNet-a [30]80.5183.7792.8988.69
GAC-ViT [31]82.1888.3692.1389.04
U-KAN [53]85.4390.4193.5491.75
EoMT [54]61.1379.3571.8074.62
HGBT [55]77.9982.9392.5287.07
LKF-DCANet [57]80.2785.7592.2488.60
AFENet [58]88.4194.0293.8393.86
Ours90.4594.4095.3994.84
Table 4. Quantitative evaluation on DLRSD.
Table 4. Quantitative evaluation on DLRSD.
MethodsmIoU (%)Recall (%)Precis (%)F1 (%)
PSPNet [13]41.1948.3269.9455.92
DeepLabV3 [45]37.7642.3777.2654.07
MultiResUNet [46]37.5445.0556.9952.24
ResUNet++ [47]36.0442.7261.8748.19
CENet [48]42.6448.1675.9657.09
UNet++ [14]34.6341.7567.6350.36
ResUNet-a [49]37.1050.8253.7150.96
LANet [50]39.1849.2462.5254.15
MANet [21,51]31.5532.9774.6245.06
TransUNet [21]41.9247.2074.6056.21
DSFNet [52]42.4546.5075.1155.43
DensePPMUNet-a [30]40.4147.0072.5255.17
GAC-ViT [31]39.8448.1976.8058.36
U-KAN [53]42.7245.3882.1957.35
EoMT [54]41.4748.1573.8956.39
HGBT [55]40.6042.6179.9354.61
LKF-DCANet [57]42.4844.6186.5295.57
AFENet [58]45.6050.6781.8661.58
Ours47.6851.0288.2263.39
Table 5. Boundary F1 scores for the FDEN and the other methods on the four experimental water body segmentation datasets.
Table 5. Boundary F1 scores for the FDEN and the other methods on the four experimental water body segmentation datasets.
MethodsTPDGIDWHDLDDLRSD
DeepLabV30.48020.16020.27300.2122
LANet0.56920.26230.35430.2225
MANet0.39230.07300.18160.1097
DFSNet0.53530.20430.33090.2478
ResUNet-a0.55860.22790.35820.2199
TransUNet0.54610.22520.36870.2925
DensePPMUNet-a0.54530.25440.43250.2864
HGBT0.59040.23720.38990.2656
U-KAN0.61670.30620.60800.2831
Ours0.65130.34400.62390.3346
Table 6. Ablation studies on the parameters B and C on the TPD and WHDLD, evaluated using the mIoU. We assume that the size of input image is 256 × 256. #Params and #FLOPs denote the number of parameters in the model and the number of floating point operations per second, respectively.
Table 6. Ablation studies on the parameters B and C on the TPD and WHDLD, evaluated using the mIoU. We assume that the size of input image is 256 × 256. #Params and #FLOPs denote the number of parameters in the model and the number of floating point operations per second, respectively.
B C #Params#FLOPsTPDWHDLD
5324.56 M8.22 G94.34%87.49%
10326.97 M12.79 G95.28%90.45%
15329.38 M17.37 G95.39%90.61%
10161.95 M4.28 G94.05%88.71%
10121.21 M3.01 G93.24%85.43%
Table 7. A comparative analysis of the computational overhead for the FDEN and other methods, including the parameter counts, FLOPs, and inference speed for each method.
Table 7. A comparative analysis of the computational overhead for the FDEN and other methods, including the parameter counts, FLOPs, and inference speed for each method.
MethodsParams (M)Flops (G)Inference Time (ms)
PSPNet46.7114.842.02
DeepLabV35.816.61.13
MANet44.3836.64.72
TransUNet105.3233.46.31
DensePPMUNet-a2.8763.429.61
GAC-ViT90.24835.842.83
U-KAN25.366.914.95
AFENet20.246.415.42
Ours6.9712.793.75
Table 8. Ablation experiments evaluating the effectiveness of the wavelet transform (WT) and Detail-Aware Attention (DAA) on the Tibetan Plateau Dataset.
Table 8. Ablation experiments evaluating the effectiveness of the wavelet transform (WT) and Detail-Aware Attention (DAA) on the Tibetan Plateau Dataset.
MethodsmIoU (%)Recall (%)Precis (%)F1 (%)
w/o WT91.4794.0297.0395.36
w/o DAA92.1295.1696.5895.74
Ours95.2897.4897.6397.47
Table 9. Ablation experiments on the use of Hessian filters on the TPD and the WHDLD.
Table 9. Ablation experiments on the use of Hessian filters on the TPD and the WHDLD.
FilterTPDWHDLD
mIoU (%)F1 (%)mIoU (%)F1 (%)
G 3 , h h 92.9096.1685.8992.13
G 3 , v v 93.1096.2886.2092.34
G 3 , h v 93.5096.4986.6392.59
H 3 ( · ) 94.1896.8888.7793.84
H 5 ( · ) 93.9296.7387.6393.17
H 7 ( · ) 93.5896.5386.7992.68
Ours95.2897.4790.4594.84
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Wang, S.; Guo, K.; Wang, N.; Tang, L. FDEN: Frequency-Band Decoupling Detail Enhancement Network for High-Fidelity Water Boundary Segmentation. Remote Sens. 2025, 17, 3062. https://doi.org/10.3390/rs17173062

AMA Style

Wang S, Guo K, Wang N, Tang L. FDEN: Frequency-Band Decoupling Detail Enhancement Network for High-Fidelity Water Boundary Segmentation. Remote Sensing. 2025; 17(17):3062. https://doi.org/10.3390/rs17173062

Chicago/Turabian Style

Wang, Shuo, Kai Guo, Ninglian Wang, and Liang Tang. 2025. "FDEN: Frequency-Band Decoupling Detail Enhancement Network for High-Fidelity Water Boundary Segmentation" Remote Sensing 17, no. 17: 3062. https://doi.org/10.3390/rs17173062

APA Style

Wang, S., Guo, K., Wang, N., & Tang, L. (2025). FDEN: Frequency-Band Decoupling Detail Enhancement Network for High-Fidelity Water Boundary Segmentation. Remote Sensing, 17(17), 3062. https://doi.org/10.3390/rs17173062

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop