1. Introduction
Benefiting from continuous advancements in remote sensing technology, increasingly detailed and higher-resolution input images can now be obtained in lake water body segmentation. The primary goal of this technology is to accurately distinguish water bodies from adjacent land, thereby monitoring the expansion and contraction of water bodies to provide critical information for flood prevention and disaster mitigation, water resource management, climate change analysis, etc. [
1,
2,
3,
4,
5].
To achieve this goal, traditional methods primarily rely on spectral-index-based threshold segmentation (such as the Normalized Difference Water Index and the Modified Normalized Difference Water Index) and morphological operations. However, these methods often depend on single spectral features or brightness differences, making it difficult to extract multi-dimensional features in complex scenes, which limits their segmentation capability in highly diverse and intricate environments. In recent years, some advanced architectures based on convolutional neural networks (CNNs) have largely overcome the limitations of traditional segmentation methods, achieving significant improvements in segmentation accuracy [
6,
7,
8,
9,
10,
11]. Among them, Ronneberger et al. [
12] proposed the U-Net architecture, featuring a symmetric encoder–decoder structure that combined high-level semantic information with low-level spatial details, effectively enhancing the detection of small objects and blurry boundaries. Since then, researchers have developed and refined the U-Net framework from various perspectives. With the integration of technologies such as the Pyramid Pooling Module (PPM), residual connections, and attention mechanisms [
13,
14,
15], the U-Net framework has seen improvements in multi-scale feature extraction, feature transfer, feature selection enhancement, etc. Despite these advancements, CNN-based methods may still struggle to capture global contextual information due to the limitations of convolutional receptive fields. The self-attention mechanism in Vision Transformers (ViTs) [
16] effectively addresses this shortcoming. Subsequently, a series of segmentation models based on Vision Transformers (ViTs) rapidly emerged [
17,
18,
19,
20,
21], enabling efficient processing of both the global context and local features, thereby further improving the overall segmentation accuracy. However, most of these existing methods primarily operate in the original pixel space of images, with limited exploration of the frequency-domain characteristics of remote sensing imagery. This constraint hinders their ability to fully leverage the representational power of deep learning models for segmentation tasks.
It is worth noting that some challenging remotely sensed images usually present rich frequency components, where high-frequency information (e.g., watershed boundaries and small tributaries) helps to refine the segmentation results, whereas medium- and low-frequency components (e.g., major bodies of water and cloud cover) provide essential structural context. Both play a crucial role in addressing challenges like low contrast and over-segmentation. However, two key issues persist: (1) the difficulty of designing an optimal convolutional operator capable of simultaneously adapting to the distinct characteristics of low-frequency smoothness and high-frequency details and (2) the need to enhance structural saliency across different frequency bands further.
For the first issue, due to the different characteristics of low-frequency smoothing and high-frequency details, we choose to decouple the frequency bands of the inputs and utilize a multilevel feature extraction network to deal with these components separately through a “divide-and-conquer” strategy. For the second issue, we aim to improve the structural saliency of different frequency bands, which requires adaptive enhancement of the structural characteristics of each band during feature extraction.
Based on these considerations, we design a frequency-band decoupling detail enhancement framework for water body segmentation. Our approach first uses wavelet transform components to replace the original image inputs, constructing a feature extraction network that focuses on both low- and high-frequency characteristics. Based on this, we introduce the Multi-Band Detail-Aware Module (MDAM), which achieves accurate enhancement of the water body detail features and interference suppression through a multi-scale structural feature analysis and a dynamic weight adjustment mechanism, as shown in
Figure 1. Experiments demonstrate that this framework outperforms existing water body segmentation models in both its detail perception and segmentation accuracy.
In summary, the contributions of this work are as follows:
We propose a novel water body segmentation framework, the Frequency-Band Decoupling Detail Enhancement Network (FDEN), to synergistically optimize the high-frequency edge information and low-frequency texture features in the frequency domain space, thereby improving the segmentation accuracy for lake water bodies.
We propose a Multi-Band Detail-Aware Module (MDAM), which further senses multi-band features and performs adaptive enhancement, effectively mitigating the problems of boundary blurring and missed detections of fine water bodies in remotely sensed water body images.
Experimental results providing a comparison with the SOTA show that the FDEN that we have developed is more advantageous than previous water body segmentation methods, and a large number of ablation experiments also demonstrate the effectiveness of our contribution.
This article is organized as follows:
Section 2 briefly reviews related works,
Section 3 describes the proposed model,
Section 4 presents the datasets and experimental settings, and
Section 5 concludes this paper.
3. The Proposed Method
Although stacking deep networks with small kernels can extract diverse physical features in remote sensing images, it is severely limited by the local consistency and the pixel-wise independent processing of individual convolutions in CNNs. In addition, usually, human visual attention pays different attention to different frequency bands, e.g., human vision is more likely to notice impulse stimuli at image boundaries than smooth structures. Therefore, for feature extraction in remotely sensed images, we propose processing each component in a divide-and-conquer manner and regulating the structural salience of different frequency bands. This approach not only enhances the extraction of key structural features but also reduces the loss of important information in low-frequency components. By emphasizing high-frequency details, our method aligns with human visual perception better.
In this section, we focus on the multi-band detail-aware mechanism and its working principle in our method. In addition, we also introduce the DWT process and the prediction module in the framework, which provide a good ground for effective enhancement of image details.
3.1. An Overview of Our Method
The overall framework of the FDEN is shown in
Figure 2. The design of the FDEN fully utilizes the ability of frequency domain information to guide the spatial features. Specifically, we first use Discrete Wavelet Transform (DWT) to transform the input into wavelet space to provide a global frequency-band benchmark. This process can be expressed as follows:
where
denotes the transformation matrix that converts the input image
into
, which contains a series of wavelet components. Subsequently,
will be passed to the feature encoding stage to extract the deep features of the image.
The encoding process of the FDEN plays a crucial role in achieving adaptive enhancement of the detail features. We fully combine the ability of multi-scale Hessian filtering to extract the boundary detail features in different frequency bands with the dynamic feature selection advantage of the attention mechanism to construct the Multi-Band Detail-Aware Module (MDAM). In terms of the feature transmission paths, the encoding process of the FDEN can also be divided into the multi-band detail feature perception path and the encoded feature transmission path. Features of these two paths on the same scale interact with each other through the attention mechanism, thus enabling detail feature enhancement. The encoding process can be described as
where
,
represent the inputs of the current MDAM,
, and
represent the outputs of the current MDAM. MDAM serves as the mapping function for the MDAM. When
N = 1,
,
. It is worth noting that the MDAM operates at 64 × 64 and 32 × 32 resolutions. At a 128 × 128 resolution, we replaced DAA with a simple multiplication form to achieve lower FLOPs.
After the feature encoding stage, the multi-scale feature maps
obtained from the processing of the MDAM are fed into the prediction module to generate the final segmentation results:
where
represents the mapping function of the prediction module, while
denotes the final segmented result.
3.2. Discrete Wavelet Transform
Discrete Wavelet Transform (DWT) has been widely applied to low-level vision tasks. Inspired by previous works [
36,
37,
38], we utilize the 2-D DWT to compute the Haar wavelet. Haar wavelets consist of the low-pass filter L and the high-pass filter H, as follows:
We can obtain four subbands, which can be expressed as
where
represent the low-frequency component and the high-frequency components in the vertical, horizontal, and diagonal directions, respectively. These subbands are then concatenated along the channel dimension and passed through a convolutional layer to produce the transformed feature representation:
where
f represents the combined mapping function of the concatenation operation and the convolutional layer. Although the concatenation followed by the convolutional operation leads to the frequency-band features being blended, this process does not entirely eliminate the individual frequency characteristics of each band. It enables the model to preserve and propagate key frequency-specific information to a certain extent, allowing each frequency band’s unique structural details to still influence the subsequent feature extraction and enhancement stages.
As shown in
Figure 1, we present visualized images of the four subbands after the wavelet transform. The
subband captures the coarse information and overall contours of the image, while the
and
subbands highlight the variations in the horizontal and vertical directions, respectively. The
subband emphasizes the fine details and subtle variations in the image. In traditional convolutional neural networks, input images are directly processed through convolutional layers, which may result in the loss of some high-frequency information, leading to blurred edges or a loss of detail. The wavelet transform explicitly separates the low-frequency and high-frequency components, enabling the network to focus on different levels of detail better and thereby improving its ability to represent fine-grained features.
3.3. The Multi-Band Detail-Aware Module
3.3.1. Multi-Scale Hessian Filtering
The Hessian matrix [
39,
40,
41], a common criterion for describing the structural characteristics of images at a specific point, is calculated using the second partial derivatives of the image in the horizontal and vertical directions. The eigenvalues of the Hessian matrix correspond to the principal curvatures at a point, where larger eigenvalues are typically associated with prominent structural transitions such as edges or ridges, while smaller eigenvalues correspond to smoother or more homogeneous regions. The Hessian matrix is calculated as follows:
where
I represents the input image, and
indicates the coordinate position of the pixel in the Hessian matrix. The terms
and
denote the second-order partial derivatives of the image pixels with respect to the
x and
y directions, respectively. The term
represents the mixed partial derivatives of the image pixels with respect to
x and
y. To enhance the computational efficiency, the convolution operation is employed to calculate the second-order derivatives of the image pixels. Therefore, Equation (
7) can be simplified as follows:
where
,
, and
are gradient-based convolution operators.
With the advancement of deep learning and convolutional neural networks (CNNs), nearly all gradient-based feature detection operators can be efficiently implemented on GPUs using predefined convolutional kernels. A more intuitive representation of this process is illustrated in
Figure 3. The Hessian matrix of
depends on its second-order derivatives, which are available using some specific filters:
here,
corresponds to the second-order derivative in the horizontal direction, while the vertical second-order derivative
can be obtained by transposing
, i.e.,
.
The feature mapping obtained after filtering the image using Equation (
9) effectively highlights the feature information of the source image, enhancing both edges and linear structures. Let
denote the Hessian filter, where
represents the size of the convolution kernel used in the convolution operator. The image processing performed by the
filter can be expressed as follows:
In order to achieve the multifaceted perception of high-frequency boundary details and subject contour information, we constructed a multi-scale Hessian filter containing attentional focusing. It contains
and
filters in addition to the
filter to extract features using convolution kernels of different sizes. The filter
can be expressed as follows:
the filter
can also be expressed in a similar form. Hessian kernels of different sizes naturally form a bandpass filter bank. Specifically, smaller-scale filters pay more attention to inferring the lower-order maximum eigenvalue of the Hessian and exploit the gradients of closer neighboring pixels to generate finer information (e.g., edges and textures). In contrast, larger-scale filters pay more attention to inferring the higher-order maximum eigenvalue of the Hessian and generating coarser information (e.g., contours). Multi-scale Hessian filters extract the higher-order derivative responses of the image at different scales, which can be viewed as extracting geometric variation information with different frequency features from multiple spatial scales, thus enhancing the model’s ability to perceive structural details. As shown in
Figure 3, we introduce a learnable multi-scale Hessian filtering mechanism, which can be represented as follows:
where
represents the Hessian features after mean aggregation along the channel dimensions. We use
,
, and
, which are learnable weight parameters, to combine multi-scale Hessian feature maps. These feature maps differ in their ability to represent fine details and coarse structure, so their weights reflect the importance of each feature map in capturing different levels of detail.
As shown in
Figure 1, we present the scaled Hessian features. It can be observed that Hessian filters at different scales vary in their effectiveness in capturing details and contours. This multi-scale approach is the core of our “divide-and-conquer” strategy, enabling multi-band information to be processed at different granularities. Specifically, smaller-scale Hessian feature filters focus on fine-grained details, while larger-scale filters emphasize coarse-grained structures. By fusing these multi-scale feature maps with learnable weights, the resulting detail-aware fusion map effectively captures the boundary details of the water body. This approach ensures that the frequency-specific information at each level is preserved and enhanced, leading to more accurate and robust segmentation results.
3.3.2. Detail-Aware Attention
The specific structure of Detail-Aware Attention (DAA) is shown in
Figure 2. The purpose of DAA is to facilitate the interaction of information from multi-scale Hessian features and encoded features, thus enhancing the network’s ability to perceive different detail information further. We use
and
to denote the input features of DAA, which represent the detail features obtained by Hessian filtering and the encoded features after deep encoding, respectively. In DAA, we use the linear projection of
to construct
Q and
K and the linear projection of
to construct
V and
L. Thus, the computation in DAA can be expressed as
where
is the number of columns in matrix
Q.
The MDAM combines multi-scale Hessian filtering and cross-attention mechanisms to fully leverage the complementary nature of local features and the global structure in image processing. The Hessian matrix captures structural changes in images (such as edges and textures) through second-order derivatives, and filters of different scales extract detailed and structural information from the image. This weighted fusion method preserves the unique characteristics of each frequency band while providing the network with a multi-level perspective.
From a mathematical perspective, the eigenvalues of the Hessian matrix reveal the curvature of the image in different directions, and filters of different scales exhibit a different performance in capturing these eigenvalues. Through weighted fusion, we effectively integrate multi-scale information while preserving the diversity and independence of frequency-band information. The cross-attention mechanism optimizes the fusion of the Hessian features and CNN deep features further, achieving effective information integration in the frequency dimension. Theoretically, this fusion of multi-scale information not only aligns with the principle of frequency-band separation but also significantly enhances the model’s ability to perceive complex structures, particularly in tasks involving water body boundaries and detail segmentation, thereby improving the model’s robustness and accuracy.
3.4. The Prediction Model
In the prediction module, we adopt a progressive upsampling strategy, gradually reconstructing low-resolution deep features to the original input size through multi-level feature fusion and sub-pixel convolution operations. Specifically, we first perform cross-scale feature fusion. The multi-scale features from each MDAM output are downsampled via average pooling and then fused with the deepest layer’s features through concatenation and convolution operations. The fused features are first processed through a convolution layer, where the number of feature map channels is expanded, followed by upsampling via PixelShuffle operations to gradually restore the feature map resolution. After upsampling, the features are connected to the output of the next-deepest MDAM layer via skip connections. Further convolution and upsampling operations continue to enhance the resolution and expressive power of the feature maps until they are restored to the size of the original input image. This design enables the model to learn rich feature representations while minimizing the loss of detail information during feature extraction and learning.
3.5. The Loss Function
To effectively address class imbalances and boundary ambiguity in water body segmentation, we adopt a hybrid loss function that combines Binary Cross-Entropy (BCE) Loss and Dice Loss:
where
This combined objective balances pixel-level accuracy and region-level consistency, enhancing the segmentation of small and ambiguous water body boundaries.
4. The Experimental Procedures and Analysis
In this section, we first present the datasets utilized for the experiments, the evaluation metrics employed to assess the model’s performance, and the details of the optimization process. Next, we compare the proposed FDEN with several state-of-the-art segmentation methods on the datasets. Finally, we conduct a series of ablation experiments to validate the contributions of the various components of our proposed method and discuss its limitations.
4.1. Benchmarks
4.1.1. Datasets
In this experiment, we utilized four datasets for training and testing: the Tibetan Plateau Dataset (TPD), the Gaofen Image Dataset (GID), the Wuhan Dense Labeling Dataset (WHDLD), and the Dense Labeling Remote Sensing Dataset (DLRSD). Here, we randomly selected 70% from the three datasets, TPD, GID, and WHDLD, to construct the training set, and the remaining portion was used for testing. This split strikes a balance between training diversity and evaluation robustness. In contrast, the DLRSD is used exclusively for testing, serving as an independent benchmark to evaluate the model’s generalization capability on unseen domains. Below, we provide a detailed introduction to each of these datasets.
(i) The Tibetan Plateau Dataset (TPD): The TPD was developed by Wang et al. [
29] at Lanzhou University. It contains a total of 6774 images, each sized 256 × 256 pixels, with a depth of 24 bits and a DPI of 96. All images in this dataset feature lakes located on the Tibetan Plateau, which are surrounded by deserts or Gobi areas. The TPD is specifically tailored to the segmentation of lake water bodies, and it is also the primary dataset we focus on. For this experiment, we used 4743 images for training and 2031 images for testing.
(ii) The Gaofen Image Dataset (GID): The GID [
42] is a large-scale land cover dataset collected using the Gaofen-2 satellite, known for its extensive coverage, wide distribution, and high spatial resolution. As shown in
Figure 4, the images from the GID contain a variety of elements such as houses, roads, rivers, and vegetation, featuring complex terrain and environmental information, which can improve the segmentation accuracy of the models for irregular edges and complex structural shapes. We synthesized the large-scale classification set and the fine land cover classification set and then sliced the images into 256 × 256 chunks, resulting in a total of 31,500 images. Of these, 22,050 images were allocated for training, and 9450 images were reserved for testing.
(iii) The Wuhan Dense Labeling Dataset (WHDLD): The WHDLD [
43] is a densely labeled dataset suitable for multi-label tasks such as remote sensing image retrieval (RSIR) and semantic segmentation. The main feature of the images in this dataset is that the contrast between the water body and the surrounding vegetation is very low, which can effectively test the segmentation accuracy and robustness of different methods in low-contrast scenes. It contains 4940 RGB images, each measuring 256 × 256 pixels. In our study, we used 3458 of these images for training and 1482 images for testing.
(iv) The Dense Labeling Remote Sensing Dataset (DLRSD): The DLRSD is widely used for remote sensing image classification tasks [
44] and primarily consists of high-resolution remote sensing images. The DLRSD comprises 21 classes, with 100 images per class. We use river images from this dataset as part of the test set. As shown in
Figure 4, these river images exhibit both low contrast and intricate boundaries, making them ideal for evaluating the generalization capability of the model.
4.1.2. Evaluation
The metrics chosen for evaluating the model’s performance include the Mean Intersection over Union (mIoU), Recall, Precision, F1 score, and boundary F1 score (BF1). Each metric is defined as follows:
(i) The Mean Intersection over the Union (mIoU): The mIoU measures the overlap between the predicted and ground truth regions:
where
N is the number of categories, and
and
denote the predicted and ground truth regions for class
i.
(ii) Recall: Recall is the ratio of correctly predicted positive pixels to all actual positives:
(iii) Precision: Precision is the ratio of correctly predicted positives to all predicted positives:
(iv) F1 Score: The F1 score is the harmonic mean of Precision and Recall:
(v) Boundary F1 score (BF1): The BF1 score evaluates the accuracy of boundary segmentation by computing the F1 score between the predicted and ground truth boundaries. The boundaries are obtained by applying morphological dilation (radius = 5) followed by an exclusive OR operation with the original masks. The score is calculated as
where
and
are computed using boundary pixels only.
These metrics together provide a comprehensive assessment of the lake water segmentation model’s performance, ensuring that the model maintains high detection rates while also achieving accurate predictions.
4.1.3. Optimization Settings
Our experiments were conducted on an NVIDIA RTX 3090 GPU using Python 3.9.19 and PyTorch 2.4.0. All comparison methods were evaluated under the same experimental settings and used the exact same training and testing sets. We employed the Stochastic Gradient Descent (SGD) algorithm as the optimizer, maintaining the hyperparameters at their default settings. Four worker threads were used during the training process, with a batch size set to 16. An initial learning rate of 0.001 was established and followed a cosine annealing scheduler, with a minimum learning rate of 0.00001, and the number of training epochs was set to 100. The momentum parameter was set to 0.9, and the weight decay parameter was set to 0.0001. The learning rate was decreased by 0.1 in each scheduling cycle. Additionally, all reported statistical data are derived from the experimental results of this study and do not include data from previous papers for comparison.
4.2. Comparison with State-of-the-Art Methods
To comprehensively demonstrate the advantages of our proposed method, we compared it with 16 mainstream algorithms widely used in semantic segmentation tasks: PSPNet [
13], DeepLabV3 [
45], MultiResUNet [
46], ResUNet++ [
47], CENet [
48], UNet++ [
14], ResUNet-a [
49], LANet [
50], MANet [
51], TransUNet [
21], DFSNet [
52], DensePPMUNet-a [
30], GAC-ViT [
31], U-KAN [
53], EoMT [
54], and HGBT [
55]. Among these algorithms, UNet++, ResUNet-a, MultiResUNet, ResUNet++, DensePPMUNet-a, and U-KAN are primarily based on the U-Net architecture, inheriting the classical encoder–decoder design. TransUNet, MANet, and GAC-ViT enhance the modeling capabilities by integrating Transformer structures, which are effective for capturing global contextual dependencies in complex scenes. In particular, EoMT achieves efficient image segmentation by simplifying the ViT architecture and introducing a novel mask annealing strategy. HGBT enhances BiFormer [
56] by introducing a hypergraph convolution module, a dual pooling module, and a feature aggregation module, effectively improving the modeling of complex spatial relationships and multi-scale features in high-resolution remote sensing images. DeepLabV3, PSPNet, CENet, LANet, and DFSNet rely on multi-scale feature extraction, contextual information capture, and pyramid aggregation modules to enhance the perception of global features. Comparison results on four datasets are shown in
Table 1,
Table 2,
Table 3, and
Table 4, respectively. Note that our training set consists of 70% randomly selected data from three datasets, TPD, GID, and WHDLD. Therefore, in the first three test sets, each method can achieve a good performance. Since DLRSD is not involved in the construction of the training set, the performance of each method on the DLRSD test set can fully reflect the generalization performance of these methods.
4.2.1. Visualization Results
The segmentation results of various methods on the TPD are shown in
Figure 5. From this figure, we observe that methods such as U-KAN, DensePPMUNet-a, TransUNet, and ResUNet-a successfully capture the overall contours of the lakes; however, their segmentation at boundary details with high-frequency characteristics is not as good as that of our method. For example, in the first row of images, the FDEN more completely extracts small lake water bodies. In the third row, the segmentation results of the FDEN are more precise when handling complex shapes or boundaries that exhibit intricate details. For LANet and MANet, they have improved sensitivity to small features within the water column but are therefore prone to introducing noise. This is most evident in the results in the second row, where LANet and MANet mis-segment the waves as land.
The segmentation results on the WHDLD are shown in
Figure 6. Our method is more effective for segmentation in low-contrast situations. For example, small pieces of water bodies in the boxes in the third and fourth rows of images are accurately extracted by the FDEN. These details fully demonstrate the enhancement of the FDEN for high-frequency details in remote sensing images. In contrast, although ResUNet-a, UNet++, and CENet can recognize the general region, they are still not as good as our method in terms of the completeness of segmentation. DensePPMUNet-a, on the other hand, is able to extract the global context prior with a dense range of scales, and the segmentation is much more coherent, but the segmentation is not as effective for such small regions with low contrast. In addition, the segmentation results of U-KAN are closest to those of our method, but our method is more accurate at complex boundaries, as can be seen in the image in the fifth row.
4.2.2. Quantitative Results
As shown in
Table 1, on the TPD, our method achieves an mIoU of 95.28%, a Recall of 97.48%, a Precision of 97.63%, and an F1 score of 97.47%. From the quantitative results, all methods showed high values on the TPD. This can be attributed to the fact that most of the images in the TPD show clear and well-defined boundaries between the lake and the land. Among the U-Net-based models, U-KAN, ResUNet-a, and DensePPMUNet-a achieved relatively good results. They fully incorporate multi-scale as well as contextual information to achieve a more advanced performance. Compared with U-KAN, our method improves the IoU by 2.57% and the F1 score by 1.40%. In the transformer-based models, TransUNet combines the local feature extraction capability of CNNs with the contextual information extraction capability of a ViT to obtain equally good segmentation results. Compared to TransUNet, our method improves the IoU by 2.67% and the F1 score by 1.41%.
Table 2 presents the segmentation results on the GID. The images in the GID are characterized by rich colors and complex regional structures, which increase the difficulty of accurately segmenting the edges of the images. Among the various comparison methods, models that consistently preserve low-level information through residual connections and fuse multi-scale information via skip connections generally exhibit a better performance. This highlights the importance of integrating information from different levels, especially in low-contrast images, as demonstrated by models such as UNet++ and DensePPMUNet-a. In the transformer-based method, HGBT combined with a hypergraph achieves the synergistic modeling of the global and local features and shows a good performance. The FDEN is substantially improved compared to all comparative methods. Compared to the next best, U-KAN, our method improves the IoU by 5.63% and the F1 score by 4.08%.
Table 3 shows the quantitative results of the comparison methods on the WHDLD. The image elements in the WHDLD are mostly buildings, roads, and vegetation, and the morphological structures are mostly regular polygons or linear, which can detect the detail fidelity of the model segmentation at the edges well.
Table 3 shows that UNet++, LANet, TransUNet, DensePPMUNet-a, GAC-ViT, and U-KAN achieve a good performance on the WHDLD. Notably, LANet does not a achieve satisfactory performance on the GID. This is highly related to the morphological structure of the elements in the image. LANet enhances the embedding of contextual information by introducing the Patch Attention Module (PAM) on the one hand and proposes the Attention Embedding Module (AEM) on the other hand for the semantic information of low-level features. Thus, it is able to show a better performance in segmenting locally regular shapes. Our method improves the IoU by 9.90% and the F1 score by 6.04% compared to those of LANet.
Unlike other datasets, the DLRSD is only used as a test set in this experiment, so it can reflect the generalization ability of the model well. The river images in the DLRSD exhibit low contrast between water bodies and the land. The quantitative results of our method and the comparison methods on the DLRSD are given in
Table 4. U-KAN shows a better segmentation performance among the comparison methods. Our method improves the IoU by 4.96% and the F1 score by 6.04% compared to those of U-KAN.
In addition to the four metrics of the mIoU, Recall, Precision, and F1 score, we also evaluated the boundary segmentation accuracy of each method using the metric of the boundary F1 score. As shown in
Table 5, the FDEN achieves a BF1 of 0.6513 on the TPD, which is an improvement of 3.46% over that of the suboptimal U-KAN. On the DLRSD, the FDEN exceeds U-KAN by 5.15%. These results consistently and robustly validate the excellent performance of the FDEN in the water body boundary extraction task.
To assess the FDEN’s robustness across various challenging scenarios, we provide visual comparisons with several other high-performing models in
Figure 7. Both scenes are sourced from the GID. The results clearly demonstrate that the FDEN excels in handling complex scenes, achieving superior boundary segmentation completeness and preserving finer details. This highlights the model’s ability to maintain a high-quality segmentation performance even in challenging and varied environments.
4.3. Ablation Experiments
In this subsection, we first analyze the trade-off between model performance and the number of parameters. To demonstrate the model’s capability to extract edge details and small targets, we then conduct a quantitative analysis of the influence and contributions of the wavelet transform and the MDAM through a series of experiments. Finally, we investigate the effect of wavelet transforms at different depths and Hessian matrices with various operators on the model performance. Our analysis primarily focuses on the experimental results from the TPD and the WHDLD.
4.3.1. The Model Parameters and the Computational Overhead
Typically, an increase in the computational complexity of the model also results in performance gains. The objective of this study is to find the optimal balance by analyzing the impact of the number of Conv Blocks in the MDAM and the number of channels on both the network complexity and performance. We denote
and
as the number of blocks and channels, respectively, across multiple experiments. The results are summarized in
Table 6. As
and
increase, the segmentation performance of the network improves, but this also leads to a rise in the number of parameters and the overall complexity. It is observed that the rate of performance improvement diminishes when
and
reach a certain threshold. Therefore, we decided to accept a slight reduction in the performance gain in exchange for a lower computational complexity. The final configuration chosen was
= 10 and
= 32.
As shown in
Table 7, our method strikes a good balance between computational efficiency and performance. It outperforms several existing models in terms of the inference time, achieving a faster processing speed of 3.75 ms, while maintaining a competitive performance with 6.97 M parameters and 12.79 G FLOPs. This demonstrates the efficiency of our approach for water body extraction, making it suitable for practical, real-time applications.
4.3.2. The Effectiveness of the Wavelet Transform
To investigate the impact of the wavelet transform on the model’s performance, we conducted an experiment where the wavelet transform module was replaced with a conventional convolutional layer and a pooling layer. All of the other settings were kept unchanged, and the model was retrained. The quantitative results, shown in
Table 8, indicate that after substituting the wavelet transform module, the mIoU dropped to 91.47%, with a Recall of 94.02%, a Precision of 97.03%, and an F1 score of 95.36%. Our model showed an improved performance in these metrics when the wavelet transform was included, highlighting its significant influence on the overall segmentation performance. We believe that the wavelet transform has such a large impact because it provides a frequency-band benchmark for the FDEN, which improves the multi-band detail perception further.
4.3.3. The Effectiveness of the MDAM
In our study of the effectiveness of the MDAM, we focus on the role of DAA, a key module in the MDAM. We evaluate the impact of DAA by removing it completely, i.e., eliminating the guiding role of multi-scale Hessian filtering for encoded features, and then retraining the model while keeping all other settings unchanged. The results, as presented in
Table 8, show that the model’s performance metrics declined across the board after the removal of DAA. This quantitative analysis confirms that multi-scale Hessian filtering is crucial for enhancing the segmentation accuracy of the model.
To assess the effectiveness of the scaled Hessian filter in capturing fine edge details further, we conducted six experiments using different filters:
,
,
,
,
, and
. The results, summarized in
Table 9, indicate that the
filter improves the model performance more significantly when the kernel size (
) is smaller. Among the various configurations, our learnable multi-scale Hessian filtering achieved the best segmentation results, demonstrating its superior ability to enhance the model’s edge detail extraction.
In addition, to understand the adaptive mechanism of multi-scale Hessian filtering better, we analyze the dynamic evolution of the three learnable weights (
,
, and
) during the training process. As shown in
Figure 8, these weights gradually converge around the 90th epoch. Although the absolute differences between them are relatively small, the consistent ordering of
indicates a stable preference of the network toward smaller kernels. This phenomenon may be partly attributed to the high complementarity among different Hessian scales for water body segmentation, which prevents the model from assigning overwhelming dominance to any single kernel. In addition, the diversity of sample styles in the dataset and the dynamics of the training process (e.g., the gradual decrease in the learning rate) may also contribute to the relatively close values of
. Nevertheless, the persistent trend that favors smaller kernels suggests that fine-scale filtering may contribute more to capturing high-frequency boundary details, while larger kernels are still preserved to ensure regional smoothness and continuity. Importantly, none of the scales are completely suppressed, which we consider as evidence that the network benefits from a balanced yet preferential utilization of multi-scale Hessian features.
4.3.4. Discussion on Limitations
Although our model achieved encouraging results, we identified some limitations in the dataset and the model during the experiments. As shown in
Figure 9, we have listed some scenarios with segmentation errors, and the incorrectly annotated parts are enlarged in the figure. Specifically,
Dataset limitations:
Figure 9a shows the segmentation errors caused by geographical names obscuring the ground truth;
Figure 9b shows errors where the model segments water bodies that were not annotated in the ground truth. These errors highlight the need for continuous dataset optimization, including improving the annotation accuracy and refining the dataset to reflect real-world conditions better.
Method limitations:
Figure 9c,d show the segmentation errors in our method at extremely rare complex boundaries. The red-boxed area in
Figure 9c exhibits complex gradient changes, and some regions have colors very similar to those in areas of land, leading our method to incorrectly segment portions of water bodies as land. The boundary regions in
Figure 9d primarily exhibit extremely low contrast, with some land areas having colors very similar to those of lakes, resulting in segmentation errors.
Figure 9c,d not only demonstrate the FDEN’s strong ability to perceive boundary details but also indicate that the FDEN still has room for improvement when faced with such highly misleading scenes. We believe that improving the FDEN’s ability to capture contextual information, possibly through the integration of advanced mechanisms such as transformer-based blocks, could mitigate these challenges and further refine the model’s performance in highly ambiguous scenarios.
5. Conclusions
In this paper, a novel lake water body segmentation network is proposed. To enhance the structural saliency of boundary details in remotely sensed images, we integrate a wavelet transform, multi-scale Hessian filtering, and an attention mechanism within an encoder–decoder framework. The wavelet transform decomposes the image into low- and high-frequency components, while the multi-scale Hessian filtering enhances the detail information across frequencies using specific differential operators. A learnable weighting mechanism is introduced to adaptively balance the contributions of features at different scales. This design effectively addresses the limited detail perception in previous water body segmentation models, thereby improving the overall segmentation accuracy and preserving fine-scale structures better. Extensive experiments on multiple benchmark datasets demonstrate that our method consistently outperforms the existing state-of-the-art techniques in both quantitative and qualitative evaluations. By enhancing the precision of segmentation, especially in challenging scenarios with complex boundaries and small water bodies, our approach enables more accurate and robust data extraction for long-term lake monitoring, ultimately supporting more informed decision-making in water conservation and environmental management.