DMLU-Net: A Hybrid Neural Network for Water Body Extraction from Remote Sensing Images

Xu, Ziqiang; Li, Mingfeng; Guo, Haixiang

doi:10.3390/app15147733

Open AccessArticle

DMLU-Net: A Hybrid Neural Network for Water Body Extraction from Remote Sensing Images

by

Ziqiang Xu

¹,

Mingfeng Li

¹ and

Haixiang Guo

^2,*

¹

School of Future Technology, China University of Geosciences, Wuhan 430074, China

²

School of Economics and Management, China University of Geosciences, Wuhan 430074, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(14), 7733; https://doi.org/10.3390/app15147733

Submission received: 13 June 2025 / Revised: 7 July 2025 / Accepted: 9 July 2025 / Published: 10 July 2025

(This article belongs to the Section Computing and Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

The delineation of aquatic features from satellite remote sensing data is vital for environmental monitoring and disaster early warning. However, existing water body detection models struggle with cross-scale feature extraction, often failing to resolve blurred boundaries, and they under-detect small water bodies in complex landscapes. To tackle these challenges, in this study, we present DMLU-Net, a U-shaped neural network integrated with a dynamic multi-kernel large-scale attention mechanism. The model employs a dynamic multi-kernel large-scale attention module (DMLKA) to enhance cross-scale feature capture; a spectral–spatial attention module (SSAM) in the decoder to boost water region sensitivity; and a dynamic upsampling module (DySample) in the encoder to restore image details. DMLU-Net and six models are tested and compared on two publicly available Chinese remote sensing datasets. The results show that the F1-scores of DMLU-net on the two datasets are 94.50% and 86.86%, and the IoU (Intersection over Union) values are 90.46% and 77.74%, both demonstrating the best performance. Notably, the model significantly reduces water boundary artifacts, and it improves overall prediction accuracy and small water body recognition, thus verifying its generalization ability and practical application potential in real-world scenarios.

Keywords:

U-Net; CNN; remote sensing image; deep learning; water extraction

1. Introduction

Water body extraction from remote sensing imagery is a key technology in water resource management, flood monitoring, environmental assessment, and other fields [1]. With the increase in global climate change and human activities, the distribution of and dynamic changes in water resources have increasingly significant impacts on ecosystems and human society. The accurate and efficient extraction of water body information not only facilitates the rational allocation and utilization of water resources but also provides important support for flood warnings, drought monitoring, and environmental protection [2]. Owing to its extensive coverage and high temporal resolution, remote sensing imagery has become the dominant data source for water body identification.

In its early stages, remote sensing-based water body extraction research mainly used spectral attributes and basic threshold segmentation techniques, which emerged in the 1980s [3,4,5]. However, these methods, which are simple and easy to use, rely on manually set thresholds and struggle to distinguish complex and variable information categories in high-resolution remote sensing images. They often face many challenges in practical scenarios [6]. With the development of machine learning techniques, many studies have used machine learning methods instead of spectral index methods for automated water body detection, including logistic regression [7], support vector machines [8], neural networks [6], and clustering [9,10]. Machine learning methods, by combining spectral, texture, and shape features, have improved the accuracy of water body extraction to a certain extent; among these methods, neural networks show superior performance [6]. However, in tasks involving complex water boundaries and multi-scale water body detection, the perception ability of ordinary neural networks represented by multi-layer perceptrons remains limited [11].

Remote sensing image extraction has found new applications due to the rapid development of deep learning technology in recent years. Convolutional neural networks (CNNs), with their powerful feature extraction capabilities, have achieved remarkable results in remote sensing image classification, object detection, and segmentation tasks [12,13,14]. CNNs capture features through multi-layer convolution and assign classification labels to each pixel. On this basis, researchers have proposed various improved CNN models, such as FCN [15], U-Net [16], SegNet [17], DeepLab series [18], and PSPNet [19,20]. In particular, encoder–decoder-structured CNN models such as U-Net are widely used in water body extraction tasks because they can capture both local and global features [21]. Traditional U-Net employs convolution and pooling to extract image semantic features; however, the limitation of its local receptive field impedes the comprehensive capture of global information, causing inferior performance in processing large-area water bodies. Moreover, although the skip connections of U-Net can transfer encoder features to the decoder, this simple concatenation method cannot effectively fuse and optimize features, easily leading to feature redundancy or loss and affecting recognition accuracy.

To capture global sequence semantic features, the attention mechanism widely used in natural language processing (NLP) has been applied to improve water body extraction models. For instance, Zhang et al. substituted the U-Net feature extractor with a MixFormer network and integrated an attention mechanism into the skip connection components, leading to a significant enhancement in water body extraction accuracy [22]. Zhang et al. optimized the network’s learning efficiency while preserving feature map detail information, which was achieved by enhancing the attention mechanism and integrating a spatial pyramid pooling module [23]. However, ordinary attention mechanisms (such as the SE module, CBAM, etc.) mostly focus on a single scale and cannot capture multi-scale feature information simultaneously. In practical applications, features of different scales are important for segmentation tasks, and models that only extract single-scale features often cannot balance the feature importance of multi-scale water bodies, leading to the missed detection of small water bodies or inaccurate boundaries [24,25]. Moreover, single-scale models require large model parameters and high computational complexity to maintain prediction accuracy, which easily limits their practicality.

To overcome the shortcomings of single-scale models, many studies have attempted to design more complex models to improve the abilities to focus on cross-scale features and detect small water body edges. Liu et al. used the ResNet50 network for feature extraction and added a squeeze-and-excitation module in the residual module to better capture cross-scale pixel feature relationships [26]. Hu et al. developed a novel deep learning network by combining ResNet50 and a multi-scale dense fusion (MDF) module, which enhanced feature representation through a position channel correlation module (DANet), significantly improving the accuracy and noise resistance of water body extraction [27]. The ability of DMAM-UNet to recognize fragmented water body features was enhanced by introducing a dual multiplication attention mechanism module (DMAM) and an atrous spatial pyramid pooling module (ASPP) [28]. These studies have proven the viability of multi-scale feature extraction approaches based on composite models.

Meanwhile, the sampling methods (such as bilinear interpolation and transposed convolution) of existing object detection models easily lead to information loss and blurred boundaries when restoring high-resolution feature maps, making it difficult to generate high-quality water body boundaries and resulting in a decrease in water body detection accuracy. Some scholars have considered improving the sampling method to alleviate this problem. An improved lightweight U-Net was proposed by An and Rui [29], who significantly reduced the model’s parameter count by reducing the number of downsampling layers and improving the bottleneck structure, but the effect of improving model accuracy was not significant, and there were still certain limitations in processing complex boundaries.

Based on the consideration of the above problems, in order to simultaneously enhance the overall detection accuracy and the ability to identify small water bodies, a new neural network based on a dynamic multi-kernel large-scale attention mechanism—DMLU-Net—is proposed. The main contributions of this model are summarized as follows:

(1) To accurately capture cross-scale remote sensing image features, inspired by the research on multi-core mechanisms [30], a dynamic multi-kernel large-scale attention (DMLKA) module is designed. DMLKA uses multi-receptive field convolution kernels and a dynamic weight aggregation mechanism to enhance the model’s ability to capture multi-scale features from remote sensing imagery. This mechanism can effectively improve the model’s ability to express complex water body structures while maintaining computational efficiency.

(2) To enhance the contextual awareness of information transfer between the encoder and decoder and to better discriminate water bodies from non-aquatic features, a spectral–spatial attention mechanism module (SSAM) is used in the skip connection to refine features. The SSAM can reduce non-water background interference and enhance the recognition fidelity of small water features and boundary segments by fusing feature attention in the channel and spatial domains.

(3) To accurately restore the contours of complex water bodies, a DySample module [31] is utilized in the decoding process to restore and enhance the spatial resolution and local details of features. DySample realizes adaptive feature repositioning and sub-pixel alignment by learning spatial offset parameters, which can overcome the boundary-blurring problem caused by traditional interpolation upsampling, thereby preserving the continuity and integrity of complex structures, such as tributaries and lakes.

Compared with other water body extraction studies, our model simultaneously features an advanced multi-scale feature extraction module and an enhanced sampling module. Additionally, all of the above-mentioned modules are first applied to extract water bodies. A systematic evaluation of two high-resolution public remote sensing datasets in China—GID and LoveDA [22]—is conducted to assess the effectiveness and generalization capacity of the proposed model in remote sensing applications. Experiments show that DMLU-Net demonstrates superior performance on the two public datasets, significantly outperforming the mainstream methods in the water body detection field in recent years in terms of multiple indicators such as the IoU and F1-score, demonstrating good accuracy, robustness, and cross-scene adaptation abilities. The subsequent sections of this paper are structured as follows: Section 2 presents the DMLU-UNet model, offering a detailed description of each component. Section 3 describes the public datasets, experimental configurations, and various comparative and ablation experiments conducted in the study. Finally, Section 4 summarizes the experimental findings and provides a comprehensive discussion.

2. Proposed Method

The proposed DMLU-Net model is introduced in this section. First, an overview of the model architecture is provided, followed by a detailed description of each module’s structure.

2.1. Architecture of DMLU-Net

As shown in Figure 1, the proposed DMLU-Net model adopts an encoder–bottleneck layer–decoder architecture, achieving accurate water body segmentation through the directional enhancement of feature maps. The encoder of the model consists of four levels of downsampling modules, each containing two 3 × 3 convolution kernels and max pooling operations. While compressing the spatial dimension (512 × 512 → 32 × 32), a spectral–spatial attention module (SSAM) is introduced in the skip connection path. This module strengthens the response of water-sensitive bands through the spectral branch (channel compression ratio of 16) and highlights boundary features with multi-scale spatial branches (3 × 3 convolutions with dilation rates of 1, 2, and 3), effectively alleviating the problems of band redundancy and edge blurring in multispectral data. In the decoding process, a DySample module is utilized to restore both image resolution and detailed information. Ultimately, the encoder and decoder feature maps are integrated via skip connections to produce the final water body extraction outcome.

The innovative design of DMLU-Net is mainly reflected in the three-level-deployed DMLKA modules, which balance the model’s ability to mine global and local features through flexible multi-scale convolution layers and dynamic weight design. It realizes progressive feature optimization, from the global context to local details, through the differentiated feature weight design controlled by the temperature parameter T. The specific roles of DMLKA in the model are as follows:

The DMLKA module is deployed in the encoder’s terminal stage to pre-fuse 528-dimensional features (T = 0.3), uniformly considering water body features at multiple scales, maintaining the richness of feature information, and capturing the macroscopic water body morphology.
The pre-fused features are dimensionally upgraded using a convolution layer, and then the DMLKA module is used to perform shallow feature fusion (T = 0.2) of the 1056-dimensional high-dimensional features to extract key information.
After organizing the extracted feature information, the DMLKA module is used in the first layer of the decoder for deep feature enhancement (T = 0.1), emphasizing the focus on minor-scale features and enhancing the model’s ability to extract the boundary detail features of various objects in remote sensing imagery.

In the design of skip connections, to retain spatial details and enhance the representation ability of feature maps, a spectral–spatial attention module is designed. The SSAM combines spatial and channel attention mechanisms, suppressing irrelevant information and highlighting important features through attention expression. This module ensures that the network can focus on water body regions, ignoring noise and background interference, thereby improving the accuracy of segmentation results.

The DySample module in the decoder realizes sub-pixel-level feature alignment by learning offset parameters. Combined with the skip connection features optimized by the SSAM, it significantly reduces upsampling artifacts during resolution restoration. By upsampling the feature map, the model gradually restores the data scale to the original image resolution (512 × 512) and finally obtains the extraction result of the water body structure.

2.2. DMLKA Module

In the task of water body extraction from remote sensing imagery, multi-scale feature learning and attention mechanisms serve as critical factors for enhancing segmentation performance. Multi-scale features help the model to quickly obtain key information, reduce training difficulty, and minimize training costs; the attention mechanism helps the network to focus on key information and reduce or ignore the impact of irrelevant information on the final result. Previous studies in the field of object detection often use various attention mechanisms, including channel attention (CA) and self-attention (SA), to obtain more information features. However, the CA and SA methods only consider the attention results in a fixed mode and cannot further absorb local information or random dependencies.

Inspired by research in the fields of object detection and visual attention [31,32], to further improve the model’s focus on important scale information and its anti-interference ability, we propose an innovative neural network module—dynamic multi-scale local kernel attention (DMLKA)—by combining multi-scale local kernel attention (MLKA) and a dynamic weight adjustment mechanism. DMLKA combines multi-scale local kernel convolution (MLKA) and a dynamic weight generator, which captures local details and long-range dependencies through multi-scale large-kernel convolution, and it adaptively adjusts the importance of features at different scales through a dynamic weight mechanism, thereby significantly improving the model’s perception and extraction abilities of multi-scale water bodies in complex scenes. The structure of the DMLKA module is shown in Figure 2.

Specifically, DMLKA consists of four main functions: a normalization layer (layer normalization); large-kernel attention (LKA) to establish mutual dependencies; a multi-scale convolution mechanism to acquire cross-scale information; and a dynamic weight allocation mechanism.

Normalization layer (layer normalization): To ensure the distribution consistency of input features, we introduce layer normalization (LN) at the module entrance. The specific implementation is as follows:

x = \frac{x - u}{\sqrt{s + ε}} \times a + b

(1)

Here,

u

and

s

are the mean and variance of each channel;

a

and

b

are learnable parameters; and

ε

is a numerical stability term. Each channel is independently normalized by the mean and standard deviation.

Large-kernel attention: As shown in Figure 2, for a given input feature map (C, H, and W are the number of channels, width, and height, respectively), LKA performs convolution operations through a single depth-wise convolution layer and a three-layer cascaded convolution module to adaptively extract features. The three-layer cascaded convolution includes a

(2 d - 1) \times (2 d - 1)

-sized depth-wise convolution (

f_{D W}

), a

(k / d) \times (k / d)

-sized depth-wise d-dilation convolution (

f_{D W D}

), and a point-wise convolution (

f_{P W}

), where

K

and

d

are the convolution layer parameters. The calculation formula of the three-layer cascaded convolution can be expressed as follows:

L K A (X) = f_{P W} (f_{D W D} (f_{D W} (X)))

(2)

Multi-scale convolution mechanism: To learn the attention map of full-scale information, a multi-scale mechanism is used to improve the flexibility of the convolution layer. Assuming that the input feature is

X \in R^{N \times C \times H \times W}

, the module first splits it into n sub-features with a scale of

C \times H \times W / n

. For the i-th group of features

X_{i}

, an LKAi determined using the parameters

\{K_{i}, d_{i}\}

is used to generate the corresponding uniform scale feature map.

Dynamic weight allocation mechanism: After the data pass through the MLKA module, they are weighted by a variable weight layer. Specifically, adaptive global average pooling (AdaptiveAvgPool2d) is employed to compress feature maps into channel-wise statistical descriptors. On each branch, the initial weights are generated through two 1 × 1 convolution layers, and the initial weights are normalized using Softmax and T (a temperature parameter) to generate dynamic weight parameters

w_{i}

for the corresponding branch. The calculation formula can be expressed as follows:

w_{i} = softmax (\frac{w e i g h t (x)}{T})

(3)

Here,

w_{i}

represents the dynamic weight corresponding to the i-th group of features

X_{i}

;

w e i g h t

represents the generated initial weight; and T is the temperature parameter, which controls the weight distribution (a smaller T makes the weights more concentrated).

The dynamic weight allocation aggregation dynamically adjusts the weights of different branches according to the input features, enabling the model to automatically select the optimal local feature response by dynamically adjusting the attention map in different scenarios, thereby avoiding potential blocky artifacts and improving the feature representation ability. The temperature parameter T plays a key role in this process. T can adjust the output distribution characteristics of the Softmax function, thereby controlling the smoothness and sharpness of dynamic weights. Not only can this mechanism enhance the flexibility and robustness of the module to different input features and task requirements, but it can also effectively alleviate the instability of weight distribution in the early training stages, thereby improving the convergence performance and final effect of the model.

Specifically, when the temperature T is large (T = 1.0), the Softmax output tends to be more uniform, leading the network to average and integrate features across all scales. This behavior facilitates the utilization of multi-scale contextual information during the early stages of convergence and enhances the model’s stability and generalization capability. However, an excessively high temperature may hinder the model’s ability to emphasize the most relevant scale responses in specific scenarios, thereby limiting its capacity to accurately model target structures.

Conversely, when T is small, the Softmax distribution becomes more “sharp”, meaning that the model is biased toward features from a dominant scale. This enhances its sensitivity to the specific target scale and is particularly advantageous in scenarios where a single-scale representation of water bodies is prominent. Nevertheless, an overly low temperature may lead to overfitting or a reduced adaptability to scale variation, ultimately impairing generalization performance.

Finally, the outputs of MLKA are weighted and calculated according to the corresponding generated dynamic weights, and then they are concatenated with the results obtained using the residual network composed of a point convolution layer and the original data to obtain the total output of the DMLKA module. In the process of remote sensing image processing, the organic combination of each function in the DMLKA module supports the model’s ability to capture multi-scale features.

2.3. SSAM

Shallow features produced by the encoder, despite containing rich spatial details, are short of semantic information, easily affected by noise, and predominantly contain non-water body interferences. Referring to existing research ideas [33], we design the SSAM using a multi-attention combination to refine feature information and enhance the network’s anti-interference ability.

The structure of the SSAM is shown in Figure 3; it mainly includes two parts: spectral attention (channel dimension compression) and multi-scale spatial attention (parallel dilated convolution). Specifically, the spectral attention branch performs global pooling on the input feature map along the spatial dimensions (height H and width W) through a global average pooling layer to obtain channel information with a shape of N × C × 1 × 1 (N is the batch size). Then, two 1 × 1 convolution operations are used to downsample the channel count from C to C/16, and a nonlinear activation function, ReLU, is introduced. Thereafter, the channel count is recovered to C, and a spectral attention map is produced using the Sigmoid function.

The multi-scale spatial attention branch defines three parallel convolution branches, each using a 3 × 3 convolution kernel with different dilation rates (dilation = 1, dilation = 2, and dilation = 3). Then, the outputs of the three branches are concatenated along the channel dimension and fused through a 1 × 1 convolution layer. Finally, a spatial attention map is generated through the Sigmoid function.

The spectral and spatial attention maps are applied to the input feature map, and the original input is superimposed to achieve residual learning to obtain the final enhanced feature map.

2.4. DySample Module

Traditional interpolation upsampling (such as bilinear upsampling) is prone to generate jagged edges or blurring when restoring the boundaries of water bodies, especially those of severely damaged irregular water systems (such as curved rivers). Therefore, we introduce a DySample sampling layer [31] to improve the ability to restore water body contours.

As a lightweight dynamic upsampling module, DySample realizes efficient upsampling by dynamically generating sampling offsets to learn sampling positions. DySample uses the PixelShuffle operation to expand the H and W dimensions of the feature map while maintaining the C dimension, thereby increasing the spatial resolution of the feature map. This enables the model to capture broader global contextual information without complex convolution operations, helping to understand the relationship between different regions, thereby improving the accuracy of semantic segmentation. The ability to integrate the global context and local details renders DySample upsampling highly suitable for water body segmentation in remote sensing imagery, effectively preserving water body edge details and enhancing the completeness of small tributaries and boundary contours.

3. Experiments

3.1. Datasets

In this study, we conducted model testing and comparative experiments on two publicly available datasets: GID and LoveDA.

3.1.1. GID Dataset

The GID (Gaofen Image Dataset) is a large-scale high-resolution remote sensing land cover dataset developed from China’s Gaofen-2 satellite imagery. Comprising 150 high-resolution Gaofen-2 satellite images, it spans a geographical area exceeding 50,000 square kilometers. The image size is 7200 × 6800 pixels, and the spatial resolution is 0.8 m. It includes five land cover categories, namely, construction, farmland, forest, grassland, and water area, with the water body category containing lakes, ponds, rivers, paddy fields, etc. Only the water body category was retained, with the other categories designated as the background. The images were cropped to 512 × 512 image blocks, and images without water body information and images with water body information greater than 95% were removed. Finally, 5502 images were generated. The cropped images were subjected to image enhancement in the form of random rotation, Gaussian blur, and random adjustments of brightness and contrast, and they were randomly divided into training and test sets at a ratio of 8:2, with a random seed set to 42. Several images and their labels obtained from this dataset are presented in Figure 4. Among these labels, black represents non-water parts, with a label of 0, and white represents water parts, with a label of 1.

3.1.2. LoveDA Dataset

LoveDA is a high-resolution remote sensing surface cover dataset containing 5987 images and 166,768 labeled objects from three cities in China: Nanjing, Changzhou, and Wuhan. It covers two different geographical environments: urban and rural environments. The pixel size is 1024 × 1024, and the spatial resolution is 0.3 m; additionally, it includes three bands: red, green, and blue. The data come from the Google Earth platform. The images were cropped to 512 × 512 image blocks, and images without water body information were removed. Finally, 8538 images were generated. The processing method used to perform image enhancement operations and divide the dataset into training and test sets was the same as that used for the GID dataset, and a random seed was set to 42. Several images and their labels obtained from this dataset are presented in Figure 5.

3.2. Implementation Details

The experiment was implemented using the PyTorch framework, with two NVIDIA GeForce RTX 4090 GPUs used for training and testing. The operating system was Windows 10, the Python version was 3.11.10, the Pytorch version was 2.5.1, and CUDA 12.6 was used for GPU acceleration.

3.2.1. Evaluation Indicators

A total of five performance evaluation indicators were utilized: overall accuracy (OA), which is defined as the percentage of correctly classified pixels relative to the total pixels; precision (P), which measures the proportion of true-positive water body pixels among all predicted water body pixels; recall (R), which is the ratio of predicted water body pixels to the ground truth water body pixels; the F1-score (F1), which calculates the harmonic mean of precision and recall; and the Intersection over Union (IoU), which is defined as the average of the intersection over the union of water bodies and the background, reflecting the overlap degree between the prediction and the ground truth. In this study, the IoU most intuitively and concisely reflects the effect of binary water segmentation. The calculation formulas for the evaluation indicators are shown in Table 1.

3.2.2. Loss Function

The model proposed in this research is an end-to-end pixel-wise classification framework, and its loss function employs a mixed weighted loss, referred to as combined loss. It combines the advantages of binary cross-entropy (BCEWithLogitsLoss) and dice loss, aiming to optimize both pixel-level classification accuracy and regional segmentation consistency. Specifically, by introducing adjustable weight parameters

α

and

β

, the contribution of the two losses is flexibly balanced. In addition, a smoothing term smooth is added to the dice loss calculation to prevent numerical jumping problems. In our experiment, we set

α

= 1.0,

β

= 0.7, and

ε

= 10⁻⁶. The specific formulas are as follows:

BCE Loss:

$L_{BCE} = - \frac{1}{N} \sum_{i = 1}^{N} [y_{i} \log ({\hat{y}}_{i}) + (1 - y_{i}) \log (1 - {\hat{y}}_{i})]$

(4)

where $N$ is the total number of samples; $y_{i}$ is the true label value; and ${\hat{y}}_{i}$ is the probability value predicted by the model (the output after Sigmoid activation).

2.: Dice Loss:

$L_{D i c e} = 1 - \frac{2 \times I n t e r \sec t i o n + s m o o t h}{U n i o n + s m o o t h}$

(5)

This can be expanded as

$L_{D i c e} = 1 - \frac{2 \sum (p \cdot t) + s m o o t h}{\sum p + \sum t + s m o o t h}$

(6)

Here, $p$ is the probability value predicted by the model; $t$ is the true mask; and smooth is the smoothing term used to prevent the denominator from being zero.

3.: Combined Loss:

$L_{c o m b i n e d} = α \cdot L_{BCE} + β \cdot L_{Dice}$

(7)

where $α$ is the weight-controlling BCE loss, and $β$ is the weight-controlling dice loss.

3.2.3. Training Settings

To accelerate convergence during model training, the Adam optimizer was adopted, with an initial learning rate of 0.001 for model training. A learning rate adjustment strategy combining “Warmup” and “Cosine Annealing with Warm Restarts (CosAWR)” was designed to improve the convergence speed and stability of model training.

Warmup Stage:

In the initial warmup_steps epochs of training, we used a linearly increasing learning rate scheduler—warmup_lr—to adjust the learning rate, gradually increasing the learning rate from a value close to zero to the base learning rate base_lr. The specific formula is

l r = b a s e_l r \cdot \min (1.0, \frac{s t e p}{w a r m u p_s t e p s})

(8)

where

s t e p

is the current training step;

w a r m u p_s t e p s

is the total number of steps in the Warmup stage; and

b a s e_l r

is the base learning rate. This strategy helps the model avoid gradient explosion caused by a too-high learning rate in the early training stages and accelerates convergence.

2.: Cosine Annealing with Warm Restarts (CosAWR) Stage:

After the Warmup stage, we switched to the cosine annealing restart strategy to maintain the model’s ability to jump out of local optima by periodically adjusting the learning rate. The CosineAnnealingWarmRestarts scheduler provided by PyTorch can achieve this operation, and its key parameters include the following:

$T_{0}$ : The initial cycle length is set to 100 epochs.
$T_{mult}$ : The multiplication coefficient of each cycle is set to 2.
$T_{\min}$ : The minimum value of the learning rate is set to 10⁻⁷.

Whenever a cycle ends, the learning rate gradually decreases to

η_{\min}

in the form of a cosine function, and then a new cycle restarts, but the function cycle length is multiplied by

T_{mult}

.

Through the two-stage learning rate strategy, the model can converge quickly in the early training stages and reduce the risk of falling into local optima through periodic learning rate adjustment in the subsequent stages. This strategy substantially enhances the model’s training stability and final performance in experimental validations.

Training cycle: A total of 300 epochs were trained, and an early stopping mechanism was adopted. If the test set performance did not improve for 10 consecutive epochs, the training was stopped.

3.2.4. Comparative Models

To evaluate the superiority and effectiveness of the proposed DMLU-Net, we compared it with various known mainstream object detection algorithms using the five performance indicators described in Section 3.2.1. The specific models included U-Net [34], DeepLabv3+ [35], SwinUNet [36], TransUNet [37], MU-Net [22], and QTU-net [21]. Among them, the MU-Net and QTU-net models have been advanced in water body recognition work in recent years. Uniform experimental conditions were maintained for the training and testing of all networks, and the indicators of each batch were averaged to obtain the final indicator results.

3.3. Experimental Results

We conducted comparative experiments on the GID and LoveDA datasets. The random seed for model training was set to 42. The experimental results show that DMLU-Net achieved the best results in all evaluation indicators on both datasets.

3.3.1. Experimental Results on GID Dataset

The specific performance indicators of each model on the GID dataset are shown in Table 2. The bold data in Table 2 represents the best parameters among all models. The results show that DMLU-Net performed outstandingly in terms of OA, P, R, F1-score, and IoU and that it was significantly superior to the other methods. Specifically, compared with the closest model MU-Net, DMLU-Net achieved a 0.52% improvement in the IoU and a 0.86% improvement in the F1-score; compared with QTU-Net, it achieved a 2.15% improvement in the IoU. The experimental findings validate the model’s superior accuracy and robustness in complex water body extraction, particularly for small water bodies and fragmented boundaries.

To intuitively reflect the performance differences of the models, Figure 6 visually displays the prediction results of DMLU-Net and the other segmentation networks on partial remote sensing images. The first column in Figure 6 shows the original images, while the other columns show the extraction results of each model. The black area in the results represents the non-water body areas detected by the model, and the white area represents the water body areas. The parts with obvious differences are circled and framed with red dashed lines. From the extraction results of various water bodies in the figure, it can be seen that our method generally performs better than the other six methods in different scenarios. The extraction results of each model for surface water bodies in the fourth row of Figure 6 highlight the superiority of DMLU-Net. Its extraction results have clear water body boundary contours, with the least classification errors and missed small water bodies, and they are closest to the true ground labels.

Specifically, Deeplabv3+ tends to exhibit over-segmentation in scenes of meandering and narrow tributaries, misclassifying the surrounding areas as water bodies. TransUnet, however, suffers from detection failures when identifying small tributaries with complex branch structures, primarily due to its inability to learn abundant global contextual features. Across diverse scenarios, although competing segmentation networks recognize water body fluctuations, DMLU-Net delivers more consistent and accurate segmentation outcomes. Moreover, in the scene of large-scale water bodies, the boundaries of DMLU-Net are the clearest. This indicates that DMLU-Net can effectively capture local details and global contextual information through the DMLKA module. The large-kernel convolution design in the DMLKA module enables the model to better perceive long-range dependencies, thus improving its efficiency. Relative to other network architectures, DMLU-Net excels in capturing sharper and more accurate water body boundaries, which is attributed to its multi-scale feature fusion mechanism.

To further validate the robustness and generalization capability of DMLU-Net, we conducted comparative experiments on the LoveDA dataset. The experimental results are presented in Table 3. The bold data in Table 3 represents the best parameters among all models. On this dataset, DMLU-Net consistently outperformed the other water body recognition networks based on CNNs and Transformers across all evaluation metrics. Here, the Intersection over Union (IoU) metric—highly relevant for target detection tasks—is taken as an example: DMLU-Net achieved a 2.91% improvement in the IoU and a 2.30% increase in the F1-score compared to Deeplabv3+. Compared with QTU-NET, DMLU-Net showed a 1.96% improvement in the IoU and a 1.56% increase in the F1-score. These results demonstrate that the proposed method can accurately extract comprehensive surface water information with a high overall accuracy, outperforming five other state-of-the-art water body recognition algorithms. Furthermore, it exhibits strong generalization capabilities, making it suitable for various water body detection tasks.

Figure 7 presents a visualization of the prediction results produced by DMLU-Net and other segmentation networks on the LoveDA dataset. As illustrated, DMLU-Net demonstrates superior performance in handling complex backgrounds and extracting multi-scale water bodies. For instance, in the eighth and ninth lines of Figure 8, urban scenes featuring small rivers and ponds are depicted. The other models are more susceptible to interference from buildings and other ground objects, often leading to misclassification or missed detections. DMLU-Net effectively suppresses background noise and enhances the accuracy of water body extraction through the integration of its spectral–spatial attention module (SSAM) and dynamic upsampling module (DySample). The second and fifth lines in Figure 8 mainly demonstrate the model’s ability to restore small water bodies.

The test results on the two public datasets demonstrate that the method proposed in this study exhibits strong recognition capabilities. It is applicable to various types of water bodies, effectively distinguishes water–land boundaries, and accurately delineates small rivers and ponds. The integration of the DMLKA module enables the model to extract features from the original feature map through multiple channels and to adaptively allocate parameter weights, thereby capturing both local and global information more precisely. The introduction of dice loss enhances the model’s sensitivity to small targets and minority classes, mitigating overfitting to dominant categories and reducing the occurrence of missed detections. Furthermore, the inclusion of the spectral–spatial attention module (SSAM) allows the model to effectively utilize water-specific spectral features, thereby improving the accuracy and robustness of water body extraction. Finally, the DySample module facilitates the accurate restoration of feature information to the original resolution.

3.3.2. Ablation Experiments

To validate the effectiveness of each module, we conduct a series of ablation experiments. The baseline U-Net adopts an encoder–decoder architecture comprising four downsampling and four upsampling operations. The feature maps at the 1/8 and 1/16 scales capture only local features through 3 × 3 convolution. Additional comparisons are made between DMLKA and common alternatives such as the squeeze-and-excitation network (SE), convolutional block attention module (CBAM), and atrous spatial pyramid pooling (ASPP) [38,39,40]. The experimental results are summarized in Table 4 and Table 5.

As shown in Table 4, removing the DMLKA module leads to a 2.6% decrease in the IoU, indicating that the DMLKA module plays a crucial role in capturing global contextual information. Further removal of the SSAM results in an additional 1.92% decrease in the IoU, demonstrating its effectiveness in suppressing background noise and non-water features, thereby enhancing water body extraction accuracy. When DySample is removed, the IoU drops by another 0.85%, suggesting that this module contributes to edge refinement and improves the accuracy of water body segmentation. Additionally, replacing MLKA with DMLKA reveals that the incorporated dynamic weight generator significantly enhances the model’s IoU. As shown in Table 5, when the DMLKA module is replaced with the SE, CBAM, and ASSP module, the performance of the model declines. This further proves the advanced nature of the proposed model.

In summary, although the ablation results do not show substantial changes in terms of overall accuracy (OA) and precision (P), the proposed DMLU-Net achieves the best performance in the IoU—the most critical metric for target detection—as well as in recall (R) and the F1-score, which confirms the effectiveness of each component module.

3.4. Discussion

3.4.1. Discussion on α and β in the Loss Function

In the loss function, the two hyperparameters

α

and

β

are introduced to control the weights of the cross-entropy loss and dice loss, respectively. By tuning the values of

α

and

β

on the GID dataset, it is observed that the model achieves optimal performance in terms of the IoU, F1-score, and recall when

α

= 1.0 and

β

= 0.7. Table 6 presents the effects of different

α

and

β

values on the experimental results. These findings suggest that dice loss plays a crucial role in addressing the imbalance between positive and negative samples, while cross-entropy loss facilitates model convergence.

3.4.2. Temperature Parameter T in DLMKA Module

During training, we experimented with different temperature values (T = 0.1, 0.2, 0.3, 0.5, and 1.0). The results demonstrated that a hierarchical temperature configuration further enhanced model performance: the optimal setup used T = 0.3 at the encoder output, T = 0.2 in the bottleneck layer, and T = 0.1 in the decoder. This coarse-to-fine temperature strategy enabled progressive refinement, shifting from multi-scale aggregation to key-scale focus, thereby improving the modeling of large-scale water body structures and fine edge details.

3.4.3. Model Complexity Analysis

In water body extraction from remote sensing images, computational efficiency and model size are critical for practical deployment. To assess the structural complexity of the proposed DMLU-Net, we compare it with six mainstream segmentation models by analyzing their parameter counts (in millions, M), computational costs (in billions of floating-point operations, FLOPs), and reasoning speed (in frames per second, FPS). The comparison results are shown in Figure 8.

As shown in the results, TransUNet, despite its strong global modeling capability, requires 105.9 M parameters and 168.7 G FLOPs, making it a computation-intensive model unsuitable for resource-constrained platforms (e.g., drones or satellite terminals). In contrast, DMLU-Net achieves a competitive segmentation accuracy, with only 37.1 M parameters and 218.4 G FLOPs, striking a better balance between efficiency and performance among the models of similar precision.

Notably, while the computational cost of DMLU-Net (218.4 G FLOPs) exceeds that of Swin-Unet (46.4 G) and MU-Net (172.7 G), its parameter count remains efficiently constrained at 37.1 M—significantly lower than that of Transformer-based models such as TransUNet (105.9 M). The reasoning speed of DMLU-Net (68.8 FPS) is not overly sacrificed with the improvement in accuracy, achieving a good balance between high accuracy and computing efficiency. This validates the effectiveness of our designed modules (DMLKA, SSAM, and DySample) in balancing feature representation and computational efficiency.

In conclusion, if high-precision segmentation is required, then DMLU-Net leads overall in terms of performance and is suitable for scenarios requiring high-quality results. DMLU-Net achieves a trade-off between a lightweight architecture and high performance while maintaining segmentation accuracy. If efficient real-time reasoning is required, then Deeplabv3+ can be chosen. Although it has a large number of parameters, its FPS advantage is obvious. If the computational load and parameter count of SwinUnet are low due to hardware resource limitations, then it can be employed when resources are tight.

3.4.4. Model Limitations and Future Work

Although the proposed DMLU-Net achieved state-of-the-art performance in water body extraction from remote sensing images and demonstrated strong generalization abilities across multiple datasets, several aspects merit further investigation and optimization:

(1) Model compression potential: While DMLU-Net improves the extraction accuracy through its DMLKA module, SSAM, and DySample module, these enhancements come at the cost of increased parameters and computational overhead, particularly limiting its efficiency in processing high-resolution imagery. This limitation may make it difficult for the model to be applied to the rapid survey of large-scale areas in emergency scenarios. Future work could explore structural pruning, lightweight attention alternatives, or knowledge distillation to substantially reduce inference costs while preserving accuracy, thereby enhancing the feasibility of deployment on resource-constrained platforms (e.g., satellite terminals or mobile devices).

(2) The model’s adaptability to multi-source heterogeneous data remains to be systematically evaluated. While current experiments focus on single-sensor optical imagery, practical remote sensing applications typically involve multi-source (e.g., SAR, Landsat, and Sentinel) and multi-modal (e.g., hyperspectral and multi-temporal) data. DMLU-Net’s performance in cross-sensor, cross-resolution, and cross-modal scenarios requires further validation. For some common problems in optical images (such as SAR speckle noise and hyperspectral dimension redundancy), subsequent measures will be taken based on this study to alleviate them through inter-modal pre-training (initializing part of the network using public non-optical datasets) and targeted data augmentation (such as SAR speckle suppression and hyperspectral dimension reduction/clustering). In addition, another possible solution is to build an ensemble learner by using a model for optical images and the model proposed in this study. Future research will focus on enhancing the model’s generalization ability in the complex data environment of the real world.

(3) This study lacks a systematic exploration of uncertainty modeling and interpretability. Although DMLU-Net performs well in numerical evaluation, the model’s predictive capability for anomalous areas, fuzzy boundaries, or low-contrast water bodies is not fully demonstrated. Future research can introduce uncertainty quantification mechanisms (such as Monte Carlo dropout and Bayesian deep learning) to evaluate the reliability of model predictions, and it can combine interpretability methods (such as CAM/Grad CAM) to enhance the controllability and trustworthiness of the model in practical scenarios.

(4) Strong data dependence and high annotation costs. Current model training relies on large-scale, high-quality pixel-level annotated data, but the manual annotation of remote sensing images is costly and time-consuming. Future research could explore semi-supervised learning, pseudo-label generation, self-supervised contrastive learning, and similar approaches to reduce the reliance on manual annotations and enhance model adaptability in data-scarce scenarios.

4. Conclusions

To address the challenges of large-scale variations, indistinct boundaries, and background interference in water body extraction from remote sensing imagery, in this study, we propose DMLU-Net, a novel hybrid neural network architecture. Building upon the encoder–decoder framework of U-Net, the model incorporates three key components: a dynamic multi-scale large-kernel attention module (DMLKA), a spectral–spatial attention module (SSAM), and a dynamic upsampling module (DySample). This integrated approach significantly enhances the model’s perceptual capability and segmentation accuracy for water body regions.

The DMLKA module enhances the model’s ability to capture multi-scale contextual information by employing large-kernel convolutions and a temperature-regulated dynamic weighting mechanism. Meanwhile, the SSAM incorporates a selective spatial–spectral feature screening mechanism within skip connections, thereby improving the model’s detection accuracy for small water bodies and edge regions. Additionally, the DySample module achieves high-precision upsampling by learning sampling offsets, effectively mitigating the boundary artifacts and information loss inherent in conventional methods.

Experimental results on two public remote sensing datasets (GID and LoveDA) demonstrate that DMLU-Net achieves superior performance compared to existing state-of-the-art CNN- and Transformer-based models, with notable improvements in key metrics, including the IoU (90.46%) and F1-score (94.50%). These results validate the effectiveness and robustness of the proposed method in complex scenarios. Furthermore, ablation studies confirm the critical contribution of each module, substantiating their essential roles in enhancing overall model performance. A model complexity analysis reveals that the model demonstrates a trade-off between computational precision and efficiency when executed on GPUs. However, if this model is to be applied in emergency scenarios or deployed as a simple system, then in-depth research on actual application scenarios is still required.

Future research will focus on making improvements in four key areas: (1) the model lightweight design, (2) multi-source remote sensing data adaptation, (3) uncertainty modeling, and (4) semi-supervised learning strategies. These improvements aim to enhance the model’s computational efficiency, generalization capability, and practical utility, thereby better supporting critical applications, including water resource monitoring, environmental protection, and disaster early warning systems.

Author Contributions

Conceptualization, Z.X. and M.L.; methodology, Z.X. and M.L.; software, Z.X. and M.L.; validation, Z.X.; formal analysis, Z.X.; investigation, Z.X. and M.L.; resources, H.G.; data curation, Z.X. and M.L.; writing—original draft preparation, Z.X. and M.L.; writing—review and editing, Z.X. and H.G.; visualization, Z.X.; supervision, H.G.; project administration, H.G.; funding acquisition, H.G. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The experiment was conducted based on publicly available datasets. The datasets can be obtained from https://x-ytong.github.io/project/GID.html (accessed on 8 July 2025) and https://github.com/Junjue-Wang/LoveDA (accessed on 8 July 2025).

Acknowledgments

The authors thank the anonymous reviewers and the editors for their valuable comments to improve our manuscript.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Dong, Z.; Liang, Z.; Wang, G.; Amankwah, S.O.Y.; Feng, D.; Wei, X.; Duan, Z. Mapping inundation extents in Poyang Lake area using Sentinel-1 data and transformer-based change detection method. J. Hydrol. 2023, 620, 129455. [Google Scholar] [CrossRef]
Qiu, J.; Cao, B.; Park, E.; Yang, X.; Zhang, W.; Tarolli, P. Flood Monitoring in Rural Areas of the Pearl River Basin (China) Using Sentinel-1 SAR. Remote Sens. 2021, 13, 1384. [Google Scholar] [CrossRef]
McFEETERS, S.K. The use of the Normalized Difference Water Index (NDWI) in the delineation of open water features. Int. J. Remote Sens. 1996, 17, 1425–1432. [Google Scholar] [CrossRef]
Li, K.; Wang, J.; Yao, J. Effectiveness of machine learning methods for water segmentation with ROI as the label: A case study of the Tuul River in Mongolia. Int. J. Appl. Earth Obs. Geoinf. 2021, 103, 102497. [Google Scholar] [CrossRef]
Xu, H. Modification of normalised difference water index (NDWI) to enhance open water features in remotely sensed imagery. Int. J. Remote Sens. 2006, 27, 3025–3033. [Google Scholar] [CrossRef]
Li, A.; Fan, M.; Qin, G.; Xu, Y.; Wang, H. Comparative Analysis of Machine Learning Algorithms in Automatic Identification and Extraction of Water Boundaries. Appl. Sci. 2021, 11, 10062. [Google Scholar] [CrossRef]
Cheng, Q.; Varshney, P.K.; Arora, M.K. Logistic regression for feature selection and soft classification of remote sensing data. IEEE Geosci. Remote Sens. Lett. 2006, 3, 491–494. [Google Scholar] [CrossRef]
Alimjan, G.; Sun, T.L.; Jumahun, H.; Guan, Y.; Zhou, W.T.; Sun, H.G. A Hybrid Classification Approach Based on Support Vector Machine and K-Nearest Neighbor for Remote Sensing Data. Int. J. Pattern Recognit. Artif. Intell. 2017, 31, 1750034. [Google Scholar] [CrossRef]
Liang, G.; Zhao, X.; Zhao, J.; Zhou, F. Feature Selection and Mislabeled Waveform Correction for Water-Land Discrimination Using Airborne Infrared Laser. Remote Sens. 2021, 13, 3628. [Google Scholar] [CrossRef]
Mahboob, M.; Genc, B. Evaluation of ISODATA Clustering Algorithm for Surface Gold Mining Using Satellite Data. In Proceedings of the 2019 International Conference on Electrical, Communication, and Computer Engineering (ICECCE), Swat, Pakistan, 24–25 July 2019. [Google Scholar]
Nagaraj, R.; Sutha, K.L. Pixel Level Feature Extraction and Machine Learning Classification for Water Body Extraction. Arab. J. Sci. Eng. 2022, 48, 9905–9928. [Google Scholar] [CrossRef]
Wang, Z.; Gao, X.; Zhang, Y.; Zhao, G. MSLWENet: A Novel Deep Learning Network for Lake Water Body Extraction of Google Remote Sensing Images. Remote Sens. 2020, 12, 4140. [Google Scholar] [CrossRef]
Zhang, Z.; Lu, M.; Ji, S.; Yu, H.; Nie, C. Rich CNN Features for Water-Body Segmentation from Very High Resolution Aerial and Satellite Imagery. Remote Sens. 2021, 13, 1912. [Google Scholar] [CrossRef]
Chen, S.; Liu, Y.; Zhang, C. Water-Body Segmentation for Multi-Spectral Remote Sensing Images by Feature Pyramid Enhancement and Pixel Pair Matching. Int. J. Remote Sens. 2021, 42, 5025–5043. [Google Scholar] [CrossRef]
Zhou, Z. Classification of landscape architecture design based on dual-channel attention improved FCN. Syst. Soft Computing 2025, 7, 200280. [Google Scholar] [CrossRef]
Copurkaya, C.; Meric, E.; Akbulut, F.P.; Catal, C. A multi-pretraining U-Net architecture for semantic segmentation. Signal Image Video Process. 2025, 19, 669. [Google Scholar] [CrossRef]
Kumar, K.A.; Vanmathi, C. A hybrid parallel convolutional spiking neural network for enhanced skin cancer detection. Sci. Rep. 2025, 15, 11137. [Google Scholar] [CrossRef]
Feng, Y.; Fan, Z.; Yan, Y.; Jiang, Z.; Zhang, S. MFAFNet: Multi-Scale Feature Adaptive Fusion Network Based on DeepLab V3+ for Cloud and Cloud Shadow Segmentation. Remote Sens. 2025, 17, 1229. [Google Scholar] [CrossRef]
Li, Y.; Li, P.; Wang, H.; Gong, X.; Fang, Z. CAML-PSPNet: A Medical Image Segmentation Network Based on Coordinate Attention and a Mixed Loss Function. Sensors 2025, 25, 1117. [Google Scholar] [CrossRef] [PubMed]
Sunwoo, H.; Lee, S.; Paik, W. A Software-Defined Sensor System Using Semantic Segmentation for Monitoring Remaining Intravenous Fluids. Sensors 2025, 25, 3082. [Google Scholar] [CrossRef]
Wang, M.Z.; Li, C.S.; Yang, X.F.; Chu, D.H.; Zhou, Z.Q.; Lau, R.Y.K. QTU-Net: Quaternion Transformer-Based U-Net for Water Body Extraction of RGB Satellite Image. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5634816. [Google Scholar] [CrossRef]
Zhang, Y.; Lu, H.; Ma, G.; Zhao, H.; Xie, D.; Geng, S.; Tian, W.; Sian, K.T.C.L.K. MU-Net: Embedding MixFormer into Unet to Extract Water Bodies from Remote Sensing Images. Remote Sens. 2023, 15, 3559. [Google Scholar] [CrossRef]
Zhang, Q.; Hu, X.; Xiao, Y. A NOVEL HYBRID MODEL BASED ON CNN AND MULTI-SCALE TRANSFORMER FOR EXTRACTING WATER BODIES FROM HIGH RESOLUTION REMOTE SENSING IMAGES. ISPRS Ann. Photogramm. Remote Sens. Spatial Inf. Sci. 2023, 10, 889–894. [Google Scholar] [CrossRef]
Lee, S.; Kim, D.-J.; Li, C.; Yoon, D.; Song, J.; Kim, J.; Kang, K.-M. A new model for high-accuracy monitoring of water level changes via enhanced water boundary detection and reliability-based weighting averaging. Remote Sens. Environ. 2024, 313, 114360. [Google Scholar] [CrossRef]
Huang, B.; Li, P.; Lu, H.; Yin, J.; Li, Z.; Wang, H. WaterDetectionNet: A New Deep Learning Method for Flood Mapping With SAR Image Convolutional Neural Network. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 14471–14485. [Google Scholar] [CrossRef]
Liu, M.; Liu, J.; Hu, H. A Novel Deep Learning Network Model for Extracting Lake Water Bodies from Remote Sensing Images. Appl. Sci. 2024, 14, 1344. [Google Scholar] [CrossRef]
Hu, H.; Fu, X.; Li, C.; Liu, M.; Feng, X. AMFF-LWBENet: A Novel Deep Learning Network Model for Extracting Lake Water Bodies From Remote Sensing Images. IEEE Access 2024, 12, 149001–149017. [Google Scholar] [CrossRef]
Liu, Z.; Liu, X.; Yu, M.; Yang, X. DMAM-UNET:An Improved Unet Semantic Segmentation for Water Body Extraction from Remotely Sensed Image. In Proceedings of the IGARSS 2023-2023 IEEE International Geoscience and Remote Sensing Symposium, Pasadena, CA, USA, 16–21 July 2023. [Google Scholar] [CrossRef]
An, S.; Rui, X. A High-Precision Water Body Extraction Method Based on Improved Lightweight U-Net. Remote Sens. 2022, 14, 4127. [Google Scholar] [CrossRef]
Wang, L.; Shen, J.; Tang, E.; Zheng, S.; Xu, L. Multi-scale Attention Network for Image Super-Resolution. J. Vis. Commun. Image Represent. 2021, 80, 103300. [Google Scholar] [CrossRef]
Liu, W.; Lu, H.; Fu, H.; Cao, Z. Learning to Upsample by Learning to Sample. arXiv 2023, arXiv:2308.15085. [Google Scholar] [CrossRef]
Guo, M.; Lu, C.; Liu, Z.; Cheng, M.; Hu, S. Visual Attention Network. arXiv 2022, arXiv:2202.09741. [Google Scholar] [CrossRef]
Sun, G.; Pan, Z.; Zhang, A.; Jia, X.; Ren, J.; Fu, H. Large Kernel Spectral and Spatial Attention Networks for Hyperspectral Image Classification. IEEE Trans. Geosci. Remote Sens. 2023, 61, 1–15. [Google Scholar] [CrossRef]
He, Y.; Yao, S.; Yang, W.; Yan, H.; Zhang, L.; Wen, Z.; Zhang, Y.; Liu, T. An extraction method for glacial lakes based on Landsat-8 imagery using an improved U-Net network. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2021, 14, 6544–6558. [Google Scholar] [CrossRef]
Xiao, C.; Zhou, Z.; Hu, Y. A Lightweight Semantic Segmentation Model for Underwater Images Based on DeepLabv3+. J. Imaging 2025, 11, 162. [Google Scholar] [CrossRef] [PubMed]
Pani, A.; Zedda, L.; Mura, D.A.; Loddo, A.; Di Ruberto, C. 3D-NASE: A Novel 3D CT Nasal Attention-Based Segmentation Ensemble. J. Imaging 2025, 11, 148. [Google Scholar] [CrossRef]
Dang, M.; Zhou, X.; Huang, G.; Wang, X.; Zhang, T.; Tian, Y.; Ding, G.; Gao, H. Application Research on Contour Feature Extraction of Solidified Region Image in Laser Powder Bed Fusion Based on SA-TransUNet. Appl. Sci. 2025, 15, 2602. [Google Scholar] [CrossRef]
Hu, J.; Shen, L.; Sun, G. Squeeze—and—Excitation networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
Woo, S.; Park, J.; Lee, J.-Y.; Kweon, I.S. CBAM: Convolutional Block Attention Module. arXiv 2018, arXiv:1807.06521. [Google Scholar] [CrossRef]
Chen, L.-C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 40, 834–848. [Google Scholar] [CrossRef]

Figure 1. Architecture of the DMLU-Net model.

Figure 2. Structure schematic of the DMLKA module.

Figure 3. Structure schematic of the SSAM.

Figure 4. Partial images and their corresponding labels in the GID dataset.

Figure 5. Partial images and their corresponding labels in the LoveDA dataset.

Figure 6. Partial extraction results of each model on the GID dataset.

Figure 7. Partial extraction results of each model on the LoveDA dataset.

Figure 8. Computational efficiency and model parameters of various models.

Table 1. Model evaluation metrics.

Index	Formula
OA	$\frac{T P + T N}{T P + T N + F P + F N}$
P	$\frac{T P}{T P + F P}$
R	$\frac{T P}{T P + F N}$
F1-score	$\frac{2 * (P r e c i s i o n \times r e c a l l)}{P r e c i s i o n + r e c a l l}$
IoU	$\frac{T P}{T P + F P + F N}$

Table 2. Experimental results based on the GID dataset.

Method	OA	P	R	F1	IoU
Unet	95.70	95.81	90.84	93.51	87.52
Deeplabv3+	95.05	92.06	93.24	93.08	86.54
TransUNet	95.57	91.57	92.14	92.82	86.53
SwinUnet	94.22	91.64	90.02	91.29	84.57
MU-Net	95.72	94.48	94.88	93.64	89.94
QTU-Net	95.87	93.61	94.56	93.85	88.31
DMLU-Net	96.21	94.29	95.19	94.50	90.46

Table 3. Experimental results based on the LoveDA dataset.

Method	OA	P	R	F1	IoU
Unet	94.5	88.76	79.02	82.99	72.19
Deeplabv3+	95.97	89.5	81.24	84.56	74.53
TransUNet	94.6	89.36	82.67	85.88	75.26
SwinUnet	93.53	86.35	80.09	83.1	71.09
MU-Net	94.79	87.88	85.56	86.7	76.52
QTU-Net	96.07	89.13	82.83	85.3	75.46
DMLU-Net	96.42	89.59	85.31	86.86	77.44

Table 4. Comparison of ablation models’ metrics.

Method	OA	P	R	F1	IoU
Baseline + SSAM + D-up	96.62	93.74	91.33	92.53	87.86
Baseline + DMLKA + D-up	95.18	94.37	93.32	93.88	88.54
Baseline + DMLKA + SSAM	95.62	94.48	94.46	94.44	89.61
Baseline + DMLKA + SSAM + D-up	96.21	94.29	95.19	94.50	90.46

Table 5. The impact of replacing the DMLKA module on model performance.

Method	OA	P	R	F1	IoU
SE	95.53	92.40	94.84	93.53	88.05
ASPP	95.46	93.14	93.91	93.45	87.89
CBAM	95.86	92.53	95.79	94.08	88.98
DMLKA	96.21	94.29	95.19	94.50	90.46

Table 6. The influence of α and β parameters on model performance indicators.

Parameters	OA	P	R	F1	IoU
$α$ = 1.0, $β$ = 1.0	96.01	94.53	94.63	94.86	90.22
$α$ = 1.0, $β$ = 0.5	95.92	94.38	94.68	94.83	90.16
$α$ = 1.0, $β$ = 0.7	96.21	94.29	95.19	94.50	90.46
$α$ = 0.5, $β$ = 1.0	96.66	94.81	94.79	94.75	90.03
$α$ = 0.7, $β$ = 1.0	96.35	94.83	94.81	94.82	90.15

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Xu, Z.; Li, M.; Guo, H. DMLU-Net: A Hybrid Neural Network for Water Body Extraction from Remote Sensing Images. Appl. Sci. 2025, 15, 7733. https://doi.org/10.3390/app15147733

AMA Style

Xu Z, Li M, Guo H. DMLU-Net: A Hybrid Neural Network for Water Body Extraction from Remote Sensing Images. Applied Sciences. 2025; 15(14):7733. https://doi.org/10.3390/app15147733

Chicago/Turabian Style

Xu, Ziqiang, Mingfeng Li, and Haixiang Guo. 2025. "DMLU-Net: A Hybrid Neural Network for Water Body Extraction from Remote Sensing Images" Applied Sciences 15, no. 14: 7733. https://doi.org/10.3390/app15147733

APA Style

Xu, Z., Li, M., & Guo, H. (2025). DMLU-Net: A Hybrid Neural Network for Water Body Extraction from Remote Sensing Images. Applied Sciences, 15(14), 7733. https://doi.org/10.3390/app15147733

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

DMLU-Net: A Hybrid Neural Network for Water Body Extraction from Remote Sensing Images

Abstract

1. Introduction

2. Proposed Method

2.1. Architecture of DMLU-Net

2.2. DMLKA Module

2.3. SSAM

2.4. DySample Module

3. Experiments

3.1. Datasets

3.1.1. GID Dataset

3.1.2. LoveDA Dataset

3.2. Implementation Details

3.2.1. Evaluation Indicators

3.2.2. Loss Function

3.2.3. Training Settings

3.2.4. Comparative Models

3.3. Experimental Results

3.3.1. Experimental Results on GID Dataset

3.3.2. Ablation Experiments

3.4. Discussion

3.4.1. Discussion on α and β in the Loss Function

3.4.2. Temperature Parameter T in DLMKA Module

3.4.3. Model Complexity Analysis

3.4.4. Model Limitations and Future Work

4. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI