LU-Net: Lightweight U-Shaped Network for Water Body Extraction of Remote Sensing Images

Deng, Chengzhi; He, Ruqiang; Wu, Zhaoming; Sun, Xiaowei; Wang, Shengqian

doi:10.3390/w17182763

Open AccessArticle

LU-Net: Lightweight U-Shaped Network for Water Body Extraction of Remote Sensing Images

by

Chengzhi Deng

^*,

Ruqiang He

,

Zhaoming Wu

,

Xiaowei Sun

and

Shengqian Wang

Jiangxi Province Key Laboratory of Smart Water Conservancy, Jiangxi University of Water Resources and Electric Power, Nanchang 330099, China

^*

Author to whom correspondence should be addressed.

Water 2025, 17(18), 2763; https://doi.org/10.3390/w17182763

Submission received: 27 July 2025 / Revised: 11 September 2025 / Accepted: 14 September 2025 / Published: 18 September 2025

(This article belongs to the Section New Sensors, New Technologies and Machine Learning in Water Sciences)

Download

Browse Figures

Versions Notes

Abstract

Deep learning-based water body extraction methods generally focus on maximizing accuracy while neglecting inference speed, which can make them challenging to apply in real-time applications. To address this problem, this paper proposes a lightweight u-shaped network (LU-Net), which improves inference speed while maintaining comparable accuracy. To reduce inference latency, a lightweight decoder block (LDB) is designed, which employs a depthwise separable convolution structure to accelerate the decoding process. To enhance accuracy, a lightweight convolutional block attention module (LCBAM) is designed, which effectively captures water-specific spectral and spatial characteristics through a dual-attention mechanism. To improve multi-scale water boundary extraction, a structurally re-parameterized multi-scale fusion prediction module (SRMFPM) is designed, which integrates multi-scale water boundary information through convolutions of different sizes. Comparative experiments are conducted on the GID and LoveDA datasets, with model performance assessed using the MIoU metric and inference latency. The results demonstrate that LU-Net achieves the lowest GPU latency of 3.1 MS and the second-lowest CPU latency of 36 MS in the experiments. On the GID, LU-Net achieves the MIoU of 91.36%, outperforming other tested methods. On the LoveDA datasets, LU-Net achieves the second-highest MIoU of 86.32% among the evaluated models, which is 0.08% lower than the top-performing CGNet. Considering both latency and MIoU, LU-Net demonstrates commendable efficiency on the GID and LoveDA datasets across all compared networks.

Keywords:

water body extraction; remote sensing images; lightweight; multi-scale; attention module

1. Introduction

Efficient and accurate monitoring for water area variation is of paramount importance for disaster prevention and the rational planning and utilization of water resources. In recent years, with the rapid advancement of remote sensing technology, remote sensing image-based water body extraction has become a crucial approach for monitoring the dynamic changes in water area.

Thresholding method [1,2,3] is the most classical method for water body extraction in remote sensing imagery, which distinguishes water bodies from other land cover by leveraging their different spectral characteristics. Although the thresholding method is effective for water body extraction in some remote sensing images, its intelligence is limited by the reliance on manual experience in threshold selection. Furthermore, the thresholding method suffers from limited generalization capability, due to its high sensitivity to objective factors such as image preprocessing techniques and the coverage location of remote sensing data. With the advancement of machine learning, numerous machine learning-based approaches have been introduced to address these problems of thresholding method, including support vector machines (SVMs) [4], principal component analysis (PCA) [5], and clustering methods [6]. The machine learning-based methods are superior to the traditional thresholding method in terms of accuracy. However, they exploit only shallow information within the remote sensing images to extract water bodies, which inherently limits their extraction performance and noise robustness when processing large-scale datasets.

With advancements in deep learning technology, it has been widely applied in computer vision domains such as object detection, image classification, and semantic segmentation. Furthermore, deep learning technology has also provided new solutions for water body extraction. Deep learning-based extraction methods can not only identify water body boundaries more accurately but also show superior robustness and generalization performance in complex scenes. In 2017, Isikdogan et al. [7] designed a convolutional neural network called Deep-WaterMap based on FCN, which effectively distinguishes water from land, snow, rain and clouds. In 2019, Li et al. [8] used the FCN model to extract water bodies from very high spatial resolution (VHR) images, and the results demonstrated that FCN has excellent water body extraction capabilities. In 2021, Yuan et al. [9] designed a multi-channel water body detection network (MC-WBDN) based on ResNet and DeepLabV3+ to make more effective use of multispectral information in remote sensing images. This network exhibited enhanced robustness against variations in lighting and weather conditions. In 2023, Lyu et al. [10] designed the successive attention fusion module (SAFM) and the joint attention module (JAM) to capture spatial details and abstract semantics at different levels simultaneously. And they combined these modules with ResNet to propose the multi-scale sequential attention fusion network (MSAFNet).

In recent years, the transformer architecture [11,12], with strong global information modeling capabilities, has become a research hotspot in image processing. It has demonstrated considerable potential and has gained increasing traction in water body extraction applications. In 2023, Chen et al. [13] proposed a double branch parallel network by combining ResNet and Swin Transformer. This network leverages the deep feature extraction capabilities of ResNet and the global information fusion capabilities of Swin Transformer, thereby improving the accuracy of water body extraction. Zhao et al. [14] replaced the convolutional layers in the encoder of U-Net [15] with Vision Transformer and incorporated a convolutional block attention module (CBAM) attention module in the decoder, proposing the ViTenc-Unet. This network has demonstrated outstanding performance in water body extraction tasks on the Qinghai–Tibet Plateau. Kang et al. [16] proposed the WaterFormer network by embedding Vision Transformer between two different CNNs, aiming to explore long-range dependencies between low-level spatial information and high-level semantic features.

Deep learning-based water body extraction methods demonstrate excellent performance in terms of accuracy, robustness, and generalization. However, their large parameter count and high computational complexity significantly hinder deployment practicality. This limitation prevents these networks from reaching their full performance potential on devices with constrained computational and storage resources or limited size and power consumption. Furthermore, the high latency of these networks hinders their applicability in real-time applications, such as flood disaster monitoring. Therefore, it is of great practical significance to design a lightweight water extraction network that balances accuracy and efficiency.

In contrast to existing water extraction networks, this study prioritizes real-time inference capability as a primary constraint and pursues enhanced segmentation accuracy under this condition. We investigate the impact of network architectures on inference latency to design low-latency modules that improve extraction accuracy. Finally, this paper proposes a lightweight u-shaped network (LU-Net) with an encoder–decoder architecture for water body extraction. The proposed network employs lightweight RepViT [17] as the encoder for rapid feature extraction. Additionally, a lightweight convolutional block attention module (LCBAM) is designed to replace the squeeze-and-excitation (SE) [18] attention module in RepViT to enhance the network’s discriminative capability. A lightweight decoding block (LDB) is designed for rapid decoding, and a structurally re-parameterized multi-scale fusion prediction module (SRMFPM) is designed to improve water boundary extraction. The LDB, together with the SRMFPM, constitutes the decoder of LU-Net. Our work adapts the official codebase of RepViT [17]. The implementation of the network architecture is substantially modified by integrating the proposed LCBAM, LDB, and SRMFPM. Furthermore, the training pipeline is adjusted for our specific task.

The main contributions of this work are summarized as follows:

The CBAM [19] is introduced to simultaneously capture the importance of different channels and spatial regions in feature maps and enhance the network’s ability to identify water bodies. Specifically, the LCBAM is designed to mitigate the impact of CBAM on inference latency.
The LDB is designed to reduce network parameters and computational complexity for improved inference speed and deployment efficiency.
The SRMFPM is designed to capture multi-scale water body features for enhanced prediction accuracy of multi-scale water boundaries.

2. Methods

2.1. Overall Framework of LU-Net

The overall architecture of LU-Net, as illustrated in Figure 1, is primarily composed of three components: encoder, decoder, and skip connections.

2.1.1. Encoder of LU-Net

The encoder extracts multiple hierarchical features from input remote-sensing images through a series of convolution operations. It is constructed based on a RepViT network with enhanced attention mechanisms, comprising stem layer, RepViT blocks, and downsampling layers. The encoder structure is illustrated in the gray portion of Figure 1.

The stem layer, called Early Convolutions [20], consists of two 3 × 3 convolutions, each with a stride of 2. The first convolution employs 20 filters, while the second utilizes 40 filters. The stem layer can quickly reduce the input image resolution and thus decrease the overall computational complexity of the network.

The RepViT Block adopts a MetaFormer [21] architecture, consisting of a token mixer and a channel mixer. It serves to extract multi-level image features. The token mixer consists of three parallel branches and an LCBAM. The three parallel branches include a 3 × 3 depthwise (DW) convolution, a 1 × 1 DW convolution, and a residual connection. For simplicity in description, we denote the three parallel branches as structurally re-parameterized [22] depthwise (SRDW) convolution. During the inference phase, the 3 × 3 SRDW convolution is transformed into an equivalent single 3 × 3 DW convolution by using structural re-parameterization technique. The channel mixer is a Feed-Forward Network (FFN), consisting of two 1 × 1 convolutional layers and a GELU activation function [23]. The first 1 × 1 convolutional layer expands the channel dimension of input features, followed by the GELU activation function. Subsequently, the second 1 × 1 convolutional layer compresses the feature dimensionality back to its original size.

The downsample layer consists of a RepViT Block, a 3 × 3 DW convolution with a stride of 2, a 1 × 1 convolution, and an FFN. The downsample layer is utilized to downsample the image. It can extract more abstract semantic features for enhanced expressive capacity.

2.1.2. Decoder of LU-Net

The decoder performs upsampling to gradually restore the spatial resolution of the feature maps and finally maps these features to class confidence. It consists of five LDBs and one SRMFPM, as illustrated in the blue portion of Figure 1. The LDB reduces inference latency while preserving outstanding feature decoding capability through its efficient network architectures. The SRMFPM can extract multi-scale information, thereby enhancing the accuracy of water body edge extraction.

2.1.3. Skip Connections of LU-Net

Skip connections are used to directly transmit the features extracted by the encoder to corresponding decoder layers. This mechanism effectively mitigates shallow detail loss to enhance the extraction performance of small water bodies. Furthermore, the skip connections in LU-Net are implemented through element-wise addition rather than channel-wise concatenation. Compared to the channel-wise concatenation approach, this element-wise addition operation can preserve feature channel dimensions to reduce parameter count and computational complexity, which facilitates the lightweight design objective of the network.

2.2. Description of the Improvements

2.2.1. Lightweight Convolutional Block Attention Module (LCBAM)

In remote sensing imagery, water bodies exhibit highly complex and diverse spectral responses. Some dark water bodies show high spectral similarity to features like shadows, making them difficult to distinguish. Moreover, water bodies also exhibit high spatial variability. Not only do their shapes and boundaries vary significantly, but they also frequently contain internal interfering objects. These factors collectively pose challenges for accurate water body extraction. To address these issues, this study introduces the CBAM to fully leverage the unique spectral and spatial characteristics of water bodies for separating “water” from “non-water”. The CBAM comprises a channel attention module (CAM) and a spatial attention module (SAM). The CAM automatically learn the most discriminative feature channels, thereby enhancing the spectral distinction between water and non-water features and improving the accuracy of spectral discrimination in complex scenarios. The SAM can guide the network to focus on critical regions rather than uniformly processing all pixels, thus suppressing interference from irrelevant background information and enhancing the quality of spatial segmentation.

The CBAM enhances segmentation accuracy but introduces significant latency, particularly due to the shared fully connected layer in the CAM. As the average-pooled features and max-pooled features share the same fully connected layer, the fully connected (FC) operations must be executed twice sequentially, which doubles the latency. Additionally, the two FC operations are sequential, which makes it difficult to mitigate the latency increase by using parallel computing techniques. Table 1 shows the processing time of CAM and SAM when handling a 40 × 128 × 128 input image under varying numbers of pooling operations. Doubling the pooling operations in CAM increases latency from 0.018 MS to 0.045 MS, exceeding twofold growth. In contrast, SAM exhibits only a moderate latency increase from 0.018 MS to 0.033 MS under equivalent conditions. Therefore, we removed the max-pooling operation in the CAM to enhance CBAM’s compatibility with parallel computing architectures such as GPUs, aiming to reduce latency. The modified CBAM is named LCBAM.

The LCBAM consists of two sequential components, as shown in Figure 2. The CAM is enclosed in the yellow solid box, while the SAM is outlined in the purple dashed box.

The CAM comprises three components: an average-pooling operation, an FC layer, and a Sigmoid activation function. Average pooling is used to extract the channel-wise average feature vector

X_{a v g}

. The FC layer consists of two linear layers and a ReLU activation function. The first linear layer projects the input feature vector into a lower-dimensional space, and the second layer maps it back to its original dimensionality. The reduction ratio

r

is a hyperparameter that is set to 0.25 to maintain lightweight. The ReLU function between the two linear layers provides the CAM with a nonlinear modeling capacity. The Sigmoid activation function maps the output of the FC layer to channel-wise attention weights

W

. The final output

Y

is computed as the element-wise multiplication of the weights

W

with the original input

X

. The mathematical expression for this process is as follows:

Y = F_{\times} (W, X)

(1)

W = S i g m o i d (F C (X_{a v g})) = S i g m o i d (Q_{2} * R E L U (Q_{1} X_{a v g}))

(2)

X_{a v g} = [x_{a v g}^{0}, x_{a v g}^{2}, \dots x_{a v g}^{k}]

(3)

x_{a v g}^{k} = \frac{1}{H W} \sum_{i = 0}^{H - 1} \sum_{j = 0}^{W - 1} x_{i j}^{k}

(4)

where

X \in R^{C \times H \times W}

denotes the input to the CAM,

Y \in R^{C \times H \times W}

denotes the output, and

F_{\times}

represents the element-wise multiplication.

W \in R^{C \times 1}

is the attention weight vector, FC refers to the fully connected layer,

Q_{1} \in R^{C / r \times C}

and

Q_{2} \in R^{C \times C / r}

denote the first and second linear layers, respectively.

X_{a v g} \in R^{C \times 1}

denotes the channel-wise average feature vector.

x_{a v g}^{k} \in R

represents the spatial average value of the

k

-th channel in

X

, and

x_{i j}^{k} \in R

denotes the pixel value at the

i

-th row and

j

-th column of the

k

-th channel in

X

,

0 \leq k \leq C - 1

.

C

,

H

, and

W

are the number of channels, height, and width of the input

X

, respectively.

The SAM comprises an average-pooling operation, a max-pooling operation, a 7 × 7 convolutional layer, and a Sigmoid activation function. The average-pooling and max-pooling operations are applied along the channel axis to extract the spatial average feature matrix

Y_{a v g}

and the spatial max feature matrix

Y_{m a x}

. The 7 × 7 convolutional layer consists of a 7 × 7 convolution operation and a batch normalization (BN) layer, which are utilized to fuse

Y_{a v g}

and

Y_{m a x}

. The Sigmoid function maps the fused features to attention weights matrix

W

. The final output

Z

is computed as the element-wise multiplication of the matrix

W

with the input

Y

. The mathematical expression for this process is as follows:

Z = F_{\times} (W, Y)

(5)

W = C o n v 7 (C a t (Y_{a v g}, Y_{m a x}))

(6)

Y_{a v g} = [[y_{a v g}^{00}, y_{a v g}^{01}, \dots, y_{a v g}^{0 j}], \dots \dots, [y_{a v g}^{i 0}, y_{a v g}^{i 1}, \dots, y_{a v g}^{i j}]]

(7)

Y_{m a x} = [[y_{m a x}^{00}, y_{m a x}^{01}, \dots, y_{m a x}^{0 j}], \dots \dots, [y_{m a x}^{i 0}, y_{m a x}^{i 1}, \dots, y_{m a x}^{i j}]]

(8)

y_{a v g}^{i j} = \frac{1}{C} \sum_{k = 0}^{C - 1} y_{k}^{i j}

(9)

y_{m a x}^{i j} = M a x (y_{0}^{i j}, {y_{2}^{i j} \dots, y}_{C - 1}^{i j})

(10)

where

Y \in R^{C \times H \times W}

denotes the input to the SAM,

Z \in R^{C \times H \times W}

denotes the output, and

F_{\times}

represents the element-wise multiplication.

W \in R^{H \times W}

is the attention weight matrix,

C o n v 7

is the 7 × 7 convolutional layer,

Y_{a v g} \in R^{1 \times H \times W}

and

Y_{m a x} \in R^{1 \times H \times W}

denote the spatial average feature matrix and the spatial max feature matrix, respectively.

y_{a v g}^{i j} \in R

and

y_{m a x}^{i j} \in R

denote the channel-wise average and maximum values of the input Y at the

i

-th row and

j

-th column, respectively.

y_{k}^{i j} \in R

is the pixel value at the

i

-th row and

j

-th column of the

k

-th channel in

Y

,

0 \leq i \leq H - 1

,

0 \leq j \leq W - 1

.

C

,

H

, and

W

are the number of channels, height, and width of the input

Y

, respectively.

2.2.2. Lightweight Decoding Block (LDB)

To achieve fast decoding and accelerate the inference speed of the network, a depthwise separable convolutional structure is employed to construct the LDB. Additionally, the RepViT’s multi-scale SRDW convolution is introduced to capture multi-scale features in remote-sensing water bodies, which can enhance the model’s decoding capability. Furthermore, a structurally re-parameterized dilated depthwise (SRADW) convolution is designed to increase the receptive field by improving the SRDW convolution. Increasing the receptive field facilitates the capture of richer contextual information, thereby enhancing segmentation reliability. The SRDW, SRADW convolution, and the depthwise separable convolutional structure are the core components of LDB.

Compared to standard convolution, the depthwise separable convolutional structure is more lightweight. For a standard 3 × 3 convolution that preserves the spatial resolution and channel dimensions of the input, the parameters (

P a r a m s

) and floating point operations (

F L O P s

) are given by:

\begin{array}{l} P a r a m s & = K_{h} \times K_{w} \times C_{i n} \times C_{o u t} + C_{o u t} \\ = 9 {C_{i n}}^{2} + C_{i n} \end{array}

(11)

\begin{array}{l} F L O P s & = K_{h} \times K_{w} \times C_{i n} \times C_{o u t} \times H \times W \\ = 9 {C_{i n}}^{2} H W \end{array}

(12)

In contrast, for a 3 × 3 depthwise separable convolution under the same configuration, the

P a r a m s

and

F L O P s

are given by:

\begin{array}{l} P a r a m s & = K_{h} \times K_{w} \times C_{i n} + C_{i n} + C_{i n} \times C_{o u t} + C_{o u t} \\ = {C_{i n}}^{2} + 11 C_{i n} \end{array}

(13)

\begin{array}{l} F L O P s & = K_{h} \times K_{w} \times C_{i n} \times H \times W + C_{i n} \times C_{o u t} \times H \times W \\ = {C_{i n}}^{2} H W + 9 C_{i n} H W \end{array}

(14)

where

K_{h} = K_{w} = 3

denotes the convolutional kernel size,

C_{i n} = C_{o u t}

represents the channel dimensions of the input and output.

H

and

W

are the height and width of the output, respectively.

According to the above equations, the

P a r a m s

and

F L O P s

of depthwise separable convolution are approximately 1/8 of those of standard convolution when the channel dimension significantly exceeds 11. This reduction in parameters and computational cost improves model deployment flexibility and inference speed.

The LDB is shown in Figure 3 and comprises three components. The first component, designed to fuse abstract semantic features, comprises a 3 × 3 SRDW convolution and a pointwise convolutional layer with residual connection. The 3 × 3 SRDW convolution consists of three parallel branches: a 3 × 3 DW convolution, a 1 × 1 DW convolution, and a residual connection. The pointwise convolutional layer, which consists of a pointwise convolution and a normalization layer, maintains identical channel dimensions between its input and output. The residual connection added to the pointwise convolutional layer facilitates gradient propagation, enabling the model to converge more easily during training.

The second component of the LDB is designed to not only fuse semantic features but also compress the channel dimensions. It includes a 3 × 3 SRADW convolution and a dimensionality-reduced pointwise convolutional layer. The dilation rate of the SRADW convolution is set to 2 to avoid the gridding effect. The pointwise convolutional layer is used to compress the feature channels and aggregate cross-channel information simultaneously.

The third component of the LDB is implemented via a 2 × 2 depthwise transposed convolution to upsample the feature maps. It does not utilize the depthwise separable convolution structure because the depthwise transposed convolution would quadruple the spatial resolution of feature maps, resulting in increased computational overhead for subsequent pointwise convolution. Such an increase is detrimental to the lightweight design of the network.

The GELU activation function is employed after pointwise convolutional layers in the LDB’s first two components to enhance the nonlinear expressive capability of the LDB. Compared to the widely adopted ReLU activation function, GELU partially mitigates the dead neuron issue. Additionally, its capability to generate negative values makes the model learn richer feature representations.

The LDB is transformed into a lighter structure during the inference phase via structural reparameterization techniques, further accelerating decoding speed. The modified structure is illustrated in the right part of Figure 3. Specifically, the 3 × 3 SRDW convolution is converted to a standard 3 × 3 depthwise convolution. The pointwise convolution with a residual connection is simplified to a standard pointwise convolution, and the 3 × 3 SRADW convolution is transformed into a standard 3 × 3 dilated depthwise convolution. The dimensionality-reduced pointwise convolution and the 2 × 2 depthwise transposed convolution remain unchanged.

2.2.3. Structurally Re-Parameterized Multi-Scale Fusion Prediction Module (SRMFPM)

Water body boundaries in remote sensing imagery are inherently complex, morphologically variable, and exhibit strong multi-scale characteristics. Some transition zones between water bodies and surrounding land covers are sharp, clear, and spatially narrow, while others are gradual, blurred, and broad. Furthermore, the boundaries of large water bodies often extend over long distances and may display multiple transition types across different segments. This multi-scale variability poses significant challenges for water boundary extraction. To address these challenges, the SRMFPM is proposed, which employs a parallel multi-branch architecture with convolutional kernels of varying sizes to extract and integrate multi-scale water features, thereby enhancing the reliability of water edge prediction.

The structure of SRMFPM is shown in Figure 4 and consists of four convolutional layers with different scales, a residual connection, and a Leaky ReLU activation function. The four convolutional layers are designed with different kernel sizes, specifically 7 × 7, 5 × 5, 3 × 3, and 1 × 1. Each convolutional layer consists of a convolutional operation and a batch normalization layer. Multi-scale convolutional layers enable the model to capture potential relationships among pixels at different spatial ranges, thereby providing richer contextual information to improve classification accuracy. The maximum kernel size is set to 7 × 7 to maintain the model’s lightweight characteristics. Compared to 9 × 9 or larger convolutions, 7 × 7-sized convolutions demonstrate superior optimization in terms of computational efficiency. Additionally, the residual connection is connected in parallel with the four convolutional layers, preserving the original input information to enhance predictive capability. The introduced Leaky ReLU function can retain negative activations, which ensures effective gradient propagation and thereby improves the model’s training stability and performance.

During inference, the structural reparameterization techniques are used to transform the four convolutional layers and the residual connection into an equivalent 7 × 7 convolutional operation to achieve lower latency. The transformed SRMFPM, as illustrated in the lower section of Figure 4, achieves end-to-end mapping from input features to final classification results with only a single 7 × 7 convolution and a Leaky ReLU activation function.

3. Experimental Results and Discussion

3.1. Datasets

To evaluate the effectiveness of all models, experiment was conducted on the GID [24] and LoveDA [25] datasets. To apply the GID dataset to our experiment, we randomly selected 61 remote sensing images from its large-scale classification set. Each image was cropped into 512 × 512 size, with corresponding label maps reclassified into binary categories: water bodies and non-water bodies. Furthermore, some images with excessive non-water body areas were excluded from the dataset to maintain a balance between positive and negative samples. The final dataset comprised 10,218 512 × 512-sized remote sensing images, divided into 8211 training images, 1003 validation images, and 1004 test images.

Following the processing approach of the GID dataset, 4191 images were selected from the LoveDA dataset, cropped into 512 × 512 patches, and further relabeled into two classes. Additionally, some images exhibiting excessive non-water coverage were discarded. The final curated dataset consists of 8694 512 × 512-sized remote sensing images, with 6816 images allocated to the training set, 939 to the validation set, and 939 to the test set.

3.2. Evaluation Metrics

3.2.1. Metrics of Extraction Performance Evaluation

To evaluate the extraction performance of all models in this experiment, a set of metrics is employed, including Mean Intersection over Union (

M I o U

), overall accuracy (

O A

),

P r e c i s i o n

,

R e c a l l

, and

K a p p a

. The

K a p p a

coefficient measures the agreement between predictions and ground truth. According to the Landis & Koch benchmark [26],

K a p p a

values indicate: <0.00 (Poor), 0.00–0.20 (Slight), 0.21–0.40 (Fair), 0.41–0.60 (Moderate), 0.61–0.80 (Substantial), and 0.81–1.00 (Almost Perfect). Additionally, this study employs the Receiver Operating Characteristic (ROC) curve for a further evaluation of model performance, which plots the False Positive Rate (

F P R

) on the x-axis against the True Positive Rate (

T P R

) on the y-axis. A curve that hugs the upper-left corner indicates superior prediction performance.

The calculation formulas for these metrics are defined as follows:

M I o U = \frac{1}{2} (\frac{T P}{F N + F P + T P} + \frac{T N}{F N + F P + T N})

(15)

O A = \frac{T P + T N}{T P + T N + F P + F N}

(16)

P r e c i s i o n = \frac{T P}{T P + F P}

(17)

T P R = R e c a l l = \frac{T P}{T P + F N}

(18)

K a p p a = \frac{O A - P_{e}}{1 - P_{e}}

(19)

P_{e} = \frac{(T P + F P) (T P + F N) + (T N + F P) (T N + F N)}{{(T P + T N + F P + F N)}^{2}}

(20)

F P R = \frac{F P}{F P + T N}

(21)

where

T P

refers to the number of pixels correctly classified as a water body;

T N

is the number of pixels correctly predicted as a non-water body;

F P

denotes the number of pixels incorrectly classified as a water body; and

F N

represents the number of pixels incorrectly classified as a non-water body. Figure 5 provides a visual illustration of

T P

,

F N

,

F P

, and

T N

.

3.2.2. Metrics of Lightweight Performance Evaluation

Three metrics,

P a r a m s

,

F L O P s

, and Latency, are adopted to assess the lightweight performance of all models.

P a r a m s

refer to the total count of learnable parameters in the model.

F L O P s

represent the total number of multiply accumulate operations during model inference. Latency denotes the time required for the model to complete a single inference. Additionally, to provide a more comprehensive evaluation of the model’s lightweight performance, the latency is measured separately on both GPU and CPU.

3.3. Experimental Details

In our study, experiments were conducted on GPU NVIDIA RTX 4070TI SUPER and CPU AMD 7500F with the following configurations: a batch size of 32, maximum epochs of 100, and an initial learning rate of 0.001. Moreover, the AdamW optimizer was used for training, with a weight decay coefficient set to 0.025. The binary cross-entropy (BCE) loss was employed as the objective function for optimization.

3.4. Comparative Experiments

To verify the performance of our approach, we compared it with six networks: EffcientNet-U, ShuffleNetV2-U, MobileNetV3-U, ENet [27], ERFNet [28], and CGNet [29]. EfficientNet-U, ShuffleNetV2-U, and MobileNetV3-U all adopt a U-shaped architecture. Their respective encoders are EfficientNet-B0 [30], ShuffleNetV2 0.5× [31], and MobileNetV3-Large [32]. Their decoders share a unified progressive reconstruction architecture, in which each decoding stage sequentially applies two 3 × 3 convolutional layers for feature fusion, followed by a 2 × 2 transposed convolution layer to restore resolution. The three networks, ENet, ERFNet, and CGNet, are lightweight networks specifically designed for semantic segmentation.

3.4.1. Lightweight Performance Comparison Results

In the lightweight experiments, 4-channel images are used to measure latency and computational cost. The results are summarized in Table 2. Although LU-Net has higher parameters and

F L O P s

compared to ENet, its latency is reduced by 3.4 MS on the GPU and 11 MS on the CPU due to its more efficient network architecture. Compared to EfficientNet-U, LU-Net achieves latency reductions of 1.2 MS on GPU and 29 MS on CPU. Compared to ShuffleNetV2-U, which has minimal

F L O P s

, LU-Net exhibits higher CPU latency but achieves better GPU latency. Relative to MobileNetV3-U, which shares comparable computational complexity and parameter count, LU-Net reduces the latency by 0.3 MS on the GPU and 3 MS on the CPU. Against CGNet, LU-Net provides 2.4 MS GPU and 32 MS CPU latency reductions thanks to its fewer branches during inference. Furthermore, LU-Net achieves latency reductions of 0.8 MS on the GPU and 49 MS on the CPU in comparison to ERFNet.

3.4.2. Extraction Performance Comparison Results on the GID Dataset

Table 3 presents the experimental results of various networks on the GID dataset. The results demonstrate that LU-Net achieves optimal performance across the three metrics:

M I o U

at 91.36%,

O A

at 96.71%, and

K a p p a

coefficient at 0.9083, which evaluate global classification quality and categorical consistency. Moreover, The LU-Net also demonstrates strong performance on metrics that focus on positive class classification. It achieves a

P r e c i s i o n

of 92.24%, second only to MobileNetV3-U, and a

R e c a l l

of 94.04%, the highest among all networks.

Figure 6 presents the ROC curves of various networks on the GID. As shown in the figure, the ROC curve of LU-Net is closest to the top-left corner compared to other networks. These results indicate the superior performance of LU-Net over other methods on the GID.

Compared to other U-shaped lightweight networks, such as ShuffleNetV2-U, LU-Net with a more efficient architecture achieves varying degrees of improvement in

M I o U

,

O A

,

R e c a l l

, and

K a p p a

, and is only slightly lower than MobileNetV3-U by 0.09% in

P r e c i s i o n

. Compared to the lightweight networks specifically designed for semantic segmentation, such as ERFNet, LU-Net achieves improvements across all metrics. Even compared to CGNet, the best performer among them, LU-Net achieves significant enhancements: 0.36% higher in

M I o U

, 0.13% higher in

O A

, 0.0041 higher in

K a p p a

, 0.58% higher in

P r e c i s i o n

, and 0.04% higher in

R e c a l l

.

Figure 7 presents the visualization results for various networks on the GID dataset. The LU-Net outperforms other networks in identifying fine-scale and slender water bodies. The advantage of LU-Net in extracting fine-scale water bodies primarily stems from its multi-scale architecture, where multi-scale convolutions can effectively capture the features of minute water bodies, thereby enhancing recognition performance. For the slender water bodies in row 4, the upper region of row 7, and row 9, the LU-Net demonstrates the optimal identification capability. MobileNetV3-U and EfficientNet-U exhibit the second-best identification results, while ERFNet, ShuffleNetV2-U, ENet, and CGNet show significant breaks in their results. For the small water bodies in rows 2 and 8, LU-Net achieves more complete segmentation. Other networks exhibit varying degrees of omission, with ERFNet and MobileNetV3-U failing to detect nearly all small water bodies.

In addition to having a better recognition effect on fine-scale water bodies, the LU-Net also performs more effectively in identifying water bodies along image edges. As shown in rows 1 and 6, LU-Net achieves more complete segmentation of water bodies within the red circles compared to other networks.

Furthermore, LU-Net exhibits enhanced discrimination between water bodies and non-water regions due to the incorporation of LCBAM. As shown in the red-circled regions in the middle part of row 7, ERFNet, ShuffleNetV2-U, ENet, MobileNetV3-U, and EfficientNet-U misclassify the unmarked dark green background as water. In contrast, LU-Net avoids significant false detections. The LU-Net also demonstrates superior adaptability in some challenging scenarios. For algae-covered fishponds in row 3 and densely distributed small fishponds in row 5, LU-Net achieves more accurate and complete segmentation than other networks.

3.4.3. Extraction Performance Comparison Results on the LoveDA Dataset

Table 4 presents the results of various networks on the LoveDA dataset. The results indicate that LU-Net achieves competitive performance on LoveDA, with metrics of 86.32%

M I o U

, 95.58%

O A

, 0.8481

K a p p a

, 89.37%

P r e c i s i o n

, and 85.83%

R e c a l l

. Compared to top-performing CGNet, the LU-Net shows only minor performance drops, with the largest decrease not exceeding 0.08%. Compared to EfficientNet-U, LU-Net exhibits 0.03% lower

O A

and 1.29% reduced

P r e c i s i o n

but achieves 0.12% higher

M I o U

, 1.55% improved

R e c a l l

, and 0.0017 increased

K a p p a

. Overall, LU-Net is superior to EfficientNet-U in water extraction performance. Moreover, LU-Net consistently outperforms the other extraction networks across all evaluation metrics.

Figure 8 presents the ROC curves of various networks on the LoveDA dataset. As shown in the figure, LU-Net’s ROC curve ranks second only to CGNet in the low

F P R

range. As the

F P R

increases, LU-Net’s ROC curve outperforms other networks. These results demonstrate that LU-Net achieves the second-best performance on the LoveDA dataset, slightly inferior to CGNet.

Figure 9 presents the visualization results for various networks on the LoveDA dataset. The results indicate that LU-Net performs better in identifying water bodies than other networks.

For the reflective water surface and the small water body in row 1, LU-Net achieves the best recognition. For the elongated water body in the upper-left section of row 5, only LU-Net and CGNet achieve near-complete segmentation, while other networks, especially ShuffleNetV2-U and EfficientNet-U, show almost no detection. The LU-Net also achieves optimal extraction performance in some challenging scenarios. As illustrated by the semi-transparent water body in the upper red circle of row 6 and the algae-covered water body in row 7, LU-Net provides more complete segmentation than other networks.

Furthermore, LU-Net shows a greater capability to discriminate between water bodies and other land cover types due to the incorporation of LCBAM. For the farmland and wasteland in rows 2 and 4, only LU-Net achieved accurate segmentation of these regions, while other networks showed varying degrees of misclassification. For the building shadows in the red circle of row 3, LU-Net, ERFNet, EfficientNet-U, and CGNet produced no false positives, whereas the other three networks misclassified the shadows as water bodies.

Considering both computational efficiency and extraction accuracy, LU-Net achieves the best overall performance among all evaluated networks. Compared to ShuffleNetV2-U, LU-Net achieves lower GPU latency and better extraction performance on the two datasets, despite exhibiting higher CPU inference time. Compared to CGNet, LU-Net achieves faster inference speed and better extraction performance on the GID dataset, although it shows slightly lower accuracy on the LoveDA dataset. Moreover, LU-Net demonstrates significant advantages in both extraction performance and latency compared to the remaining networks.

3.5. Ablation Study

To analyze the effectiveness of the proposed modules, we first constructed a baseline network. Its encoder is a RepViT network, and its decoder comprises multiple blocks, each consisting of two 3 × 3 convolution layers and one 2 × 2 transposed convolution layer. Then, we developed various network variants by incrementally adding or replacing modules on this baseline. Finally, we evaluated various performance metrics for these variants. The ablation study follows the same experimental setup as the comparison experiments.

3.5.1. Performance Metrics Results

Table 5 presents the lightweight performance results of various network variants. Table 6 and Table 7 present the extraction performance results on the GID and LoveDA datasets, respectively.

The results show that introducing LDB reduces GPU latency by 0.5 MS and CPU latency by 7 MS for the baseline network. Moreover, LDB can enhance the network’s water extraction performance by leveraging its multi-scale feature fusion and large receptive field. On the GID dataset, the LDB increases the baseline’s

M I o U

,

O A

, and

K a p p a

by 0.07%, 0.03%, and 0.0007, respectively. On the LoveDA dataset, the improvements in the same metrics are 0.24%, 0.09%, and 0.0030, respectively.

The LCBAM improves the network’s extraction performance through its superior ability to focus on critical features, despite increasing GPU latency by 0.5 MS and CPU latency by 3 MS. On the GID dataset, LCBAM improves

M I o U

by 0.11%,

O A

by 0.04%,

K a p p a

by 0.0012, and

P r e c i s i o n

by 0.35%, although

R e c a l l

drops by 0.17%. On the LoveDA dataset, the LCBAM boosts

M I o U

,

O A

, and

K a p p a

by 0.23%, 0.1%, and 0.0029, respectively.

The SRMFPM also enhances segmentation performance through multi-scale prediction, although it slightly increases GPU latency by 0.1 MS and CPU latency by 2 MS. On the GID dataset, SRMFPM improves

M I o U

,

O A

, and

K a p p a

by 0.06%, 0.02%, and 0.0007. Notably, it increases

P r e c i s i o n

by 0.09% with no drop in

R e c a l l

. On the LoveDA dataset, SRMFPM improves

M I o U

,

O A

,

K a p p a

, and

R e c a l l

by 0.08%, 0.01%, 0.0009, and 0.57%, respectively, while only reducing

P r e c i s i o n

by 0.42%.

3.5.2. Visualization Results Analysis

Figure 10 presents the visualization results of network variants on the GID dataset. The results show that progressive integration of the proposed modules into the baseline leads to noticeable improvements in fine-scale water body identification, false positive suppression, and edge extraction.

Replacing the baseline decoder blocks with LDBs significantly improves the identification of small water bodies, thanks to their multi-scale characteristics. As shown in the top red-circled regions of row 1, the small fish ponds are identified more completely after introducing LDBs. Likewise, the slender bridge in row 4 and the narrow river in row 5 are also detected with greater completeness.

The LCBAM enhances the model’s ability to capture subtle differences between the water and background by enhancing water body features. Thus, the integration of LCBAM into the baseline-LDB leads to a notable reduction in false detections. As shown in rows 2 and 3, the extraction results of baseline-LDB-LCBAM are superior to those of baseline-LDB, with almost no false detections in the paddy fields within the red-circled regions.

The proposed SRMFPM further enhances the network’s extraction performance. As shown in row 6, the identification of non-water regions is more accurate in LU-Net than in baseline-LDB-LCBAM. Additionally, SRMFPM enhances the accuracy and smoothness of water body edge segmentation. As illustrated in Figure 11, the edge segmentation results produced by LU-Net are closer to the ground truth image than those of baseline-LDB-LCBAM.

Figure 12 presents the visualization results of network variants on the LoveDA dataset. As the proposed modules are added progressively, the network demonstrates improved performance in water body extraction.

The integration of LDB enhances the performance in identifying algae-covered water bodies. As shown in rows 1 and 2, the baseline fails to extract many algae-covered water bodies, while the baseline-LDB captures much more of these areas. Moreover, LDB improves the network’s capability to extract small water bodies that are located near image boundaries. As shown in rows 4 and 6, the extraction performance of baseline-LDB is superior to that of the baseline network.

The LCBAM enhances the network’s segmentation accuracy by amplifying the feature differences between water and background. The baseline-LDB produces varying degrees of false detections for the agricultural land in row 3 and the forested area in row 5. However, the baseline-LDB-LCBAM significantly reduces these false positives.

The introduction of SRMFPM further improves the network’s extraction performance. As illustrated in the figure of row 5, LU-Net produces fewer false positives than baseline-LDB-LCBAM. Moreover, adding SRMFPM also improves the recognition of water body edges on the LoveDA dataset. As illustrated in Figure 13, the network with SRMFPM achieves better edge segmentation closer to the ground truth.

4. Conclusions

This research proposes a lightweight U-shaped water body extraction network named LU-Net, aiming to meet the requirements for real-time extraction and deployment convenience. The LU-Net adopts an encoder–decoder architecture, where RepViT serves as the encoder to enable efficient feature extraction, and the LDB is designed as the decoder to support fast decoding. Furthermore, two key components—LCBAM and SRMFPM—are designed to enhance this network’s discriminative capability and ability to identify water body boundaries. The experimental results demonstrate that LU-Net achieves the best or near-best extraction performance on the GID and LoveDA datasets. Moreover, LU-Net achieves the lowest latency on GPU and the second-lowest latency on CPU, indicating its efficiency across different computing platforms. Considering all aspects, LU-Net strikes the optimal trade-off between extraction accuracy and inference speed among all tested methods. However, this study was primarily validated on the GID and LoveDA datasets. Although they include diverse scenarios, this remains insufficient to comprehensively evaluate the model’s generalization capability and noise robustness across different sensors, seasons, and extreme terrain conditions. In the future, we plan to explore and improve LU-Net’s performance in such complex environments to meet the diverse requirements of real-world water body extraction tasks. Additionally, we will investigate LU-Net’s application potential in multispectral and hyperspectral remote sensing imagery to broaden its deployment scenarios and extend its performance boundaries.

Author Contributions

Conceptualization, C.D. and S.W.; methodology, R.H.; software, Z.W. and R.H.; validation, R.H. and X.S.; writing—original draft preparation, R.H. and C.D.; writing—review and editing, C.D. and Z.W.; supervision, S.W.; funding acquisition, C.D. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by National Natural Science Foundation of China (T2225019), and the Jiangxi Natural Science Foundation of China (20252BAC250011).

Data Availability Statement

The data presented in this study are available upon request from the corresponding author. The data are not publicly available as the study is still in progress.

Acknowledgments

The authors would like to gratefully thank the Editor, Associate Editor and the Anonymous Reviewers for their outstanding comments and suggestions, which greatly helped us to improve the technical quality and presentation of this manuscript.

Conflicts of Interest

The authors declare no conflicts of interest.

References

McFEETERS, S.K. The Use of the Normalized Difference Water Index (NDWI) in the Delineation of Open Water Features. Int. J. Remote Sens. 1996, 17, 1425–1432. [Google Scholar] [CrossRef]
Xu, H. Modification of Normalised Difference Water Index (NDWI) to Enhance Open Water Features in Remotely Sensed Imagery. Int. J. Remote Sens. 2006, 27, 3025–3033. [Google Scholar] [CrossRef]
Feyisa, G.L.; Meilby, H.; Fensholt, R.; Proud, S.R. Automated Water Extraction Index: A New Technique for Surface Water Mapping Using Landsat Imagery. Remote Sens. Environ. 2014, 140, 23–35. [Google Scholar] [CrossRef]
Duan, Q.; Meng, L. Applicability of the Water Information Extraction Method Based on GF-1 Image. Remote Sens. Nat. Resour. 2015, 27, 79–84. [Google Scholar] [CrossRef]
Yang, F.; Guo, J.; Tan, H.; Wang, J. Automated Extraction of Urban Water Bodies from ZY-3 Multi-Spectral Imagery. Water 2017, 9, 144. [Google Scholar] [CrossRef]
Zhang, Y.; Liu, X.; Zhang, Y.; Ling, X.; Huang, X. Automatic and Unsupervised Water Body Extraction Based on Spectral-Spatial Features Using GF-1 Satellite Imagery. IEEE Geosci. Remote Sens. Lett. 2019, 16, 927–931. [Google Scholar] [CrossRef]
Isikdogan, F.; Bovik, A.C.; Passalacqua, P. Surface Water Mapping by Deep Learning. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2017, 10, 4909–4918. [Google Scholar] [CrossRef]
Li, L.; Yan, Z.; Shen, Q.; Cheng, G.; Gao, L.; Zhang, B. Water Body Extraction from Very High Spatial Resolution Remote Sensing Data Based on Fully Convolutional Networks. Remote Sens. 2019, 11, 1162. [Google Scholar] [CrossRef]
Yuan, K.; Zhuang, X.; Schaefer, G.; Feng, J.; Guan, L.; Fang, H. Deep-Learning-Based Multispectral Satellite Image Segmentation for Water Body Detection. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2021, 14, 7422–7434. [Google Scholar] [CrossRef]
Lyu, X.; Jiang, W.; Li, X.; Fang, Y.; Xu, Z.; Wang, X. MSAFNet: Multiscale Successive Attention Fusion Network for Water Body Extraction of Remote Sensing Images. Remote Sens. 2023, 15, 3121. [Google Scholar] [CrossRef]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image Is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv 2021, arXiv:2010.11929. [Google Scholar] [CrossRef]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows. arXiv 2021, arXiv:2103.14030. [Google Scholar] [CrossRef]
Chen, J.; Xia, M.; Wang, D.; Lin, H. Double Branch Parallel Network for Segmentation of Buildings and Waters in Remote Sensing Images. Remote Sens. 2023, 15, 1536. [Google Scholar] [CrossRef]
Zhao, X.; Wang, H.; Liu, L.; Zhang, Y.; Liu, J.; Qu, T.; Tian, H.; Lu, Y. A Method for Extracting Lake Water Using ViTenc-UNet: Taking Typical Lakes on the Qinghai-Tibet Plateau as Examples. Remote Sens. 2023, 15, 4047. [Google Scholar] [CrossRef]
Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. arXiv 2015, arXiv:1505.04597. [Google Scholar] [CrossRef]
Kang, J.; Guan, H.; Ma, L.; Wang, L.; Xu, Z.; Li, J. WaterFormer: A Coupled Transformer and CNN Network for Waterbody Detection in Optical Remotely-Sensed Imagery. ISPRS J. Photogramm. Remote Sens. 2023, 206, 222–241. [Google Scholar] [CrossRef]
Wang, A.; Chen, H.; Lin, Z.; Han, J.; Ding, G. RepViT: Revisiting Mobile CNN From ViT Perspective. arXiv 2024, arXiv:2307.09283. [Google Scholar]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-Excitation Networks. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
Woo, S.; Park, J.; Lee, J.-Y.; Kweon, I.S. CBAM: Convolutional Block Attention Module. In Proceedings of the Computer Vision—ECCV 2018; Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y., Eds.; Springer International Publishing: Cham, Switzerland, 2018; pp. 3–19. [Google Scholar]
Xiao, T.; Singh, M.; Mintun, E.; Darrell, T.; Dollar, P.; Girshick, R. Early Convolutions Help Transformers See Better. In Proceedings of the Advances in Neural Information Processing Systems, virtual, 6–14 December 2021; Curran Associates, Inc.: Red Hook, NY, USA, 2021; Volume 34, pp. 30392–30400. [Google Scholar]
Yu, W.; Luo, M.; Zhou, P.; Si, C.; Zhou, Y.; Wang, X.; Feng, J.; Yan, S. MetaFormer Is Actually What You Need for Vision. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 10809–10819. [Google Scholar]
Ding, X.; Zhang, X.; Ma, N.; Han, J.; Ding, G.; Sun, J. RepVGG: Making VGG-Style ConvNets Great Again. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 13728–13737. [Google Scholar]
Hendrycks, D.; Gimpel, K. Gaussian Error Linear Units (GELUs). arXiv 2023, arXiv:1606.08415. [Google Scholar]
Tong, X.-Y.; Xia, G.-S.; Lu, Q.; Shen, H.; Li, S.; You, S.; Zhang, L. Land-Cover Classification with High-Resolution Remote Sensing Images Using Transferable Deep Models. Remote Sens. Environ. 2020, 237, 111322. [Google Scholar] [CrossRef]
Wang, J.; Zheng, Z.; Ma, A.; Lu, X.; Zhong, Y. LoveDA: A Remote Sensing Land-Cover Dataset for Domain Adaptive Semantic Segmentation. arXiv 2022, arXiv:2110.08733. [Google Scholar]
Landis, J.R.; Koch, G.G. The Measurement of Observer Agreement for Categorical Data. Biometrics 1977, 33, 159. [Google Scholar] [CrossRef] [PubMed]
Paszke, A.; Chaurasia, A.; Kim, S.; Culurciello, E. ENet: A Deep Neural Network Architecture for Real-Time Semantic Segmentation. arXiv 2016, arXiv:1606.02147. [Google Scholar] [CrossRef]
Romera, E.; Álvarez, J.M.; Bergasa, L.M.; Arroyo, R. ERFNet: Efficient Residual Factorized ConvNet for Real-Time Semantic Segmentation. IEEE Trans. Intell. Transp. Syst. 2018, 19, 263–272. [Google Scholar] [CrossRef]
Wu, T.; Tang, S.; Zhang, R.; Cao, J.; Zhang, Y. CGNet: A Light-Weight Context Guided Network for Semantic Segmentation. IEEE Trans. Image Process. 2021, 30, 1169–1179. [Google Scholar] [CrossRef]
Tan, M.; Le, Q. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. In Proceedings of the 36th International Conference on Machine Learning, PMLR, Long Beach, CA, USA, 24 May 2019; pp. 6105–6114. [Google Scholar]
Ma, N.; Zhang, X.; Zheng, H.-T.; Sun, J. ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture Design. arXiv 2018, arXiv:1807.11164. [Google Scholar] [CrossRef]
Howard, A.; Sandler, M.; Chen, B.; Wang, W.; Chen, L.-C.; Tan, M.; Chu, G.; Vasudevan, V.; Zhu, Y.; Pang, R.; et al. Searching for MobileNetV3. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Korea, 27 October 2019–2 November 2019; pp. 1314–1324. [Google Scholar]

Figure 1. The overall architecture of the proposed LU-Net.

Figure 2. The structure of the LCBAM.

Figure 3. The structure of the LDB. B denotes the batch-size and C represents the channel dimension. H and W are the width and height of the feature map. The norm layer and nonlinearity are omitted for simplicity.

Figure 4. The structure of the SRMFPM.

Figure 5. Confusion Matrix.

Figure 6. ROC curves of various networks on the GID dataset.

Figure 7. Visualization of results on the GID dataset for various networks: white represents the water class and black represents the non-water class, red-circled region showing major discrepancies in results. (a) Image; (b) Ground truth; (c) CGNet; (d) ERFNet; (e) ShuffleNetV2-U; (f) ENet; (g) MobileNetV3-U; (h) EfficientNet-U; (i) LU-Net.

Figure 8. ROC curves of various networks on the LoveDA dataset.

Figure 9. Visualization of results on the LoveDA dataset for various networks: white represents the water class and black represents the non-water class, red-circled region showing major discrepancies in results. (a) Image; (b) Ground truth; (c) CGNet; (d) ERFNet; (e) ShuffleNetV2-U; (f) ENet; (g) MobileNetV3-U; (h) EfficientNet-U; (i) LU-Net.

Figure 10. Visualization results on the GID dataset for network variants: white represents the water class and black represents the non-water class, red-circled region showing major discrepancies in results. (a) Image; (b) Ground truth; (c) Baseline; (d) Baseline-LDB; (e) Baseline-LDB-LCBAM; (f) LU-Net (Baseline-LDB-LCBAM-SRMFPM).

Figure 11. Segmentation image slices on the GID dataset. (a) Ground truth; (b) Baseline-LDB-LCBAM; (c) LU-Net (Baseline-LDB-LCBAM-SRMFPM).

Figure 12. Visualization results on the LoveDA dataset for network variants: white represents the water class and black represents the non-water class, red-circled region showing major discrepancies in results. (a) Image; (b) Ground truth; (c) Baseline; (d) Baseline-LDB; (e) Baseline-LDB-LCBAM; (f) LU-Net (Baseline-LDB-LCBAM-SRMFPM).

Figure 13. Segmentation image slices on the LoveDA dataset. (a) Ground truth; (b) Baseline-LDB-LCBAM; (c) LU-Net (Baseline-LDB-LCBAM-SRMFPM).

Table 1. Latency of CAM and SAM with Varying Pooling Configurations.

Attention Module	Pool	Latency on GPU (MS)
CAM	Avg	0.018
CAM	Avg + Max	0.045
SAM	Avg	0.018
SAM	Avg + Max	0.033

Table 2. Lightweight performance of different networks.

Module	Params (M)	FLOPs (G)	Latency on GPU (MS)	Latency on CPU (MS)
ENet	0.36	1.96	6.5	47
EfficientNet-U	5.88	3.41	4.3	65
ShuffleNetV2-U	1.14	1.66	3.6	29
MobileNetV3-U	3.51	2.04	3.4	39
CGNet	0.50	3.48	5.5	68
ERFNet	2.06	13.30	3.9	85
LU-Net	2.39	2.29	3.1	36

Table 3. Extraction performance of various networks on the GID dataset.

Module	MIoU (%)	OA (%)	Precision (%)	Recall (%)	Kappa
ENet	90.86	96.51	91.97	93.45	0.9026
EfficientNet-U	90.99	96.57	91.94	93.69	0.9040
ShuffleNetV2-U	90.63	96.43	91.69	93.30	0.9000
MobileNetV3-U	90.92	96.55	92.33	93.15	0.9033
CGNet	91.00	96.58	91.66	94.00	0.9042
ERFNet	90.39	96.35	91.78	92.77	0.8972
LU-Net	91.36	96.71	92.24	94.04	0.9083

Table 4. Extraction performance of various networks on the LoveDA dataset.

Module	MIoU (%)	OA (%)	Precision (%)	Recall (%)	Kappa
ENet	85.25	95.18	87.63	85.47	0.8348
EfficientNet-U	86.20	95.61	90.66	84.28	0.8464
ShuffleNetV2-U	85.21	95.23	89.18	83.78	0.8344
MobileNetV3-U	85.55	95.37	89.31	84.23	0.8383
CGNet	86.40	95.63	89.44	85.84	0.8489
ERFNet	85.62	95.37	88.42	85.14	0.8388
LU-Net	86.32	95.58	89.37	85.83	0.8481

Table 5. Lightweight performance of variants.

Module	Params (M)	FLOPs (G)	Latency on GPU (MS)	Latency on CPU (MS)
Baseline	4.895	4.481	3.0	38
Baseline-LDB	2.392	2.239	2.5	31
Baseline-LDB-LCBAM	2.392	2.241	3.0	34
Baseline-LDB-LCBAM-SRMFPM (LU-Net)	2.393	2.293	3.1	36

Table 6. Extraction performance of variants on the GID dataset.

Module	MIoU (%)	OA (%)	Precision (%)	Recall (%)	Kappa
Baseline	91.12	96.62	92.07	93.82	0.9057
Baseline-LDB	91.19	96.65	91.80	94.20	0.9064
Baseline-LDB-LCBAM	91.30	96.69	92.15	94.03	0.9076
Baseline-LDB-LCBAM-SRMFPM (LU-Net)	91.36	96.71	92.24	94.04	0.9083

Table 7. Extraction performance of variants on the LoveDA dataset.

Module	MIoU (%)	OA (%)	Precision (%)	Recall (%)	Kappa
Baseline	85.77	95.38	88.34	85.74	0.8413
Baseline-LDB	86.01	95.47	89.00	85.57	0.8443
Baseline-LDB-LCBAM	86.24	95.57	89.79	85.26	0.8472
Baseline-LDB-LCBAM-SRMFPM (LU-Net)	86.32	95.58	89.37	85.83	0.8481

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Deng, C.; He, R.; Wu, Z.; Sun, X.; Wang, S. LU-Net: Lightweight U-Shaped Network for Water Body Extraction of Remote Sensing Images. Water 2025, 17, 2763. https://doi.org/10.3390/w17182763

AMA Style

Deng C, He R, Wu Z, Sun X, Wang S. LU-Net: Lightweight U-Shaped Network for Water Body Extraction of Remote Sensing Images. Water. 2025; 17(18):2763. https://doi.org/10.3390/w17182763

Chicago/Turabian Style

Deng, Chengzhi, Ruqiang He, Zhaoming Wu, Xiaowei Sun, and Shengqian Wang. 2025. "LU-Net: Lightweight U-Shaped Network for Water Body Extraction of Remote Sensing Images" Water 17, no. 18: 2763. https://doi.org/10.3390/w17182763

APA Style

Deng, C., He, R., Wu, Z., Sun, X., & Wang, S. (2025). LU-Net: Lightweight U-Shaped Network for Water Body Extraction of Remote Sensing Images. Water, 17(18), 2763. https://doi.org/10.3390/w17182763

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

LU-Net: Lightweight U-Shaped Network for Water Body Extraction of Remote Sensing Images

Abstract

1. Introduction

2. Methods

2.1. Overall Framework of LU-Net

2.1.1. Encoder of LU-Net

2.1.2. Decoder of LU-Net

2.1.3. Skip Connections of LU-Net

2.2. Description of the Improvements

2.2.1. Lightweight Convolutional Block Attention Module (LCBAM)

2.2.2. Lightweight Decoding Block (LDB)

2.2.3. Structurally Re-Parameterized Multi-Scale Fusion Prediction Module (SRMFPM)

3. Experimental Results and Discussion

3.1. Datasets

3.2. Evaluation Metrics

3.2.1. Metrics of Extraction Performance Evaluation

3.2.2. Metrics of Lightweight Performance Evaluation

3.3. Experimental Details

3.4. Comparative Experiments

3.4.1. Lightweight Performance Comparison Results

3.4.2. Extraction Performance Comparison Results on the GID Dataset

3.4.3. Extraction Performance Comparison Results on the LoveDA Dataset

3.5. Ablation Study

3.5.1. Performance Metrics Results

3.5.2. Visualization Results Analysis

4. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI