A Lightweight Network for Water Body Segmentation in Agricultural Remote Sensing Using Learnable Kalman Filters and Attention Mechanisms

Dingyi Liao; Jun Sun; Zhiyong Deng; Yudong Zhao; Jiani Zhang; Dinghua Ou

doi:10.3390/app15116292

,

and

¹

College of Information Engineering, Sichuan Agricultural University, Ya’an 625014, China

²

Key Laboratory of Investigation, Monitoring, Protection and Utilization for Cultivated Land Resources, Ministry of Natural Resources, Chengdu 610045, China

³

College of Resources, Sichuan Agricultural University, Chengdu 611130, China

^*

Author to whom correspondence should be addressed.

Appl. Sci.2025, 15(11), 6292;https://doi.org/10.3390/app15116292

This article belongs to the Section Computing and Artificial Intelligence

Version Notes

Order Reprints

Abstract

Precise identification of water bodies in agricultural watersheds is crucial for irrigation, water resource management, and flood disaster prevention. However, the spectral noise caused by complex light and shadow interference and water quality differences, combined with the diverse shapes of water bodies and the high computational cost of image processing, severely limits the accuracy of water body recognition in agricultural watersheds. This paper proposed a lightweight and efficient learnable Kalman filter and Deformable Convolutional Attention Network (LKF-DCANet). The encoder is built using a shallow Channel Attention-Enhanced Deformable Convolution module (CADCN), while the decoder combines a Convolutional Additive Token Mixer (CATM) and a learnable Kalman filter (LKF) to achieve adaptive noise suppression and enhance global context modeling. Additionally, a feature-based knowledge distillation strategy is employed to further improve the representational capacity of the lightweight model. Experimental results show that LKF-DCANet achieves an Intersection over Union (IoU) of 85.95% with only 0.22 M parameters on a public dataset. When transferred to a self-constructed UAV dataset, it achieves an IoU of 96.28%, demonstrating strong generalization ability. All experiments are conducted on RGB optical imagery, confirming that LKF-DCANet offers an efficient and highly versatile solution for water body segmentation in precision agriculture.

Keywords:

remote sensing; lightweight network; semantic segmentation; learnable Kalman filter; deformable convolution; attention mechanism; transfer learning

1. Introduction

With the advancement of remote sensing technology, it has become increasingly convenient to obtain high-resolution surface land images [1], providing ample data support for agricultural water conservancy surveys and the extraction of water body information. As a core technology in agricultural remote sensing, the identification of water bodies in remote sensing images is of crucial significance for applications such as agricultural production, farmland irrigation, and ecological protection [2]. Therefore, in view of the complex background of the rural environment [3], developing a reliable and highly generalizable water body segmentation method is crucial for the implementation of precision agriculture and the ecological protection of river basins [4].

In the field of water body segmentation of remote sensing images, the research methods mainly include threshold algorithms, machine learning algorithms, and deep learning algorithms [5]. Threshold algorithms such as the Normalized Difference Water Index (NDWI) [6], the Modified Normalized Difference Water Index (MNDWI) [7], and the Automated Water Extraction Index (AWEI) [8] achieve the extraction of targets by analyzing the response characteristics of water bodies in specific spectral intervals. However, these methods suffer from high data acquisition costs and heavy reliance on manual parameter tuning, limiting their effectiveness in modern agricultural remote sensing applications. To overcome the limitations of threshold segmentation methods, machine learning methods such as Support Vector Machine (SVM) [9], Maximum Likelihood Classification (MLC) [10], Decision Tree (DT) [11], and Random Forest (RF) [12] have been introduced into the task of water body extraction. These methods, by learning the decision boundaries between categories from labeled training samples, exhibit better segmentation accuracy than traditional index methods in specific datasets. However, their limited capacity to extract deep features makes them inadequate for complex scenarios, such as water scale variations and vegetation occlusion in agricultural watersheds [1,3].

In addition to optical image-based methods, synthetic aperture radar (SAR) data has been widely used in water body segmentation research due to its weather-independent nature and the ability to highlight open water features through low backscatter signals [13]. SAR is particularly effective in flood mapping and large-scale hydrological applications. However, its application in small agricultural watersheds is limited. The spatial resolution of most SAR systems is insufficient to detect narrow streams and irrigation ditches. Additionally, specular reflections from smooth, non-water surfaces such as wet roads or bare soil may lead to misclassification [14]. These issues reduce the applicability of SAR-based segmentation methods in fine-scale agricultural scenarios.

Deep learning has made significant progress in the field of remote sensing image analysis, providing effective solutions for complex tasks such as water body extraction. convolutional neural networks (CNNs) [15] are widely applied in semantic segmentation due to their powerful feature representation capabilities. Models such as the Fully Convolutional Network (FCN) [16], U-Net [17], Pyramid Scene Parsing Network (PSPNet) [18], and DeepLab [19,20,21] all adopt the encoder–decoder structure to preserve spatial details and improve the segmentation accuracy. Deep High-Resolution Representation Learning for Visual Recognition (HRNet) [22] further integrates the feature information of each layer by connecting branches with different resolutions in parallel. However, when these models are directly applied to the task of agricultural watersheds identification, due to the insufficient consideration of multi-scale semantic features and spatial correlations during the feature extraction process, the extraction accuracy is difficult to meet the practical application requirements of agricultural remote sensing monitoring.

Aiming at the problem of insufficient multi-scale feature perception ability, researchers have proposed a variety of improvement schemes to address key challenges in water body identification, such as interference from complex backgrounds, the lack of connectivity of large-scale water areas, and rough edge positioning. Kang et al. [23] proposed MSCEnet, which effectively integrates multi-scale contextual information by embedding the Res2Net [24] module, dilated convolutions, and multi-kernel convolution units. Wang et al. [25] developed a water body extraction method for remote sensing images, which employs a dual attention module to enhance the global dependencies in both spatial and channel dimensions. To enhance the ability to extract dense scale features, Xiang et al. introduced an improved DensePPM module [26]. In addition, to address the challenge of extracting water body boundaries, Chen et al. [27] proposed a hybrid semantic segmentation strategy based on K-Net, which improves the segmentation accuracy of lakes by iteratively optimizing the details. Liu et al. [28] proposed MSFENet, which reduces the dependence on large-scale data by extracting multi-scale features of water bodies and combining contrastive learning. FWE-net [29] and MEC-net [30], through the channel attention mechanism, adaptively adjust the weights of feature channels and guide the model to focus on the edge areas of water bodies. Despite the progress achieved by the aforementioned methods in water body extraction, conventional convolutional neural networks still face limitations when addressing the unique challenges of agricultural watersheds—such as complex terrain, significant variations in water body scales, and diverse vegetation coverage. In recent years, the Transformer network has become an important technology for solving the problem of complex feature correlations in water body extraction due to its self-attention mechanism. Vision Transformer (ViT) [31] achieves powerful capture of global dependencies through image partitioning and sequence modeling. Zhong et al. [32] proposed a multi-level Transformer module that establishes long-range dependencies between the encoder and the decoder, enhancing the context modeling ability of the boundary information of lakes. With the advancement of general-purpose segmentation frameworks, vision-language pretraining models such as CLIP [33] and Segment Anything Model (SAM) [34] have demonstrated remarkable generalization capabilities across diverse segmentation tasks by leveraging large-scale cross-modal training. However, when applied to domain-specific tasks such as water body identification in agricultural watersheds, they still lack effective adaptation strategies.

Although Transformer-based architectures have demonstrated superior performance to traditional CNNs in semantic segmentation tasks, the computational cost of their self-attention mechanism increases quadratically with the sequence length [35], which severely restricts their application on intelligent agricultural devices. Therefore, there is an urgent need to develop lightweight solutions that meet practical requirements. To this end, various techniques, such as low-rank decomposition, model pruning, weight quantization, and knowledge distillation, have been proposed to reduce the redundancy existing in deep neural network architectures [36,37]. Zhou et al. [38], the proposers of UNet++, pointed out that by drawing on the design examples of PSPNet [18] and FC-DenseNets [39], shallow downsampling can achieve better performance than deep networks under certain circumstances. EDANet [40], on the other hand, utilizes the asymmetric convolutional dense module to decompose the n × n convolution into n × 1 and 1 × n kernels, thus improving the segmentation efficiency. Through a modular design, ESPNet [41,42] combines dilated convolutions with a spatial pyramid composed of grouped pointwise convolutions, achieving an inference speed of only 360,000 parameters and approximately 112 FPS.

Despite recent advances, current methods still face significant challenges in segmentation accuracy, computational efficiency, and generalization capability: (1) The optical properties of agricultural watersheds in remote sensing images are affected by factors such as water composition, surface reflection, and environmental conditions, often leading to misclassification and reduced segmentation accuracy [43]. (2) Natural water bodies in agricultural watersheds typically have irregular boundaries, fragmented distributions, and large-scale variations, while artificial ones are dense and intertwined. Current models struggle to capture both the semantic information of large-scale water bodies and the geometric details of small ones simultaneously [3]. (3) High-resolution datasets used for water body identification in agricultural watersheds are large in size, require costly annotations, and incur high computational overhead during training [5]. In addition, the scarcity of publicly available datasets restricts the generalization ability of the model.

To address the above challenges, this study aims to develop an efficient and generalizable segmentation framework for extracting water bodies in agricultural watershed environments from high-resolution optical remote sensing imagery. The goal is to improve segmentation accuracy while significantly reducing model complexity, enabling its deployment on resource-constrained agricultural devices. The main contributions are as follows:

We proposed a learnable Kalman filter module, applied in the feature decoding stage, to enhance segmentation accuracy by stabilizing feature representations and suppressing complex noise through end-to-end optimization.
We proposed a lightweight water body segmentation framework. By reducing the network depth and width, it achieves a favorable trade-off between accuracy and computational efficiency.
Our method demonstrates strong performance on public datasets and effectively generalizes to UAV-captured images via transfer learning.

2. Materials and Methods

2.1. Datasets

To verify the accuracy of the proposed model in water body recognition and its generalization ability in different scenarios, this study constructed an experimental system using multi-source heterogeneous data, which consists of two parts: a global satellite image dataset and a dataset collected autonomously by unmanned aerial vehicles (UAVs). All satellite and UAV images used in this study are based on optical remote sensing and contain three spectral bands: red, green, and blue (RGB). The RGB configuration was selected due to its wide availability across both satellite and UAV platforms, its high spatial resolution, and its compatibility with mainstream deep learning architectures.

2.1.1. GLH-Water Dataset Construction

This study constructed a benchmark testing platform based on the publicly available GLH-water dataset [44]. This dataset is composed of high-resolution satellite images distributed globally. Each image has a size of 12,800 × 12,800 pixels, a spatial resolution of 0.3 m, and a geographical coverage area of approximately 3686 square kilometers. The original data has been processed through a strict quality control process, and it contains precise semantic annotations of 40.96 billion pixels, covering diverse hydrological environments, including rivers, lakes, forest ponds, irrigated fields, bare land, and urban zones, as shown in Figure 1. To be compatible with the deep learning framework, we cut the original images into 4193 sub-tiles with a size of 1024 × 1024 pixels each. Then, we divided the dataset into a training set and a validation set at a ratio of 8:2. After performing standardized preprocessing on the images, the pixel values of their labels were normalized to 0 (background) and 1 (water body). After the model training was completed, we performed inference on each block of the prediction set, and then we reconstructed the full-size image by stitching together the sub-images to achieve the final evaluation.

Figure 1. Example images and corresponding segmentation masks from the GLH-water dataset.

2.1.2. Self-Constructed UAV Dataset

To verify the generalizability of the model in complex scenarios, this study constructed a dataset of UAV aerial water-body images in the Dongfengqu Irrigation District, Sichuan Province, China (30.52° N, 104.23° E). The dataset covers various water-body landscapes, such as irrigation canals, farmland water areas, and artificial ponds, with a total survey area of approximately 3,521,585.7 m², as shown in Figure 2. The data collection used a DJI Mavic 3E quad-copter (SZ DJI Technology Co., Ltd., Shenzhen, China). The 4/3 CMOS high-performance RGB camera it carries has 20 million effective pixels, a video recording resolution of 4K, a shooting height of 170 m, and a ground sampling distance of 5.01 cm/pixel, ensuring wide image coverage and rich details. In the data pre-processing process, 620 images with a size of 960 × 544 pixels were extracted from the UAV videos at an interval of 60 frames. The dataset was divided at a ratio of 9:1, and 10% of the samples were finely annotated.

Figure 2. Study area for UAV data collection in Dongfengqu Irrigation District, China.

2.2. Overview of the Proposed Framework

This paper proposed an efficient, lightweight water body segmentation framework. This framework includes a dual-network structure: One is a deep network constructed based on the U-shaped architecture, which takes MobileNetV2 as the backbone and innovatively embeds a learnable Kalman filter module. The other is the LKF-DCANet. It uses a low number of channels and a shallow CADCN to construct the encoder and integrates the CATM and the LKF module in the decoder. This design significantly reduces the computational complexity while maintaining high segmentation performance. The lightweight network is trained using knowledge distillation to effectively inherit semantic representations from a deep model. This framework achieves a balance between computational complexity and segmentation accuracy in the task of agricultural watersheds identification, providing a reliable technical solution for agricultural water conservancy monitoring and water environment governance.

2.3. Deep Water Body Segmentation Network

To address complex noise conditions in agricultural watersheds, such as fog interference, surface shadows, and water eutrophication, we designed a deep U-shaped network architecture, employing MobileNetV2 [45] as the backbone for feature extraction and downsampling. In the decoder, a learnable Kalman filter module is integrated to enhance feature stability and noise suppression, as illustrated in Figure 3.

Figure 3. Architecture of the deep network based on MobileNetV2.

2.3.1. Learnable Kalman Filter Module

The Kalman filter is a recursive method for state estimation that combines prior predictions with current observations through a two-step process: prediction and update [46,47,48]. While effective in linear systems, it struggles with nonlinear scenarios and requires manual parameter tuning. In this work, we adapt the Kalman Filter into a learnable, spatially aware module suitable for remote sensing image segmentation. To enhance the model’s ability to cope with noise interference during feature decoding, we serialize the two-dimensional feature map and regard each row as a one-dimensional sequence. In this sequence, the pixels are dynamically modeled by the LKF module according to the spatial adjacent order. The Kalman module predicts the state

{\hat{x}}_{k | k - 1}

at the current position based on the state estimate

{\hat{x}}_{k - 1 | k - 1}

at the adjacent position and adaptively corrects the prediction result by combining it with the actual observation value

z_{k}

in the input feature map. Through this spatially ordered recursive mechanism, the model can effectively enhance its robustness against interference factors while maintaining the continuity of the local structure, providing a more stable and accurate feature representation for subsequent segmentation tasks.

To implement this recursive mechanism, we designed an LKF module by embedding the classical prediction-update structure of the Kalman filter into a deep neural network. The core of it lies in transforming the state transition matrix

F_{k}

, the observation matrix

H_{k}

, the process noise covariance matrix

Q_{k}

, and the observation noise covariance matrix

R_{k}

, which rely on statistics and manual settings in the traditional Kalman filter into matrices whose parameters can be learned and updated through a deep neural network. Through backpropagation, dynamic optimization is carried out during the training process, enabling the filter to adaptively adjust the state estimation, as shown in Figure 4.

Figure 4. Recursive spatial estimation pipeline based on the learnable Kalman filter.

(1): State Prediction Phase in LKF

In our framework, the input feature map is serialized along the spatial dimension, and each pixel is processed sequentially based on its adjacent spatial positions. At each step, the module treats the input feature at the previous position

z_{k - 1}

as the current observation for initialization. During the prediction phase, the state estimation

{\hat{x}}_{k | k - 1}

and the associated error covariance

P_{k | k - 1}

are predicted via matrix operations, as shown in Equation (1). A learnable state transition matrix

F_{k}

models the spatial evolution of features between adjacent positions, allowing the previous state estimation

{\hat{x}}_{k - 1 | k - 1}

to be propagated forward. Simultaneously, a learnable process noise covariance matrix

Q_{k}

is introduced to account for uncertainties in the prediction:

\{\begin{matrix} {\hat{x}}_{k | k - 1} = F_{k} {\hat{x}}_{k - 1 | k - 1} \\ P_{k | k - 1} = F_{k} P_{k - 1 | k - 1} F_{k}^{T} + Q_{k} \end{matrix}

(1)

where

{\hat{x}}_{k - 1 | k - 1}

denotes the updated state estimation at the adjacent previous position (initialized as a zero tensor), and

{\hat{x}}_{k | k - 1}

represents the predicted state at the current spatial position.

P_{k - 1 | k - 1}

is the posterior error covariance matrix at the previous position, initialized as 0.1I, with I being the identity matrix.

(2): State Update Phase in LKF

In the update phase, the module refines the predicted state using the actual feature observation

z_{k}

at the current spatial position. The state is mapped to the observation space through the learnable observation matrix

H_{k}

to ensure an effective connection between the predicted state and the actual observation value. Use the learnable observation noise covariance matrix

R_{k}

to quantify the noise level in the observation data and determine the extent of the influence of the observation value on the state update. And calculate the observation prediction error covariance

S_{k}

to predict the comprehensive uncertainty of the error in the observation space, as shown in Equation (2).

S_{k} = H_{k} P_{k | k - 1} H_{k}^{T} + R_{k}

(2)

Based on

S_{k}

, the Kalman gain

K_{k}

measures the balance weight between the predicted state and the observed value and adjusts the contribution of the prediction error to the state update through the inverse matrix

S_{k}^{- 1}

, as shown in Equation (3).

K_{k} = P_{k | k - 1} H_{k}^{T} S_{k}^{- 1}

(3)

By using the Kalman gain, the predicted state is corrected to obtain the updated state estimation:

{\hat{x}}_{k | k} = {\hat{x}}_{k | k - 1} + K_{k} (z_{k} - H_{k} {\hat{x}}_{k | k - 1})

(4)

where

z_{k}

represents the observation value at the current position, that is, the input feature map.

z_{k} - H_{k} {\hat{x}}_{k | k - 1}

is the observation residual.

{\hat{x}}_{k | k}

represents the updated state estimation, which is a more accurate output obtained on the basis of correcting the predicted state.

At the same time, the posterior estimation error covariance matrix is updated to

P_{k | k}

, as shown in Equation (5), which is used to measure the precision of the estimated values. The update process fully takes into account the deviation between the observed data and the predicted values, effectively improving the accuracy of the state estimation.

P_{k | k} = (I - K_{k} H_{k}) P_{k | k - 1}

(5)

This recursive optimization mechanism enables the LKF module to gradually receive the inputs of adjacent pixels in spatial order and learn the dynamic patterns of their state transitions. In areas with interference such as vegetation occlusion, shadow coverage, or water body reflections, the model can, based on the predicted state of the previous spatial position, combine the current observation information with the surrounding context to adaptively infer the true state of the current pixel and maintain the continuity of the regional structure, thus effectively repairing the disturbed features.

Different from traditional filtering methods that rely on prior noise statistical parameters, the LKF module conducts learnable modeling of the state transition matrix and the covariance matrix through end-to-end training, realizing a dynamic, data-driven adaptive optimization mechanism. In addition, the LKF module has a lightweight structure with a small number of parameters. It only depends on basic matrix operations and has good computational efficiency. It significantly improves the model’s ability to suppress spatial noise and enhances the stability and consistency of feature representation in complex remote sensing environments.

2.3.2. Encoder and Decoder

The backbone network used in this study is MobileNetV2, whose core is composed of inverted residual blocks and depthwise separable convolutions. Each inverted residual block dynamically adjusts the number of channels through an expansion factor, successively generating multi-scale feature maps with 16, 24, 32, 64, 96, and 160 channels. Feature reuse is achieved through skip connections between various modules, enhancing the expressive ability of the model and providing a feature foundation that combines local details and global semantics for subsequent upsampling and multi-scale fusion, as illustrated in Figure 3.

After downsampling through the deep MobileNetV2 encoder, the model utilizes an upsampling module that combines bilinear interpolation with a 3 × 3 convolution to gradually restore the resolution of the feature maps and finally output the segmentation results. The upsampling module can dynamically adjust the input and output channels according to the feature scale. The concatenated multi-scale feature maps are further corrected and optimized through a 1 × 1 convolution and a learnable Kalman filter module, effectively achieving the fusion of global semantics and local details and enhancing the model’s ability to distinguish between water pollution and the interference of vegetation shadow occlusion.

2.4. Lightweight Design of the Water Body Segmentation Network

In this study, we proposed a lightweight water body segmentation network LKF–DCANet. As shown in Figure 5, both the encoder and decoder of this feature extraction network adopt a streamlined design, retaining only the channels that are highly correlated with the target features. In this way, while maintaining the ability of multi-scale feature extraction, the number of parameters and the amount of computation are greatly compressed. Specifically, the encoder part employs a low number of channels and a shallow CADCN downsampling module, as shown in Figure 5c. In this module, the deformable convolution module combines depthwise convolution with pointwise convolution through a separable convolution strategy, effectively reducing the number of parameters and memory consumption. In the decoder part, CATM module is introduced [49]. By simplifying the attention calculation, it achieves an efficient fusion of spatial and channel information. In addition, the network extensively employs low-parameter nonlinear activation functions, normalization layers, and parameter-free upsampling operations to ensure the overall expressive ability without increasing additional complexity, as shown in Figure 5b. To further enhance the performance of the lightweight model, we introduced a feature-based knowledge distillation strategy [36]. Through feature transfer, the lightweight model can effectively inherit the discriminative ability of the deep model, thus achieving high-precision segmentation while significantly reducing resource consumption.

Figure 5. Overall architecture of the proposed lightweight LKF-DCANet.

2.4.1. Encoder Based on CADCN

To enhance the model’s ability to extract water bodies with complex shapes, highly variable sizes, and spatial distributions in agricultural watersheds, this study designed a lightweight backbone network. The core of this network lies in constructing an efficient CADCN feature extraction block. Specifically, the encoder adopts a four-layer downsampling structure. In each layer, the resolution of the feature map is gradually reduced by max pooling and 1 × 1 convolution. At the same time, the feature expression is enhanced by increasing the number of channels. The preliminary feature extraction module uses 3 × 3 convolution, batch normalization, and the GELU (Gaussian Error Linear Unit) activation function to map the three channels of the RGB image into 16 channels and achieve preliminary downsampling. Subsequently, multi-scale features are extracted layer by layer through four CADCNs. The number of output channels is 16, 32, 64, and 128 in sequence. Compared with the existing mainstream segmentation architectures, the number of network layers and channels is significantly reduced, thus effectively reducing the number of parameters.

The CADCN consists of a 1 × 1 convolution, a deformable convolution [50], batch normalization, a GELU activation function, and a Convolutional Block Attention Module (CBAM) channel attention sub-module (CA) [51]. Among them, the deformable convolution technology breaks the limitations of the fixed sampling grid by introducing dynamic offset values, enabling the convolution kernel to adaptively adjust the sampling positions according to the spatial distribution of the input features. In this way, it can more accurately capture the geometric features of water bodies in agricultural watersheds, as shown on the left side of Figure 5c. The red boxes in the figure represent the dynamically sampled locations guided by learned offsets. Inspired by the multi-head self-attention mechanism, the deformable convolution module divides the spatial aggregation process into multiple groups. Each group independently predicts the sampling offset values and modulation scalars, enabling the learning of rich features from multiple subspaces, as shown in Figure 6. At the same time, a separable convolution strategy is adopted, in which the depthwise convolution is responsible for position-aware modulation, and the pointwise convolution shares the projection weights among the sampling points, thus significantly reducing the number of parameters and memory overhead. Apply softmax normalization constraints to each sampling point to ensure that the sum of the modulation scalars is 1 so as to improve the training stability under different model scales. The specific calculation expression is

y (p_{0}) = \sum_{g = 1}^{G} \sum_{k = 1}^{K} w_{g} m_{g k} x_{g} (p_{0} + p_{k} + Δ p_{g k})

(6)

where G is the number of aggregation groups, and K is the number of sampling points per group.

w_{g}

denotes the shared projection weight of group g, and

m_{g k}

is the modulation factor for the k-th point in group g.

x_{g}

is the input feature map,

p_{0}

is the current position,

p_{k}

is the sampling offset, and

Δ p_{g k}

is its additional offset.

y (p_{0})

indicates the output feature at position

p_{0}

.

Figure 6. Illustration of 3 × 3 deformable convolution.

However, although deformable convolution alone performs excellently in adapting to spatial transformations, its ability to model the differences in features among different channels is relatively insufficient. To this end, we introduced the channel attention sub-module (CA) of CBAM into the CADCN. Through two strategies of global average pooling and global max pooling, each channel is adaptively weighted and then, respectively, passed through a dimensionality reduction fully connected layer and a ReLU activation function. After that, they are fused through a dimensionality-increasing fully connected layer, and finally, the channel attention weights are obtained through Sigmoid normalization. Suppose

x \in R^{C \times H \times W}

is the input feature map, and the calculation expression of the channel attention weights is

C A (x) = σ (W_{1} (δ (W_{0} (A v g P o o l (x))) + δ (W_{0} (M a x P o o l (x)))))

(7)

where

W_{0}

and

W_{1}

are the fully connected layers for dimensionality reduction and increase, respectively.

δ

denotes the ReLU function, and

σ

is the Sigmoid function.

A v g P o o l (x)

and

M a x P o o l (x)

indicate global average and max pooling on input x.

This feature extraction block has a strong ability to adapt to spatial transformations and can effectively extract the feature information among channels, significantly enhancing the feature representation. The shallow layers equipped with the CADCN are responsible for accurately capturing local edge details and fine textures of water bodies under complex agricultural watershed conditions. In contrast, the deeper layers focus on modeling global contextual distributions and semantic boundaries, ensuring a comprehensive understanding of the overall region.

2.4.2. Decoder with CATM and LKF Module

Due to the inherent limitations of shallow architectures, lightweight models often struggle to capture long-range semantic dependencies across regions, resulting in insufficient global context modeling. To address this issue, we introduce the CATM mechanism into the decoder part of the model to enhance the feature representation ability, as shown in Figure 7.

Figure 7. Structure of the Convolutional Additive Token Mixer Module (CATM).

This mechanism combines multi-head self-attention with spatial-channel operations. It generates queries (Q), keys (K), and values (V) through independent linear transformations, thus capturing the global dependency information in the feature map, as shown in Equation (8).

Q = W_{q} x, K = W_{k} x, V = W_{v} x

(8)

Different from the traditional self-attention module, CATM adopts additive similarity calculation in the Query and Key branches and replaces the complex softmax normalization with the Sigmoid activation function. This effectively preserves the original feature dimensions, avoids the loss of information in the two-dimensional score vector, and significantly reduces the computational complexity and the number of parameters at the same time. This design improves the parallel processing ability and deployment efficiency of the network. Specifically, CATM defines the similarity function as the sum of the context scores of

Q \in R^{N \times d}

and

K \in R^{N \times d}

, as shown in Equation (9).

Sim (Q, K) = Φ (Q) + Φ (K) s . t . Φ (Q) = C (S (Q))

(9)

where

Φ (\cdot)

represents the context mapping function, which is embodied as the Sigmoid-based channel attention

C (\cdot) \in R^{N \times d}

and spatial attention

S (\cdot) \in R^{N \times d}

. Here, s.t. stands for “subject to”, indicating that the integration process is constrained by both channel and spatial attention weights. Then,

Γ (\cdot) \in R^{N \times d}

is used to integrate the context information.

Finally, the optimized output features are obtained:

O = Γ (Φ (Q) + Φ (K)) \cdot V

(10)

In the decoder stage, both the CATM and the LKF modules are integrated into the final upsampling layer of the U-shaped architecture. After upsampling and feature fusion, the CATM module employs an additive self-attention mechanism to enhance interactions between global contextual information and local spatial details, thereby improving segmentation accuracy. Simultaneously, the LKF module performs feature enhancement to suppress noise and stabilize spatial representations. This design significantly improves the decoder’s capability to recover fine-grained structures and delineate precise boundaries while maintaining the overall lightweight nature of the network.

2.4.3. Feature-Based Knowledge Distillation Strategy

Although the lightweight design significantly reduces the parameter scale and computational cost of the model, its feature expression ability is often limited, which leads to a decrease in segmentation accuracy in complex remote sensing scenarios. To improve the discriminative ability of the lightweight model, this paper proposed a feature-based knowledge distillation strategy [52] to guide the shallow model to learn richer intermediate semantic features from the deep model during the training phase, as shown in Figure 8.

Figure 8. Feature distillation structure between LKF-DCANet and the deep network via multi-level MSE loss.

Specifically, we constructed a distillation structure in which a deep network and a lightweight network run in parallel. The deep network adopts an architecture with a large number of parameters to obtain high-quality semantic representations, while the lightweight network, based on a shallow structure LKF-DCANet, learns the discriminative features of the deep network through multi-layer feature alignment. We select the encoder outputs of the 1st, 2nd, 3rd, and 4th layers of the lightweight network and align them with the feature maps of the 1st, 4th, 7th, and 12th layers of the deep network in the spatial dimension. The feature differences between the two at each layer are measured by the Mean Squared Error (MSE) loss function.

After performing forward propagation on the corresponding encoder downsampling parts of the deep model and the lightweight model, traverse the pairs of feature maps of the deep model and the lightweight model, extract the feature maps

T_{f}^{i}

and

S_{f}^{i}

of the two models at the i-th layer, respectively, and calculate the Mean Squared Error (MSE) between these two feature maps, as shown in Equation (11).

L_{f e a t} = \frac{1}{N} \sum_{i = 1}^{N} {(S_{f}^{i} - T_{f}^{i})}^{2}

(11)

where N is the total number of feature map pairs, and

S_{f}^{i}

and

T_{f}^{i}

represent the feature maps of the lightweight network and the deep network at the i-th layer, respectively.

Based on the feature distillation loss, we jointly model it with the water body segmentation task loss to construct the overall optimization objective function

L

, as shown in Equation (12):

L = α \cdot L_{m a i n} + β \cdot L_{f e a t}

(12)

where

L_{m a i n}

represents the loss term of the semantic segmentation task, and

L_{f e a t}

represents the feature alignment distillation loss term. In the distillation experiments, the weighting coefficients

α

and

β

were set to 0.7 and 0.3, respectively, through multiple experiments. This setting consistently achieves the best trade-off between segmentation accuracy and distillation effectiveness.

The optimization of the overall loss function enables the lightweight model to learn the discriminative ability of the deep model layer by layer. As a result, with an extremely low number of parameters and computational resource consumption, it can achieve accurate segmentation of the boundaries and shapes of water bodies in agricultural watersheds.

2.5. Training Strategy and Transfer Pipeline

In this study, we designed a comprehensive experimental workflow to validate the effectiveness, efficiency, and generalization ability of the proposed LKF-DCANet. The entire pipeline consists of three main stages: supervised pre-training on a large-scale satellite dataset, knowledge distillation for lightweight optimization, and cross-domain transfer learning on UAV imagery.

We first trained both the deep network and the lightweight LKF-DCANet on the publicly available GLH-water dataset [44]. The deep network adopts a U-shaped structure with MobileNetV2 as the backbone and integrates an LKF module in the decoder. The lightweight network utilizes a shallower encoder built with CADCN and a decoder that incorporates both the CATM and LKF modules.

To improve the segmentation performance of the lightweight model while maintaining computational efficiency, we adopted a feature-based knowledge distillation strategy. During training, the deep network provided intermediate feature representations, which the lightweight network learned to align at multiple layers via a Mean Squared Error (MSE) loss. This strategy enabled the lightweight model to inherit the discriminative capabilities of the deeper architecture, significantly enhancing segmentation accuracy.

To assess cross-domain generalization, we transferred the pre-trained lightweight model to a self-constructed UAV dataset collected from the Dongfengqu Irrigation District in Sichuan, China. We fine-tuned the model using only 10% of the labeled images. After adaptation, the model was employed to infer the remaining 90% of the unlabeled data, generating pseudo-labels for evaluation. The complete workflow of pretraining, distillation, fine-tuning, and pseudo-labeling is illustrated in Figure 9.

Figure 9. Two-stage transfer learning framework based on pretrained LKF-DCANet.

3. Experiments and Results

3.1. Evaluation Metrics

In order to comprehensively evaluate the model’s detection ability in scenarios such as long and narrow water bodies commonly found in farmland irrigation areas, fragmented distribution, and strong background interference (building occlusion, vegetation coverage, topographic shadows). We noticed that the number of water-body pixels is usually much smaller than that of non-water-body background pixels, resulting in a serious data imbalance between the target and background categories. This kind of imbalance will undermine the effectiveness of traditional semantic segmentation evaluation metrics, such as Overall Accuracy (OA) and Mean Intersection over Union (mIoU), in reflecting the actual performance of water body segmentation.

To compensate for the evaluation bias caused by the excessively high proportion of background pixels, this study adopted a refined index system to comprehensively evaluate the accuracy of water body segmentation in remote sensing images. The evaluation indicators used in the experiment are as follows:

I o U = \frac{T P}{F P + F N + T P}

(13)

P r e c i s i o n = \frac{T P}{T P + F P}

(14)

R e c a l l = \frac{T P}{T P + F N}

(15)

where TP denotes the number of samples correctly classified as water body, FP refers to background samples incorrectly predicted as water body, and FN refers to water body samples misclassified as background.

F 1 - S c o r e = 2 \times \frac{P r e c i s i o n \times R e c a l l}{P r e c i s i o n + R e c a l l}

(16)

P a r a m s = k_{h} \times k_{w} \times C_{i n} \times C_{o u t}

(17)

F L O P s = [(C_{i n} \times k_{w} \times k_{h}) + (C_{i n} \times k_{w} \times k_{h} - 1)] \times C_{o u t} \times W \times H

(18)

where

k_{h}

and

k_{w}

denote the kernel’s height and width,

C_{i n}

is the number of input channels, and

C_{o u t}

is the number of output channels.

3.2. Training Protocol

All experiments were conducted on a workstation equipped with an NVIDIA RTX 4090 GPU (NVIDIA Corporation, Santa Clara, CA, USA) and implemented using the PyTorch 2.5.0 framework. The proposed water body segmentation frameworks are based on a U-shaped architecture. During training, the Adam optimizer was employed. For the deep network, the initial learning rate was set to 5 × 10⁻⁴, with a momentum of 0.9 and a weight decay of 1 × 10⁻⁴. For the lightweight network, the initial learning rate was set to 2 × 10⁻⁴, with the same momentum and a weight decay of 1 × 10⁻⁵. Both the deep network and the lightweight LKF-DCANet adopted a cosine annealing learning rate schedule, with the minimum learning rate set to 1% of the initial value. The batch size was fixed at 8 for both, and all models were trained for 500 epochs.

3.3. Experimental Results

To comprehensively evaluate the effectiveness of the proposed deep water body segmentation network, we compare it with a range of representative models on the publicly available GLH-water dataset. These include classic semantic segmentation networks such as DeepLabV3+, FCN8s, HRNet-48, PSPNet, and STDC-1446, which are widely used in natural scene segmentation tasks; water body-oriented models like MSResNet, MECNet, and MSCENet, which are specifically designed to capture spectral and structural features of water regions in remote sensing imagery; high-resolution segmentation methods such as MagNet, FCL, and ISDNet, which focus on preserving fine-grained boundaries in very high-resolution images; and PCL, which introduces a pyramid consistency loss to address visual inconsistency and topological discontinuity in ultra-large-scale satellite imagery. All models are trained under the same settings, and their performance is evaluated using IoU and F1-Score. The experimental results are shown in Table 1.

Table 1. Performance comparison of different segmentation models in terms of IoU and F1-Score on GLH-water dataset [44].

Among them, the proposed deep water body segmentation network achieves 88.86% in IoU and 94.10% in F1-Score. Compared with other methods, this represents an improvement of 6.6–44.19% in IoU and 3.83–32.35% in F1-Score. These results clearly validate the effectiveness of our architectural design in suppressing noise and enhancing segmentation accuracy under complex environmental conditions.

3.3.1. Ablation Experiments

To explore the impact of each proposed component, we performed an ablation study by incrementally adding modules to a baseline network. The configurations include combinations of CADCN, LKF, CATM, and their fusions. We compared model parameters, FLOPs, and IoU, as shown in Table 2.

Table 2. Ablation study of module combinations: parameters, FLOPs, and IoU on GLH-water.

The baseline model (MobileNetV2) achieved an IoU of 86.19%, with a computational cost of 136.06 GFLOPs and 6.644 M parameters. After integrating the LKF module, the IoU increased significantly to 88.86%, while the computation only slightly rose to 136.088 GFLOPs, indicating that the LKF module is both lightweight and effective in stabilizing feature representations. To reduce computational cost, we replaced the deep encoder with a shallow CADCN, resulting in a lightweight model with 0.10 M parameters and 72.91 GFLOPs. However, the IoU dropped to 81.62%, suggesting a loss of feature representation capacity under aggressive compression. To address this, we introduced the CATM module in the decoder. This addition slightly increased the parameter count to 0.22 M, reduced the computational load to 71.38 GFLOPs, and improved the IoU to 82.55%. These results demonstrate that the CATM module effectively compensates for the performance degradation caused by model compression.

3.3.2. Distillation Experiment

In practical applications, agricultural equipment is usually restricted by limited computational resources and strict parameter budgets. To address this issue, we not only compared the proposed deep segmentation model with mainstream methods but also evaluated the performance of the lightweight model and the effect after knowledge distillation. The results are shown in Table 3.

Table 3. Performance comparison between lightweight and distilled models on GLH-water dataset.

On the publicly available GLH-water dataset, DeepLabv3+ achieved an IoU of 79.80% with 5.81 million parameters and 211.47 GFLOPs, while PSPNet reached an IoU of 75.19% with 2.38 million parameters and 24.12 GFLOPs. In contrast, our lightweight model achieved a higher IoU of 82.55% using only 0.22 million parameters and 71.38 GFLOPs, demonstrating its suitability for deployment on resource-constrained platforms.

Furthermore, after applying feature-based knowledge distillation, the model retained the same architecture and computational cost, but the IoU improved to 85.95%—a relative increase of 3.4%. These results indicate that transferring discriminative feature representations from the deep network significantly enhances the performance of the lightweight model.

As illustrated in Figure 10, the red boxes highlight the qualitative differences in segmentation results after integrating various modules. The baseline model (MobileNetV2) struggles in regions with natural illumination variation, water quality differences, and vegetation occlusion, leading to noticeable misclassifications and blurred boundaries. After incorporating the LKF module, the model demonstrates enhanced noise resilience. The red-boxed regions show clearer and more continuous water boundaries, with a significant reduction in false positives, indicating that the LKF module effectively stabilizes feature representations and suppresses optical noise.

Figure 10. Qualitative segmentation results of different model variants on GLH-water.

When compressing the model using the shallow CADCN encoder, segmentation accuracy degrades, particularly in narrow and fragmented water bodies. The red boxes reveal broken edges and increased prediction noise, reflecting reduced representational capacity under aggressive model simplification. By applying a feature-based knowledge distillation strategy (Distilled LKF-DCANet), the visual results show more coherent contours and fewer misclassified pixels in complex backgrounds.

3.3.3. Transfer Learning Experiment

To verify the cross-domain adaptability and generalization ability of the proposed model in complex farmland scenarios, this study designed a two-stage transfer learning framework. In the first stage, based on the lightweight model LKF-DCANet after knowledge distillation, the pretrained weights obtained on the public dataset GLH-water were transferred to the self-constructed UAV-acquired farmland water body dataset. In the second stage, only 10% of the labeled samples in the UAV dataset were used to fine-tune the weights of the LKF-DCANet, enabling it to quickly adapt to the brand-new farmland water body environment. Finally, predictions were made on the remaining 90% of the unlabeled data based on the fine-tuned model. The experimental results, as shown in Table 4, indicate that a segmentation accuracy of 96.28% is achieved on the self-constructed UAV dataset, enabling efficient knowledge transfer at a limited annotation cost.

Table 4. Transfer performance of the distilled LKF-DCANet on GLH-water and UAV datasets.

The visual segmentation results are shown in Figure 11, which illustrates qualitative predictions across representative categories of agricultural surface water bodies. Specifically, the model demonstrates stable recognition performance for narrow linear structures such as drainage ditches and irrigation channels (Figure 11a), as well as spatially fragmented or irregular water bodies, including farm ponds and shallow seasonal reservoirs (Figure 11b,c). Furthermore, Figure 11d presents challenging shallow water bodies such as flooded fields and vegetated wetland patches, which often exhibit low spectral contrast and mixed contamination from surrounding terrain. These examples highlight the model’s capacity for fine-grained segmentation, even in visually complex environments, and its adaptability to diverse water body types commonly encountered in agricultural watersheds. Overall, the proposed LKF-DCANet achieves accurate boundary delineation and strong robustness against noise and appearance ambiguity.

Figure 11. Qualitative prediction results on unlabeled UAV data after fine-tuning.

4. Discussion

In this study, aiming at the key challenges widely existing in the segmentation of water bodies in agricultural remote sensing, such as complex noise interference, the diversity of water body shapes, the limited computational resources of agricultural devices, and the scarcity of high-quality labeled datasets, a lightweight segmentation LKF-DCANet network that integrates multiple modules is proposed. While effectively improving the segmentation accuracy, this network significantly reduces the parameter scale and computational cost, demonstrating excellent generalization performance and practical deployment potential. This paper conducts discussions on the aspects of the theoretical rationality of the model design, the structural versatility, and the scene adaptability, so as to further clarify the technical advantages and application value of this method.

4.1. Theoretical Rationale and Design Motivation

Starting from the imaging characteristics of remote sensing images, whether they are from satellite platforms or low-altitude unmanned aerial vehicles, remote sensing images generally have non-ideal factors such as illumination changes, haze interference, and vegetation occlusion. In response to the above problems, this paper introduces the LKF module, which effectively bridges the mechanism gap between traditional filtering models and deep learning and effectively alleviates the bottleneck of traditional filters’ strong dependence on prior parameters and poor adaptability in nonlinear scenarios. This module sets the state transition matrix and the covariance matrix as trainable parameters, enabling the adaptive correction of disturbed features during the end-to-end training process. This allows the model to continuously correct prediction errors in remote sensing scenarios with spatial noise and observational uncertainties, enhancing the consistency and stability of its representation. At the same time, the matrix reasoning process on which the module depends is essentially a tensor operation with a high degree of parallelism, which has the advantages of good computational efficiency and memory usage. The experimental results further show that this module can effectively enhance the stability of feature representation during the feature extraction stage. Especially in typical interference areas such as vegetation occlusion and shadow coverage, it significantly improves the classification accuracy of the model.

4.2. Multi-Module Collaboration and Feature Enhancement

In order to address the challenges posed by complex water body morphologies in agricultural watersheds during the feature extraction stage, this study employed the CADCN. Compared with traditional convolutional compression methods, CADCN enhances the response intensity of key semantic regions while retaining the ability to perceive multi-scale structures, avoiding the degradation of semantic information caused by excessive pruning. However, shallow structures still have limitations in global context modeling and struggle to capture long-distance semantic dependency relationships across regions. To compensate for this deficiency, the CATM module is introduced in the decoder part, constructing a decoding mechanism that integrates local convolutional perception and additive information reconstruction. Experimental results show that, without significantly increasing the computational burden, the CATM module enhances the modeling performance for features such as wide spatial distribution and fragmented water body structures in remote sensing images. Through its collaborative effect with the CADCN in the encoder, the adaptability of the model to the recognition of water body boundaries and multi-scale morphologies is significantly improved.

4.3. Knowledge Distillation for Lightweight Accuracy Preservation

Aiming at the problems that the lightweight model has limited expressive ability in complex remote sensing backgrounds and is prone to blurred boundaries and missed detection of small targets, this paper further introduces a feature-level oriented knowledge distillation strategy. Different from traditional distillation methods based on output soft labels, this strategy guides the shallow model to enhance its discriminative ability for key areas while maintaining a compact structure by semantically aligning the intermediate features at the encoder stage of the deep network. In order to verify the actual effect of this strategy in the task of remote sensing water body segmentation, this paper further conducts a visual comparison of the attention heatmaps of the deep teacher model and the lightweight model at different levels, as shown in Figure 12. The results show that the deep model retains richer spatial and semantic information in each layer, while the lightweight model, despite its compact structure, can still effectively capture the main contours of the water bodies and exhibits an attention distribution similar to that of the deep model. This verifies the significant role of this strategy in enhancing the integrity of semantic representation and segmentation accuracy and provides effective support for high-performance water body recognition under low-resource conditions.

Figure 12. Visualization of attention maps at each downsampling stage for the lightweight and deep models. Warm colors (yellow to red) represent higher feature responses, while cool colors (blue) indicate lower responses.

4.4. Cross-Platform Generalization and Data Efficiency

To address the practical challenge of the shortage of high-quality annotated data for agricultural remote sensing, this paper constructs a transfer learning framework that combines pre-training and fine-tuning. By conducting preliminary training on a large-scale publicly available water body dataset and then transferring the learned knowledge to the agricultural water body image data captured by UAVs, high-precision recognition of agricultural watersheds can be achieved even with a very small number of annotated samples. The experimental results show that this strategy not only significantly improves the model’s cross-domain prediction ability but also effectively reduces the degree of data dependence in remote sensing application scenarios, enhancing the model’s general adaptability across multiple geographical regions and multiple observation platforms. The effectiveness of this transfer mechanism indicates that LKF-DCANet also has strong generalizability in scenarios with scarce data, providing a technical reference for the monitoring of water resources in wide-area farmland.

4.5. Computational Limitations and Future Directions

Although the proposed LKF-DCANet demonstrates strong performance, fast inference speed, and good generalization ability in small-scale, high-resolution UAV remote sensing scenarios, its effectiveness declines in certain complex real-world conditions. For example, in ultra-high-resolution satellite imagery such as Sentinel-2 (10 m per pixel), detecting narrow water bodies—such as irrigation canals and drainage ditches spanning only 1–3 pixels—remains challenging due to insufficient spatial detail. In addition, on the GLH-water dataset, the model achieves an IoU of 85.95%, which is not extremely high but reflects a reasonable trade-off under the lightweight design constraints. Architectural simplifications—such as reduced channel dimensions and a shallower encoder—were intentionally introduced to minimize computational complexity, which inevitably leads to some performance gap compared to heavier models. Moreover, physical environments such as tree-shaded rivers, sediment-laden ponds, or temporarily dried-up ditches can cause severe spectral ambiguity, often resulting in missed detections or blurred boundaries.

To evaluate the computational feasibility of LKF-DCANet in real-world satellite applications, we tested its performance on a full Sentinel-2 scene (12,800 × 12,800 pixels) using an NVIDIA RTX 4090 GPU. The image was divided into 1024 × 1024 patches with a 128-pixel overlap to ensure seamless reconstruction and reduce edge artifacts. Each patch requires 71.38 GFLOPs, and the inference time per patch is approximately 30–40 ms. Processing the entire scene involves about 200 overlapping patches, followed by probabilistic stitching and boundary refinement during post-processing. As a result, the end-to-end processing time for a full Sentinel-2 scene is approximately 10–12 min, including GPU computation, memory I/O, and reconstruction. These findings highlight the practical limitations of deploying the model on ultra-high-resolution imagery, where data volume and memory overhead become significant despite its lightweight design (only 0.22 M parameters).

5. Conclusions

This study proposed LKF-DCANet, an efficient water body segmentation framework that integrates a learnable Kalman filter, lightweight feature extraction, self-attention mechanisms, and feature distillation. The framework balances representation capacity with computational efficiency and demonstrates strong adaptability to fragmented water bodies and complex boundary structures. Experimental results confirm its high accuracy and low computational cost on both public remote sensing datasets and UAV imagery.

Future work will focus on enhancing the model’s robustness under extreme weather conditions, discontinuous water structures, and complex background interference. In addition, we will explore the integration of multi-modal remote sensing data and domain-adaptive training strategies to improve generalization in unseen agricultural regions. To further enhance deployment feasibility in large-scale satellite scenes, we also plan to incorporate hierarchical resolution processing, memory-efficient tile streaming, and patch aggregation mechanisms. These improvements aim to optimize the model’s parallelism and reduce latency, supporting real-time and low-power agricultural applications.

Author Contributions

Conceptualization, D.L. and J.S.; methodology, J.S.; software, D.L.; validation, Z.D., Y.Z. and D.O.; formal analysis, J.S.; investigation, D.L.; resources, J.S.; data curation, Z.D., Y.Z. and J.Z.; writing—original draft preparation, D.L.; writing—review and editing, J.S.; visualization, Z.D.; supervision, D.O.; project administration, D.O.; funding acquisition, J.S. All authors have read and agreed to the published version of the manuscript.

Funding

This work is supported by Open Fund of Key Laboratory of Investigation, Monitoring, Protection and Utilization for Cultivated Land Resources, Ministry of Natural Resources (Grant Nos. CLRKL2024GP10; CLRKL2024KP02), and Observation and Research Station of Land Ecology and Land Use in Chengdu Plain, Ministry of Natural Resources, P.R. China (Grant Nos. CDORS-2024-06; CDORS-2024-08), and the Natural Science Foundation of Sichuan, China (No. 2024NSFSC0075); Provincial Undergraduate Training Program on Innovation and Entrepreneurship (No. S202410626024).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data and algorithm code can be obtained at https://github.com/Estrellading/LKF-DCANet (accessed on 27 May 2025).

Acknowledgments

The authors sincerely thank the editor and anonymous reviewers for their constructive comments and suggestions, which have greatly improved the quality of this manuscript.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

LKF-DCANet	Learnable Kalman Filter and Deformable Convolutional Attention Network
LKF	Learnable Kalman Filter
CADCN	Channel Attention-Enhanced Deformable Convolutional Network
CATM	Convolutional Additive Token Mixer
GELU	Gaussian Error Linear Unit
FLOPs	Floating Point Operations
Params	Parameters
IoU	Intersection over Union
F1	F1-Score
UAV	Unmanned Aerial Vehicle
CNN	Convolutional Neural Network
DJI	Da Jiang Innovations
GPU	Graphics Processing Unit
CUDA	Compute Unified Device Architecture

References

Wen, D.; Huang, X.; Bovolo, F.; Li, J.; Ke, X.; Zhang, A.; Benediktsson, J.A. Change detection from very-high-spatial-resolution optical remote sensing images: Methods, applications, and future directions. IEEE Geosci. Remote Sens. Mag. 2021, 9, 68–101. [Google Scholar] [CrossRef]
Knox, J.W.; Kay, M.G.; Weatherhead, E.K. Water regulation, crop production, and agricultural water management—Understanding farmer perspectives on irrigation efficiency. Agric. Water Manag. 2012, 108, 3–8. [Google Scholar] [CrossRef]
Karpatne, A.; Khandelwal, A.; Chen, X.; Mithal, V.; Faghmous, J.; Kumar, V. Global Monitoring of Inland Water Dynamics: State-of-the-Art, Challenges, and Opportunities. In Computational Sustainability; Springer: Cham, Switzerland, 2016; pp. 121–147. [Google Scholar] [CrossRef]
Lei, P.; Yi, J.; Li, S.; Li, Y.; Lin, H. Agricultural surface water extraction in environmental remote sensing: A novel semantic segmentation model emphasizing contextual information enhancement and foreground detail attention. Neurocomputing 2025, 617, 129110. [Google Scholar] [CrossRef]
Li, Y.; Dang, B.; Zhang, Y.; Du, Z. Water body classification from high-resolution optical remote sensing imagery: Achievements and perspectives. ISPRS J. Photogramm. Remote Sens. 2022, 187, 306–327. [Google Scholar] [CrossRef]
McFeeters, S.K. The use of the Normalized Difference Water Index (NDWI) in the delineation of open water features. Int. J. Remote Sens. 1996, 17, 1425–1432. [Google Scholar] [CrossRef]
Xu, H. Modification of normalised difference water index (NDWI) to enhance open water features in remotelysensed imagery. Int. J. Remote Sens. 2006, 27, 3025–3033. [Google Scholar] [CrossRef]
Feyisa, G.L.; Meilby, H.; Fensholt, R.; Proud, S.R. Automated Water Extraction Index: A new technique for surface water mapping using Landsat imagery. Remote Sens. Environ. 2014, 140, 23–35. [Google Scholar] [CrossRef]
Cortes, C.; Vapnik, V. Support-vector networks. Mach. Learn. 1995, 20, 273–297. [Google Scholar] [CrossRef]
Deng, R.; Huang, J.F.; Wang, F.M. Research on extraction method of water body with DS spectral enhancement based on HJ-1 images. Spectrosc. Spectr. Anal. 2011, 31, 3064–3068. [Google Scholar] [CrossRef]
Friedl, M.A.; Brodley, C.E. Decision tree classification of land cover from remotely sensed data. Remote Sens. Environ. 1997, 61, 399–409. [Google Scholar] [CrossRef]
Breiman, L. Random forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
Guo, Z.; Wu, L.; Huang, Y.; Guo, Z.; Zhao, J.; Li, N. Water-body segmentation for SAR images: Past, current, and future. Remote Sens. 2022, 14, 1752. [Google Scholar] [CrossRef]
Yang, S.; Wang, L.; Yuan, Y.; Fan, L.; Wu, Y.; Sun, W.; Yang, G. Recognition of small water bodies under complex terrain based on SAR and optical image fusion algorithm. Sci. Total Environ. 2024, 946, 174329. [Google Scholar] [CrossRef]
LeCun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE 1998, 86, 2278–2324. [Google Scholar] [CrossRef]
Long, J.; Shelhamer, E.; Darrell, T. Fully Convolutional Networks for Semantic Segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3431–3440. [Google Scholar] [CrossRef]
Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention—MICCAI 2015: 18th International Conference, Munich, Germany, 5–9 October 2015; Part III. Springer International Publishing: Cham, Switzerland, 2015; pp. 234–241. [Google Scholar] [CrossRef]
Zhao, H.; Shi, J.; Qi, X.; Wang, X.; Jia, J. Pyramid Scene Parsing Network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2881–2890. [Google Scholar] [CrossRef]
Chen, L.C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 801–818. [Google Scholar] [CrossRef]
Chen, L.C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 40, 834–848. [Google Scholar] [CrossRef] [PubMed]
Chen, L.C.; Papandreou, G.; Schroff, F.; Adam, H. Rethinking Atrous Convolution for Semantic Image Segmentation. arXiv 2017, arXiv:1706.05587. [Google Scholar] [CrossRef]
Wang, J.; Sun, K.; Cheng, T.; Jiang, B.; Deng, C.; Zhao, Y.; Liu, D.; Mu, Y.; Tan, M.; Wang, X.; et al. Deep high-resolution representation learning for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 43, 3349–3364. [Google Scholar] [CrossRef]
Kang, J.; Guan, H.; Peng, D.; Chen, Z. Multi-scale context extractor network for water-body extraction from high-resolution optical remotely sensed images. Int. J. Appl. Earth Obs. Geoinf. 2021, 103, 102499. [Google Scholar] [CrossRef]
Gao, S.H.; Cheng, M.M.; Zhao, K.; Zhang, X.Y.; Yang, M.H.; Torr, P. Res2net: A new multi-scale backbone architecture. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 43, 652–662. [Google Scholar] [CrossRef] [PubMed]
Wang, B.; Chen, Z.; Wu, L.; Yang, X.; Zhou, Y. SADA-net: A shape feature Optimization and multiscale context information-based Water Body extraction method for high-resolution remote sensing images. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2022, 15, 1744–1759. [Google Scholar] [CrossRef]
Xiang, D.; Zhang, X.; Wu, W.; Liu, H. Denseppmunet-a: A robust deep learning network for segmenting water bodies from aerial images. IEEE Trans. Geosci. Remote Sens. 2023, 61, 4202611. [Google Scholar] [CrossRef]
Chen, C.; Wang, Y.; Yang, S.; Ji, X.; Wang, G. A K-Net-based hybrid semantic segmentation method for extracting lake water bodies. Eng. Appl. Artif. Intell. 2023, 126, 106904. [Google Scholar] [CrossRef]
Liu, B.; Du, S.; Bai, L.; Ouyang, S.; Wang, H.; Zhang, X. Water extraction from optical high-resolution remote sensing imagery: A multi-scale feature extraction network with contrastive learning. GIScience Remote Sens. 2023, 60, 2166396. [Google Scholar] [CrossRef]
Wang, J.; Wang, S.; Wang, F.; Zhou, Y.; Wang, Z.; Ji, J.; Xiong, Y.; Zhao, Q. FWENet: A deep convolutional neural network for flood water body extraction based on SAR images. Int. J. Digit. Earth 2022, 15, 345–361. [Google Scholar] [CrossRef]
Zhang, Z.; Lu, M.; Ji, S.; Yu, H.; Nie, C. Rich CNN features for water-body segmentation from very high resolution aerial and satellite imagery. Remote Sens. 2021, 13, 1912. [Google Scholar] [CrossRef]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Zhong, H.F.; Sun, Q.; Sun, H.M.; Jia, R.S. NT-Net: A semantic segmentation network for extracting lake water bodies from optical remote sensing images based on transformer. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5627513. [Google Scholar] [CrossRef]
Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning transferable visual models from natural language supervision. In Proceedings of the 38th International Conference on Machine Learning, PMLR, Virtual, 18–24 July 2021. [Google Scholar]
Kirillov, A.; Mintun, E.; Ravi, N.; Mao, H.; Rolland, C.; Gustafson, L.; Xiao, T.; Whitehead, S.; Berg, A.C.; Lo, W.Y.; et al. Segment Anything. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 4015–4026. [Google Scholar] [CrossRef]
Keles, F.D.; Wijewardena, P.M.; Hegde, C. On the computational complexity of self-attention. In Proceedings of the 34th International Conference on Algorithmic Learning Theory, PMLR, Singapore, 20–23 February 2023; pp. 597–619. [Google Scholar]
Hinton, G.; Vinyals, O.; Dean, J. Distilling the knowledge in a neural network. arXiv 2015, arXiv:1503.02531. [Google Scholar] [CrossRef]
Denil, M.; Shakibi, B.; Dinh, L.; Ranzato, M.A.; De Freitas, N. Predicting Parameters in Deep Learning. In Proceedings of the 27th International Conference on Neural Information Processing Systems, Lake Tahoe, NV, USA, 5–10 December 2013; p. 26. [Google Scholar] [CrossRef]
Zhou, Z.; Siddiquee, M.M.R.; Tajbakhsh, N.; Liang, J. Unet++: Redesigning skip connections to exploit multiscale features in image segmentation. IEEE Trans. Med. Imaging 2019, 39, 1856–1867. [Google Scholar] [CrossRef]
Jégou, S.; Drozdzal, M.; Vazquez, D.; Romero, A.; Bengio, Y. The one hundred layers tiramisu: Fully convolutional densenets for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Honolulu, HI, USA, 21–26 July 2017; pp. 11–19. [Google Scholar] [CrossRef]
Lo, S.Y.; Hang, H.M.; Chan, S.W.; Lin, J.J. Efficient dense modules of asymmetric convolution for real-time semantic segmentation. In Proceedings of the 1st ACM International Conference on Multimedia in Asia, Beijing, China, 15–18 December 2019; pp. 1–6. [Google Scholar] [CrossRef]
Watanabe, S.; Hori, T.; Karita, S.; Hayashi, T.; Nishitoba, J.; Unno, Y.; Soplin, N.E.; Heymann, J.; Wiesner, M.; Chen, N.; et al. ESPnet: End-to-end speech processing toolkit. arXiv 2018, arXiv:1804.00015. [Google Scholar]
Mehta, S.; Rastegari, M.; Shapiro, L.; Hajishirzi, H. Espnetv2: A light-weight, power efficient, and general purpose convolutional neural network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 9190–9200. [Google Scholar] [CrossRef]
Nagaraj, R.; Kumar, L.S. Extraction of surface water bodies using optical remote sensing images: A review. Earth Sci. Inform. 2024, 17, 893–956. [Google Scholar] [CrossRef]
Li, Y.; Dang, B.; Li, W.; Zhang, Y. Glh-water: A large-scale dataset for global surface water detection in large-size very-high-resolution satellite imagery. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 20–27 February 2024; Volume 38, pp. 22213–22221. [Google Scholar] [CrossRef]
Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L.C. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 4510–4520. [Google Scholar] [CrossRef]
Chui, C.K.; Chen, G. Kalman Filtering; Springer International Publishing: Berlin, Germany, 2017; pp. 19–26. [Google Scholar] [CrossRef]
Revach, G.; Shlezinger, N.; Ni, X.; Escoriza, A.L.; Van Sloun, R.J.; Eldar, Y.C. KalmanNet: Neural network aided Kalman filtering for partially known dynamics. IEEE Trans. Signal Process. 2022, 70, 1532–1547. [Google Scholar] [CrossRef]
Bai, Y.; Yan, B.; Zhou, C.; Su, T.; Jin, X. State of art on state estimation: Kalman filter driven by machine learning. Annu. Rev. Control 2023, 56, 100909. [Google Scholar] [CrossRef]
Zhang, T.; Li, L.; Zhou, Y.; Liu, W.; Qian, C.; Hwang, J.N.; Ji, X. Cas-vit: Convolutional additive self-attention vision transformers for efficient mobile applications. arXiv 2024, arXiv:2408.03703. [Google Scholar] [CrossRef]
Wang, W.; Dai, J.; Chen, Z.; Huang, Z.; Li, Z.; Zhu, X.; Hu, X.; Lu, T.; Lu, L.; Li, H.; et al. Internimage: Exploring large-scale vision foundation models with deformable convolutions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 14408–14419. [Google Scholar] [CrossRef]
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar] [CrossRef]
Heo, B.; Kim, J.; Yun, S.; Park, H.; Kwak, N.; Choi, J.Y. A comprehensive overhaul of feature distillation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 1921–1930. [Google Scholar] [CrossRef]
Guo, S.; Liu, L.; Gan, Z.; Wang, Y.; Zhang, W.; Wang, C.; Jiang, G.; Zhang, W.; Yi, R.; Ma, L.; et al. Isdnet: Integrating shallow and deep networks for efficient ultra-high resolution segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 4361–4370. [Google Scholar] [CrossRef]
Huynh, C.; Tran, A.T.; Luu, K.; Hoai, M. Progressive semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 16755–16764. [Google Scholar] [CrossRef]
Dang, B.; Li, Y. MSResNet: Multiscale residual network via self-supervised learning for water-body detection in remote sensing imagery. Remote Sens. 2021, 13, 3122. [Google Scholar] [CrossRef]
Li, Q.; Yang, W.; Liu, W.; Yu, Y.; He, S. From contexts to locality: Ultra-high resolution image segmentation via locality-aware contextual correlation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 7252–7261. [Google Scholar] [CrossRef]
Fan, M.; Lai, S.; Huang, J.; Wei, X.; Chai, Z.; Luo, J.; Wei, X. Rethinking bisenet for real-time semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 9716–9725. [Google Scholar] [CrossRef]

Figure 1. Example images and corresponding segmentation masks from the GLH-water dataset.

Figure 2. Study area for UAV data collection in Dongfengqu Irrigation District, China.

Figure 3. Architecture of the deep network based on MobileNetV2.

Figure 4. Recursive spatial estimation pipeline based on the learnable Kalman filter.

Figure 5. Overall architecture of the proposed lightweight LKF-DCANet.

Figure 6. Illustration of 3 × 3 deformable convolution.

Figure 7. Structure of the Convolutional Additive Token Mixer Module (CATM).

Figure 8. Feature distillation structure between LKF-DCANet and the deep network via multi-level MSE loss.

Figure 9. Two-stage transfer learning framework based on pretrained LKF-DCANet.

Figure 10. Qualitative segmentation results of different model variants on GLH-water.

Figure 11. Qualitative prediction results on unlabeled UAV data after fine-tuning.

Figure 12. Visualization of attention maps at each downsampling stage for the lightweight and deep models. Warm colors (yellow to red) represent higher feature responses, while cool colors (blue) indicate lower responses.

Table 1. Performance comparison of different segmentation models in terms of IoU and F1-Score on GLH-water dataset [44].

Methods	IoU (%)	F1-Score
MECNet [30]	44.67	61.75
ISDNet [53]	53.04	-
MagNet [54]	62.77	-
MSResNet [55]	69.76	82.18
FCN8s [16]	73.66	84.83
FCtL [56]	74.92	85.66
MSCENet [23]	74.81	85.58
PSP-Net [18]	75.19	85.84
STDC-1446 [57]	75.82	86.25
HRNet-48 [22]	78.6	88.01
DeepLab-V3+ [21]	79.8	88.76
PCL [44]	82.26	90.27
Our Deep Network	88.86	94.10

Table 2. Ablation study of module combinations: parameters, FLOPs, and IoU on GLH-water.

Model Name	Params (M)	FLOPs (G)	IoU (%)
MobileNetV2 [45]	6.644	136.064	86.19
MobileNetV2+LKF	6.644	136.088	88.86
CADCN+LKF	0.10	72.908	81.62
LKF-DCANet	0.22	71.377	82.55

Table 3. Performance comparison between lightweight and distilled models on GLH-water dataset.

Model Name	Params (M)	FLOPs (G)	IoU (%)
DeepLab-V3+ [21]	5.81	211.47	79.8
PSP-Net [18]	2.38	24.12	75.19
MobileNetV2+LKF	6.644	136.088	88.86
LKF-DCANet	0.22	71.377	82.55
Distilled LKF-DCANet	0.22	71.377	85.95

Table 4. Transfer performance of the distilled LKF-DCANet on GLH-water and UAV datasets.

Datasets	IoU (%)	F1-Score
GLH-Water [44]	85.95	92.10
Self-Constructed Dataset	96.28	97.72

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

A Lightweight Network for Water Body Segmentation in Agricultural Remote Sensing Using Learnable Kalman Filters and Attention Mechanisms

Abstract

1. Introduction

2. Materials and Methods

2.1. Datasets

2.1.1. GLH-Water Dataset Construction

2.1.2. Self-Constructed UAV Dataset

2.2. Overview of the Proposed Framework

2.3. Deep Water Body Segmentation Network

2.3.1. Learnable Kalman Filter Module

2.3.2. Encoder and Decoder

2.4. Lightweight Design of the Water Body Segmentation Network

2.4.1. Encoder Based on CADCN

2.4.2. Decoder with CATM and LKF Module

2.4.3. Feature-Based Knowledge Distillation Strategy

2.5. Training Strategy and Transfer Pipeline

3. Experiments and Results

3.1. Evaluation Metrics

3.2. Training Protocol

3.3. Experimental Results

3.3.1. Ablation Experiments

3.3.2. Distillation Experiment

3.3.3. Transfer Learning Experiment

4. Discussion

4.1. Theoretical Rationale and Design Motivation

4.2. Multi-Module Collaboration and Feature Enhancement

4.3. Knowledge Distillation for Lightweight Accuracy Preservation

4.4. Cross-Platform Generalization and Data Efficiency

4.5. Computational Limitations and Future Directions

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Article Metrics

Citations

Article Access Statistics