1. Introduction
Water body segmentation holds significant application value in environmental governance policies and advancements in USV technologies. In the context of environmental monitoring, real-time water detection enables hydrological disaster monitoring through water body segmentation [
1,
2,
3]. For autonomous systems, the precise delineation of water bodies is essential for vessel navigation [
4,
5,
6]. Recently, applications of water body segmentation have been further expanded to include ecosystem assessment [
7,
8] and climate change impact analysis [
9].
Traditional sensors, including contact-based devices, radar, and sonar, face inherent limitations: Contact sensors are restricted to static scenarios with short-range detection, while radar and sonar suffer from noise artifacts due to fog, reflections, or turbulent waves. In contrast, optoelectronic cameras offer a cost-effective alternative but demand advanced vision algorithms to address complex environmental interference.
This study addresses the machine vision-based identification of water regions in outdoor scenes, which is critical for intelligent video surveillance in aquatic environments. Furthermore, water surface segmentation represents a quintessential case of complex image segmentation, providing insights into broader segmentation challenges. Existing water body segmentation methodologies are typically classified into image processing-based and machine learning-based frameworks. Image processing frameworks depend on empirical features (e.g., gradients, textures, edges), which exhibit limitations in generalizability and robustness. For instance, Zhao et al. [
10] employed an adaptive-threshold Canny edge detection algorithm for river boundary detection. However, these frameworks demand a priori environmental knowledge and exhibit performance degradation under substantially varying conditions. With the proliferation of deep learning, machine learning-based frameworks have been deployed for water body segmentation. For example, Zhan et al. [
11] developed an online learning framework, leveraging Convolutional Neural Networks (CNNs) for water region detection in unknown navigation scenarios. Han et al. [
12] pioneered the use of Fully Convolutional Networks (FCNs) for urban flood segmentation. Although these frameworks attain high accuracy, their computational complexity poses significant hardware demands.
Most state-of-the-art deep learning-based water segmentation frameworks require significant computational resources. Given practical deployment constraints and real-time inference requirements, this study proposes a network architecture that balances segmentation accuracy with computational efficiency. The primary challenge involves balancing model parameter complexity and inference efficiency. Therefore, designing a lightweight yet accurate segmentation network specifically optimized for water region recognition has significant academic and practical implications.
The key contributions of this work are as follows:
A Global–Local Feature (GLF) extraction module is proposed to enhance hierarchical feature representation within network depth and width constraints, thereby improving segmentation performance.
A Gated Attention (GA) module is designed with skip connections to enable efficient feature utilization, enhancing segmentation accuracy and interference robustness.
GFANet is implemented using a lightweight backbone integrated with the proposed modules. Under standardized training protocols, GFANet achieves segmentation accuracy comparable to complex models while demonstrating fewer parameters, lower computational complexity, and faster inference speed.
2. Related Work
Deep learning-driven image semantic segmentation has established itself as a prominent research domain in the past decade. Although deep learning-based segmentation has been widely adopted in autonomous driving for lane detection, its application to complex water body segmentation scenarios remains significantly underexplored, offering a promising research trajectory with substantial potential.
2.1. Feature-Enhanced Image Segmentation
The accurate segmentation of complex boundaries and ambiguous edges poses significant challenges. Continuous advancements in feature representation learning have improved segmentation performance. UNet [
13] employs skip connections for multi-level feature fusion, integrating high-resolution shallow encoder features with deep decoder semantic features via channel-wise concatenation to preserve local details and global context. Traditional CNNs, constrained by limited receptive fields, face challenges in modeling long-range dependencies. PSPNet [
14] addresses this limitation with a pyramid pooling module, which downsamples feature maps via multi-scale pooling, followed by bilinear upsampling and concatenation to enable explicit global context modeling. Concurrently, the DeepLab series [
15] pioneered dilated convolutions, exponentially expanding receptive fields by adjusting dilation rates without parameter increase.
However, uncontrolled receptive field expansion risks local detail loss, introducing a trade-off between global and local feature representation. To resolve this, BiSeNet [
16] introduced a dual-path architecture incorporating a feature fusion module with channel attention weighting to balance speed and accuracy. DANet [
17] pioneered a dual-attention mechanism that captures channel correlations using covariance matrices and adaptively fuses outputs via learnable parameters, thereby improving segmentation precision for complex boundaries. CCNet [
18] further optimized computational efficiency through a recursive criss-cross attention mechanism. This architecture aggregates contextual information along horizontal/vertical axes via two sequential criss-cross attention modules, concatenated with local feature maps to achieve superior segmentation performance.
These works demonstrate that integrating local and global semantic features is essential for segmenting complex boundaries. Building on this foundation, GFANet integrates two novel modules—Global–Local Feature Fusion (GLF) and Global Attention (GA)—to enhance semantic segmentation performance.
2.2. Water Body Segmentation
Early water body segmentation was primarily based on handcrafted features and a priori knowledge. Rankin et al. [
19] developed classification rules using color and texture analysis, though generalization remained limited by scene-specific lighting and terrain variations. Yao [
20] proposed a hybrid framework integrating region growing and texture analysis: initial segmentation using brightness thresholds, followed by K-Means clustering on 9 × 9 image patches to identify water regions via minimal texture variance. However, shadow interference required stereo vision, thereby increasing computational complexity. To reduce manual intervention, Achar et al. [
21] developed a self-supervised algorithm leveraging RGB, texture, and elevation features for patch-level binary classification, aiming to reduce manual intervention. However, missegmentation persisted in complex boundary regions.
With the proliferation of convolutional neural networks (CNNs), data-driven methodologies gained prominence. Elias et al. [
22,
23] pioneered encoder-decoder architectures for Unmanned Surface Vehicle (USV) water detection, achieving real-time segmentation. However, single-scene training limited architectural generalization. Eltner et al. [
24] integrated CNN segmentation with structure-from-motion (SfM) for 3D water level measurement. However, cross-river applications necessitated manual parameter calibration. Blanch et al. [
25] addressed this limitation by training a universal CNN on multi-basin heterogeneous datasets, achieving significant improvements in cross-regional river segmentation. Cao et al. [
26] and Miao et al. [
27] proposed high–low feature connection methods but neglected receptive field mismatches. Meanwhile, Liang et al. [
28] enhanced Deeplabv3+ for USV navigable area detection, improving the F1-score by 6.8% over baselines.
Recent advancements have focused on lightweight hybrid architectures to balance segmentation accuracy and computational efficiency. Kang et al. [
29] developed CoastFormer, incorporating axial attention in the encoder to capture long-range coastline features while maintaining high-resolution outputs using a CNN decoder. Zhang et al. [
30] developed MSF-Net, leveraging complementary infrared and visible-light data to enhance turbid water detection, thereby achieving a Dice coefficient of 91.5%.
Notwithstanding these advancements, key challenges remain: limited discriminative capability under complex lighting and turbid conditions, and the challenge of balancing lightweight designs with high segmentation precision. This study introduces a multi-modal adaptive network featuring dynamic feature selection and hierarchical attention mechanisms to balance robustness and real-time computational performance.
3. Materials and Methods
To address the speed–accuracy trade-off in practical applications, this study introduces GFANet (Global–Local Feature Attention Network), a lightweight architecture designed for real-time water surface semantic segmentation. GFANet employs an encoder–decoder architecture comprising three core components: (1) a backbone, (2) a Global–Local Feature (GLF) extraction module, and (3) a Gated Attention (GA) module.
3.1. Backbone
A lightweight and efficient backbone was developed to enhance inference speed. Ma et al. [
31] developed ShuffleNetV2, demonstrating its superiority over competing lightweight architectures. While ShuffleNetV2 was proposed several years ago, its lightweight design principles remain a foundational guideline for efficient module architecture. Contemporaries such as MobileNet and EfficientNet continue to see widespread engineering applications, whereas extensive studies [
32,
33] have validated ShuffleNetV2’s superior hardware-friendly architecture and memory access efficiency compared to these counterparts. Therefore, to satisfy real-time inference requirements, ShuffleNetV2 was adopted as the backbone architecture.
The main design of ShuffleNetV2 includes the basic unit for feature representation, as shown in
Figure 1a, and the down unit for downsampling, as shown in
Figure 1b. In the basic unit, the input with
c feature channels undergoes channel splitting into two branches. One branch preserves the identity mapping, while the other branch applies three convolutions with consistent input/output dimensions. The branch outputs are concatenated, maintaining channel count consistency. Channel shuffle is employed to fuse features and facilitate inter-branch communication. In the down unit, channel splitting is omitted, and the convolution stride is increased to double the output channels while halving spatial resolution.
Based on these two fundamental units and with reference to the design of ShuffleNet V2, we constructed the encoder backbone, as depicted in
Figure 2. The encoder consists of four downsampling stages. The first stage involves feature transformation and extraction using a downsampling convolution, followed by max pooling. Subsequent feature extraction is carried out through three additional downsampling stages, which capture deeper semantic information. To ensure model efficiency and a lightweight design, the last three downsampling stages uniformly utilize one downsampling unit and three basic units. Feature maps from each downsampling stage are retained for skip connections.
Corresponding to the encoder, the decoder comprises four upsampling stages aimed at progressively restoring image resolution. The detailed architecture is illustrated in
Figure 3. Stage 4 utilizes a 4× upsampling ratio, whereas the other stages employ 2× upsampling. The outputs of upsampling stages 1–3 are fused with feature maps 3-1 from the encoder via channel-wise concatenation, thus supplementing deep semantic features with shallow spatial details. To achieve model lightweighting, we apply 1 × 1 kernel depthwise separable convolutions subsequent to upsampling operations. This design significantly reduces the parameter count and computational overhead, while preserving feature representation capability. Lastly, upsampling stage 4 reconstructs the image to its original resolution and conducts pixel-wise semantic classification to accomplish the segmentation task.
3.2. GLF
To improve the detailed information extraction and enhance the feature extraction capability of the network, we introduce the Global–Local Feature (GLF) extraction module. The architecture of this module is depicted in
Figure 4.
The GLF module integrates local–global information, multi-scale features, attention mechanisms, and channel shuffling. These strategies enhance feature diversity and the expressive capacity of feature extraction, enabling the module to address similar complex backgrounds in computer vision tasks and the limited feature extraction capabilities of networks. Consequently, the network training performance becomes more stable.
Specifically, the module processes input features through dual-branch extraction and aggregation. The local branch uses four dilated convolutions with different rates, while the global branch applies self-attention.
The input features (
x) pass through four parallel dilated convolutions. We specifically adopt dilation rates [1–4] rather than alternatives like [1,3,5,7], as this progressive configuration ensures smoother multi-scale feature coverage. Furthermore, the smaller dilation rates inherently reduce padding operations and computational overhead, enhancing the module’s efficiency. We concatenate their outputs to form initial local features as follows:
Subsequently, a channel shuffle operation blends features from different convolutional layers, enhancing inter-channel information flow. A 1 × 1 convolutional layer then transforms the mixed features to match input dimensions, producing the final local feature output as follows:
The global branch applies self-attention to capture global features. This mechanism establishes global weighting through spatial positions that dynamically adjust attention weights across all locations, significantly enhancing contextual awareness. Initially, the branch flattens the input to obtain Query (
Q), and extracts the Key (
K) and Value (
V) tensors through pyramid pool operations. The implementation steps are as follows:
In the pyramid pool, we choose to use four maximum pools with strides of 2, 3, 5, and 6. The pooled results are concatenated after flattening to obtain
P.
P contains a large amount of global information. Therefore, we use channel shuffle and point convolution to enhance the information interaction across groups and obtain
K and
V. In our design,
K and
V are equal.
After obtaining the attention parameters, we calculate the attention feature map. As shown in Equation (
8), we compute the
Q-
K dot product and scale it by
(where
c represents the feature dimension) to ensure numerical stability and mitigate gradient issues. Equation (
8) applies softmax along the sequence dimension to convert attention scores into probability distributions. These weights then generate aggregated global context through weighted summation of the Value (
V) tensor, as shown in (
9). Finally, a convolutional linear transformation adjusts feature dimensions, producing refined global features in (
10).
Equation (
10) represents the output of the Global–Local Fusion (GLF) model. The features from both branches are combined through element-wise addition to produce the final feature extraction output. This fusion integrates detailed spatial information from local features with global contextual semantics, producing a robust feature map optimized for segmentation tasks.
3.3. GA
In semantic segmentation models, skip connections preserve spatial details between the encoder and decoder, reducing information loss from downsampling and improving boundary localization. However, direct addition-based connections often propagate excessive low-level textures and noise, hindering effective feature discrimination.
To overcome these limitations, we propose the Gated Attention (GA) module, which aims to address the issues associated with direct skip connections. The structure of the GA module is illustrated in
Figure 5. The process of the GA module is divided into two parts: convolutional attention and gating. The convolutional attention implemented in this paper realizes multi-head attention through convolution operations. The specific implementation is as follows:
After the input is normalized, the number of channels is expanded to three times through dot convolution. At this point, it is equivalent to concatenating
Q,
K, and
V into a tensor.
Then, local feature extraction is carried out through depth-separable convolution, and the output tensor is segmented into attention parameters
Q,
K, and
V through the chunk operation.
Subsequently, the multi-head attention mechanism is applied to adjust the dimensions of the entire
Q,
K, and
V from [B, dim, H, W] to [B, 4, dim/4, H*W]. After independently calculating the attention for each head and merging the results of multiple heads, the total attention output is obtained, and the feature channel is adjusted through point convolution.
Finally, the original input is added to the attention output through residual connection.
After the attention operation, we use a GLU (Gated Linear Unit) [
34] for gating. This process is as follows:
We assigned two branches, linearly transformed the original input, and divided it into two branches, as shown in Equation (
18). The first branch only has convolution operations to extract features, while the second branch adds a Gelu at the end as an activation function to generate gated weights. Then, we multiply the results of the two branches to achieve gating.
The output of the GA model is shown in Equation (
19). The GA module enhances the expressive power of low-level features by aggregating multiple enhancement results, reducing redundancy while preserving critical details.
3.4. GFANet
The lightweight backbone guarantees exceptionally high baseline inference speeds for the model. The integration of the GLF (Global–Local Feature) and GA (Gated Attention) modules results in a slight decrease in inference efficiency compared to the original backbone, but significantly enhances segmentation accuracy. This design allows GFANet to maintain competitive inference speeds while achieving high-precision performance, thereby achieving an optimal balance between computational efficiency and segmentation quality.
As illustrated in
Figure 6, the complete GFANet architecture is realized by strategically integrating the GLF and GA modules into appropriate positions within the backbone network. GFANet’s primary objective is to achieve segmentation accuracy comparable to advanced semantic segmentation networks while significantly reducing memory usage and execution time.
4. Experiment
Comparative experiments were conducted between GFANet and representative semantic segmentation architectures. Three benchmark datasets widely used in water body segmentation were employed to evaluate GFANet’s performance against state-of-the-art approaches.
4.1. Experiment Configuration
Experiments were conducted on a Windows 10 OS with an Intel Core i5-12400 CPU (2.5 GHz, 32 GB RAM) and an NVIDIA GeForce RTX 2080 Ti GPU (11 GB VRAM). The deep learning framework comprised PyTorch 2.2.2 (Python 3.11.9), CUDA 12.1, and cuDNN 8.8.
To ensure fair comparison across experiments, we standardized the input resolution at 512 × 512 and trained models for 150 epochs to ensure full convergence. To accommodate the GPU memory constraints of large models, we set the batch size to 8, balancing memory limits and preventing overfitting from smaller batches. We employed the Adam optimizer with default settings, as its adaptive learning rate is widely adopted in segmentation tasks. Following Leslie N. Smith’s methodology [
35], we conducted learning rate sensitivity tests on GFANet, as shown in
Figure 7, ultimately selecting 1 × 10
−3 from the optimal range as the initial rate, with a minimum value of 1 × 10
−4 (0.1 × initial rate). And we implemented a cosine annealing schedule throughout the training process.
We used scaling, flipping, translation, and color transformation as data augmentation techniques to enrich the dataset. Using data augmentation, we dynamically expanded the dataset to four times its original size during training. This augmented data exist only within the training phase, thus being applied exclusively to the training set.
4.2. Evaluation Metrics
This study employs seven evaluation metrics: mean Intersection over Union (mIoU), mean Pixel Accuracy (mPA), Accuracy, F1-score, GigaFLOPS (GFLOPs), parameter count (Params), and frames per second (FPS). Among these, mIoU quantifies segmentation region accuracy, while mPA measures the proportion of correctly classified pixels. Accuracy provides the overall pixel-wise classification correctness, and F1-score balances precision and recall to handle class imbalance. Both metrics quantify segmentation performance. GFLOPs quantify computational complexity, reflecting GPU computational efficiency. Params quantify model complexity by parameter count, indicating memory footprint. FPS quantifies real-time performance by measuring video stream processing rate. The mIoU, mPA, Accuracy, and F1-score are calculated as follows:
where
TP (True Positive) represents the count of pixels accurately predicted as positive.
TN (True Negative) signifies the count of pixels accurately predicted as negative.
FP (False Positive) indicates the count of pixels incorrectly predicted as positive.
FN (False Negative) indicates the number of pixels that are incorrectly predicted to be negative.
N is the number of categories.
4.3. Ablation Studies of GFANet
We conducted an ablation study on GFANet to validate the effectiveness of our design. The model was trained and evaluated on the RIWA dataset, with the quantitative results demonstrating the contributions of each module as shown in
Table 1.
The results validate the feasibility and effectiveness of our design. Compared to the backbone network, integrating the GLF module yields improvements of +3.2% for mIoU, +3.39% for mPA/Accuracy, and +4.2 for F1-score. These enhancements stem from GLF’s dual-branch multi-scale feature extraction, which significantly strengthens the model’s representational capacity—albeit with increased parameters and computations.
Furthermore, optimizing low-level feature expression through refined skip connections also boosts performance. The standalone GA module surpasses the backbone by +2.17% for mIoU, +2.55% for mPA, +1.92% for Accuracy, and +3.8 for F1-score, while adding 0.7M for params and 1.23 for GFLOPs.
Crucially, the coordinated application of GLF and GA demonstrates synergistic effects, achieving compounded performance gains that confirm the modules’ complementary design principles.
4.4. Experiments of Lightweighting Comparison
To evaluate the performance of the proposed network, we compared it with various architectures, including UNet [
13], DeeplabV3+ [
15] (backbone = Xception), PSPNet [
14] (backbone = ResNet50), and SegNet [
36], which are large-scale CNNs. Along with LEDNet [
37], SeaFormer [
38], and LETNet [
39], they are lightweight CNN.
To analyze the data clearly and straightforwardly, we made a separate comparison of the part of the model’s computational load in this section. After that, the comparison of the model’s reasoning ability will no longer show the data of computational load. The results of testing the sizes of each model using test images with a resolution of 512 × 512 are shown in
Table 2.
According to
Table 2, GFANet maintains parameters of 4.41 for Params, 7.15 for gflops, and approximately 155 for FPS, which is significantly lighter than large-scale CNNS. Compared with other lightweight models, the parameter number of GFANet is approximately 3.5 M higher than that of LEDNet and LETNet, and 0.4 M higher than that of SeaFormer. However, in terms of computational complexity, the computational complexity of GFANet is only higher than that of LEDNet and lower than that of SeaFormer and LETNet, which indicates that the complexity within our network is relatively lower. Meanwhile, the real-time inference speed of GFANet is very fast. Under the same conditions, its inference speed is almost the same as that of the fastest LEDNet. Although SeaFormer and LETNet are also significantly faster than large-scale networks, there is a significant gap in speed compared with GFANet.
4.5. Experiments Based on the Riwa Dataset
All datasets were partitioned into training, validation, and test sets with a fixed ratio of 7:1:2. To ensure fairness, no pre-trained weights were used across all models. Comparative results, including evaluation metrics and model parameters, are presented in
Table 3. GFANet outperforms all baseline architectures across the evaluated metrics. GFANet attains an mIoU of 82.59% and an mPA of 90.05%, outperforming all baselines, with the exception of SegNet (an mIoU of −0.27%, an mPA of −0.41%)—indicating minimal accuracy gaps. GFANet exhibits a 0.44% lower accuracy than SegNet, whereas its F1-score is 0.1 points higher. This suggests that while GFANet demonstrates superior recognition capability under class imbalance conditions, it shows marginally reduced detection precision for medium-sized objects compared to SegNet. Detailed results are provided in
Table 3.
The segmentation results of representative architectures are visualized in
Figure 8 with misclassified pixels: green (false water) and blue (false background). GFANet achieves performance comparable to top-performing SegNet while outperforming other architectures, exhibiting superior edge delineation precision. The original image accentuates this difference: all architectures, except for GFANet and SegNet, exhibit significant errors at riverbank edges. The lightweight model SeaFormer, which also has excellent data, performs poorly in actual operation. While SegNet misclassifies a small bottom-right water region as the background, GFANet accurately segments the entire water area.
Although GFANet exhibits marginally inferior edge discrimination performance compared to SegNet, the discrepancy is negligible. GFANet attains nearly fourfold faster inference speed than SegNet, alongside significantly reduced computational complexity and parameter count. This highlights GFANet as a balanced architecture, achieving a favorable trade-off between minimal accuracy reduction and substantial inference speed gains.
4.6. Experiments Based on the WaterSeg Dataset
The WaterSeg dataset was proposed by Liang et al. [
41] for water segmentation in videos and images within Video Object Segmentation (VOS) research. By adopting the DAVIS dataset format, it has gained widespread adoption. The dataset can be downloaded from
https://www.kaggle.com/datasets/gvclsu/water-segmentation-dataset (accessed on 11 March 2025).
GFANet maintains superior performance on the WaterSeg dataset (
Table 4), achieving an mIoU of 88.49% and an mPA of 93.68%, which is marginally lower than PSPNet (<0.5%) and SeaFormer (0.04% mIou and 0.23% mPA), significantly outperforming other architectures. The performance in terms of accuracy and f1-scores is the same.
The visualization results on the dataset are presented in
Figure 9. Under low-light conditions, all architectures exhibit significant edge detection performance degradation.
Under these scenarios, SegNet—previously the top performer—exhibits accuracy degradation, while PSPNet achieves the highest segmentation precision. GFANet maintains competitive performance, exhibiting comparable overall error to PSPNet. GFANet retains significant advantages in inference speed and model complexity over other architectures.
4.7. Experiments Based on the USVInland Dataset
The USVInland dataset [
42], developed and released by Orca-Tech researchers, focuses on inland waterway scenarios.
The quantitative results of architectures on the USVInland dataset are presented in
Table 5. All architectures achieve higher accuracy on USVInland, attributed to the dataset’s simpler image characteristics compared to other benchmarks. Under these conditions, the performance differences between architectures are reduced. GFANet was slightly lower than PSPNet (<0.05%) in the first three indicators and slightly higher than SeaFormer (<0.1%). In terms of f1-score, GFANet achieved a score of 98.1, while PSPNet and SeaFormer achieved 98.2. This indicates that in this dataset, both of them have stronger detail capabilities.
We chose pictures of rain interference to increase the differences. The results are visualized in
Figure 10. Except for GFANet, PSPNet, and SeaFormer, the other models all have large segmentation errors in the left bank area. In terms of the remote details, SeaFormer demonstrated the best performance (which is in line with the data results).
Experiments on three datasets (Riwa, WaterSeg, USVInland) encompassed images under diverse lighting conditions: bright, dim, and partially occluded scenarios. The results reveal that GFANet exhibits superior performance in high-brightness environments and reduced performance in low-light conditions. Nevertheless, GFANet consistently delivers more precise water area and boundary segmentation than competing architectures across diverse environments. This highlights GFANet’s superior transferability and enhanced robustness, enabling effective adaptation to diverse environmental conditions in water segmentation. Collectively, GFANet is ideally suited for real-world water surface semantic segmentation applications.
5. Conclusions
This study introduces GFANet (Global–Local Feature Attention Network), a real-time water surface semantic segmentation architecture leveraging Local–Global Feature extraction and gated attention mechanisms. Comprehensive experiments were performed on three benchmark datasets: RIWA, WaterSeg, and USVInland. Evaluation was assessed across four dimensions: segmentation accuracy, computational complexity, model size (parameters), and inference speed. The experimental results (as detailed in
Table 3 and
Figure 7) reveal that GFANet achieves a favorable balance between segmentation precision and computational efficiency, maintaining high accuracy with significantly reduced parameters and computational complexity compared to conventional CNN architectures.
While this study simulated three common real-world environmental scenarios (bright, low-light, and rain-interfered conditions), practical deployment tests on unmanned surface vehicles (USVs) or surveillance devices were not included.
Future research will focus on deploying GFANet on unmanned surface vehicles (USVs) to address real-world operational challenges. This entails the systematic exploration of (1) hardware–software co-design for embedded implementation, (2) computational graph optimization for resource-constrained edge devices, and (3) dynamic video stream analysis with multisensor fusion extensions. These efforts aim to (1) resolve deployment-specific bottlenecks, (2) expand application scenarios in marine environmental monitoring and intelligent navigation systems, and (3) validate GFANet’s practical utility in real-world contexts.
Author Contributions
Conceptualization, L.J.; methodology, S.X.; writing—original draft preparation, S.X.; writing—review and editing, S.X. and L.J. All authors have read and agreed to the published version of the manuscript.
Funding
This research received no external funding.
Data Availability Statement
The authors have used publicly available data in this manuscript. The dataset links are mentioned in this paper.
Conflicts of Interest
The authors declare no conflicts of interest.
References
- Yadav, A.; Nascetti, A.; Ban, Y. Attentive Dual Stream Siamese U-Net for Flood Detection on Multi-Temporal Sentinel-1 Data. In Proceedings of the IEEE International Symposium on Geoscience and Remote Sensing, Kuala Lumpur, Malaysia, 17–22 July 2022; pp. 5222–5225. [Google Scholar]
- Lopez-Fuentes, L.; Rossi, C.; Skinnemoen, H. River segmentation for flood monitoring. In Proceedings of the IEEE International Conference on Big Data, Boston, MA, USA, 11–14 December 2017; pp. 3746–3749. [Google Scholar]
- Inoue, H.; Katayama, T.; Song, T.; Shimamoto, T. Semantic Segmentation of River Video for Smart River Monitoring System. In Proceedings of the 2024 IEEE 13th Global Conference on Consumer Electronics (GCCE), Kitakyushu, Japan, 29 October–1 November 2024; pp. 678–681. [Google Scholar]
- Yu, J.J.; Luo, W.T.; Xu, F.H.; Wei, C.Y. River boundary recognition algorithm for intelligent float-garbage ship. In Proceedings of the 7th International Conference on Electronic Design, Xi’an, China, 15–17 August 2018; pp. 29–34. [Google Scholar]
- Marine Robotics Lab. Real-time semantic segmentation for USV obstacle avoidance. In Autonomous Navigation Systems; Springer: Berlin, Germany, 2023; pp. 1023–1035. [Google Scholar]
- Shao, M.; Liu, X.; Zhang, T.; Zhang, Q.; Sun, Y.; Yuan, H.; Xiao, C. DMTN-Net: Semantic Segmentation Architecture for Surface Unmanned Vessels. Electronics 2024, 13, 4539. [Google Scholar] [CrossRef]
- UNEP. Global Coastal Ecosystem Monitoring, 2nd ed.; United Nations Press: New York, NY, USA, 2023; pp. 77–89. [Google Scholar]
- Song, H.; Wu, H.; Huang, J.; Zhong, H.; He, M.; Su, M.; Yu, G.; Wang, M.; Zhang, J. HA-Unet: A Modified Unet Based on Hybrid Attention for Urban Water Extraction in SAR Images. Electronics 2022, 11, 3787. [Google Scholar] [CrossRef]
- IPCC Technical Annex: CNN-Based Glacial Lake Outburst Flood Early Warning. Available online: https://www.ipcc.ch/report/ar6/ (accessed on 15 March 2023).
- Zhao, J.; Yu, H.; Gu, X.; Wang, S. The edge detection of river model based on self-adaptive Canny Algorithm and connected domain segmentation. In Proceedings of the 2010 8th World Congress on Intelligent Control and Automation, Jinan, China, 7–9 July 2010; pp. 1333–1336. [Google Scholar]
- Zhan, W.; Xiao, C.; Wen, Y.; Zhou, C.; Yuan, H.; Xiu, S.; Zhang, Y.; Zou, X.; Liu, X.; Li, Q. Autonomous Visual Perception for Unmanned Surface Vehicle Navigation in an Unknown Environment. Sensors 2019, 19, 2216. [Google Scholar] [CrossRef] [PubMed]
- Han, X.; Nguyen, C.; You, S.; Lu, J. Single Image Water Hazard Detection using FCN with Reflection Attention Units. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 105–120. [Google Scholar]
- Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. In Proceedings of the 2015 Medical Image Computing and Computer-Assisted Intervention, Munich, Germany, 5–9 October 2015; pp. 234–241. [Google Scholar]
- Zhao, H.; Shi, J.; Qi, X.; Wang, X.; Jia, J. Pyramid scene parsing network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 2881–2890. [Google Scholar]
- Chen, L.C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-decoder with atrous separable convolution for semanticimage segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 801–818. [Google Scholar]
- Yu, C.; Wang, J.; Peng, C.; Gao, C.; Yu, G.; Sang, N. Bisenet: Bilateral segmentation network for real-time semantic segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 32–341. [Google Scholar]
- Fu, J.; Liu, J.; Tian, H.; Li, Y.; Bao, Y.; Fang, Z.; Lu, H. Dual attention network for scene segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 3146–3154. [Google Scholar]
- Huang, Z.; Wang, X.; Huang, L.; Huang, C.; Wei, Y.; Liu, W. Ccnet: Criss-cross attention for semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 603–612. [Google Scholar]
- Rankin, A.L.; Matthies, L.H.; Huertas, A. Daytime water detection by fusing multiple cues for autonomous off-road navigation. In Transformational Science and Technology for the Current and Future Force; World Scientific: Bukit Timah, Singapore, 2006; pp. 177–184. [Google Scholar]
- Yao, T.; Xiang, Z.; Liu, J.; Xu, D. Multi-feature fusion based outdoor water hazards detection. In Proceedings of the 2007 International Conference on Mechatronics and Automation, Harbin, China, 5–8 August 2007; pp. 652–656. [Google Scholar]
- Achar, S.; Sankaran, B.; Nuske, S.; Scherer, S.; Singh, S. Self-supervised segmentation of river scenes. In Proceedings of the 2011 IEEE International Conference on Robotics and Automation, Shanghai, China, 9–13 May 2011; pp. 6227–6232. [Google Scholar]
- Elias, M.; Kehl, C.; Schneider, D. Photogrammetric water level determination using smartphone technology. Photogramm. Rec. 2019, 34, 198–223. [Google Scholar] [CrossRef]
- Elias, M.; Maas, H.G. Measuring water levels by handheld smartphones—A contribution to exploit crowdsourcing in the spatio-temporal densification of water gauging networks. Int. Hydrogr. Rev. 2022, 27, 9–22. [Google Scholar] [CrossRef]
- Eltner, A.; Bressan, P.; Akiyama, T.; Gonçalves, W.; Junior, J. Using deep learning for automatic water stage measurements. Water Resour. Res. 2021, 57, e2020WR027608. [Google Scholar] [CrossRef]
- Blanch, X.; Wagner, F.; Hedel, R.; Grundmann, J.; Eltner, A. Towards automatic real-time water level estimation using surveillance cameras. In Proceedings of the EGU General Assembly 2022, Vienna, Austria, 23–27 May 2022; p. 3225. [Google Scholar]
- Cao, H.; Tian, Y.; Liu, Y.; Wang, R. Water body extraction from high spatial resolution remote sensing images based on enhanced u-net and multi-scale information fusion. Sci. Rep. 2024, 14, 16132. [Google Scholar] [CrossRef] [PubMed]
- Miao, R.; Ren, T.; Zhou, K.; Zhang, Y. A method of water body extraction based on multiscale feature and global context information. IEEE J. Sel. Top. Appl. Earth Observ. Remote Sens. 2024, 17, 12138–12152. [Google Scholar] [CrossRef]
- Cheng, L.; Xiong, R.; Wu, J.R.; Yan, X.M.; Yang, C.L.; Zhang, Y.J.; He, Y.Z. Fast Segmentation Algorithm of USV Accessible Area Based on Attention Fast Deeplabv3. IEEE Sens. J. 2024, 24, 24168–24177. [Google Scholar] [CrossRef]
- Kang, J.; Guan, H.Y.; Ma, L.F.; Wang, L.Y.; Xu, Z.S.; Li, J. Waterformer: A coupled transformer and cnn network for waterbody detection in optical remotely-sensed imagery. ISPRS-J. Photogramm. Remote Sens. 2023, 206, 222–241. [Google Scholar] [CrossRef]
- Zhang, F.; Sun, X.; Ma, F.; Yin, Q. Superpixelwise likelihood ratio test statistic for polsar data and its application to built-up area extraction. ISPRS-J. Photogramm. Remote Sens. 2024, 209, 233–248. [Google Scholar] [CrossRef]
- Ma, N.; Zhang, X.; Zheng, H.T.; Sun, J. ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture Design. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 116–131. [Google Scholar]
- Han, K.; Wang, Y.H.; Tian, Q.; Guo, J.Y.; Xu, C.J.; Xu, C. GhostNet: More Features From Cheap Operations. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 1577–1586. [Google Scholar]
- Wan, A.; Dai, X.L.; Zhang, P.Z.; He, Z.J.; Tian, Y.D.; Xie, S.N. FBNetV2: Differentiable Neural Architecture Search for Spatial and Channel Dimensions. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 12962–12971. [Google Scholar]
- Dauphin, Y.N.; Fan, A.; Auli, M.; Grangier, D. Language Modeling with Gated Convolutional Networks. In Proceedings of the 34th International Conference on Machine Learning, Sydney, Australia, 6–11 August 2017; Volume 70, pp. 933–941. [Google Scholar]
- Smith, L.N. Cyclical Learning Rates for Training Neural Networks. In Proceedings of the 2017 IEEE Winter Conference on Applications of Computer Vision (WACV), Santa Rosa, CA, USA, 24–31 March 2017; pp. 464–472. [Google Scholar]
- Badrinarayanan, V.; Kendall, A.; Cipolla, R. Segnet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 2481–2495. [Google Scholar] [CrossRef] [PubMed]
- Wang, Y.; Zhou, Q.; Liu, J.; Xiong, J.; Gao, G.W.; Wu, X.F.; Latecki, L.J. Lednet: A Lightweight Encoder-Decoder Network for Real-Time Semantic Segmentation. In Proceedings of the 2019 IEEE International Conference on Image Processing (ICIP), Taipei, Taiwan, 22–25 September 2019; pp. 1860–1864. [Google Scholar]
- Wan, Q.; Huang, Z.; Lu, J.; Yu, G.; Zhang, L. Seaformer: Squeeze-enhanced axial transformer for mobile semantic segmentation. In Proceedings of the Eleventh International Conference on Learning Representations (ICLR), Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
- Xu, G.A.; Li, J.C.; Gao, G.W.; Lu, H.M.; Yang, J.; Yue, D. Lightweight Real-Time Semantic Segmentation Network with Efficient Transformer and CNN. IEEE Trans. Intell. Transp. Syst. 2023, 23, 15897–15906. [Google Scholar] [CrossRef]
- Wagner, F.; Eltner, A.; Maas, H.; Yin, Q. River water segmentation in surveillance camera images: A comparative study of offline and online augmentation using 32 CNNs. Int. J. Appl. Earth Obs. Geoinf. 2023, 119, 103305. [Google Scholar] [CrossRef]
- Liang, Y.Q.; Jafari, N.; Luo, X.; Chen, Q.; Cao, Y.P.; Li, X. WaterNet: An adaptive matching pipeline for segmenting water with volatile appearance. Comput. Vis. Media 2020, 6, 65–78. [Google Scholar] [CrossRef]
- Cheng, Y.W.; Jiang, M.X.; Zhu, J.N.; Liu, Y.M. Are We Ready for Unmanned Surface Vehicles in Inland Waterways? The USVInland Multisensor Dataset and Benchmark. IEEE Robot. Autom. Lett. 2021, 6, 3964–3970. [Google Scholar] [CrossRef]
| Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).