A Feasible Domain Segmentation Algorithm for Unmanned Vessels Based on Coordinate-Aware Multi-Scale Features

Zhou, Zhengxun; Li, Weixian; Wang, Yuhan; Liu, Haozheng; Wu, Ning

doi:10.3390/jmse13081387

Open AccessArticle

A Feasible Domain Segmentation Algorithm for Unmanned Vessels Based on Coordinate-Aware Multi-Scale Features

by

Zhengxun Zhou

,

Weixian Li

,

Yuhan Wang

,

Haozheng Liu

and

Ning Wu

^*

Key Laboratory of Beibu Gulf Offshore Engineering Equipment and Technology, Beibu Gulf University, Qinzhou 535011, China

^*

Author to whom correspondence should be addressed.

J. Mar. Sci. Eng. 2025, 13(8), 1387; https://doi.org/10.3390/jmse13081387

Submission received: 19 June 2025 / Revised: 16 July 2025 / Accepted: 17 July 2025 / Published: 22 July 2025

(This article belongs to the Section Ocean Engineering)

Download

Browse Figures

Versions Notes

Abstract

The accurate extraction of navigational regions from images of navigational waters plays a key role in ensuring on-water safety and the automation of unmanned vessels. Nonetheless, current technological methods encounter significant challenges in addressing fluctuations in water surface illumination, reflective disturbances, and surface undulations, among other disruptions, in turn making it challenging to achieve rapid and precise boundary segmentation. To cope with these challenges, in this paper, we propose a coordinate-aware multi-scale feature network (GASF-ResNet) method for water segmentation. The method integrates the attention module Global Grouping Coordinate Attention (GGCA) in the four downsampling branches of ResNet-50, thus enhancing the model’s ability to capture target features and improving the feature representation. To expand the model’s receptive field and boost its capability in extracting features of multi-scale targets, the Avoidance Spatial Pyramid Pooling (ASPP) technique is used. Combined with multi-scale feature fusion, this effectively enhances the expression of semantic information at different scales and improves the segmentation accuracy of the model in complex water environments. The experimental results show that the average pixel accuracy (mPA) and average intersection and union ratio (mIoU) of the proposed method on the self-made dataset and on the USVInaland unmanned ship dataset are 99.31% and 98.61%, and 98.55% and 99.27%, respectively, significantly better results than those obtained for the existing mainstream models. These results are helpful in overcoming the background interference caused by water surface reflection and uneven lighting in the aquatic environment and in realizing the accurate segmentation of the water area for the safe navigation of unmanned vessels, which is of great value for the stable operation of unmanned vessels in complex environments.

Keywords:

unmanned ships; ResNet-50; coordinate-aware multi-scale features; feature expression

1. Introduction

As autopilot technology continues to advance, unmanned surface vehicles (USVs) are becoming increasingly vital for tasks such as floating waste removal [1] and on-water search and rescue operations [2]. Accurate domain recognition for USVs is crucial for ensuring their autonomous navigation. Unmanned vessels can carry visual sensors to segment safe navigable areas [3], and the accurate segmentation of navigable water boundaries is undoubtedly an important prerequisite for autonomously accomplishing tasks using unmanned vessels. However, the image segmentation of water scenes faces multiple challenges: on the one hand, the lighting conditions vary significantly, and, on the other hand, the reflections of buildings, aquatic vegetation, and other objects on the water surface are highly similar to their physical characteristics. Together, these factors lead to greater difficulties in accurately recognizing water boundaries. Therefore, the accurate identification of water and land boundaries is crucial to enhancing the autonomous navigation capability of unmanned vessels [4,5].

Currently, traditional image segmentation algorithms are being extensively applied in water segmentation tasks. An Chengjin et al. [6] used one-dimensional Otsu to coarsely segment a Synthetic Aperture Radar (SAR) image in order to obtain the initial region of the water; they then used a Computer Vision (CV) model to realize a fine segmentation of the water. Compared to SAR images, water features in visible images present higher complexity levels, significantly increasing the segmentation difficulty. Yu Sufen et al. [7] realized water segmentation based on grayscale images, initially using Laplace pyramid algorithms to enhance the original grayscale image, and then using Otsu threshold segmentation algorithms to segment the enhanced image. However, this method still faces the issue of insufficient segmentation accuracy, relying on manual intervention to remove pseudo-target regions. The initial segmentation effect of the water domain is poor, and the results after manual correction match with the actual water domain; however, for the feasible domain identification of unmanned boats, manual correction is undesirable. X. Qiu et al. [8] proposed the determination of the inland river shoreline position by using a phase correlation algorithm to enhance the precision of phase alignment through feature extraction under constraints, as well as determining the shoreline inclination angle by using the affine method. Most of the water image backgrounds are uniform; using this feature, N. Li et al. [9] proposed an adaptive thresholding segmentation method incorporating the one-dimensional uniformity metric, Otsu algorithm, and uniformity metric. In view of the characteristics of water images, J. Yu et al. [10] designed a method based on Multi-Block Local Binary Patterns and Hue Intensity Saturation color space with hue variance features in order to detect water regions in images. P. Santana [11] proposed to implement water detection based on the dynamic texture of water in video sequences by calculating the entropy of the optical flow over multiple frames as a feature of the water, using the labeling results as a guide to manually label the water, and segmenting each frame of the video using the K-means clustering algorithm. The method of water boundary segmentation is not ideal, with poor effectiveness in the presence of water surface reflection.

As deep learning steadily advances [12,13], deep learning-driven semantic segmentation methods have seen widespread use in the domain of image segmentation. For example, Yu et al. applied convolutional neural networks (CNNs) to detect water bodies in remote sensing images [14]. However, the shallow CNN architecture limits the detection performance. Fully convolutional neural networks (FCN) apply convolutional layers to image segmentation tasks instead of fully connected layers, which can realize image pixel-level segmentation [15]. Isikdogan et al. proposed a model DeepWaterMap based on a fully connected neural network (FCN) framework, which optimized the connection method and reduced the number of parameters, making it more suitable for water body segmentation [16]. Although FCNs achieve feature extraction through continuous downsampling, this can lead to a loss of information concerning small targets. For this reason, U-Net [17] and SegNet [18] networks introduce a “U”-shaped encoder–decoder architecture, which fuses shallow details with high-level semantic features through jump connections. Feng et al. combined U-Net with a superpixel-based conditional random field (CRF) model to achieve better small water area segmentation [19]. However, this connection method adopts an equal-weight channel processing mechanism, which makes it difficult to adaptively highlight key feature regions. Xia et al. proposed DAU-Net, a densely connected network with skip connections, which reduces semantic differences through multi-scale feature fusion and thus enhances edge segmentation [20]. However, directly connecting low-level and high-level feature maps may confuse feature representation, thereby reducing accuracy [21]. For water images with complex backgrounds and noise, the equal weights of feature maps make it difficult for the model to prioritize key areas. To address this issue, Woo et al. proposed DAU-Net, a densely connected network based on skip connections, which reduces semantic differences through multi-scale feature fusion and thus enhances edge segmentation [22]. Jonnala et al. proposed a multi-scale residual and attention-enhanced U-Net model (AER U-Net), which combined the residual block, attention mechanism, and dropout layer to improve the accuracy and generalization ability of water segmentation. It excels in dealing with large-scale water bodies, but is not accurate enough when segmenting tiny targets [23]. In order to solve the problem of segmentation of tiny targets, Jonnala et al. proposed the DSIA U-Net model, which combines the deep and shallow interaction mechanisms and the attention module to improve the segmentation accuracy. However, in complex environments, the performance of the model is not as expected [24]. The GGCA (Global Grouped Coordinate Attention) [25] is coupled more efficiently via simultaneous grouping and computational coordinate decomposition, which pays attention to both the channel and space at the same time. In the spatial dimensions, the channels are grouped to grasp the overall information across both height and width, thereby strengthening feature representation. The weights associated with the feature maps can be dynamically fine-tuned to guide the model’s attention toward important areas, enhancing both the model’s effectiveness and the precision of attention distribution. Similarly, Li et al. proposed a boundary attention module (BA module) and combined it with an adaptive weight multi-task learning (AWML) model to capture useful boundary information [26]. However, the smaller convolutional kernel in the CNN can only focus on a small region in the image at each convolution and is less capable of recognizing objects of different sizes. CNN pooling operations and larger convolution steps significantly reduce feature resolution, affecting small-scale feature extraction [27]. The SAM proposed by Kirillov et al. has general segmentation capabilities, but its performance in underwater images is limited due to problems such as lighting and background interference [28]. To this end, Hong et al. proposed WaterSAM, which introduced LoRA technology and only fine-tuned the image encoder to reduce the computational and annotation costs, and performed better in the segmentation of small objects and fuzzy boundaries [29]. To improve the operating efficiency of SAM, Zhang et al. proposed EfficientViT-SAM, which uses an efficient encoder and knowledge distillation method to significantly accelerate reasoning while maintaining segmentation accuracy [30]. Fu et al. proposed the Lite-SAM model, which uses a lightweight encoder LiteViT and an automatic prompt module AutoPPN to achieve low-parameter and high-speed real-time full-image segmentation performance. This model significantly reduces resource consumption [31]. ASPP allows the model to obtain a larger sensory field by combining the sizes of multiple null convolutional kernels to retrieve contextual information for features of different sizes [32,33,34,35]. Another approach uses a feature pyramid FPN to effectively integrate multi-scale features through a top-down feature fusion mechanism. However, this structure suffers from detailed information attenuation during feature fusion, making it difficult to accurately recognize tiny targets or complex boundary features [36].

To summarize, the traditional water segmentation method is suitable for cases with limited data volume and computational resources. Traditional segmentation methods are easily affected by environmental factors such as light conditions, water reflections, floating objects, etc. For example, at the intersection of clear water and muddy water, this distinct color change is a great challenge in traditional segmentation methods. Therefore, the traditional segmentation method does not meet the requirements of the feasible domain recognition of unmanned boats; instead, the segmentation algorithm based on deep learning will be the main method of feasible domain recognition.

In this paper, a coordinate-aware multi-scale feature model is presented. The system employs ResNet-50 [37] for feature extraction and leverages its residual design to effectively address the issue of gradient decay caused by increasing network depth, allowing the model to capture multi-scale features more effectively in watershed images. In addition, the attention mechanism GGCA is introduced into the model, and GGCA can effectively capture light changes at different positions of the water surface through the pooling operation in the H and W directions. Among them, the pooling operation in the H direction is the core part of suppressing reflection interference, which distinguishes real objects and reflections by learning the position relationship in the vertical direction. At the same time, GGCA also adopts the method of channel grouping so that the model can learn the feature representation of the water surface under different lighting conditions. In addition, through attention weighting, the model can dynamically adjust the attention to different regional features so that the model can focus on the target features. In this paper, the receptive field of the model is extended by introducing the hollow space pyramid pooling module to enhance the context association of multi-scale features. Combined with the feature pyramid fusion mechanism, the multi-scale semantic representation ability is effectively improved, and the segmentation difficulties caused by water surface reflection, uneven illumination, and water surface rippling in the complex environment of inland rivers are effectively overcome. It significantly improves the segmentation accuracy of unmanned ships in complex aquatic environments, providing reliable technical support for unmanned ship intelligent systems.

2. General Model Architecture

The GASF-ResNet model proposed in this paper consists of two core components: an attention-enhanced encoder based on ResNet-50, which utilizes spatial global information to generate the attention weights through the GGCA mechanism to achieve a dynamic balance between global context and local features, and a multi-scale fusion decoder with an ASPP module for cross-layer feature optimization. The architecture significantly improves the feature characterization capability of watershed images. The ASPP module is configured with differentially inflated null convolution layers: small inflated null convolution focuses on capturing the local subtle features, while large inflated null convolution acquires global contextual semantic information. By connecting the different expansion rates of hollow convolutional layers and then fusing the features through 1 × 1 convolutional layers, a feature map with rich details and contextual information is finally generated. The model realizes multi-scale feature fusion by means of a feature pyramid network (FPN); finally, it adjusts the channel dimensions by using a 3 × 3 convolutional layer to output the final image segmentation results.

In view of the many challenges posed by the complex and variable nature of inland water environments, the GASF-ResNet model plays an important role in the monitoring of properties, such as changes in water surface light and reflections of buildings and vegetation on the shore. In response to changes in light on the water surface, GGCA is able to capture the pattern of light changes at different positions on the water surface through pooling operations in the H and W directions. In addition, GGCA processes the channels in groups so that the model can learn the feature representation of the water surface under different lighting conditions. Through the attention weights activated by the sigmoid, the model can dynamically adjust the attention to different regional features, thereby enhancing spatial perception. ASPP also plays an important role in this. In response to the changes in light on the water surface, ASPP captures the changes in light from local to global through dilated convolution with different expansion rates, and its large receptive field is helpful in understanding the overall light distribution of the water surface. Multi-scale feature fusion also plays a key role. In response to changes in surface light, the low-level features retain the surface texture and light detail information, while the high-level features provide semantic understandings of the water. The fusion of features at different scales provides the model with robustness to light variations. For the reflection problem, GGCA can perceive the position relationship between the real object and its reflection on the water surface, then strengthen the water characteristics through the attention mechanism to suppress the interference of the reflection. When dealing with the reflection problem, the multi-scale characteristics of ASPP can adapt to the dimensional changes in the reflection due to perspective and water surface fluctuations, so as to better divide the waterfront boundary and distinguish the real shore boundary from the reflection boundary. At the same time, the low-level features capture the texture details of the reflection, and the higher-level features deduce the semantic meaning of the reflection. By fusing features at different levels, the model is able to better understand the relationship between the reflection and the real object. Therefore, the GASF-ResNet model can still accurately identify water areas under complex lighting and water surface reflection conditions, segment the waterfront boundary more accurately, reduce the missegmentation of reflections, and reduce the probability of misdividing reflections into non-water areas. The GASF-ResNet model copes with the core problems of disturbing water segmentation in the inland river environment, such as water surface illumination changes, reflections, and water surface ripples, from different perspectives and forms a complementary and collaborative network model. The model structure of GASF-ResNet is shown in Figure 1.

2.1. Global Grouped Coordinate Attention Enhanced Encoder

When processing water images, the intricate background often makes the segmentation task susceptible to disturbances caused by factors like lighting variations and reflections from the shoreline. To strengthen the network’s selective focus on target feature areas and effectively suppress unimportant background features, this paper presents an attention enhancement encoder. This encoder introduces the attention module GGCA to perform target enhancement on different scale features extracted by the backbone network, as shown in Figure 1. GGCA consists of a spatial focus mechanism and a channel focus mechanism. The channel focus mechanism processes the input features by grouping them according to channels. Meanwhile, the spatial focus mechanism enhances the model’s capability to focus on the target area across both the height and width dimensions, respectively. The implementation process of the attention mechanism is depicted in Figure 2. Traditional watershed segmentation techniques often struggle to efficiently capture the overall context of an image across both spatial axes—height and width—especially in complex visual scenarios, which hampers their effectiveness in feature representation. To overcome this limitation, this study adopts the GGCA module. The GGCA module solves this problem by three key technological breakthroughs: First, asymmetric pooling is employed to conduct global average pooling along the vertical axis and global maximum pooling along the horizontal axis, thereby enabling the capture of bi-directional long-range features. Second, an attention mechanism of shared convolution is designed to dynamically generate spatial weight maps for the adaptive enhancement of features. Third, a channel grouping strategy is utilized to maintain the diversity of the feature representations.

The GGCA module introduced in this paper aims to cope with limitations in complex visual tasks. Traditional models struggle to effectively obtain global information across the spatial dimensions encompassing the height and width of the image, which affects their feature representation. To improve this, a GGCA module integrating multidimensional global context and attention mechanisms is developed to significantly enhance the network’s feature learning capability. The module adopts a two-way structure: On one hand, global spatial information is efficiently extracted via global average pooling and maximum pooling operations conducted along the vertical and horizontal axes, respectively. On the other hand, by integrating shared convolution with an attention mechanism, a spatial attention map is generated to dynamically adjust feature weights, thereby highlighting key features and reducing noise interference. In order to ensure the diversity and richness of features, a group processing strategy is adopted to group the input feature maps by channel dimension. Compared with the single attention mechanism, the advantages of the GGCA module are reflected in the following: multi-dimensional global information fusion, a dynamic feature enhancement mechanism, and grouped feature processing, which comprehensively capture spatial features through bi-directional pooling operation, realize the accurate positioning and enhancement of key features, and ensure the richness and distinctiveness of the feature representation, respectively. In complex visual tasks, traditional models often find it difficult to effectively cope with the changes in water surface light, the reflection of buildings and vegetation on the shore, and the interference of other obstacles in the complex environment of inland rivers, which will significantly affect the accuracy of water area identification and segmentation. In order to solve these limitations, this paper introduces the GGCA module, which integrates multi-dimensional global information and an attention mechanism, significantly enhancing the feature learning ability of the network. Water surface reflections often produce strong horizontal patterns (reflections of objects), and attention in the H direction can suppress the interference of these reflections. In addition, grouping processing allows different groups of channels to handle different types of reflections separately. GGCA can be spatially adaptive according to different lighting conditions; the directionality generated by different lighting angles can be captured by the attention mechanism in the H and W directions, and the channel grouping can focus on extracting the lighting invariance features. Experimental results show that the module can significantly improve the feature expression ability and multi-scale information learning effect of the model in complex inland river scenes. It can effectively overcome the influence of surface reflection and light changes on water segmentation, thus improving the segmentation effect. The implementation process of the attention mechanism is shown in Figure 2.

First, for the input feature map,

X \in ℝ^{B \times C \times H \times W}

(1)

This feature map is partitioned into G groups based on the number of channels, with each group comprising C/G channels. In this context, B denotes the batch size, C represents the number of channels, and H and W correspond to the height and width of the feature map, respectively. The grouped feature map can be expressed as

X \in ℝ^{B \times G \times \frac{C}{G} \times H \times W}

(2)

Global average pooling and global maximum pooling are applied to the grouped feature maps along the height and width dimensions, respectively.

\{\begin{cases} X_{h, avg} = AvgPool (X) \in ℝ^{B \times G \times \frac{C}{G} \times H \times 1} \\ X_{h, \max} = MaxPool (X) \in ℝ^{B \times G \times \frac{C}{G} \times H \times 1} \\ X_{w, avg} = AvgPool (X) \in ℝ^{B \times G \times \frac{C}{G} \times 1 \times W} \\ X_{w, \max} = MaxPool (X) \in ℝ^{B \times G \times \frac{C}{G} \times 1 \times W} \end{cases}

(3)

For each grouped feature map, a shared convolutional layer is applied to process the features. This shared convolutional layer comprises two 1 × 1 convolutional layers, a batch normalization layer, and a ReLU activation function, which are used to reduce and subsequently recover the channel dimensions.

\{\begin{cases} Y_{h, avg} = Conv (X_{h, avg}) \\ Y_{h, \max} = Conv (X_{h, \max}) \\ Y_{w, avg} = Conv (X_{w, avg}) \\ Y_{w, \max} = Conv (X_{w, \max}) \end{cases}

(4)

Attention weights in the height and width dimensions are produced by aggregating the outputs of the convolutional layers and then applying a Sigmoid activation function.

\{\begin{cases} A_{h} = σ (Y_{h, avg} + Y_{h, \max}) \in ℝ^{B \times G \times \frac{C}{G} \times H \times 1} \\ A_{w} = σ (Y_{w, avg} + Y_{w, \max}) \in ℝ^{B \times G \times \frac{C}{G} \times 1 \times W} \end{cases}

(5)

The input feature maps are modulated by the attention weights to generate the output feature maps.

O = X \times A_{h} \times A_{w} \in ℝ^{B \times C \times H \times W}

(6)

Here, the attention weights

A_{h}

and

A_{w}

will be expanded in the height and width directions, respectively, to match the dimensions of the input feature map.

2.2. Multi-Scale Feature Fusion Decoder

In images of water bodies, the backgrounds commonly feature trees, structures, bridges, and various other objects, and the separation between these elements and the water is frequently not distinctly defined. If only relying on the high-level semantic features extracted by the model for segmentation, the complex background is prone to interfere with the model’s judgment, which, in turn, triggers problems such as the detail regions being segmented incorrectly or missed. In this paper, we design a multi-scale feature fusion decoder, which is implemented using ASPP and the FPN in conjunction with each other. In this case, the ASPP module enlarges the receptive field via dilated convolution, thus extracting multi-level contextual information that is compatible with features of local details and global context. Subsequently, multi-scale feature fusion is performed in conjunction with the FPN to generate richer feature maps. In terms of coping with the light changes on the water surface, ASPP can capture the light change patterns at different scales, while the multi-scale feature fusion improves the robustness of the model to light changes by combining texture and semantic information. When dealing with the reflection problem, ASPP can effectively deal with the scale changes and complex shapes of the reflection, and the multi-scale feature fusion can more accurately identify the reflection region by combining details and semantic information. Figure 3 demonstrates the process of this multi-scale fusion decoder. The watershed image is derived via four phases of the encoder to produce the feature map

F = {F_{1}, F_{2}, F_{3}, F_{4}}

as the input to the decoder, where each residual block stage performs downsampling and convolution operations on the input image, so that the resolution of the output feature map

F_{i}

becomes {1/4, 1/8, 1/16, 1/32} of the original image and the number of output channels becomes {256, 512, 1024, 2048}. Before multi-scale fusion, the feature maps

F_{i}

are processed by the ASPP module in order to capture contextual information at different scales. The ASPP module is shown in the dashed box at the left end of Figure 3. The module contains five branches: one 1 × 1 convolution; three 3 × 3 dilated convolutions with dilation rates of 6, 12, and 18, which are employed for multi-scale feature extraction; and one global average pooling layer. In this structure, the 1 × 1 convolution layer adjusts the channel count of the input features, while the three dilated convolutions with varying dilation rates are tasked with capturing multi-scale features, and global average pooling is performed by a 1 × 1. The global average pooling is passed through a 1 × 1 convolutional layer to match the number of channels. The features are then expanded back to their original spatial dimensions by bilinear interpolation, and the features extracted from the five branches are finally superimposed in the channel dimension to recover their number of channels using a 3 × 3 convolution. Moreover, each feature map

F_{i}

passes through a 1 × 1 convolutional layer to make the number of channels equal to 512, aiming to fuse various feature maps

P = {P_{1}, P_{2}, P_{3}, P_{4}}

. Then, starting from the higher-level feature

P_{4}

, the higher-level features are upsampled by a factor of two and used with lower-level features via pixel-wise addition, yielding the fused features

C = {C_{1}, C_{2}, C_{3}}

. Finally, the results of each level of the FPN are upsampled by factors of {1, 2, 4, 8} to reach one-fourth of the original image’s size. After a 3 × 3 convolution is applied to each output feature, they are combined in the channel dimensions to obtain the fused feature map.

2.3. Model Loss Function

This study uses the standard cross-entropy loss function in the water segmentation task to measure the difference between the predicted probability distribution and the true label distribution. For the “water-land” binary classification segmentation task, the loss function for a single pixel is defined as:

L = - y \log (\hat{p}) - (1 - y) \log (1 - \hat{p})

(7)

Among them,

y \in {0, 1}

represents the true label (1 for water, 0 for land),

\hat{p} \in [0, 1]

represents the probability that the model predicts that the pixel is a water area.

When

y = 1

(water pixels), The loss is dominated by:

- y \log (\hat{p})

, If

\hat{p}

is close to 1 (correct prediction), the loss approaches 0; If

\hat{p}

is too small (misjudged as land), the loss will increase significantly, forcing the model to strengthen its ability to distinguish water features.

When

y = 0

(land pixels), The loss is dominated by:

- \log (1 - \hat{p})

, Similarly, this part will suppress the predicted probability of land pixels being misclassified as water areas.

3. Experimental Data and Processing

3.1. Overview of the Experimental Dataset

Datasets play a crucial role in deep learning tasks, determining the training quality, generalization ability, and accuracy of models. The swift advancement in autonomous driving relies heavily on the availability of extensive datasets for research purposes, training, and learning. For the visual perception task of unmanned ships, datasets are equally indispensable. Although some datasets for unmanned ship scene segmentation and recognition have appeared in recent years, the data size and application scenarios of the existing datasets are still very limited. As a result, they cannot provide enough learning and training samples for research into related algorithms. To address this problem, this chapter constructs a set of water surface scene segmentation datasets and a set of water surface scene matching datasets through independent data collection based on the requirements of the actual application environment of unmanned boats. These datasets can provide richer and more diverse data resources for the field of unmanned boat visual perception, thus promoting the progress of unmanned boat technology. In addition, there are some datasets obtained through aerial or underwater photography for visual perception research of unmanned aerial vehicles and unmanned underwater vehicles. However, there is still a lack of research on scene-matching datasets for unmanned ships. The reasons are as follows: On the one hand, unmanned ships are not as popular as unmanned vehicles and drones. Most of the industry demand comes from maritime, water conservancy, river management units, and so on. For safety reasons, the relevant management and approval process is extremely strict, making it difficult to carry out experiments. On the other hand, even if safety conditions allow, there is great uncertainty in assembling the hull and transporting it to the water for pre-experimentation. In particular, the sea route is wider and more complex than the ground vehicle traveling route; therefore, it is difficult to obtain repeated data in the same location. Currently, countries around the world are actively promoting smart water transportation programs. Scenario matching as an important basis for unmanned ship environment perception can not only greatly reduce operating costs but can also significantly improve the safety of shipping, especially in tasks such as entering and leaving ports, fixed route cruising, and patrolling.

In this study, two datasets were used, namely the self-collected dataset of different river channel scenes on the campus of Beibu Gulf University and the world’s first unmanned dataset, USVInland, jointly released by Oka Zhibo and Tsinghua University and Northwestern Polytechnical University. The dataset contains a total of 800 images. There are 640, 80, and 80 images in the training set, test set, and validation set, respectively. In this study, the collected raw data was preprocessed through processes such as frame extraction and deduplication. The feasible domain annotation is carried out on the data, the data is augmented according to the dataset and network training situation, and the feasible domain identification dataset is constructed. The USVInland dataset is the first dataset of unmanned inland vessels under multi-sensor and multi-weather conditions in real-world scenarios (https://orca-tech.cn/datasets/USVInland/Waterline) accessed on 7 July 2025. The dataset contains data from different river scenarios, and the published data includes three tasks: SLAM, stereo matching, and waterfront segmentation. Among them, the USVInland waterfront segmentation task dataset contains a total of 700 images. In this study, 518 and 182 images were selected as the training set and test set, respectively, and 91 images were randomly selected from the test set for the validation set. Due to the limitation of computing power, the image resolution of both the self-collected and USVInland datasets is too large, which is not suitable for the direct input of the model, and the number of pictures contained in the dataset is small. In order to increase the diversity of data and improve the generalization ability of the model to an unknown environment, this paper preprocesses the data in the process of model training. Before the input model, the size of the input image is cropped to half of the original image, and the cropped image is randomly rotated, randomly flipped, and Gaussian blurred, as well as carrying out other image enhancement techniques, in order to expand the number of images to three times the original number, so as to meet the number requirements of model training.

In order to enhance the statistical robustness of the experimental results, this study used k-fold cross-validation to evaluate the performance of the self-made dataset. The number of cross validations was K = 5. The dataset was divided into five non-overlapping subsets through stratified sampling. In each experiment, four subsets were used as training sets and one subset was used as a validation set. Five experiments were performed in rotation. Finally, the average values of mPA and mIoU of the five results were taken as the model performance indicators.

3.2. Experimental Environment and Parameter Settings

In this paper, the proposed mode is implemented in the MMSegmentation framework on the Ubuntu20.04.3 operating system, and the experimental runtime environment includes Pytorch1.11.0, mmcv-full1.7.1, and Python3.8. The experimental hardware configuration includes Intel(R)Xeon(R) Platinum8255C (CPU) (Intel Corporation, Santa Clara, CA, USA) and NVIDIAGeForceRTX3090 (GPU) (NVIDIA Corporation, Santa Clara, CA, USA), with a memory size of 24 GB. CUDA11.0 and cuDNN8.0.5 have also been installed to accelerate deep learning and parallel computing tasks.

In the training process, the model optimization is carried out using the Stochastic Gradient Descent (SGD) method, the momentum factor is 0.9, and the weight decay coefficient is 0.0005. The Polynomial Decay learning rate scheduling strategy is used, resulting in the following:

l r = l r_{b a s e} \times {(1 - \frac{i t e r}{\max_i t e r})}^{p o w e r}

(8)

where

l r_{b a s e}

,

i t e r

, and

\max_i t e r

represent the initial learning rate of 0.01, the current training iteration to the batch, and the maximum number of iterations of 20,000, respectively.

3.3. Experimental Evaluation Indicators

The evaluation metrics selected in this paper are the mean intersection over union (mIoU), mean pixel accuracy (mPA), and mean F1 score (mF1), where the larger the value, the better the segmentation result. In a binary classification scenario, there are typically four possible outcomes:

(1): TP (True Positive): The actual case is positive, and the model correctly predicts it as positive.
(2): FP (False Positive): The model incorrectly predicts a positive case when the actual case is negative.
(3): FN (False Negative): The model incorrectly predicts a negative case when the actual case is positive.
(4): TN (True Negative): The actual case is negative, and the model correctly predicts it as negative.

In semantic segmentation, the mIoU is computed separately for each class to express the intersection ratio of the true label and the predicted result, and then, the IOUs of all classes are averaged as follows:

m I O U = \frac{1}{k + 1} \sum_{i = 0}^{k} \frac{P_{i i}}{\sum_{j = 0}^{k} P_{i j} + \sum_{j = 0}^{k} P_{j i} - P_{i i}}

(9)

where

k + 1

,

i

,

j

, and

P_{i j}

denote the number of categories contained in the dataset (1 means background category), the true labeling category of the pixel, the predicted labeling category of the pixel, and the total number of pixels with true labeling category

i

and predicted labeling category

j

. The mPA pixel accuracy is the proportion of correct results for each prediction category to the total predicted values, which is obtained as follows:

m P A = \frac{1}{k + 1} \sum_{i = 0}^{k} \frac{P_{i i}}{\sum_{j = 0}^{k} P_{i j}}

(10)

The F1 indicator is a weighted harmonic average of precision and recall, with higher values indicating better predictive power. Both precision and recall can be calculated directly from the confusion matrix. Therefore, the results of segmentation are evaluated using a confusion matrix, an example of which is shown in Figure 4.

The formula of the F1 indicator is shown in (10)–(12), and the precision, also known as the accuracy rate, is the proportion of correct predictions in the positive sample predicted by the classifier, where the larger the value, the better the prediction ability. The recall rate is the proportion of all positive samples that are correctly predicted by the classifier to be positive, where the higher the value, the better the prediction ability.

\Pr ecision = \frac{T P}{T P + F P}

(11)

Re call = \frac{T P}{T P + F N}

(12)

F l_{score} = 2 \times \frac{\Pr ecision \times Re call}{\Pr ecision + Re call}

(13)

4. Experimental Results and Analysis

4.1. Evaluation of Various Models During the Training Process

To examine how the new approach compares to the traditional method throughout the training process, the validation set is validated after every 1000 iterations are completed. Figure 5 displays the IoU scores for both land and water classes, with the x-axis indicating the iteration count (in increments of 1000) and the y-axis representing the IoU values. Figure 5a–c correspond to the IoU and MIoU measurements for the water zone, land zone, and MIoU at each validation stage, respectively. From the figures, it can be observed that the U-net, Dmnet, and FCN models show large fluctuations in the early stages, and the IoU values show a significant decrease in some iteration stages, but the performance gradually stabilizes as the training progresses.

On the contrary, the GASF-ResNet model attains the top performance, with the IoU score maintaining a high level with minimal fluctuations and stabilizing after the 17,000th iteration. This indicates that the models in this paper have good robustness while achieving high accuracy recognition. As the training progresses with more iterations, the IoU scores of all models keep improving, and by the conclusion of the training, the IoU value of this paper’s model reaches 98.43% and 98.79% for the water and land area categories, respectively.

Table 1 shows the results of k-fold cross-validation. The parameters of the cross-validation training environment, optimizer, and learning rate strategy are all consistent. The k-fold cross-validation dataset effectively reduces the impact of data partitioning deviation on the results in a single experiment. The final cross-validation self-made dataset has an mPA mean of 99.14% and a confidence interval of [99.10, 99.18]. The mIoU mean is 98.47% and the confidence interval is [98.43, 98.51], which fully reflects the stability and generalization ability of the model performance and provides statistical support for the stability of the model performance.

4.2. Analysis of Comparative Experimental Outcomes

To appraise the segmentation capability of the proposed network on waterside images, we executed training and validation on the identical dataset and undertook a comparative evaluation with various well-established segmentation methods, including FCN, U-Net, DeepLabV3, and Uper-Net, using the mean intersection and merger ratio (mIoU) and the mean pixel accuracy (mPA) as the main evaluation metrics. All methods were assessed under uniform experimental conditions and consistent parameter configurations. The models were trained over multiple iterations on the dataset, with the best-performing weights on the validation set saved for comparison. The comparative analysis of the proposed network and traditional algorithm performances is detailed in Table 2.

Based on the data presented in Table 2, DeepLabv3 exhibits a high mean intersection and merger ratio (mIoU) among all of the compared models. The approach employs a deep convolutional neural network (CNN) as its core architecture, consisting of numerous convolutional layers. In addition, DeepLabv3 efficiently captures multi-scale features through the dilated convolution and ASPP modules, which also increases the model parameters. In contrast, the GASF-ResNet multi-scale feature extraction network proposed in this paper improves the segmentation effect by fusing multi-scale information and spatial pyramid pooling; however, the model parameters are still lower than DeepLabv3. In summary, this approach enhances the mean intersection over union and mean pixel accuracy by 0.22% and 0.43%, respectively, relative to DeepLabv3. When compared with widely used networks like FCN, U-Net, and Uper-Net, there are improvements of 4.36%, 4.26%, and 4% in the mIoU metric, respectively, with superior performances in both the average intersection over union and pixel accuracy measures compared to other existing methods.

From Figure 6, it can be calculated that the F1 of the background area is 97.98%, and the F1 of the water area is 98.04%. Therefore, the model performs well in the task of water segmentation and has high recognition accuracy and few misjudgments.

To substantiate the efficacy of the proposed segmentation technique, four exemplary images from the dataset were selected for qualitative analysis, with the visual segmentation results of different networks on these images being presented in Figure 7. Each column features three sections: the initial image, the segmentation outcome of each technique, and a magnified detail view. The model classifies the original images into two types: red regions represent water bodies, while unprocessed regions denote land or obstacles. In the simple scenarios of water areas 1 and 4, the FCN network can basically and correctly divide the water areas. However, in the complex water areas 2 and 3, the segmentation effect is relatively rough, meaning some water areas are not properly divided, and there is also obvious over-segmentation. This is because the FCN network lacks global contextual information, and it is easy to misidentify non-water areas as water areas. In contrast, due to the lack of an attention mechanism to guide feature learning, the segmentation boundary of the FCN network is not accurate enough, leading to a serious loss of detailed information. In addition, the receptive field is limited, which makes it impossible to effectively process multi-scale features and accurately distinguish between reflective water surface and real water. In the scenarios of water areas 1 and 4, the U-Net network, which is commonly used for medical image segmentation, shows strong detail capture capabilities and can accurately identify water boundaries. However, when dealing with water surface reflections and subtle features, the U-Net network slightly underperforms, with significant over-segmentation or omissions. This is mainly due to the lack of an adaptive feature selection mechanism in U-Net, which is prone to segmentation errors in the reflective areas of the water surface. In addition, its hopping connections may transmit noise signatures, resulting in the mishandling of subtle features. Because U-Net mainly relies on local feature fusion, it cannot understand the global water distribution, which also limits its effect in complex water scenarios to a certain extent.

In contrast, DeepLabv3 enhances the segmentation ability of objects at different scales by introducing dilated convolutions to expand the receptive field and performs well in dealing with water surface reflections. However, when working with edge details, DeepLabv3 shows noticeable missing segmentation. This is because the dilated convolution not only expands the receptive field but also leads to a certain degree of spatial information loss, which makes the water boundary not clear enough. Overall, the GASF-ResNet proposed in this paper demonstrates excellent performance in all water domain scenarios by fusing multi-scale features and efficiently integrating contextual information. GASF-ResNet is able to better capture the details of the waterfront junction through the two-way attention of GGCA, ensuring the integrity of the boundary. It can effectively solve the problem of visual confusion caused by water surface reflection and lighting changes. Specifically, the global average pooling in GASF-ResNet is used to extract the features of the whole water area, while the two-way attention weights can suppress the interference of the reflection region. In addition, the grouping mechanism allows different groups of channels to focus on different features, allowing for global modeling and reducing the impact of local lighting changes on the model. In complex waters 2 and 3, it is able to correctly divide out the surface area while avoiding the problems of over-segmentation and omission. When dealing with reflections and details on the water surface, it accurately captures details and precisely segments complex boundaries. This not only results in the highest average intersection ratio and average pixel accuracy but also keeps the number of model parameters relatively low.

In addition, in order to ensure that the model has the ability to generalize, GASF-ResNet was trained and tested on the USVInaland dataset, with the experimental environment and parameter configuration training parameters of all methods being consistent with those of the self-made dataset. Table 3 shows the performance comparison between the proposed network model and the classic method. From the results in Table 3, it can be seen that the proposed method is better than other methods in terms of the average intersection over union ratio and average pixel accuracy.

4.3. Findings and Evaluation from Ablation Studies

In order to verify the effectiveness of the proposed GASF-ResNet in the task of water segmentation in the complex environment of inland rivers, ablation experiments with different modules were carried out. A total of seven sets of experiments, ①~⑦, were designed to evaluate the roles of the attention enhancement module GGCA, the multi-scale feature fusion FPN, and Avoidance Spatial Pyramid Pooling (ASPP) in the model, respectively. The settings and parameters across all experiments were kept uniform, with the results being summarized in Table 4. Specifically, experiment ⑦ demonstrates the performance of the full GASF-ResNet model on the dataset, achieving an average intersection over union and average pixel accuracy of 98.61% and 99.31%, respectively.

(1): In experiment ①, the model encoder’s attention enhancement module was taken out, resulting in reductions of 1.2% and 0.57% in the mIoU and mPA, respectively, when compared with the complete model in experiment ⑦. It is indicated that the more accurate acquisition of key target features is facilitated by the attention enhancement module, while interference from background noise is mitigated.
(2): In experiment ②, the multi-scale feature fusion step was omitted, and only the ASPP module was used for segmenting the high-level semantic features. It was shown that the model’s capacity to detect targets at various scales was diminished, leading to decreases of 0.93% and 0.59% in the mIoU and mPA compared to the complete model ⑦. This indicates that multi-scale feature fusion is important for enhancing target feature representation and improving the segmentation effect.
(3): In experiment ③, the ASPP module was eliminated, and only features from the backbone network were employed for direct multi-scale fusion. It was observed that the model’s performance experienced a slight decline, with increases of 0.8% and 0.57% in the mIoU and mPA, respectively, when compared to the full model ⑦. This suggests that the ASPP component is vital in expanding the receptive field and improving the extraction of multi-scale contextual features.
(4): In experiment ④, the multi-scale feature aggregation and ASPP components were excluded, with segmentation relying solely on high-level semantic features. Reductions of 1.06% and 0.68% in the mIoU and mPA, respectively, were observed compared to the complete model ⑦. This suggests that the combined utilization of both modules can greatly enhance segmentation effectiveness, particularly in the detection and handling of fine boundary details. Since GGCA belongs to the encoder part, there is no need to consider its compatibility with the decoder.
(5): In experiment ⑤, the attention enhancement module and the ASPP module in the model encoder are omitted, and the multi-scale feature fusion is performed directly, which shows that the model performance decreases by 1.22% and 0.76% in the mIoU and mPA, respectively, when compared with the full model ⑦. This indicates that an important role is played by these two modules in enhancing the key feature representation and improving the overall feature extraction capability.
(6): In experiment ⑥, the attention enhancement component within the encoder part of the model and the multi-scale feature integration were removed, with segmentation carried out solely after the ASPP module. Compared to experiment ⑦, decreases of 1.38% and 0.86% in the mIoU and mPA were observed, respectively. This suggests that the attention enhancement component of the encoder and multi-scale feature aggregation aid in capturing multi-dimensional global context, enlarging the network’s receptive field and emphasizing key features, thereby enhancing segmentation performance.

5. Conclusions

Focusing on the problem that existing semantic segmentation methods are easily interfered with by background noise when dealing with water images with water surface illumination transformations and shore reflections, resulting in the inaccurate segmentation of small targets such as water boundaries or water plants, this paper proposes the multi-spatial dimensional attention mechanism model GASF-ResNet, which utilizes ResNet-50 as the primary backbone for feature extraction integrated with the attention enhancement component. The attention map is produced by aggregating the global context in the spatial axes (height and width) and assigning weights to the input feature map to improve feature expression. This enables the model to rapidly concentrate on the target area within the feature map and reduce the impact of background noise. For the problems of water surface reflection and lighting change, the GGCA module grouping process allows different channel groups to deal with different types of reflections separately. GGCA can be spatially adaptive according to different lighting conditions; the directionality generated by different lighting angles can be captured by the attention mechanism in the H and W directions, and the channel grouping can focus on extracting the lighting invariant features. Furthermore, the FPN is employed to build a multi-scale aggregation component, which, together with spatial pyramid pooling, enlarges the receptive field and facilitates the integration of various semantic features. In this way, it effectively overcomes the problems of water surface reflection, uneven illumination, water surface ripples, and water boundaries in the complex environment of rivers. The experimental outcomes indicate that the average pixel accuracy (mPA) and average cross-union ratio (mIoU) of the proposed method are 99.31% and 98.61% on the self-made dataset and 98.55% and 99.27% on the USVInaland unmanned ship dataset, respectively, which are significantly better than the existing mainstream models, while the model’s parameter count and computational requirements do not escalate significantly, thereby validating the effectiveness and practicality of the proposed enhancements in this study. Future feasibility water segmentation research will focus on improving performance under adverse weather conditions, developing models that can operate stably under adverse weather conditions, and enhancing the adaptability of models to complex environments by simulating different weather scenarios. This improves its segmentation accuracy under harsh conditions such as strong winds, heavy rain, or dense fog. Exploring multi-modal data fusion technology, combined with optical images, radar signals, underwater acoustic data, and other modal information, will further improve the overall perception of the water surface environment. In addition, combining deep learning techniques with hardware will be a core method of improving segmentation efficiency and enabling real-time applications.

Author Contributions

Software, Z.Z.; validation; Z.Z.; writing—original draft, Z.Z. and N.W.; formal analysis: Y.W. and H.L.; conceptualization, N.W.; supervision investigation, N.W.; resources, W.L.; writing—review and editing, Z.Z.; supervision, N.W.; methodology, N.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

Data is available upon request.

Conflicts of Interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as potential conflicts of interest.

References

Ruangpayoongsak, N.; Sumroengrit, J.; Leanglum, M. A Floating Waste Scooper Robot On Water Surface. In Proceedings of the 2017 17th International Conference on Control, Automation and Systems (ICCAS), Jeju, Republic of Korea, 18–21 October 2017; pp. 1543–1548. [Google Scholar]
Mendonca, R.; Marques, M.M.; Marques, F.; Lourenco, A.; Pinto, E.; Santana, P.; Coito, F.; Lobo, V.; Barata, J. A cooperative multi-robot team for the surveillance of shipwreck survivors at sea. In Proceedings of the OCEANS 2016 MTS/IEEE Monterey, Monterey, CA, USA, 19–23 September 2016; pp. 1–6. [Google Scholar]
Wang, W.; Gheneti, B.; Mateos, L.A.; Duarte, F.; Ratti, C.; Rus, D. Roboat: An Autonomous Surface Vehicle for Urban Waterways. In Proceedings of the 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Macau, China, 3–8 November 2019; pp. 6340–6347. [Google Scholar]
Bae, I.; Hong, J. Survey on the Developments of Unmanned Marine Vehicles: Intelligence and Cooperation. Sensors 2023, 23, 4643. [Google Scholar] [CrossRef] [PubMed]
Gatesichapakorn, S.; Takamatsu, J.; Ruchanurucks, M. ROS Based Autonomous Mobile Robot Navigation Using 2D LiDAR and RGB-D Camera. In Proceedings of the 2019 First International Symposium on Instrumentation, Control, Artificial Intelligence, and Robotics (ICA-SYMP), Bangkok, Thailand, 16–18 January 2019; pp. 151–154. [Google Scholar] [CrossRef]
An, C.-J.; Chen, Z.-P. SAR image watershed segmentation algorithm based on Otsu and improved CV model. Signal Process. 2011, 27, 221–225. [Google Scholar]
Yu, S.F. A Watershed Segmentation Extraction Method for Bridge Recognition. Electro-Opt. Control. 2011, 18, 72–75. [Google Scholar]
Qiu, X.; Chen, S.; Huang, Y. An Algorithm for Identification of Inland River Shorelines based on Phase Correlation Algorithm. In Proceedings of the 2019 Chinese Automation Congress (CAC), Hangzhou, China, 22–24 November 2019; IEEE: New York, NJ, USA; pp. 2047–2053. [Google Scholar]
Li, N.; Lv, X.; Xu, S.; Li, B.; Gu, Y. An improved water surface images segmentation algorithm based on the Otsu method. J. Circuits Syst. Comput. 2020, 29, 2050251. [Google Scholar] [CrossRef]
Yu, J.; Lin, Y.; Zhu, Y.; Xu, W.; Hou, D.; Huang, P.; Zhang, G. Segmentation of river scenes based on water surface reflection mechanism. Appl. Sci. 2020, 10, 2471–2489. [Google Scholar] [CrossRef]
Santana, P.; Ca, M.R.; Barata, J. Water detection with segmentation guided dynamic texture recognition. In Proceedings of the 2012 IEEE International Conference on Robotics and Biomimetics (ROBIO), Guangzhou, China, 11–14 December 2012; IEEE: New York, NJ, USA; pp. 1836–1841. [Google Scholar]
Lyu, X.; Jiang, W.; Li, X.; Fang, Y.; Xu, Z.; Wang, X. MSAFNet: Multiscale Successive Attention Fusion Network for Water Body Extraction of Remote Sensing Images. Remote Sens. 2023, 15, 3121. [Google Scholar] [CrossRef]
Zhang, J.T.; Gao, J.T.; Liang, J.S.; Wu, Y.Q.; Li, B.; Zhai, Y.; Li, X.M. Efficient Water Segmentation with Transformer and Knowledge Distillation for USVs. J. Mar. Sci. Eng. 2023, 11, 901. [Google Scholar] [CrossRef]
Yu, L.; Wang, Z.; Tian, S.; Ye, F.; Ding, J.; Kong, J. Convolutional Neural Networks for Water Body Extraction From Landsat Imagery. Int. J. Comput. Intell. Appl. 2017, 16, 1750001. [Google Scholar] [CrossRef]
Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3431–3440. [Google Scholar]
Isikdogan, F.; Bovik, A.C.; Passalacqua, P. Surface Water Mapping by Deep Learning. IEEE J. Sel. Top. Appl. Earth Observ. Remote Sens. 2017, 10, 4909–4918. [Google Scholar] [CrossRef]
Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the 2015 International Conference on Medical Image Computing and Computer-Assisted Intervention, LNCS 9351, Munich, Germany, 5–9 October 2015; Springer: Cham, Switzerland; pp. 234–241. [Google Scholar]
Badrinarayanan, V.; Kendall, A.; Cipolla, R. SegNet:A deep convolutional encoder-decoder architecture for image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 2481–2495. [Google Scholar] [CrossRef]
Feng, W.; Sui, H.; Huang, W.; Xu, C.; An, K. Water Body Extraction from Very High-Resolution Remote Sensing Imagery Using Deep U-Net and a Superpixel-Based Conditional Random Field Model. IEEE Geosci. Remote Sens. Lett. 2019, 16, 618–622. [Google Scholar] [CrossRef]
Xia, M.; Cui, Y.; Zhang, Y.; Xu, Y.; Liu, J. DAU-Net: A Novel Water Areas Segmentation Structure for Remote Sensing Image. Int. J. Remote Sens. 2021, 42, 2594–2621. [Google Scholar] [CrossRef]
Duan, L.; Hu, X. Multiscale Refinement Network for Water-Body Segmentation in High-Resolution Satellite Imagery. IEEE Geosci. Remote Sens. Lett. 2020, 17, 686–690. [Google Scholar] [CrossRef]
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. CBAM: Convolutional Block Attention Module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Jonnala, N.S.; Siraaj, S.; Prastuti, Y.; Chinnababu, P.; Babu, B.P.; Bansal, S.; Upadhyaya, P.; Prakash, K.; Faruque, M.R.I.; Al-Mugren, K.S. AER U-Net: Attention-enhanced multi-scale residual U-Net structure for water body segmentation using Sentinel-2 satellite images. Sci. Rep. 2025, 15, 16099. [Google Scholar] [CrossRef] [PubMed]
Jonnala, N.S.; Bheemana, R.C.; Prakash, K.; Bansal, S.; Jain, A.; Pandey, V.; Faruque, M.R.I.; Al-Mugren, K.S. DSIA U-Net: Deep shallow interaction with attention mechanism UNet for remote sensing satellite images. Sci. Rep. 2025, 15, 549. [Google Scholar] [CrossRef]
Gao, Y.; Yu, H. Lightweight wheat fertility identification model based on improved Vision Transformer. J. Anhui Inst. Sci. Technol. 2024, 1–10. Available online: http://kns.cnki.net/kcms/detail/34.1300.N.20241213.0846.002.html (accessed on 10 May 2025).
Li, A.; Jiao, L.; Zhu, H.; Li, L.; Liu, F. Multitask Semantic Boundary Awareness Network for Remote Sensing Image Segmentation. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5400314. [Google Scholar] [CrossRef]
Aleissaee, A.A.; Kumar, A.; Anwer, R.M.; Khan, S.; Cholakkal, H.; Xia, G.S.; Khan, F.S. Transformers in Remote Sensing: A Survey. Remote Sens. 2023, 15, 1860. [Google Scholar] [CrossRef]
Kirillov, A.; Mintun, E.; Ravi, N.; Mao, H.; Rolland, C.; Gustafson, L. Segment anything. In Proceedings of the IEEE/CVF International Conference on Computer Vision 2023, Paris, France, 1–6 October 2023; pp. 4015–4026. [Google Scholar]
Hong, Y.; Zhou, X.; Hua, R.; Lv, Q.; Dong, J. WaterSAM: Adapting SAM for Underwater Object Segmentation. J. Mar. Sci. Eng. 2024, 12, 1616. [Google Scholar] [CrossRef]
Zhang, Z.; Cai, H.; Han, S. Efficientvit-sam: Accelerated segment anything model without performance loss. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2024, Seattle, WA, USA, 16–22 June 2024; pp. 7859–7863. [Google Scholar]
Fu, J.; Yu, Y.; Li, N.; Zhang, Y.; Chen, Q.; Xiong, J.; Yin, J.; Xiang, Z. Lite-sam is actually what you need for segment everything. In Proceedings of the European Conference on Computer Vision, Milano, Italy, 29 September–4 October 2024; Springer Nature: Cham, Switzerland, 2024; pp. 456–471. [Google Scholar]
Chen, L.C.; Papandreou, G.; Kokkinos, I. Semantic Image Segmentation with Deep Convolutional Nets and Fully Connected CRFs. arXiv 2016, arXiv:1412.7062. [Google Scholar] [CrossRef]
Chen, L.C.; Papandreou, G.; Kokkinos, I. DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 40, 834–848. [Google Scholar] [CrossRef]
Chen, L.C.; Papandreou, G.; Schroff, F. Rethinking Atrous Convolution for Semantic Image Segmentation. arXiv 2017, arXiv:1706.05587. [Google Scholar] [CrossRef]
Chen, L.C.; Zhu, Y.; Papandreou, G. Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 801–818. [Google Scholar] [CrossRef]
Xiao, T.; Liu, Y.; Zhou, B.; Jiang, Y.; Sun, J. Unified perceptual parsing for scene understanding. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 418–434. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer VISION and Pattern Recognition 2016, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar] [CrossRef]

Figure 1. GASF-ResNet structure.

Figure 2. Flow of attention mechanism realization.

Figure 3. Multi-scale feature integration decoder.

Figure 4. Confusion matrix plot.

Figure 5. Evaluation of various models on the validation set. (a) Intersection over union (IoU) for water regions. (b) IoU for land. (c) Mean intersection over union (MIoU).

Figure 6. The results of the confusion matrix of the self-made dataset were tested with the GASF-ResNet model.

Figure 7. Comparison of the segmentation effect of different network structures.

Table 1. K-fold cross validation results of GASF-ResNet on self-made datasets.

Number of Cross Validations	mPA (%)	mIoU (%)
1	99.18	98.51
2	99.12	98.45
3	99.15	98.43
4	99.10	98.50
5	99.17	98.47
average value	99.14	98.47
Standard Deviation	0.032	0.031
95% Confidence Interval	[99.10, 99.18]	[98.43, 98.51]

Table 2. Test the performance comparison results of the classic model with a self-made dataset.

Model	PA/%		IOU/%		mPA/%	mIOU/%	mF1/%	Quantity of Participants/M
Model	Body of Water	Land	Body of Water	Land	mPA/%	mIOU/%	mF1/%	Quantity of Participants/M
FCN	96.95	96.99	93.9	94.6	96.97	94.25	95.79	47.105
U-Net	97.01	97.05	93.61	94.68	97.03	94.35	95.86	28.991
DeepLabv3	98.5	99.26	97.66	98.22	98.88	97.94	97.96	65.72
Uper-Net	97	97.34	94.29	94.94	97.17	94.61	96.06	64.042
Dmnet	97.3	98.02	95.3	95.86	97.66	95.58	97.76	50.803
GASF-ResNet	99.1	99.52	98.43	98.79	99.31	98.61	98.01	62.15

Table 3. Comparison results of the performance of the classic model using the USVInland dataset.

Model	PA/%		IOU/%		mPA/%	mIOU/%	mF1/%	Quantity of Participants/M
Model	Body of Water	Land	Body of Water	Land	mPA/%	mIOU/%	mF1/%	Quantity of Participants/M
FCN	97.10	97.32	94.38	95.85	97.21	95.12	95.72	47.105
U-Net	97.20	97.30	94.46	95.91	97.25	95.19	95.85	28.991
DeepLabv3	99.00	99.04	97.94	98.43	99.02	98.18	97.98	65.72
Uper-Net	96.50	97.86	94.3	95.79	97.18	95.05	96.17	64.042
Dmnet	97.97	97.99	95.88	96.91	97.98	96.40	97.30	50.803
GASF-ResNet	99.00	99.54	98.37	98.74	99.27	98.55	98.00	62.15

Table 4. Results of ablation experiments.

Experiment Number	GGCA	Multi-Scale Feature Fusion	ASPP	mPA/%	mIoU/%
①	×	√	√	98.74	97.41
②	√	×	√	98.72	97.68
③	√	√	×	98.81	97.81
④	√	×	×	98.63	97.55
⑤	×	√	×	98.55	97.39
⑥	×	×	√	98.45	97.23
⑦	√	√	√	99.31	98.61

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhou, Z.; Li, W.; Wang, Y.; Liu, H.; Wu, N. A Feasible Domain Segmentation Algorithm for Unmanned Vessels Based on Coordinate-Aware Multi-Scale Features. J. Mar. Sci. Eng. 2025, 13, 1387. https://doi.org/10.3390/jmse13081387

AMA Style

Zhou Z, Li W, Wang Y, Liu H, Wu N. A Feasible Domain Segmentation Algorithm for Unmanned Vessels Based on Coordinate-Aware Multi-Scale Features. Journal of Marine Science and Engineering. 2025; 13(8):1387. https://doi.org/10.3390/jmse13081387

Chicago/Turabian Style

Zhou, Zhengxun, Weixian Li, Yuhan Wang, Haozheng Liu, and Ning Wu. 2025. "A Feasible Domain Segmentation Algorithm for Unmanned Vessels Based on Coordinate-Aware Multi-Scale Features" Journal of Marine Science and Engineering 13, no. 8: 1387. https://doi.org/10.3390/jmse13081387

APA Style

Zhou, Z., Li, W., Wang, Y., Liu, H., & Wu, N. (2025). A Feasible Domain Segmentation Algorithm for Unmanned Vessels Based on Coordinate-Aware Multi-Scale Features. Journal of Marine Science and Engineering, 13(8), 1387. https://doi.org/10.3390/jmse13081387

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Feasible Domain Segmentation Algorithm for Unmanned Vessels Based on Coordinate-Aware Multi-Scale Features

Abstract

1. Introduction

2. General Model Architecture

2.1. Global Grouped Coordinate Attention Enhanced Encoder

2.2. Multi-Scale Feature Fusion Decoder

2.3. Model Loss Function

3. Experimental Data and Processing

3.1. Overview of the Experimental Dataset

3.2. Experimental Environment and Parameter Settings

3.3. Experimental Evaluation Indicators

4. Experimental Results and Analysis

4.1. Evaluation of Various Models During the Training Process

4.2. Analysis of Comparative Experimental Outcomes

4.3. Findings and Evaluation from Ablation Studies

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI