An Intelligent Semantic Segmentation Network for Unmanned Surface Vehicle Navigation

Shao, Mingzhi; Liu, Xin; Yan, Xuejun; Li, Yabin; Cui, Wenchao; Sun, Chengmeng; Xiao, Changshi

doi:10.3390/jmse13101990

Open AccessArticle

An Intelligent Semantic Segmentation Network for Unmanned Surface Vehicle Navigation

by

Mingzhi Shao

^1,2,

Xin Liu

^1,2,*

,

Xuejun Yan

³,

Yabin Li

⁴,

Wenchao Cui

¹,

Chengmeng Sun

¹ and

Changshi Xiao

^1,2,5

¹

School of Ship and Port Engineering, Shandong Jiaotong University, Weihai 264209, China

²

Weihai Institute of Marine Information Science and Technology, Weihai 264200, China

³

Shandong Zhengzhong Information Technology Co., Ltd., Jinan 250014, China

⁴

Qingdao Institute of Shipping Development Innovation, Qingdao 266200, China

⁵

School of Shipping, Wuhan University of Technology, Wuhan 430070, China

^*

Author to whom correspondence should be addressed.

J. Mar. Sci. Eng. 2025, 13(10), 1990; https://doi.org/10.3390/jmse13101990

Submission received: 17 September 2025 / Revised: 5 October 2025 / Accepted: 8 October 2025 / Published: 17 October 2025

Download

Browse Figures

Versions Notes

Abstract

With the development of artificial intelligence neural networks, semantic segmentation has received more and more attention in the field of ocean engineering, especially in the fields of unmanned vessels and drones. However, challenges such as limited open ocean datasets, insufficient feature extraction for segmentation networks, pixel pairing problem, and frequency-domain obfuscation still exist. To address these issues, we propose USVS-Net, a high-performance segmentation network for segmenting USV feasible domains and surface obstacles. To overcome the pixel pairing confusion problem, a Global Channel-Spatial Attention module (GCSA) is designed in this paper, which enhances feature interactions, suppresses redundant features, and improves pixel matching accuracy through channel shuffling strategy and large kernel spatial attention. In addition, a median-enhanced channel-spatial attention (MECS) module is proposed to enhance edge details and suppress noise by fusing the median, mean, and maximum values to facilitate cross-scale feature interactions. For evaluation, a dataset USV-DATA containing images of marine obstacles is constructed. Experiments show that USVS-Net outperforms SOTA with mIoU reaching 81.71% and mPA reaching 90.18%, which is a significant improvement over the previous methods. These findings indicate that USVS-Net has high accuracy and robustness and can provide valuable support for autonomous navigation of unmanned vessels.

Keywords:

Artificial Intelligence Networks; marine environment; segmentation of obstacles; water targets

1. Introduction

Unmanned surface vehicles (USVs) have played an increasingly vital role in various maritime domains such as ocean monitoring, intelligent navigation, environmental exploration, channel inspection, and maritime security [1]. Compared with manned vessels, USVs can autonomously operate in complex, hazardous, and inaccessible environments, significantly enhancing the intelligence and safety of maritime operations. However, to achieve safe and reliable autonomous navigation, USVs must possess accurate environmental perception and feasible domain recognition capabilities. Vision-based semantic segmentation can classify navigable water regions and obstacles at the pixel level, serving as a core technology for USV environmental understanding and path planning.

Nevertheless, semantic segmentation in maritime scenes faces multiple challenges. First, the ocean surface environment exhibits extremely high complexity: lighting conditions vary dramatically, strong reflections and glare are frequent, wave undulations cause dynamic texture changes, and there exist multi-scale small obstacles and background interference—all of which make feature extraction and category discrimination difficult. Second, the lack of high-quality, large-scale annotated datasets leads to insufficient generalization of models, hindering stable performance under diverse sea conditions. In addition, pixel correspondence in maritime scenes is complex, edges are often blurred, and noise is abundant. Existing networks tend to focus only on prominent features while neglecting fine-grained details, resulting in inaccurate boundary segmentation and missed detections of small objects.

To address these issues, recent studies have made active explorations. To alleviate data scarcity, researchers have constructed maritime segmentation datasets. For example, MassMIND [2] provides annotated samples under multiple sea conditions and modalities, laying the foundation for model training and evaluation. Some studies have introduced weakly supervised and self-training strategies. The Scaffolding Learning Regime [3], for instance, iteratively optimizes pseudo-labels for boundaries and regions, significantly reducing manual annotation costs. To mitigate the effects of intense illumination and glare, polarization imaging and image enhancement methods have been adopted in preprocessing stages, leveraging polarization cues to suppress highlights and improve the separability between water and obstacles [4]. For nighttime or low-light conditions, researchers have proposed two-stage frameworks that combine enhancement and attention mechanisms to better capture dark structures and suppress noise [5].

In terms of network architecture, multi-scale contextual modeling and attention mechanisms have become mainstream. eWaSR [6] designs a lightweight contextual aggregation module for real-time segmentation on embedded devices; WaSR-T [7] introduces temporal context to enhance robustness under dynamic water surfaces; TBiSeg [8] combines Transformers with dual-layer routing to improve long-range dependency capture and boundary consistency; DBEENet [9] enhances contour clarity through an edge-refinement branch. Other studies have explored frequency-domain modeling and statistical robustness to improve noise resistance, integrating median operators, edge guidance, and Bayesian uncertainty estimation to preserve high-frequency details and identify risk regions [10]. Meanwhile, the trend toward lightweight design has motivated the use of novel backbone networks and parameter-free attention mechanisms, such as MobileOne [11] and SimAM [12], which significantly reduce computational cost while maintaining accuracy.

Despite the progress in data utilization, illumination suppression, multi-scale fusion, and lightweight deployment, several limitations remain. (1) Current models fail to fully exploit non-salient information during feature extraction, struggling to balance global semantics and local details; (2) frequency-domain information is insufficiently leveraged, leading to entanglement of noise and useful edge features; (3) generalization under diverse sea conditions is still limited, especially in small-object segmentation and complex boundary delineation.

To overcome these challenges, this paper proposes an intelligent semantic segmentation network, USVS-Net, tailored for USV autonomous navigation. The network enhances diversity and robustness at the data level through the construction of a high-quality dataset (USV-DATA); at the feature level, it introduces a Global Channel-Spatial Attention (GCSA) module to realize cross-scale pixel feature interaction and global correlation modeling; and at the robustness level, it designs a median-enhanced channel-spatial attention (MECS) module that fuses mean, max, and median features to suppress noise and enhance edge detail representation. Experimental results demonstrate that USVS-Net achieves significantly superior segmentation performance compared with mainstream models on USV-DATA, MassMIND, and Pascal VOC2007 datasets [13], with notable improvements in mIoU and mPA, verifying its effectiveness and practical value in complex maritime environments.

The chapter structure of this paper is organized in the following way: Section 2 systematically describes the construction method of the USV-DATA dataset and the selection of other datasets, Section 3 details the network architecture proposed in this study, Section 4 carries out the systematic experimental validation and the analysis of the results, and finally, in Section 5, we summarize the results of the study and look forward to the future direction of the study.

2. Datasets

2.1. USV-DATA

In this paper, the visual sensors carried by unmanned vessels (USVs) or ordinary boats are mainly used as the main data collection method, and fieldwork and on-water experiments are carried out in several regions of China. Based on the CS40P model unmanned vessel, the team carried out a large number of navigation experiments in offshore, oceanic, lake, and river waters to collect raw data under various practical application scenarios. These data are collected in real time by the visual sensors carried by the unmanned boat, and can be transmitted to a remote server via video streaming or saved on an edge computing platform, ultimately obtaining a large number of raw video sequences. During the experiment, the team covered a variety of weather conditions, including night, sunny, foggy, and rainy days, and collected rich environmental data. In addition, a large amount of data on water targets, such as boats, beacons, reefs, and driftwood, were also collected. Name this dataset USV-DATA. The specific collection locations are shown in Figure 1.

The quality of these datasets is crucial to the performance of semantic segmentation algorithms. However, no maritime datasets specialized in feasible domain segmentation for unmanned surface vessels (USVs) have been made publicly available in China. The key to USV navigation scene segmentation is to identify the boundaries of the feasible region as well as to accurately segment water surface obstacles (e.g., ships, reefs, buoys, etc.). Therefore, our team is working to provide high-quality and challenging semantic segmentation datasets for future unmanned vessel navigation in the collection of on-water experimental data from several areas along the coast of China.

Due to the inconvenience of carrying the unmanned boat, iPhone 14 and Xiaomi 14 Ultra were used for image acquisition in some waters. Over 30,000 water scene image data were acquired by our team; Figure 2 shows some of the images.

In the initial processing stage, images that did not meet the criteria due to blurring or corruption were eliminated by data cleaning. Subsequently, the structural similarity index (SSIM) algorithm [14], as in Equations (1)–(4), was utilized to compute the similarity of the images by luminance contrast, contrast contrast, and structural contrast, respectively, and finally, the redundant images with more than 80% similarity were de-duplicated. On this basis, further manual review means were used to eliminate the lower quality image samples. The specific processing flow is shown in Figure 3. After this series of rigorous screening and processing processes, a dataset called USV-DATA was finally constructed with 28,751 images.

a (x, y) = \frac{2 μ_{x} μ_{y} + c_{1}}{μ_{x}^{2} + μ_{y}^{2} + c_{1}}

(1)

b (x, y) = \frac{{2 σ}_{x y} + c_{2}}{σ_{x}^{2} + σ_{y}^{2} + c_{2}}

(2)

c (x, y) = \frac{σ_{x y} + c_{3}}{σ_{x} σ_{y} + c_{3}}

(3)

S S I M = {[a (x, y)]}^{α} + {[b (x, y)]}^{β} + {[c (x, y)]}^{γ}

(4)

In addition, x,y is the input image; a(x,y) denotes the luminance contrast; b(x,y) denotes the contrast contrast; c(x,y) denotes the structural contrast;

μ_{x}

,

μ_{y}

is the image mean;

σ_{x}

,

σ_{y}

is the image standard deviation;

c_{1}

,

c_{2}

,

c_{3}

is a constant;

α

,

β

,

γ

are greater than zero.

In the data annotation stage, this study adopts human-in-the-loop labeling method [15] and “Labelme” [16] for annotation. The interface of “Labelme” is shown in Figure 4, which contains working buttons such as the original data path, drawing annotation box, and save button; the middle area is the annotation area, and the drawing box divides the image into background, obstacle, self, sea, sky, etc. The right side contains category labels and file list. The middle area is the labeling area, and the polygonal box divides the image into background, obstacle, self, sea, sky, and other areas, and the right side contains class labels and file lists. Figure 5 illustrates the labeling results for some of the images.

We can see from the Table 1, the USV-DATA dataset contains 28,751 images, which are divided into training, validation, and test subsets. Specifically, 26,807 images are used for training and 1944 for validation. The test set is constructed by extracting representative frames from the raw unprocessed videos and subsequently manually annotated using the same labeling procedure as the training data. All quantitative evaluations, including mIoU and mPA, are performed on these annotated test samples. In addition, to verify the generalization ability, experiments are also conducted on publicly available datasets such as MassMIND and Pascal VOC2007.

2.2. MASSMIND

In order to validate the effectiveness of the proposed semantic segmentation model, in addition to the experiments using a home-made dataset, we selected a recognized open maritime dataset, MASSMIND, a large-scale open dataset for the field of maritime automation, which is mainly used for the research of maritime autonomous ship sensing, target detection, path planning, and navigation techniques, for further evaluation. The dataset contains multimodal sensor data from different marine environments (e.g., high seas, ports, shipping lanes, and nearshore areas), including RGB images, depth images, LiDAR data, and radar images. The dataset covers the annotation of various marine targets and obstacles, such as ships, buoys, submersibles, islands, etc., and supports semantic segmentation and instance segmentation tasks. To improve the robustness of the model, MASSMIND also considers data under different weather (e.g., sunny, rainy, hazy days, etc.) and lighting conditions to help researchers develop perceptual systems that perform well in complex environments. All the images in this dataset and their labeled annotations are 640 × 512, totaling 2916 images with seven categories: Sky, Water, Obstacle, Living Obstacle, Bridge, Self, and Background, labeled Sky, Water, Bridge, Obstacle, Living Obstacle, Background, and Self. examples. The distribution share in various classes is shown in Figure 6, where the shares of Sky, Water, Bridge, Obstacle, Living Obstacle, Background, and Self are 13.0%, 13.0%, 3.2%, 31.8%, 19.5%, 12.8%, and 6.7%, respectively. Some of the image data are shown in Figure 7.

2.3. Pascal VOC2007

To demonstrate the generalization capability of USVS-Net, the publicly available dataset “The PASCAL Visual Object Classes 2007” (Pascal VOC2007) was chosen to validate the algorithm’s effectiveness. The image data for Pascal VOC2007 is shown in Figure 8. It is widely used as a benchmark in the Pascal VOC challenge and encompasses 20 object categories. Each image in the dataset is annotated with objects from these categories, including humans, animals (such as cats and dogs), vehicles (like cars, boats, and airplanes), and household items (e.g., chairs, tables, and sofas). It is worth emphasizing that the PASCAL VOC dataset is not a dataset dedicated to the maritime domain, but it contains images of ships at sea as well as a variety of other categories of objects. The purpose of choosing this dataset is to verify whether USVS-Net can still achieve accurate segmentation of maritime objects such as ships in a generalized dataset that contains rich categories and complex scenes. This design not only examines the relevance and robustness of the algorithm in ship segmentation tasks, but also evaluates its generalization ability in multi-category coexistence scenarios, thus demonstrating the broad adaptability and reliability of USVS-Net in practical applications.

3. Methodology

3.1. General Flowchart of the Proposed Methodology

The overall process is illustrated in Figure 9a: first, data are collected in maritime environments using an onboard camera, and a dataset is constructed through data processing (frame extraction and annotation). Then, the USVS-Net model is trained offline and deployed onto the edge computing platform of the unmanned surface vehicle (USV). During actual operation, the camera continuously captures real-time images, which are processed by the deployed model to generate pixel-level semantic segmentation results, providing environmental perception support for autonomous navigation of the USV.

Although recent semantic segmentation methods have achieved considerable progress in multi-scale contextual modeling and attention mechanisms, a large proportion of studies still tend to emphasize the “dominant/salient” features of an image, while neglecting boundary details, less salient textures, and information from non-dominant channels. This imbalance prevents the network from fully exploiting the diverse discriminative cues contained in the source image. In maritime scenes, for instance, single-frame vision-based models—such as the lightweight variant eWaSR in the WaSR series [6]—primarily focus on strongly salient water–obstacle boundaries and large objects. Under complex sea conditions featuring strong reflections, wave disturbances, and small-scale obstacles, fine-grained details and weak texture cues are often ignored, leading to boundary misclassification and missed detections of small targets. To overcome the limitation of focusing solely on salient regions, one line of research explicitly incorporates boundary or contour priors into the network through boundary-aware branches or loss functions. Empirical results show that these approaches can significantly enhance the discriminability of pixels in transition areas; however, they still exhibit instability under severe noise or large inter-class scale variations [17]. Another line of work introduces temporal context to smooth single-frame noise and surface perturbations, but when inter-frame appearance changes drastically or small targets appear only momentarily, non-salient features may still be suppressed [7]. Recently, some studies have explored frequency-domain perspectives, arguing that neglected high-frequency details (e.g., edges and fine textures) are crucial for accurate segmentation. Consequently, attention weighting in Fourier or wavelet domains has been proposed to compensate for the bias of emphasizing salient responses only in the spatial domain. Nevertheless, how to achieve effective complementarity between spatial and frequency representations without amplifying noise remains an open problem requiring more careful design and validation [18].

Based on these observations, we argue that relying solely on dominant features tends to amplify salient regions and suppress complementary information (such as fine-grained boundaries, less salient textures, and high-frequency cues), which leads to performance bottlenecks in small-object segmentation, complex boundaries, and heavily disturbed maritime environments. Therefore, this study emphasizes cross-scale, cross-channel, and cross-domain (spatial/frequency) collaborative modeling at the network level, aiming to systematically enhance the capture of “non-salient but crucial” information without introducing excessive computational cost.

Therefore, this paper proposes a multi-scale interactive segmentation network USVS-Net based on DeeplabV3+ framework for feasible domain segmentation of unmanned vessels. The structure of the network is shown in Figure 9b. Although using Xception as the backbone network of DeepLabV3+ improves the accuracy, it suffers from the disadvantages of high computational complexity, large memory occupation, and long training time, which limit its application in resource-limited scenarios. Therefore, in USVS-Net, we use Mobilenetv2 network as the backbone feature extraction network of the algorithm to overcome the above drawbacks. Aiming at the shortcomings of Mobilenetv2 network in feature extraction—its deep convolution and downsampling operations tend to lead to the loss of image details, which in turn reduces the segmentation accuracy—this paper innovatively designs the Global Channel-Spatial Attention (GCSA) module. This module strengthens the model’s ability to understand global semantics by establishing a long-range association mechanism between features, which effectively alleviates the problem of detail information attenuation while improving the feature representativeness. However, there is a channel shuffling operation in GCSA, which loses position information, which is important for generating spatial attention maps. Therefore, we integrate the coordinate attention module after GCSA to reduce the loss of position information and further enhance the feature representation capability of the network through the multi-spatial orientation feature sensing capability of this module.

In the feature fusion stage, the conventional feature fusion network adopts ASPP, which is capable of capturing multi-scale contextual information, but the network has many drawbacks: the frequency-domain features of the feasible domain image of the unmanned ship can describe the details of edges, lines, etc., which are crucial for the pixel-by-pixel classification. However, ASPP cannot effectively differentiate between useful information (e.g., edges) and useless information (e.g., noise) when processing frequency-domain information due to its fixed null rate design, which may lead to noise amplification; and the multi-branching structure increases the computational complexity and memory occupation, which reduces the efficiency of the frequency-domain information extraction; and the null convolution may also introduce the lattice effect, which may disrupt the continuity of the frequency-domain features, affecting the detail modeling. Therefore, ASPP is difficult to effectively distinguish between useful and useless frequency-domain information, limiting its performance in pixel-by-pixel classification. Based on the above problems, we propose the median-enhanced spatial and channel attention block (MECS) with excellent frequency-domain feature capture capability, and apply it behind the null convolution of ASPP to enhance the frequency-domain feature extraction capability of the input image, forming the MECS-ASPP structure.

In USVS-Net, CBL (containing convolution, proposed normalization, and activation function) and Upsample are used, and due to the presence of operations such as dimensionality reduction and dimension enhancement in their structures, these operations cause the channel information to be messed up, thus affecting the segmentation accuracy of the network. Therefore, we adopt cSE module and Triplet Attention to integrate channel information and realize cross-latitude interaction, which in turn strengthens the channel feature representation.

The workflow of the network:

Firstly, image data input to Mobilenetv2 backbone network for multilayer feature extraction; shallow detail features are defined as primary semantic features and deep abstract features as secondary semantic features, according to the difference in feature abstraction level; cross-layer information complementarity is realized through feature fusion mechanism. This hierarchical feature delineation method can preserve the original details of the image while capturing the high-level semantic associations.

Secondly, the first semantic information is processed by GCSA, CA, MECS-ASPP, CBL, and Upsample to become “deep features”, and the “deep features” are processed by cSE to be fused with the first semantic information; the fused information is processed by CBL. After the fused information is processed by CBL and activation function, it is again multiplied element-by-element with the first semantic information to form “deep processing feature”; in order to fully utilize the semantic information, the “first semantic information”, “deep processing feature”, and “deep processing feature” are multiplied by CBL and activation function. In order to fully utilize the semantic information, the “first semantic information”, “deep processing features”, and “deep features” are fused again, and the fused features are processed by CBL and activation function, and then fused with the first semantic information again to form “deep features”. The fused features are processed by Triplet Attention, and then upsampling is utilized again.

Finally, the segmentation result is obtained with the same resolution as the input image.

3.2. Attention Mechanism Structure

3.2.1. Global Channel-Spatial Attention Module (GCSA Module)

Motivations: Pixel classification helps to distinguish similar objects and recognize small feature differences. However, existing methods are often realized by increasing the computational complexity. In this paper, a multi-scale multi-channel global attention augmentation module (GCSA module) was proposed, as shown in Figure 10. This module contains channel attention, channel shuffling, and spatial attention. Our motivation for designing this module and its mechanism for improving pixel classification accuracy is as follows:

(1) For the channel attention submodule: In traditional convolutional neural networks, the dependencies between channels are often ignored, while these relationships are crucial for the capture of global information. Neglecting channel dependencies may lead to underutilization of feature map information, which in turn weakens the representation of global features. This module optimizes channel interactions through multilayer perceptron (MLP), which consists of four steps: firstly, dimensional rearrangement is performed to adjust the input features to H × W × C format; then, the number of channels is compressed to 1/4 through the first layer of MLP, and ReLU activation is used to filter the effective features; then, the original channel dimensions are restored through the second layer of MLP to retain the key information; finally, the channel attention weight map is generated and multiplied with the original features on an element-by-element basis. The design enables the network to learn globally associated features across channels through the mechanism of first compressing and then expanding channel dimensions, while automatically suppressing redundant information.

(2) About channel shuffling: Although channel attention enhances the representation of feature maps, it may not sufficiently break the constraints between channels, resulting in insufficient mixing of information, which affects the effectiveness of feature representation. To solve this problem, we introduce the channel shuffling operation. Specifically, we divide the enhanced feature map into several groups (e.g., four groups, each containing a quarter of all the channels), then perform a transpose operation on each group to disrupt the original order of the channels and finally restore it to its original shape. This operation helps to promote the flow of information between channels and enhance the diversity of features, thus improving the model performance.

(3) About the spatial attention submodule: Relying solely on channel attention and channel shuffling operations may not fully utilize the spatial information, especially when capturing the local and global features of an image; ignoring the spatial dimension will lose many important details. Therefore, we use two 7 × 7 convolutional layers in the spatial attention module to process spatial information. First, the input feature maps are compressed to a quarter of the original number of channels by the first 7 × 7 convolutional layer and nonlinearly transformed by batch normalization and ReLU activation function; then, the second 7 × 7 convolutional layer restores the number of channels to the original dimensions and batch normalization is performed again. This design effectively captures the spatial dependencies. Finally, the spatial attention map is generated by the Sigmoid function and multiplied element-by-element with the feature map after channel shuffling to obtain the final output feature map.

Figure 10. Structure of the GCSA module.

The specific workflow of the module is as follows:

First, the input feature map consists of multiple channels, each with spatial dimensions H × W. In the channel attention module, the input feature map is first transposed from the original C × H × W shape to W × H × C. Next, the first fully connected layer (MLP) reduces the number of channels to a small fraction of the original (e.g., 1/4) and introduces nonlinear variations through the ReLU activation function. Subsequently, the second fully connected layer restores the number of channels to their original size. Finally, after an inversion operation, the feature map is restored to its original shape of C × H × W and the channel attention maps are generated by a Sigmoid activation function. Finally, the input feature map is multiplied element-by-element with the generated channel attention map to obtain the enhanced feature map. The whole process can be represented by Equation (5).

F_{c h a n n e l} = σ (M L P (P e r m u t e (F_{I n p u t}))) ⨀ F_{I n p u t}

(5)

F_{c h a n n e l}

: the enhanced feature map,

σ

: the Sigmoid function,

⨀

: the element-by-element multiplication,

F_{I n p u t}

: the original input feature map.

In the spatial attention module, the input feature maps are first passed through a 7 × 7 convolutional layer, which reduces the number of channels to one-fourth of the original number, thus achieving feature dimensionality reduction. Next, the feature maps are normalized by a batch normalization (BN) layer to remove the internal covariate bias and help the model train more stably. Then, nonlinearization is performed using the ReLU activation function to enhance the model’s expressiveness. Next, the feature maps are passed through a second 7 × 7 convolutional layer with the number of channels restored to their original size and normalized again by a batch normalization layer. Finally, a spatial attention map is generated using the Sigmoid activation function to represent the importance of each spatial location in the feature map. The feature map after channel shuffling is multiplied element-by-element with the spatial attention map to obtain the final output feature map containing spatial information. The whole process can be described by Equation (7).

F_{s p a t i a l} = σ (C o n v (B N) {(R e L u (C o n v (F_{s h u f f l r})))) ⨀ F}_{s h u f f l e}

(6)

F_{s p a t i a l}

is the feature map after spatial attention processing. Due to the use of GCSA module enhances the pixel attention and integrates Coordinate Attention to reduce the loss of spatial information, let thus improves the characterization of features.

3.2.2. Median-Enhanced Channel and Spatial Attention Module (MECS Module)

Motivations: Frequency-domain features of feasible domain images can describe details such as edges, lines, etc., but some of the frequency-domain information may be involved in image noise, so it is particularly important to distinguish between useful and useless frequency-domain information. Therefore, this paper proposes the MECS module shown in Figure 11, which includes channel attention and spatial attention. This module can effectively capture and fuse features on different scales. We design this module mainly based on the following reasons:

(1) Channel Feature Reinforcement Module: In traditional methods, the channel attention mechanism mostly adopts two ways, mean value calculation and peak extraction, to obtain global feature information, but these strategies are prone to bias in the face of noise-containing data, especially when there are obvious interference signals in the feature map, which will affect the accuracy of feature analysis. To address this problem, we integrate median filtering technology in channel attention computation, combining median filtering with the original mean and peak extraction methods to form a channel weight computation system with stronger anti-interference ability. The median filtering technology has been maturely applied in the field of image noise reduction, which can effectively eliminate the interference of abnormal data and maintain the integrity of key features by choosing the median value, significantly improving the anti-interference performance of the attention mechanism.

(2) Multi-scale spatial perception module: Conventional single-size convolutional kernel is often difficult to comprehensively obtain spatial information of different granularities when processing image features, which will affect the feature recognition ability of the model when facing complex scenes. For this reason, we have developed a hierarchical multi-specification convolutional scheme: firstly, a 5 × 5 base convolutional layer is used for initial feature extraction; then the base features are fed in parallel into multiple depth-separable convolutional layers of different specifications to capture small-scale details and large-scale contour features; finally, the features output from each branch are superimposed and integrated, and then the spatial weight distribution maps are generated by 1 × 1 convolution. By adjusting the pixel-level correspondence between the original feature maps and the dynamically generated weight maps, enhanced features that integrate multi-dimensional spatial information are finally obtained. This hierarchical processing can simultaneously capture feature changes in different directions and scales, which significantly improves the model’s ability to recognize multimorphic targets.

Figure 11. Structure of the MECS module.

The specific workflow of the module is as follows:

For channel attention, three pooling operations are first performed on the input feature maps: global average pooling (AvgPool), global maximum pooling (MaxPool), and global median pooling (MedianPool), resulting in three different pooling results. The size of each pooling result is 1 × 1, where C is the number of channels. Next, each pooling result is fed separately into a shared multilayer perceptron (MLP) that contains two 1 × 1 convolutional layers and a ReLU activation function. The first convolutional layer shrinks the number of channels from C to C/r, where r is the compression ratio; the second convolutional layer then restores the number of channels to the original C. Three attention maps are then obtained by mapping the output values to the range [0, 1] via the Sigmoid activation function. These three attention maps are summed at the element level to obtain the final channel attention map. Finally, this channel attention map is multiplied element-by-element with the original feature map to obtain the weighted feature map. The whole process can be described by Equations (7) and (8).

F_{c} = σ (M L P (A v g P o o l (F))) + σ (M L P (M a x P o o l (F))) + σ (M L P (M e d i a n P o o l (F)))

(7)

{F^{'} = F ⨀ F}_{c}

(8)

σ

: the Sigmoid activation function,

⨀

: the element-by-element multiplication operation.

For spatial attention, first, the input feature map is passed through a 5 × 5 deep convolutional layer that is used to extract low-level features. The output of this convolutional layer is of the same size as the input. Then, the output feature maps of the initial convolutional layer are passed through multiple deep convolutional layers of different sizes (e.g., 1 × 1, 7 × 7, etc.) to further extract multi-scale features. The outputs of these different convolutional layers are summed at the element level to form a fused feature map. Finally, the fused feature maps are passed through a 1 × 1 convolutional layer to generate the final spatial attention map. In Equations (9) and (10), the generated attention maps are multiplied element-wise with the already weighted channel feature maps to obtain the final output feature maps.

F_{S} = \sum_{i = 1}^{n} D_{i} (F^{'})

(9)

F^{″} = C o n v (F_{S}) ⨀ F^{'}

(10)

n denotes the number of deep convolutions and Conv denotes the convolution operation.

The recognition of pixel features is enhanced due to the use of the MECS module to extract global statistical information and multi-scale deep convolution to capture representative features and tiny hidden features at different scales.

4. Experiments

4.1. Experimental Setting

For the software configuration, Python version 3.11.7 was used, with the PyTorch 2.0.0 framework, and optimized for GPU acceleration with CUDA 11.8 and cuDNN 8.7.0. Meanwhile, image processing tasks rely on the torchvision 0.15.1 library.

4.2. Evaluation Metrics

To provide a thorough and impartial evaluation of the sea surface obstacle segmentation network model, this study utilizes multiple accuracy metrics, such as mean pixel accuracy (mPA) and mean intersection over union (mIoU). These metrics collectively establish a comprehensive framework for assessing model performance.

Mean pixel accuracy (mPA) refers to the average of the ratio of the number of correctly categorized pixels in each category to the ratio of the number of all pixels in that category, which is calculated separately, and its formula is shown in (11).

m P A = \frac{1}{n + 1} \sum_{i = 1}^{n} \frac{m_{i i}}{\sum_{j = 0}^{n} m_{i j}}

(11)

m_{i i}

is the number of correctly categorized pixels and

m_{i j}

is the number of pixels of class i assigned to class j.

The intersection of union (IoU) is a measure of the degree of overlap between the predicted results and the real labels, which can be calculated by referring to Equation (12), and is illustrated in Figure 12a. The mean intersection ratio (mIoU) is the average of the intersection ratios of all categories, and the detailed calculation method is shown in Equation (13).

I o U = \frac{A \cap B}{A \cup B}

(12)

m I o U = \frac{\sum_{i = 1}^{n} {I o U}_{i}}{n}

(13)

n denotes the number of classes.

During the training process, we used stochastic gradient descent (SGD) to update the weights, and the total training duration was set to 100 epoch. To improve the training efficiency, an adaptive learning rate adjustment strategy was used. For the performance evaluation of the model, the Dice Similarity Coefficient Loss function (Dice Loss) is used. Its schematic diagram can be referred to in Figure 12b, and the definition is given in Equation (14).

L_{D i c e} = \frac{F P + F N}{2 \times T P + F P + F N}

(14)

FP, FN, and TP denote the erroneous positive class, erroneous negative class, and correct positive class, respectively.

Figure 13 shows that the loss function curve has converged, proving the effectiveness of the network.

4.3. Attention Module Comparison Experiment

In order to validate the effectiveness and advantages of the GCSA module and the MECS module, we conducted comparative experiments with existing modules (e.g., SENet, CBAM, ASPP), thus comprehensively evaluating the performances of GCSA and MECS in terms of feature extraction, noise suppression, multi-scale information fusion, and computational efficiency. These experiments can not only prove the effectiveness of innovative designs such as median pooling, multi-scale deep convolution, channel shuffling, etc., but also highlight the robustness and generalization ability of the modules in complex scenarios. At the same time, the visualization analysis and multi-dimensional performance evaluation provide strong experimental support for the innovativeness and practicality of the module, and enhance the credibility and practical application value of the research.

The experiments are based on the MASSMIND dataset dedicated to maritime segmentation, using the DeepLabV3+ framework with MobileNetV2 as the backbone network. This dataset is a maritime dataset dedicated to sea surface obstacle segmentation made publicly available by ShaileshNirgudkar in June 2023, and contains 2917 long wave infrared segmented images captured in coastal marine environments over a period of 2 years, categorized into seven classes: Sky, Water, Obstacle, Living Obstacle, Bridge, Self, and Background, labeled Sky, Water, Bridge, Obstacle, Living Obstacle, Background, and Self, and the dataset is divided into training and validation sets in the ratio of 9:1. The experimental parameter settings are shown in Table 2 and the experimental results are shown in Table 3.

From Table 3, we can see that the mIoU of both the GCSA module and the MECS module are significantly better than those of SENet and CBAM. mIoU of the MECS module reaches 75.57%, which is an improvement of 3.27% compared to SENet, and an improvement of 5% compared to CBAM, which fully verifies its superiority in multi-scale feature extraction and noise robustness. In addition, although the GCSA module and the MECS module are comparable to CBAM in terms of computational efficiency (number of parameters, computation, inference speed), their significant improvement in segmentation accuracy further demonstrates their advantages in feature expressiveness and task adaptability. These results fully demonstrate the innovation and usefulness of the proposed module in semantic segmentation tasks. Finally, the combination of the MECS and GCSA modules achieves an mIoU of 78.81%, which once again proves the superiority and compatibility of the two modules.

In order to observe the comparison results more intuitively, visualization experiments are carried out, and the results are shown in Figure 14. In Figure 14a–g are images, labeled SENet, CBAM, GCSA, MECS, and GCSA-MECS, respectively. From Figure 14c,d, it can be clearly observed that SENet and CBAM, due to the lack of sufficient feature extraction capability and pixel resolution, make mistakes in pixel classification and misclassify the characters on the ship, the obstacles in the far distance, as the ship; from Figure 14e, it can be seen that due to the adoption of MECS module, the pixel attention of the network is enhanced, thus possessing superior segmentation effect than SENet and CBAM, and can accurately recognize the characters and other kinds of figures in the figure; from Figure 14f, it can be seen that due to the adoption of MECS module, the pixel allocation ability of the network is enhanced, which makes the network have excellent segmentation effect and can accurately describe the details such as edges and lines; from Figure 14g, it can be seen that the performance of the network has been further improved after the simultaneous adoption of the MECS module and the GCSA module, which can not only accurately segment the image, but also describes the details such as edges and lines clearly, which proves once again the superiority and compatibility of the two modules.

4.4. Integral Ablation Experiments

4.4.1. USV-DATA Dataset

This subsection evaluates the contribution of the GCSA and MECS modules under identical hyperparameters and experimental settings using the USV-DATA dataset. As shown in Table 4, incorporating the GCSA module improves the mIoU and mPA by 5.34% and 4.07%, respectively, compared to the benchmark network, demonstrating its ability to enhance pixel-level attention, strengthen multi-scale inter-pixel relationships, and reduce spatial information loss. The MECS module further increases mIoU and mPA by 6.14% and 6.26%, respectively, by capturing global statistical information and multi-scale representative features to refine pixel classification. When both modules are integrated, the network achieves optimal performance, with mIoU and mPA improvements of 11.58% and 10.37%, confirming that multi-scale, multi-dimensional feature extraction significantly boosts segmentation accuracy.

Next, we visualize the effects of the proposed modules. The visualization experiments are presented in Figure 15. As shown in Figure 15a, the baseline network without the GCSA and MECS modules lacks sufficient spatial information processing capability, resulting in poor contour recognition of mountain ranges and frequent pixel misclassifications among the sky, ocean, and mountains. In Figure 15b, the GCSA module enhances mountain range recognition by strengthening pixel correlations through multi-scale interactions and reducing spatial information loss. It accurately reclassifies many pixels previously misidentified as sky or seawater. In Figure 15c, the MECS module improves the segmentation of edges and fine details by leveraging global statistical features and multi-scale representative information to distinguish frequency-domain characteristics, producing smoother and more natural mountain contours. Finally, Figure 15d illustrates the complete USVS-Net, where the synergy between GCSA and MECS significantly enhances feature resolution. This dual-module mechanism enables intelligent cross-channel information selection and captures spatial features from fine details to global structures, achieving accurate pixel-level recognition of mountainous terrain and continuous geological boundaries, demonstrating strong performance in complex terrain analysis and boundary reconstruction.

4.4.2. MASSMIND Dataset

The ablation experiments are carried out on the MASSMIND dataset, and the results are shown in Figure 16, which shows that after gradually adding the attention proposed in this paper, the segmentation effect of the USVS-Net is improving, especially for the small obstacles in the water showing excellent segmentation effect.

4.4.3. Pascal VOC2007 Dataset

The ablation experiments are carried out on the Pascal VOC2007 dataset, and the results are shown in Figure 17, which shows that the segmentation effect of USVS-Net is still improving after gradually adding the attention proposed in this paper. While ensuring the segmentation effect, USVS-Net is more accurate in segmenting the details, especially the edge lines of the ship’s upper buildings are described more clearly, which proves the effectiveness of the attention mechanism designed in this paper.

4.5. Algorithm Comparison Experiments

4.5.1. USVS-DATA Dataset

In the USV-DATA dataset, we selected the models in the table for comparison. Figure 18 and Figure 19 show the visualization of the compared algorithms. Among the visualization results, FCN (Resnet101), Deeplabv3 (Resenet50), Deeplabv3+ (mobilenetv2), and USVS-Net with higher mIoU are selected for a total of four comparison algorithms.

Analysis of experimental results: the FCN (Resnet50) is characterized by the introduction of residual connectivity, which enables the network to pass the gradient more efficiently, thus avoiding the gradient vanishing problem, enabling the network to successfully train deeper structures, and thus achieving higher segmentation accuracies on Sky (97.3%) and Background (96.8%). However, Resnet50 has a relatively large number of parameters and computation, so it has relatively high hardware requirements, and needs higher computational power and storage resources to support it during the training process.

FCN (Resnet50) is characterized by a deeper network structure and more residual blocks, which enables it to learn richer feature representations, which is very useful for improving the performance of tasks such as image segmentation. However, it still has a large number of parameters, which is not favorable for deployment.

The Deeplabv3 algorithm with Resnet50 and Resnet101 as the backbone network, although it uses null convolution to capture multi-scale information and is able to realize fine segmentation of images without loss of spatial resolution, it still suffers from the disadvantages of a cumbersome number of parameters, high consumption of computational resources, and unfavorable deployment.

The advantages of Deeplabv3+ (Xception) and Deeplabv3+ (Mobilenetv2) lie in the introduction of the Encoder and Decoder structures and the enhancement of the Encoder’s ability to extract semantic information through advanced techniques such as cavity convolution, spatial pyramid pooling, and multi-scale loss function, and Decoder’s ability to achieve pixel-level prediction. However, due to the use of bilinear upsampling in the Encoder and Decoder structures, this can cause the network to not have enough spatial information extraction ability, which results in spatial information loss and seriously affects the segmentation accuracy.

Compared with the traditional model architecture, USVS-Net makes three innovative breakthroughs: firstly, it builds a lightweight and efficient structure based on the Mobilenetv2 framework, which effectively reduces the computational burden; secondly, to address the problems of easy loss of image details and difficulty in distinguishing neighboring regions, it designs the Global Channel Screening Component (GCSA) and Multi-dimensional Feature Convergence System (MECS), where the former strengthens the key information by means of dynamic channel and the latter uses cross-scale convolutional combination to capture micro-texture and macro-morphological features; ultimately, the two synergistically construct a multi-level feature interaction mechanism to form a cross-channel and cross-scale intelligent information transfer network. This architectural innovation makes the model not only obtain the optimal category intersection ratio in complex scene segmentation tasks, but also achieve a balance between segmentation accuracy and computational efficiency, which is especially suitable for application scenarios that need to take into account both recognition accuracy and real-time performance.

4.5.2. MASSMIND Dataset

Algorithmic comparison experiments were conducted on the MASSMIND dataset, with results shown in Table 5. Results for the proposed method are highlighted in bold. The experiments demonstrate that USVS-Net achieves superior segmentation performance compared to other mainstream algorithms, attaining the highest segmentation accuracy across all categories. This further validates the superiority of USVS-Net.

4.5.3. Pascal VOC2007 Dataset

Pascal VOC2007 is selected for algorithm comparison experiments. The Pascal VOC2007 dataset has been introduced before, which is a non-maritime-specific dataset. In order to prove the segmentation performance of the USVS-Net algorithm, we selected the non-maritime classes in it for the comparison experiments, aiming at demonstrating whether USVS-Net still has excellent segmentation performance when facing other classes. The experimental results are shown in Figure 20, which shows that the segmentation performance of USVS-Net is still better than other algorithms when facing other categories.

4.6. SOTA Algorithm Comparison Experiment

In order to validate USVS-Net, we chose “2007 PASCAL visual object categories” as the dataset, and chose the SOTA algorithm for the comparison experiment. These SOTA algorithms are FarSeg++ [19], SegNeXt [20], and UNetFormer [21]. Figure 21 shows the comparison of the intersection ratio (IoU) metrics and the mean intersection ratio (mIoU) metrics for all category segmentations, and Figure 22 shows the visualization results.

Figure 21 shows the comparison of IoU metrics for all class segmentations, and Figure 22 shows the visualization results. Figure 22 shows that USVS-Net’s segmentation metrics are higher than those of other mainstream networks and lead the segmentation results in all classes.

As can be observed, Figure 22c–f are the segmentation result maps of different algorithms and Figure 22c,d are the segmentation results of FarSeg++ network and SegNeXt network; due to the lack of sufficient spatial information extraction ability and global information-focusing ability, pixel classification error occurs between recognizing the ship and the other categories, which leads to the ship being recognized as the other classes, which seriously affects the segmentation accuracy. Figure 22e shows the recognition results of UnetFormer algorithm, which not only suffers from insufficient contour recognition accuracy, but also suffers from great errors in the recognition of details such as the edge lines of the ship. Figure 22f shows the USVS-Net, which can show that, due to the adoption of the GCSA module, the loss of spatial information is reduced and the recognition rate of the ship is improved; the MECS module distinguishes the important frequency-domain information by extracting the global statistical information and capturing the representative features of different scales, which realizes the accurate segmentation of details such as edges and lines, and segments the edges and lines in a smoother and more natural way for the edges of ships. Figure 21 and Figure 22 show that the segmentation effect of USVS-Net proposed in this paper on the Pascal VOC2007 dataset is superior to that of some mainstream networks, which proves that the network has advanced segmentation ability.

4.7. Discussion

The experimental results demonstrate that the proposed USVS-Net exhibits significant advantages in semantic segmentation under complex and dynamic maritime environments. First, the introduction of the Global Channel-Spatial Attention (GCSA) module effectively enhances the network’s ability to model global contextual information and cross-channel dependencies, enabling more comprehensive utilization of both salient and non-salient features. This design improves the recognition accuracy of fine-grained boundaries and small-scale obstacles, and significantly enhances segmentation stability and reliability under challenging conditions such as wave interference, glare reflections, and background noise. Second, the median-enhanced channel-spatial attention (MECS) module, by integrating mean, max, and median pooling features, further refines feature representations and effectively suppresses noise interference, thereby strengthening high-frequency detail extraction and edge localization accuracy. This module performs particularly well in scenarios with densely distributed adjacent obstacles or visually complex sea surfaces. Furthermore, the USV-DATA real-world dataset constructed in this study covers diverse sea states and lighting conditions, and its rich sample diversity significantly improves the generalization ability of the model. Comparative experiments conducted on USV-DATA, MassMIND, and Pascal VOC2007 show that USVS-Net outperforms existing mainstream methods in metrics such as mIoU and mPA, verifying its robustness and practicality in real USV autonomous navigation scenarios.

However, USVS-Net still has certain limitations. The introduction of dual attention modules, while enhancing feature representation capability, also increases the structural complexity of the model. Future work should focus on optimizing computational efficiency while maintaining performance to meet the real-time operation requirements of embedded platforms. In addition, segmentation accuracy may still decline under extreme conditions such as intense glare, dense fog, or heavy rainfall. Future research will explore the integration of multi-modal data (e.g., infrared, radar, or LiDAR) and spatio-temporal feature modeling to further improve the adaptability and perception stability of the model in complex maritime environments, providing more reliable visual perception support for unmanned surface vehicles.

5. Conclusions

Aiming at the lack of public marine dataset, insufficient feature extraction, complicated pixel pairing relationship processing, and mixed feature information in frequency-domain and other key problems in the research of USV feasible domain segmentation by AI network, this paper proposes a multi-scale interactive USV feasible domain segmentation network, USVS-Net, and constructs the USV-DATA dataset.

The most significant innovation of the USVS-Net in this study lies in the introduction of the GCSA module and MECS module. The GCSA module improves target recognition by leveraging multi-scale interactions to strengthen pixel-level correlations while minimizing spatial information loss, thereby reducing pixel misclassification errors during the segmentation process. The MECS module distinguishes the important frequency-domain information by extracting the global statistical information and capturing the representative features of different scales, which realizes the accurate segmentation of the details, such as the edges, lines, and so on. Combining the two modules with the designed network to form USVS-Net improves the segmentation accuracy and efficiency. The effectiveness and robustness of USVS-Net are verified by experiments on USV-DATA dataset and Pascal VOC2007 dataset. Finally, the feasibility of the algorithm was verified by offshore sailing experiments.

In the future, the team will further expand the dataset, improve the network structure by expanding the sensing field [22] and enhancing the feature resolution capability, and will try to combine the unmanned ship equipped with image segmentation algorithm with the shipborne UAV equipped with target detection algorithms, and at the same time, develop the shipborne UAV’s autonomous landing technology [23] and the ship’s exhaust gas detection and tracking technology [24], which will facilitate the development of air–sea synergy technology.

Although USVS-Net performs well in the unmanned boat feasible domain segmentation task, there are still some shortcomings that need to be further improved in future research: (1) USVS-Net reduces the computational complexity by using MobileNetV2 as the backbone network, but the inference speed may still fail to meet the real-time requirements in high-resolution images or complex scenes. Therefore, we will continue to explore more lightweight network structures or model compression techniques (e.g., pruning, quantization) to further improve the computational efficiency of the algorithm and make it more suitable to be deployed on resource-constrained edge devices. (2) USVS-Net performs well in regular scenarios, but the segmentation performance may be degraded under extreme light, weather, or complex background conditions. Therefore, we will introduce more domain-adaptive methods to improve the robustness of the algorithm in extreme scenarios.

Author Contributions

M.S.: software, validation; X.L.: Conceptualization, methodology, project administration, funding acquisition; X.Y. and Y.L.: formal analysis, investigation, resources, data curation; W.C. and C.S.: writing—original draft preparation, writing—review and editing, visualization, supervision; C.X.: visualization, supervision. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Natural Science Foundation of Shandong Province issued by the Science and Technology Department of Shandong Province under grant number: ZR2022QE201.

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors on request.

Conflicts of Interest

The authors declare no conflicts of interest. Author Xuejun Yan was employed by the company Shandong Zhengzhong Information Technology Co., Ltd., which has no conflict of interest with this study. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Li, J.; Zhang, G.; Jiang, C.; Zhang, W. A survey of maritime unmanned search system: Theory, applications and future directions. Ocean Eng. 2023, 285, 115359. [Google Scholar] [CrossRef]
Nirgudkar, S.; DeFilippo, M.; Sacarny, M.; Benjamin, M.; Robinette, P. Massmind: Massachusetts maritime infrared dataset. Int. J. Robot. Res. 2023, 42, 21–32. [Google Scholar] [CrossRef]
Žust, L.; Kristan, M. Learning maritime obstacle detection from weak annotations by scaffolding. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 4–8 January 2022; pp. 955–964. [Google Scholar]
Xue, H.; Li, R.; Zhao, Y.; Deng, Y. Polarization imaging method for underwater low-visibility metal target using focus dividing plane. Appl. Sci. 2023, 13, 2054. [Google Scholar] [CrossRef]
Im, S.J.; Yun, C.; Lee, S.J.; Park, K.R. Artificial Intelligence-Based Low-light Marine Image Enhancement for Semantic Segmentation in Edge Intelligence Empowered Internet of Things Environment. IEEE Internet Things J. 2024, 12, 4086–4114. [Google Scholar]
Teršek, M.; Žust, L.; Kristan, M. ewasr—An embedded-compute-ready maritime obstacle detection network. Sensors 2023, 23, 5386. [Google Scholar] [PubMed]
Žust, L.; Kristan, M. Temporal context for robust maritime obstacle detection. In Proceedings of the 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Kyoto, Japan, 23–27 October 2022; pp. 6340–6346. [Google Scholar]
Fu, C.; Li, M.; Zhang, B.; Wang, H. TBiSeg: A transformer-based network with bi-level routing attention for inland waterway segmentation. Ocean Eng. 2024, 311, 119011. [Google Scholar] [CrossRef]
Sun, G.; Jiang, X.; Lin, W. DBEENet: Dual-branch edge-enhanced network for semantic segmentation of USV maritime images. Ocean Eng. 2025, 341, 122731. [Google Scholar] [CrossRef]
Lu, Z.; Li, D.; Song, Y.-Z.; Xiang, T.; Hospedales, T.M. Uncertainty-aware source-free domain adaptive semantic segmentation. IEEE Trans. Image Process. 2023, 32, 4664–4676. [Google Scholar] [CrossRef] [PubMed]
Vasu, P.K.A.; Gabriel, J.; Zhu, J.; Tuzel, O.; Ranjan, A. Mobileone: An improved one millisecond mobile backbone. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 20–22 June 2023; pp. 7907–7917. [Google Scholar]
Yang, L.; Zhang, R.Y.; Li, L.; Xie, X. Simam: A simple, parameter-free attention module for convolutional neural networks. In Proceedings of the International Conference on Machine Learning, PMLR, Vienna, Austria, 18–24 July 2021; pp. 11863–11874. [Google Scholar]
Tong, K.; Wu, Y. Rethinking PASCAL-VOC and MS-COCO dataset for small object detection. J. Vis. Commun. Image Represent. 2023, 93, 103830. [Google Scholar] [CrossRef]
Setiadi, D.R.I.M. PSNR vs SSIM: Imperceptibility quality assessment for image steganography. Multimed. Tools Appl. 2021, 80, 8423–8444. [Google Scholar]
Subramanya, S.Y.; Watanabe, K.; Dengel, A.; Ishimaru, S. Human-in-the-Loop Annotation for Image-Based Engagement Estimation: Assessing the Impact of Model Reliability on Annotation Accuracy. In Proceedings of the International Conference on Human-Computer Interaction, Gothenburg, Sweden, 22–27 June 2025; Springer: Cham, Switzerland, 2025; pp. 169–186. [Google Scholar]
Zhang, X.; Diao, M.; Meng, W.; Bai, X.; Hou, L. Research on Aluminum Alloy Fracture Fatigue Striation Using Segmentation Method Based on Attention Mechanism and Vgg-Unet. J. Mater. Eng. Perform. 2025, 34, 36–45. [Google Scholar] [CrossRef]
Xiao, X.; Zhao, Y.; Zhang, F.; Luo, B.; Yu, L.; Chen, B.; Yang, C. BASeg: Boundary aware semantic segmentation for autonomous driving. Neural Netw. 2023, 157, 460–470. [Google Scholar] [CrossRef] [PubMed]
Zou, F.; Liu, Y.; Chen, Z.; Zhanghao, K.; Jin, D. Fourier Channel Attention Powered Lightweight Network for Image Segmentation. IEEE J. Transl. Eng. Health Med. 2023, 11, 252–260. [Google Scholar] [CrossRef] [PubMed]
Zheng, Z.; Zhong, Y.; Wang, J.; Ma, A. FarSeg++: Foreground-aware relation network for geospatial object segmentation in high spatial resolution remote sensing imagery. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 13715–13729. [Google Scholar] [PubMed]
Guo, M.H.; Lu, C.Z.; Hou, Q.; Liu, Z.; Cheng, M.M.; Hu, S.M. Segnext: Rethinking convolutional attention design for semantic segmentation. Adv. Neural Inf. Process. Syst. 2022, 35, 1140–1156. [Google Scholar]
Wang, L.; Li, R.; Zhang, C.; Fang, S.; Duan, C.; Meng, X.; Atkinson, P.M. UNetFormer: A UNet-like transformer for efficient semantic segmentation of remote sensing urban scene imagery. ISPRS J. Photogramm. Remote Sens. 2022, 190, 196–214. [Google Scholar]
Wang, H.; Wang, C.; Quan, Y.; Wang, D. Efficiently Expanding Receptive Fields: Local Split Attention and Parallel Aggregation for Enhanced Large-scale Point Cloud Semantic Segmentation. arXiv 2024, arXiv:2409.01662. [Google Scholar] [CrossRef]
Liu, X.; Shao, M.; Zhang, T.; Zhou, H.; Song, L.; Jia, F.; Sun, C.; Yang, Z. Research on Motion Control and Compensation of UAV Shipborne Autonomous Landing Platform. World Electr. Veh. J. 2024, 15, 388. [Google Scholar] [CrossRef]
Zhou, F.; Fan, Y.; Zou, J.; An, B. Ship emission monitoring sensor web for research and application. Ocean Eng. 2022, 249, 110980. [Google Scholar] [CrossRef]

Figure 1. Place of data collection.

Figure 2. Maritime image data.

Figure 3. Image data processing procedure.

Figure 4. “Labelme” labeling interface.

Figure 5. USVS-DATA dataset image data.

Figure 6. Percentage of dataset categories.

Figure 7. MASSMIND dataset image data.

Figure 8. Pascal VOC2007 dataset image data.

Figure 9. Method diagram. (a) Overall methodology flowchart of the proposed USVS-Net-based semantic segmentation framework for USV applications; (b) USVS-Net.

Figure 12. Schematic representation of indicators. (a) IoU schematic; (b) schematic diagram of the loss function.

Figure 13. Loss function curve.

Figure 14. Visualization results. (a) Image, (b) label, (c) SENet, (d) CBAM, (e) MECS, (f) GCSA, (g) MECS-GCSA.

Figure 15. Visualization results. (a) Benchmark, (b) benchmark + GCSA module, (c) benchmark + MECS module, (d) USVS-Net.

Figure 16. Visualization results. (a) Benchmark, (b) benchmark + GCSA module, (c) benchmark + MECS module, (d) USVS-Net.

Figure 17. Visualization results. (a) Benchmark, (b) benchmark + GCSA module, (c) benchmark + MECS module, (d) USVS-Net.

Figure 18. Visualization results. (a) IoU metrics comparison, (b) mIoU metrics comparison.

Figure 19. Visualization results. (a) Original image, (b) label, (c) FCN(Resnet101), (d) Deeplabv3(Rsenet50), (e) Deeplabv3+(mobilenetv2), (f) USVS-Net.

Figure 20. Visualization results. (a) FCN (Resnet101), (b) Deeplabv3 (Rsenet50), (c) Deeplabv3+ (mobilenetv2), (d) USVS-Net.

Figure 21. Visualization results. (a) IoU metrics comparison, (b) mIoU metrics comparison.

Figure 22. Visualization results. (a) Original image, (b) label, (c) FarSeg++, (d) SegNeXt, (e) UnetFormer, (f) USVS-Net.

Table 1. Statistics for the USV-DATA dataset.

Class	Training Set		Validation Set
Class	Number of Images	Number of Labels	Number of Images	Number of Labels
Sky	26,513	44,050	1924	3363
Water	26,764	49,933	1937	3546
Obstacles	18,483	44,218	1323	3363

Table 2. Training parameter settings.

Input Resolution	512 × 512
Batch size	16
Learning rate	0.001
Epochs	100

Table 3. Ablation Experiment Parameters.

Module	mIoU	Parameters (M)	GFLOPs	FPS
Baseline + SENet	72.3	2.1	5.8	45
Baseline + CBAM	70.57	2.5	6.2	42
Baseline + GCSA	77.23	2.4	6.2	40
Baseline + MECS	75.57	2.2	6.0	43
Baseline + MECS + GCSA	78.81	4.5	6.3	41

Table 4. Module performance comparison experiment.

USVS-NET	GCSA	MECS
✔
✔	✔
✔		✔
✔	✔	✔

Table 5. Comparison experiments on the MASSMIND dataset.

Models	FCN (Resnet50)	FCN (Resnet101)	Deeplabv3 (Resnet50)	Deeplabv3 (Resnet101)	Deeplabv3+ (Xception)	Deeplabv3+ (Mobilenetv2)	USVS-NET
Self	87.8	86.8	88.4	87.4	89	90	90
Living Obstacle	83.7	81.7	84.1	83.6	75	76	78
Obstacle	0	0	0	0	4	7	53
Bridge	49.7	40.2	50.3	43.4	8	59	60
Water	56	71.2	72.5	46.9	40	92	95
Sky	97.3	98	97.5	95.9	85	98	98
Background	96.8	97.1	97.2	96.9	84	98	98
mIoU	67.3	67.9	70	64.9	55.18	74.13	81.71

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Shao, M.; Liu, X.; Yan, X.; Li, Y.; Cui, W.; Sun, C.; Xiao, C. An Intelligent Semantic Segmentation Network for Unmanned Surface Vehicle Navigation. J. Mar. Sci. Eng. 2025, 13, 1990. https://doi.org/10.3390/jmse13101990

AMA Style

Shao M, Liu X, Yan X, Li Y, Cui W, Sun C, Xiao C. An Intelligent Semantic Segmentation Network for Unmanned Surface Vehicle Navigation. Journal of Marine Science and Engineering. 2025; 13(10):1990. https://doi.org/10.3390/jmse13101990

Chicago/Turabian Style

Shao, Mingzhi, Xin Liu, Xuejun Yan, Yabin Li, Wenchao Cui, Chengmeng Sun, and Changshi Xiao. 2025. "An Intelligent Semantic Segmentation Network for Unmanned Surface Vehicle Navigation" Journal of Marine Science and Engineering 13, no. 10: 1990. https://doi.org/10.3390/jmse13101990

APA Style

Shao, M., Liu, X., Yan, X., Li, Y., Cui, W., Sun, C., & Xiao, C. (2025). An Intelligent Semantic Segmentation Network for Unmanned Surface Vehicle Navigation. Journal of Marine Science and Engineering, 13(10), 1990. https://doi.org/10.3390/jmse13101990

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

An Intelligent Semantic Segmentation Network for Unmanned Surface Vehicle Navigation

Abstract

1. Introduction

2. Datasets

2.1. USV-DATA

2.2. MASSMIND

2.3. Pascal VOC2007

3. Methodology

3.1. General Flowchart of the Proposed Methodology

3.2. Attention Mechanism Structure

3.2.1. Global Channel-Spatial Attention Module (GCSA Module)

3.2.2. Median-Enhanced Channel and Spatial Attention Module (MECS Module)

4. Experiments

4.1. Experimental Setting

4.2. Evaluation Metrics

4.3. Attention Module Comparison Experiment

4.4. Integral Ablation Experiments

4.4.1. USV-DATA Dataset

4.4.2. MASSMIND Dataset

4.4.3. Pascal VOC2007 Dataset

4.5. Algorithm Comparison Experiments

4.5.1. USVS-DATA Dataset

4.5.2. MASSMIND Dataset

4.5.3. Pascal VOC2007 Dataset

4.6. SOTA Algorithm Comparison Experiment

4.7. Discussion

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI