3.1. General Flowchart of the Proposed Methodology
The overall process is illustrated in
Figure 9a: first, data are collected in maritime environments using an onboard camera, and a dataset is constructed through data processing (frame extraction and annotation). Then, the USVS-Net model is trained offline and deployed onto the edge computing platform of the unmanned surface vehicle (USV). During actual operation, the camera continuously captures real-time images, which are processed by the deployed model to generate pixel-level semantic segmentation results, providing environmental perception support for autonomous navigation of the USV.
Although recent semantic segmentation methods have achieved considerable progress in multi-scale contextual modeling and attention mechanisms, a large proportion of studies still tend to emphasize the “dominant/salient” features of an image, while neglecting boundary details, less salient textures, and information from non-dominant channels. This imbalance prevents the network from fully exploiting the diverse discriminative cues contained in the source image. In maritime scenes, for instance, single-frame vision-based models—such as the lightweight variant eWaSR in the WaSR series [
6]—primarily focus on strongly salient water–obstacle boundaries and large objects. Under complex sea conditions featuring strong reflections, wave disturbances, and small-scale obstacles, fine-grained details and weak texture cues are often ignored, leading to boundary misclassification and missed detections of small targets. To overcome the limitation of focusing solely on salient regions, one line of research explicitly incorporates boundary or contour priors into the network through boundary-aware branches or loss functions. Empirical results show that these approaches can significantly enhance the discriminability of pixels in transition areas; however, they still exhibit instability under severe noise or large inter-class scale variations [
17]. Another line of work introduces temporal context to smooth single-frame noise and surface perturbations, but when inter-frame appearance changes drastically or small targets appear only momentarily, non-salient features may still be suppressed [
7]. Recently, some studies have explored frequency-domain perspectives, arguing that neglected high-frequency details (e.g., edges and fine textures) are crucial for accurate segmentation. Consequently, attention weighting in Fourier or wavelet domains has been proposed to compensate for the bias of emphasizing salient responses only in the spatial domain. Nevertheless, how to achieve effective complementarity between spatial and frequency representations without amplifying noise remains an open problem requiring more careful design and validation [
18].
Based on these observations, we argue that relying solely on dominant features tends to amplify salient regions and suppress complementary information (such as fine-grained boundaries, less salient textures, and high-frequency cues), which leads to performance bottlenecks in small-object segmentation, complex boundaries, and heavily disturbed maritime environments. Therefore, this study emphasizes cross-scale, cross-channel, and cross-domain (spatial/frequency) collaborative modeling at the network level, aiming to systematically enhance the capture of “non-salient but crucial” information without introducing excessive computational cost.
Therefore, this paper proposes a multi-scale interactive segmentation network USVS-Net based on DeeplabV3+ framework for feasible domain segmentation of unmanned vessels. The structure of the network is shown in
Figure 9b. Although using Xception as the backbone network of DeepLabV3+ improves the accuracy, it suffers from the disadvantages of high computational complexity, large memory occupation, and long training time, which limit its application in resource-limited scenarios. Therefore, in USVS-Net, we use Mobilenetv2 network as the backbone feature extraction network of the algorithm to overcome the above drawbacks. Aiming at the shortcomings of Mobilenetv2 network in feature extraction—its deep convolution and downsampling operations tend to lead to the loss of image details, which in turn reduces the segmentation accuracy—this paper innovatively designs the Global Channel-Spatial Attention (GCSA) module. This module strengthens the model’s ability to understand global semantics by establishing a long-range association mechanism between features, which effectively alleviates the problem of detail information attenuation while improving the feature representativeness. However, there is a channel shuffling operation in GCSA, which loses position information, which is important for generating spatial attention maps. Therefore, we integrate the coordinate attention module after GCSA to reduce the loss of position information and further enhance the feature representation capability of the network through the multi-spatial orientation feature sensing capability of this module.
In the feature fusion stage, the conventional feature fusion network adopts ASPP, which is capable of capturing multi-scale contextual information, but the network has many drawbacks: the frequency-domain features of the feasible domain image of the unmanned ship can describe the details of edges, lines, etc., which are crucial for the pixel-by-pixel classification. However, ASPP cannot effectively differentiate between useful information (e.g., edges) and useless information (e.g., noise) when processing frequency-domain information due to its fixed null rate design, which may lead to noise amplification; and the multi-branching structure increases the computational complexity and memory occupation, which reduces the efficiency of the frequency-domain information extraction; and the null convolution may also introduce the lattice effect, which may disrupt the continuity of the frequency-domain features, affecting the detail modeling. Therefore, ASPP is difficult to effectively distinguish between useful and useless frequency-domain information, limiting its performance in pixel-by-pixel classification. Based on the above problems, we propose the median-enhanced spatial and channel attention block (MECS) with excellent frequency-domain feature capture capability, and apply it behind the null convolution of ASPP to enhance the frequency-domain feature extraction capability of the input image, forming the MECS-ASPP structure.
In USVS-Net, CBL (containing convolution, proposed normalization, and activation function) and Upsample are used, and due to the presence of operations such as dimensionality reduction and dimension enhancement in their structures, these operations cause the channel information to be messed up, thus affecting the segmentation accuracy of the network. Therefore, we adopt cSE module and Triplet Attention to integrate channel information and realize cross-latitude interaction, which in turn strengthens the channel feature representation.
The workflow of the network:
Firstly, image data input to Mobilenetv2 backbone network for multilayer feature extraction; shallow detail features are defined as primary semantic features and deep abstract features as secondary semantic features, according to the difference in feature abstraction level; cross-layer information complementarity is realized through feature fusion mechanism. This hierarchical feature delineation method can preserve the original details of the image while capturing the high-level semantic associations.
Secondly, the first semantic information is processed by GCSA, CA, MECS-ASPP, CBL, and Upsample to become “deep features”, and the “deep features” are processed by cSE to be fused with the first semantic information; the fused information is processed by CBL. After the fused information is processed by CBL and activation function, it is again multiplied element-by-element with the first semantic information to form “deep processing feature”; in order to fully utilize the semantic information, the “first semantic information”, “deep processing feature”, and “deep processing feature” are multiplied by CBL and activation function. In order to fully utilize the semantic information, the “first semantic information”, “deep processing features”, and “deep features” are fused again, and the fused features are processed by CBL and activation function, and then fused with the first semantic information again to form “deep features”. The fused features are processed by Triplet Attention, and then upsampling is utilized again.
Finally, the segmentation result is obtained with the same resolution as the input image.
3.2. Attention Mechanism Structure
3.2.1. Global Channel-Spatial Attention Module (GCSA Module)
Motivations: Pixel classification helps to distinguish similar objects and recognize small feature differences. However, existing methods are often realized by increasing the computational complexity. In this paper, a multi-scale multi-channel global attention augmentation module (GCSA module) was proposed, as shown in
Figure 10. This module contains channel attention, channel shuffling, and spatial attention. Our motivation for designing this module and its mechanism for improving pixel classification accuracy is as follows:
(1) For the channel attention submodule: In traditional convolutional neural networks, the dependencies between channels are often ignored, while these relationships are crucial for the capture of global information. Neglecting channel dependencies may lead to underutilization of feature map information, which in turn weakens the representation of global features. This module optimizes channel interactions through multilayer perceptron (MLP), which consists of four steps: firstly, dimensional rearrangement is performed to adjust the input features to H × W × C format; then, the number of channels is compressed to 1/4 through the first layer of MLP, and ReLU activation is used to filter the effective features; then, the original channel dimensions are restored through the second layer of MLP to retain the key information; finally, the channel attention weight map is generated and multiplied with the original features on an element-by-element basis. The design enables the network to learn globally associated features across channels through the mechanism of first compressing and then expanding channel dimensions, while automatically suppressing redundant information.
(2) About channel shuffling: Although channel attention enhances the representation of feature maps, it may not sufficiently break the constraints between channels, resulting in insufficient mixing of information, which affects the effectiveness of feature representation. To solve this problem, we introduce the channel shuffling operation. Specifically, we divide the enhanced feature map into several groups (e.g., four groups, each containing a quarter of all the channels), then perform a transpose operation on each group to disrupt the original order of the channels and finally restore it to its original shape. This operation helps to promote the flow of information between channels and enhance the diversity of features, thus improving the model performance.
(3) About the spatial attention submodule: Relying solely on channel attention and channel shuffling operations may not fully utilize the spatial information, especially when capturing the local and global features of an image; ignoring the spatial dimension will lose many important details. Therefore, we use two 7 × 7 convolutional layers in the spatial attention module to process spatial information. First, the input feature maps are compressed to a quarter of the original number of channels by the first 7 × 7 convolutional layer and nonlinearly transformed by batch normalization and ReLU activation function; then, the second 7 × 7 convolutional layer restores the number of channels to the original dimensions and batch normalization is performed again. This design effectively captures the spatial dependencies. Finally, the spatial attention map is generated by the Sigmoid function and multiplied element-by-element with the feature map after channel shuffling to obtain the final output feature map.
Figure 10.
Structure of the GCSA module.
Figure 10.
Structure of the GCSA module.
The specific workflow of the module is as follows:
First, the input feature map consists of multiple channels, each with spatial dimensions H × W. In the channel attention module, the input feature map is first transposed from the original C × H × W shape to W × H × C. Next, the first fully connected layer (MLP) reduces the number of channels to a small fraction of the original (e.g., 1/4) and introduces nonlinear variations through the ReLU activation function. Subsequently, the second fully connected layer restores the number of channels to their original size. Finally, after an inversion operation, the feature map is restored to its original shape of C × H × W and the channel attention maps are generated by a Sigmoid activation function. Finally, the input feature map is multiplied element-by-element with the generated channel attention map to obtain the enhanced feature map. The whole process can be represented by Equation (5).
: the enhanced feature map, : the Sigmoid function, : the element-by-element multiplication, : the original input feature map.
In the spatial attention module, the input feature maps are first passed through a 7 × 7 convolutional layer, which reduces the number of channels to one-fourth of the original number, thus achieving feature dimensionality reduction. Next, the feature maps are normalized by a batch normalization (BN) layer to remove the internal covariate bias and help the model train more stably. Then, nonlinearization is performed using the ReLU activation function to enhance the model’s expressiveness. Next, the feature maps are passed through a second 7 × 7 convolutional layer with the number of channels restored to their original size and normalized again by a batch normalization layer. Finally, a spatial attention map is generated using the Sigmoid activation function to represent the importance of each spatial location in the feature map. The feature map after channel shuffling is multiplied element-by-element with the spatial attention map to obtain the final output feature map containing spatial information. The whole process can be described by Equation (7).
is the feature map after spatial attention processing. Due to the use of GCSA module enhances the pixel attention and integrates Coordinate Attention to reduce the loss of spatial information, let thus improves the characterization of features.
3.2.2. Median-Enhanced Channel and Spatial Attention Module (MECS Module)
Motivations: Frequency-domain features of feasible domain images can describe details such as edges, lines, etc., but some of the frequency-domain information may be involved in image noise, so it is particularly important to distinguish between useful and useless frequency-domain information. Therefore, this paper proposes the MECS module shown in
Figure 11, which includes channel attention and spatial attention. This module can effectively capture and fuse features on different scales. We design this module mainly based on the following reasons:
(1) Channel Feature Reinforcement Module: In traditional methods, the channel attention mechanism mostly adopts two ways, mean value calculation and peak extraction, to obtain global feature information, but these strategies are prone to bias in the face of noise-containing data, especially when there are obvious interference signals in the feature map, which will affect the accuracy of feature analysis. To address this problem, we integrate median filtering technology in channel attention computation, combining median filtering with the original mean and peak extraction methods to form a channel weight computation system with stronger anti-interference ability. The median filtering technology has been maturely applied in the field of image noise reduction, which can effectively eliminate the interference of abnormal data and maintain the integrity of key features by choosing the median value, significantly improving the anti-interference performance of the attention mechanism.
(2) Multi-scale spatial perception module: Conventional single-size convolutional kernel is often difficult to comprehensively obtain spatial information of different granularities when processing image features, which will affect the feature recognition ability of the model when facing complex scenes. For this reason, we have developed a hierarchical multi-specification convolutional scheme: firstly, a 5 × 5 base convolutional layer is used for initial feature extraction; then the base features are fed in parallel into multiple depth-separable convolutional layers of different specifications to capture small-scale details and large-scale contour features; finally, the features output from each branch are superimposed and integrated, and then the spatial weight distribution maps are generated by 1 × 1 convolution. By adjusting the pixel-level correspondence between the original feature maps and the dynamically generated weight maps, enhanced features that integrate multi-dimensional spatial information are finally obtained. This hierarchical processing can simultaneously capture feature changes in different directions and scales, which significantly improves the model’s ability to recognize multimorphic targets.
Figure 11.
Structure of the MECS module.
Figure 11.
Structure of the MECS module.
The specific workflow of the module is as follows:
For channel attention, three pooling operations are first performed on the input feature maps: global average pooling (AvgPool), global maximum pooling (MaxPool), and global median pooling (MedianPool), resulting in three different pooling results. The size of each pooling result is 1 × 1, where C is the number of channels. Next, each pooling result is fed separately into a shared multilayer perceptron (MLP) that contains two 1 × 1 convolutional layers and a ReLU activation function. The first convolutional layer shrinks the number of channels from C to C/r, where r is the compression ratio; the second convolutional layer then restores the number of channels to the original C. Three attention maps are then obtained by mapping the output values to the range [0, 1] via the Sigmoid activation function. These three attention maps are summed at the element level to obtain the final channel attention map. Finally, this channel attention map is multiplied element-by-element with the original feature map to obtain the weighted feature map. The whole process can be described by Equations (7) and (8).
: the Sigmoid activation function, : the element-by-element multiplication operation.
For spatial attention, first, the input feature map is passed through a 5 × 5 deep convolutional layer that is used to extract low-level features. The output of this convolutional layer is of the same size as the input. Then, the output feature maps of the initial convolutional layer are passed through multiple deep convolutional layers of different sizes (e.g., 1 × 1, 7 × 7, etc.) to further extract multi-scale features. The outputs of these different convolutional layers are summed at the element level to form a fused feature map. Finally, the fused feature maps are passed through a 1 × 1 convolutional layer to generate the final spatial attention map. In Equations (9) and (10), the generated attention maps are multiplied element-wise with the already weighted channel feature maps to obtain the final output feature maps.
n denotes the number of deep convolutions and Conv denotes the convolution operation.
The recognition of pixel features is enhanced due to the use of the MECS module to extract global statistical information and multi-scale deep convolution to capture representative features and tiny hidden features at different scales.