A Lightweight Detection Algorithm for Unmanned Surface Vehicles Based on Multi-Scale Feature Fusion

: Lightweight detection methods are frequently utilized for unmanned system sensing; however, when put in complicated water surface environments, they suffer from insufﬁcient feature fusion and decreased accuracy. This paper proposes a lightweight surface target detection algorithm with multi-scale feature fusion augmentation in an effort to improve the poor detection accuracy of lightweight detection algorithms in the mission environment of unmanned surface vehicles (USVs). Based on the popular one-stage lightweight YOLOv7-Tiny target detection algorithms, a lightweight extraction module is designed ﬁrst by introducing the multi-scale residual module to reduce the number of parameters and computational complexity while improving accuracy. The Mish and SiLU activation functions are used to enhance network feature extraction. Second, the path aggregation network employs coordinate convolution to strengthen spatial information perception. Finally, the dynamic head, which is based on the attention mechanism, improves the representation ability of object detection heads without any computational overhead. According to the experimental ﬁndings, the proposed model has 22.1% fewer parameters than the original model, 15% fewer GFLOPs, a 6.2% improvement in mAP@0.5, a 4.3% rise in mAP@0.5:0.95, and satisﬁes the real-time criteria. According to the research, the suggested lightweight water surface detection approach includes a lighter model, a simpler computational architecture, more accuracy, and a wide range of generalizability. It performs better in a variety of difﬁcult water surface circumstances.


Introduction
Artificial intelligence has helped the fields of computer vision and image literacy flourish.It has also sparked technological advancements in unmanned systems.When compared to other types of conventional marine equipment, unmanned surface vehicles (USVs) are distinguished by their low maintenance costs, low energy consumption, and lengthy periods of continuous operation [1][2][3].Additionally, USVs can take the place of people to perform difficult and hazardous tasks.As a result, research into USV technology is a reaction to the need for human ocean exploration.
The capacity to recognize targets and comprehend the surroundings is one of the core technologies of USV.One of the fundamental technologies of USV is the ability to perceive the environment and identify targets.The USV ought to have a thorough understanding of its surroundings thanks to artificial intelligence technology.For instance, to identify the sort of target at present, the USV sensing technology primarily employs the photoelectric pod to collect optical image data and the LiDAR system to collect point cloud data.The point cloud dataset produced by LiDAR is, however, limited in scope and lacking in detail, and the direct processing of the 3D point cloud necessitates powerful computational hardware [4].While the optical image of the water surface has the benefits of being simple to collect, having rich color and texture information, and having established processing techniques [5], using optoelectronic equipment to obtain the intended appearance attributes is a crucial part of how USVs perceive their surroundings.
Deep learning-based object detection techniques are now widely used.The two-stage method and the one-stage algorithm are the two categories into which deep learning target identification techniques can be widely divided.Two steps are required to complete two-stage detection.The candidate region is created first.The candidate frame is then classified and regressed using algorithms such as R-CNN [6], Fast R-CNN [7], and Faster R-CNN [8].Single-stage detection, such as SSD [9] and YOLO [10][11][12][13], uses a convolutional neural network to extract the target's feature information before performing sampling and classification regression operations on the corresponding feature maps using anchor frames with various aspect ratios.Although the two-stage method has a high accuracy rate, realtime requirements are challenging to achieve.Single-stage target detection methods, on the other hand, are far faster and better suited to real-time detection needs.Additionally, singlestage detection techniques are actively being improved; two examples include YOLOv7 [14] and YOLOv8 [15].These models combine the benefits of precision and quickness.
Despite the fact that many of the aforementioned detection techniques exist, USV only has a small amount of equipment arithmetic power, so it is best to combine lightweight models.However, the lightweight model has fewer feature extraction procedures, fewer model parameters, and a little poorer accuracy.In practice, USVs are subject to problems such as blurring of the image due to wind and waves, causing the boat to rock and shake, and reduced visibility due to backlighting, rain, and fog, all of which can affect the image quality and detection results.Therefore, the algorithm's performance would be adversely affected in complex settings.The introduction of lightweight convolution, which typically only has the potential to increase speed while significantly reducing accuracy and robustness, has been the main method of lightweight target detection based on USVs that has received the majority of the research.Many strategies overlook the combination of more useful features under lightweight operation.In order to improve the above problems, this research proposes a lightweight water surface target detection algorithm with multi-scale feature fusion augmentation to improve the detection accuracy of lightweight detection algorithms.Because of its excellent performance, the YOLOv7-Tiny lightweight model is chosen as the baseline model in this study to explore the incorporation of more potent multi-scale features of water surface targets on a lighter basis.The following are the paper's main contributions: (1) A multi-scale feature extraction module is designed to enhance the network's ability to extract target features.Meanwhile, this paper uses the Mish and SiLU activation functions to replace the original activation functions and improve the learning ability of the network.(2) In this paper, coordinate convolution is used in path aggregation networks to improve the fusion of information from multi-scale feature maps in the up-sampling step.Finally, the dynamic head is used in the prediction process to effectively combine spatial information, multi-scale features, and task awareness.(3) For USVs, a target detection technique with fewer parameters and reduced computing costs was suggested; it outperforms top lightweight algorithms in a variety of complicated scenarios on water and fully satisfies the timeliness requirements.In addition, a number of model improvement comparison experiments are designated to serve as references for the investigation of techniques for water surface target detection.
The essay is set up as follows: Section 2 provides an analysis of the approaches employed as well as some current pertinent research work.Section 3 provides a thorough explanation of the suggested techniques.The experimental findings are presented in Section 4, along with a comparison of the various approaches and a summary.Section 5 serves as the essay's conclusion.on the other hand, is a lighter version of YOLOv7, whose network structure is shown in Figure 1, and the structure of each module is shown in Figure 2.

The YOLOv7-Tiny Detection Framework
YOLOv7 was proposed by the team of Alexey Bochkovskiy, the author of YOLOv4, on 20 August 2022.Its performance on the COCO dataset is excellent, and its model accuracy and detection speed are unquestionably first in the interval from 5 to 160 FPS.YOLOv7-Tiny, on the other hand, is a lighter version of YOLOv7, whose network structure is shown in Figure 1, and the structure of each module is shown in Figure 2.

The YOLOv7-Tiny Detection Framework
YOLOv7 was proposed by the team of Alexey Bochkovskiy, the author of YOLOv4, on 20 August 2022.Its performance on the COCO dataset is excellent, and its model accuracy and detection speed are unquestionably first in the interval from 5 to 160 FPS.YOLOv7-Tiny, on the other hand, is a lighter version of YOLOv7, whose network structure is shown in Figure 1, and the structure of each module is shown in Figure 2.

Backbone Network
The backbone network of YOLOv7-Tiny consists of CBL modules, MCB modules, and MP modules, the structure of which is shown in Figure 1.The CBL module consists of a convolutional layer, a batch normalization layer, and a LeakyReLU layer, which sets the convolutional kernel size to 1 to change the number of channels in the feature map.When the convolutional kernel size is set to 3, if the step size is 1, it is mainly used to extract features; if the step size is 2, it is used to downsample.The MCB is an efficient network structure.It has two main branches, which enable the network to extract more feature information and have stronger robustness by controlling the shortest and longest gradient paths.One of which is through a CBL module; the other branch first goes through a CBL module for changing the number of channels and then through two CBL modules for the features to be extracted, each passing through a CBL module to output one feature.Finally, the four output features are superimposed and output to the last CBL module.The backbone downsampling method starts with two convolutions of step size 2, followed by a maximum pooling (MP) module of step size 2, with each downsampling halving the feature map size.

Head Network
YOLOv7-Tiny's Head network adds SPPCSP, MCB, and CBL modules on top of the path aggregation network (PaNet) to achieve better multi-scale feature fusion, where the SPPCSP module has two branches, one of which has only one CBL module, while the other branch is more complex.The first goes through a CBL module, followed by performing pooling kernels of 13, 9, and 5 for MP, and then stacking operations, again going through a CBL module.It then performs channel fusion with the other branch and feeds the fused output into the last CBL module to obtain the output.

Prediction Network
The I-Detect detecting head is used as the YOLOv7-Tiny network's output.The CBL module serves to gather features and adjust the number of channels after the MCB module has extracted the feature network at three different sizes.To anticipate targets of various sizes, feature maps with channel counts of 64, 128, and 256 are output in three different sizes.

The Mish and SiLU Activation Function
Fewer feature extraction operations are necessary due to the lightweight model's limited number of parameters and calculations.Without raising deployment costs, the model can learn and perform better when the appropriate activation function is used.To circumvent the difficulty of establishing a consistent link between positive and negative input values, LeakyReLU [23] is substituted with the activation functions Mish and SiLU [24,25].The equations they possess are as follows: The replacement activation function can achieve the minimum value at zero, which self-stabilizes and buffers the weights.The gradient calculation is made easier by the more derivable activation functions Mish and SiLU.This improves the feature extraction network's performance.In order to prevent the delayed convergence brought on by a zero gradient during network training, the Mish and SiLU activation functions have a lower bound, but no upper bound, as seen in Figure 3, and the gradient is near 1.It is possible to prevent the issue of sluggish convergence brought on by a zero gradient.The length of LeakyReLU is not truncated in the negative interval.However, compared to LeakyReLU activation functions, the Mish and SiLU activation functions are smoother, adding more nonlinear expressions and enhancing the model's capacity for learning.
network's performance.In order to prevent the delayed convergence brought on by a zero gradient during network training, the Mish and SiLU activation functions have a lower bound, but no upper bound, as seen in Figure 3, and the gradient is near 1.It is possible to prevent the issue of sluggish convergence brought on by a zero gradient.The length of LeakyReLU is not truncated in the negative interval.However, compared to LeakyReLU activation functions, the Mish and SiLU activation functions are smoother, adding more nonlinear expressions and enhancing the model's capacity for learning.In this study, the MCB-SM module uses these two activation functions, as shown in Figure 4. Later, they will also be utilized in the neck, head, and modules created for this paper.

Design of Res2Block
Targets on the water surface come in a wide variety of kinds, sizes, and aspect ratios.Understanding the observed object as well as the surrounding environmental context requires knowledge of multi-scale feature information.The Multi Concat Block, or MCB for short, is a crucial feature extraction technique for the YOLOv7-Tiny and is used repeatedly in the backbone and neck to aggregate useful features.However, the number of parameters and the computational effort of the convolution of the multiple stacked features extracted are quite large and can be mixed with the fusion of redundant features.This section introduces the Res2Block, which is more compact and aims to enable a more thorough fusion of multi-scale water surface target properties.Most existing methods represent multi-scale features in a layer-wise manner.Res2Block represents multi-scale features at In this study, the MCB-SM module uses these two activation functions, as shown in Figure 4. Later, they will also be utilized in the neck, head, and modules created for this paper.
gradient during network training, the Mish and SiLU activation functions have a lower bound, but no upper bound, as seen in Figure 3, and the gradient is near 1.It is possible to prevent the issue of sluggish convergence brought on by a zero gradient.The length of LeakyReLU is not truncated in the negative interval.However, compared to LeakyReLU activation functions, the Mish and SiLU activation functions are smoother, adding more nonlinear expressions and enhancing the model's capacity for learning.In this study, the MCB-SM module uses these two activation functions, as shown in Figure 4. Later, they will also be utilized in the neck, head, and modules created for this paper.

Design of Res2Block
Targets on the water surface come in a wide variety of kinds, sizes, and aspect ratios.Understanding the observed object as well as the surrounding environmental context requires knowledge of multi-scale feature information.The Multi Concat Block, or MCB for short, is a crucial feature extraction technique for the YOLOv7-Tiny and is used repeatedly in the backbone and neck to aggregate useful features.However, the number of parameters and the computational effort of the convolution of the multiple stacked features extracted are quite large and can be mixed with the fusion of redundant features.This section introduces the Res2Block, which is more compact and aims to enable a more thorough fusion of multi-scale water surface target properties.Most existing methods represent multi-scale features in a layer-wise manner.Res2Block represents multi-scale features at

Design of Res2Block
Targets on the water surface come in a wide variety of kinds, sizes, and aspect ratios.Understanding the observed object as well as the surrounding environmental context requires knowledge of multi-scale feature information.The Multi Concat Block, or MCB for short, is a crucial feature extraction technique for the YOLOv7-Tiny and is used repeatedly in the backbone and neck to aggregate useful features.However, the number of parameters and the computational effort of the convolution of the multiple stacked features extracted are quite large and can be mixed with the fusion of redundant features.This section introduces the Res2Block, which is more compact and aims to enable a more thorough fusion of multi-scale water surface target properties.Most existing methods represent multi-scale features in a layer-wise manner.Res2Block represents multi-scale features at a granular level and increases the range of receptive fields for each network layer.Compared to the original structure, it can combine multi-scale feature information to help the network better understand contextual semantic classification and perceive the boundaries and areas of target objects.
As a result, the Res2Block module is built and proposed in this study in relation to the Res2Net [26] network architectural concept.By replacing a set of convolutions with a smaller set of convolutions and linking several filter sets in a hierarchical residual-like manner, the reference paper for Res2Net proposes to build the feature extraction network structure.The Res2Net module, or R2M for short, is the proposed neural network module's moniker since it entails residual-like connections within a single residual block.
Figure 5 shows the difference between the Bottleneck block and the R2M module, which are commonly used in common network junction structures.After a CBS module, R2M divides the feature mapping uniformly into subsets of feature mappings, denoted by X i , where i ∈ {1, 2, . . . ,s}.Each feature subset X i has the same spatial size but with 1/s number of channels.Each X i has a corresponding convolution with a 3 × 3 filter denoted by K i , except for X 1 .We denote the output of K i (•) by Y i .The feature subset X i is added to the output of K i−1 (•) and then fed into K i (•).To reduce the parameters while adding s, the convolution of X 1 is omitted.Thus, Y i can be written as Equation (4): Each K i (•) may receive feature information from all feature splits X n .Each time a feature split goes through a convolution operator with a 3 × 3 filter, the output can have a larger receptive field than X n .Each time a feature split X n passes through a convolution operator with a 3 × 3 filter, the output can have a larger receptive field than X i .
smaller set of convolutions and linking several filter sets in a hierarchical residual-like manner, the reference paper for Res2Net proposes to build the feature extraction network structure.The Res2Net module, or R2M for short, is the proposed neural network module's moniker since it entails residual-like connections within a single residual block.
Figure 5 shows the difference between the Bottleneck block and the R2M module, which are commonly used in common network junction structures.After a CBS module, R2M divides the feature mapping uniformly into subsets of feature mappings, denoted by i X , where   The Res2Net module processes features in a multi-scale way for splitting, facilitating the extraction of global and local information.The output of the Res2Net module contains various numbers and combinations of receptive field size scales.All splits are interconnected, allowing for more effective processing of features by 1:1 convolutional splits and cascading techniques that can force convolution.The first split's convolution decreases the number of parameters, which is sometimes referred to as feature reuse.Use it as a scale dimension control parameter.Figure 6 illustrates how Res2Block is further designed in this study to more effectively incorporate multi-scale features.
cascading techniques that can force convolution.The first split's convolution decreases the number of parameters, which is sometimes referred to as feature reuse.Use it as a scale dimension control parameter.Figure 6 illustrates how Res2Block is further designed in this study to more effectively incorporate multi-scale features.After a convolutional with the SiLU activation function, the number of channels is halved in a split operation.The feature map with the halved number of channels is then stacked with the other two halved branches to form a feature map with 1.5 times the number of channels after a Res2Net module.Finally, the result is output after another convolutional with the SiLU activation function.Compared with the MCB module in the original network, the method in this paper reduces the computationally intensive stacking of the feature extraction convolutions with 3 × 3 filters and introduces richer multi-scale information.In this paper, the MCB module with 128 and 256 downsampling channels in After a convolutional with the SiLU activation function, the number of channels is halved in a split operation.The feature map with the halved number of channels is then stacked with the other two halved branches to form a feature map with 1.5 times the number of channels after a Res2Net module.Finally, the result is output after another convolutional with the SiLU activation function.Compared with the MCB module in the original network, the method in this paper reduces the computationally intensive stacking of the feature extraction convolutions with 3 × 3 filters and introduces richer multi-scale information.In this paper, the MCB module with 128 and 256 downsampling channels in the backbone and neck is replaced by the designed Res2Block because the number of channels processed corresponds to a larger number of parameters.It is demonstrated that the proposed Res2Block reduces the number of parameters and computational effort and, at the same time, improves the detection accuracy of water surface targets.

Neck Combined with CoordConv
When performing water surface target detection tasks, USVs frequently deal with challenging spatial environments such as backlight, interference from the background, upwelling, rain and fog.In place of standard convolution, coordinate convolution (Co-ordConv) [27] is introduced in this paper.For multi-scale targets, combining CoordConv within a path aggregation network (PaNet) can significantly lessen the loss of spatial information during the feature fusion process.CoordConv adds a coordinate channel for convolutional access to Cartesian spatial information in comparison to normal convolution.CoordConv allows the network to learn to choose between full translational invariance or varying degrees of translational dependence depending on the specific task without sacrificing the computational and parametric efficiency of regular convolution.According to translational invariance, a network's response (output) is the same regardless of how its inputs are translated in image space.It can more accurately capture the target's spatial information and lessen information interference caused by things such as position and angle transformation in multi-scale targets as the transformation between high-level spatial latents and pixels becomes easier to learn.The principle is illustrated in Figure 7.  CoordConv can be implemented based on a simple extension of the standard convolution, adding two channels and filling in the coordinate information.The operation of adding two coordinates i and j is depicted in (b) in Figure 7. Specifically, the i coor- dinate channel is an 1 hw − matrix with 0 in the first row, 1 in the second row, and 2 in the third row.The j coordinate channels are filled with constants in the same way as the columns, and the coordinate values of i and j are finally scaled linearly to keep them within the range of [-1, 1], and the two coordinates are finally integrated into a third additional channel r coordinate with the Equation ( 5): CoordConv can be implemented based on a simple extension of the standard convolution, adding two channels and filling in the coordinate information.The operation of adding two coordinates i and j is depicted in (b) in Figure 7. Specifically, the i coordinate channel is an h × w − 1 matrix with 0 in the first row, 1 in the second row, and 2 in the third row.The j coordinate channels are filled with constants in the same way as the columns, and the coordinate values of i and j are finally scaled linearly to keep them within the range of [-1, 1], and the two coordinates are finally integrated into a third additional channel r coordinate with the Equation (5): While enhancing the perception of spatial information, CoordConv allows for a flexible choice of translation invariance based on network learning.The principle is similar to residual connectivity and enhances the generalization capabilities of the model to a certain extent.With a negligible number of bias parameters, a standard convolution with a convolution kernel of k and channels of c will contain c 2 k 2 weights.While after a CoordConv contains weights of (c + d)ck 2 .The i and j coordinate operations are added when d is taken as 2, and the r coordinate operation is added when d is taken as 3.
YOLOv7-Tiny uses PaNet on the Neck for multi-scale feature extraction and fusion.In this paper, CoordConv is introduced to Neck and Head, replacing the normal convolution in the up-sampling part of PaNet and all convolutions in the Head part and introducing the third channel coordinate information.The experimental results show that the improved network effectively combines spatial information with an almost negligible increase in the number of parameters and enhances the fusion of multi-scale target features.

Head Combined with Dynamic Head
The detection of water surface targets faces challenges of position change, angle switch, and scale change.The YOLOv7-Tiny detection head does not combine multiscale information, spatial information, and task information well.This paper combines dynamic head (DyHead) [28] to enhance the adaptability of the original model for water surface target detection tasks.The method unifies the object detection head with attention.By coherently combining multiple self-attentive mechanisms, scale awareness between feature levels, spatial awareness between spatial locations, and task awareness within output channels.Only the horizontal dimension uses the scale-aware attention module.Depending on the scale of the item, it learns the relative relevance of different semantic levels in order to improve the features of each object at the proper level.The spatially aware attention module is used in the height-width spatial dimension.It acquires coherent spatial representations for discrimination.Depending on the various convolutional kernel responses from the objects, the task-aware attention module is deployed on channels that direct various feature channels to serve various tasks (such as classification, box regression, and center/keypoint learning, for example).The first two approaches to multi-scale feature extraction as well as spatial information augmentation in this paper, are further combined effectively here.
The principle of DyHead is shown in Figure 8. Depending on the scale of the item, it learns the relative relevance of different semantic levels in order to improve the features of each object at the proper level.The spatially aware attention module is used in the height-width spatial dimension.It acquires coherent spatial representations for discrimination.Depending on the various convolutional kernel responses from the objects, the task-aware attention module is deployed on channels that direct various feature channels to serve various tasks (such as classification, box regression, and center/keypoint learning, for example).The first two approaches to multi-scale feature extraction as well as spatial information augmentation in this paper, are further combined effectively here.The principle of DyHead is shown in Figure 8.The L different levels of the feature pyramid output are scaled in series as , where L is the number of pyramid levels.The L different levels of the feature pyramid output are scaled in series as F ∈ R L×H×W×C , where L is the number of pyramid levels.H, W and C are the width, height, and number of channels of the intermediate-level features, respectively.Further define S = H × W to obtain the 3D tensor definition F ∈ R L×S×C .For the above-given tensor, the general equation, when combined with attention, is where π(•) tabulates the attention function, and in practice, this function is encoded through one fully connected layer.However, as the network deepens, it becomes computationally expensive to do the attention function learning directly on a high-dimensional tensor in this way.Therefore, Dy-Head divides the attention function into three parts, as shown in Figure 8, with each part focusing on only one perspective.This is shown in the following Equation (7): where π L (•), π S (•) and π C (•) are the three functions applied to L, S and C, respectively.Scale-aware attention is first fused to fuse semantic information at different scales, as shown in the Equation ( 8): where f (•) is a linear function approximated by a 1 × 1 convolutional layer and is a hard S-shaped function.
Secondly, considering the high tensor dimensionality in π S (•), the spatially aware attention module is decomposed into two steps: The spatially aware attention module is decomposed into two steps: The process of first using deformable convolutional sparse attention learning and then aggregating images across layers at the same spatial location is shown in the Equation ( 9): where K is the number of sparsely sampled locations, p k + ∆p k is a location offset by a self-learning spatial offset ∆p k to focus on the discriminative region, and ∆m k is the selflearning importance scalar at the p k location.Both are learned from input features at the median level of F. Finally, task-aware attention was deployed.It dynamically switches the function's on and off channels to support different tasks: the rationale is shown in the Equation (10): where In this paper, the original detector head is replaced by the DyHead, and the number of channels in the prediction output is adjusted to 128.This improvement allows the detector head to capture more detailed information about the target and thus predict it more accurately.In this paper, the original detector head is replaced by the DyHead, and the num of channels in the prediction output is adjusted to 128.This improvement allows th tector head to capture more detailed information about the target and thus predict it m accurately.

The Proposed Model
The improved network is shown in Figure 10.

The Proposed Model
The improved network is shown in Figure 10.In Figure 11, the altered components are shown in red boxes.After replacing the network backbone input with Mish and all MCB modules with at least 128 output channels with R2B, the activation function for the downsampling convolution with two consecutive steps of 2 is used.The number of output channels remained constant from the outset.SiLU is used in place of PaNet's activation functions for the two-step downsampling convolution.With three additional channels of coordinate information, CoordConv takes the place of regular convolution in the up-sampling and detection headers.In the end, DyHead is included, bringing the total number of output prediction channels to 128.In Figure 11, the altered components are shown in red boxes.After replacing the network backbone input with Mish and all MCB modules with at least 128 output channels with R2B, the activation function for the downsampling convolution with two consecutive steps of 2 is used.The number of output channels remained constant from the outset.SiLU is used in place of PaNet's activation functions for the two-step downsampling convolution.With three additional channels of coordinate information, CoordConv takes the place of regular convolution in the up-sampling and detection headers.In the end, DyHead is included, bringing the total number of output prediction channels to 128.

Experiments
To validate the effectiveness and superiority of the proposed model in a challenging water surface detection environment.The platform and the parameters of the experiment are configured as follows.

Experimental Environment and Parameter Setting
The platform of this experiment is as follows in Table 1.The experimental parameters are set as shown in the following Table 2.

Experiments
To validate the effectiveness and superiority of the proposed model in a challenging water surface detection environment.The platform and the parameters of the experiment are configured as follows.

Experimental Environment and Parameter Setting
The platform of this experiment is as follows in Table 1.The experimental parameters are set as shown in the following Table 2.
In order to increase the generalization capability of the target detection model, data enhancement is generally performed prior to training the neural network, and common methods include scaling, panning, rotation, and color variation.Due to the complex working environment of unmanned boats, this paper combines scaling, panning, flipping, color changing, and mosaic data enhancement techniques to increase the diversity of the data.The mosaic data enhancement method randomly crops and scales four random images within the dataset and then randomly arranges and stitches them into a single image.If there are small images or blank sections, they are grayed out to ensure the size is the same as the original input size.The results of the data enhancement are shown in Figure 12.In order to increase the generalization capability of the target detection model, data enhancement is generally performed prior to training the neural network, and common methods include scaling, panning, rotation, and color variation.Due to the complex working environment of unmanned boats, this paper combines scaling, panning, flipping, color changing, and mosaic data enhancement techniques to increase the diversity of the data.The mosaic data enhancement method randomly crops and scales four random images within the dataset and then randomly arranges and stitches them into a single image.If there are small images or blank sections, they are grayed out to ensure the size is the same as the original input size.The results of the data enhancement are shown in Figure 12.

Introduction to USV and Datasets
Validating the performance of data-driven deep network algorithms is generally performed on large, publicly available datasets.However, at this stage, there are no large publicly available datasets suitable for water surface target detection, a single dataset has limited scenarios, and the training results are not sufficient to illustrate the learning capability of the model.In this paper, we extracted part of the images from SeaShip7000 [29], the Water Surface Object Detection Dataset (WSODD) [30], and realistic and reliable data from a USV with a photoelectric pod device.Some of the actual datasets were collected from the USV platform "North Ocean" in the waters of Tai Lake in Suzhou, Lushun in Dalian, and Tiger Beach in Dalian.Figures 13-15 show the maps of the three test areas and the surrounding environment, respectively.The three test sites have different environmental characteristics, with Suzhou Tai Lake having the least wind and waves but with interference targets such as fishing nets and small flags on the water surface; Dalian Lushun Sea being more open, with typical targets on the water surface such as channel buoys, lighthouses, and dykes; and Dalian Tiger Beach being the most open, but with the most wind and waves, with more typical targets on the water surface such as fishing boats.The presence of interference targets such as fishing nets and small flags requires detection algorithms with better accuracy and differentiation capabilities.The wakes of USVs on the water's surface are more violent when the wind and waves are high, while wave interference is more likely to be generated.Higher requirements are placed on the recognition and robustness of multiple targets on the water surface.

Introduction to USV and Datasets
Validating the performance of data-driven deep network algorithms is generally performed on large, publicly available datasets.However, at this stage, there are no large publicly available datasets suitable for water surface target detection, a single dataset has limited scenarios, and the training results are not sufficient to illustrate the learning capability of the model.In this paper, we extracted part of the images from SeaShip7000 [29], the Water Surface Object Detection Dataset (WSODD) [30], and realistic and reliable data from a USV with a photoelectric pod device.Some of the actual datasets were collected from the USV platform "North Ocean" in the waters of Tai Lake in Suzhou, Lushun in Dalian, and Tiger Beach in Dalian.Figures 13-15 show the maps of the three test areas and the surrounding environment, respectively.The three test sites have different environmental characteristics, with Suzhou Tai Lake having the least wind and waves but with interference targets such as fishing nets and small flags on the water surface; Dalian Lushun Sea being more open, with typical targets on the water surface such as channel buoys, lighthouses, and dykes; and Dalian Tiger Beach being the most open, but with the most wind and waves, with more typical targets on the water surface such as fishing boats.The presence of interference targets such as fishing nets and small flags requires detection algorithms with better accuracy and differentiation capabilities.The wakes of USVs on the water's surface are more violent when the wind and waves are high, while interference is more likely to be generated.Higher requirements are placed on the recognition and robustness of multiple targets on the water surface.The data cover a variety of realistic and complex scenarios, such as backlighting, fog, wave disturbance, target clustering, and background disturbance datasets, as shown in Figure 16.In order to create the experimental dataset, an 8:2 ratio of the training set to the validation set was established, with 6824 images the training set and 1716 images comprising the test set.Dividing the training and test sets ensured that the number of target labels for each category was proportional to the distribution of the dataset.The images captured by USV are mainly used to complement the relatively small number of categories in the dataset.The distribution of labels in statistical Figure 17 is shown, with enough labels of each species distributed in the training and validation sets.The data cover a variety of realistic and complex scenarios, such as backlighting, fog, wave disturbance, target clustering, and background disturbance datasets, as shown in Figure 16.In order to create the experimental dataset, an 8:2 ratio of the training set to the validation set was established, with 6824 images comprising the training set and 1716 images comprising the test set.Dividing the training and test sets ensured that the number of target labels for each category was proportional to the distribution of the dataset.The images captured by USV are mainly used to complement the relatively small number of categories in the dataset.The distribution of labels in statistical Figure 17 is shown, with enough labels of each species distributed in the training and validation sets.The data cover a variety of realistic and complex scenarios, such as backlighting, fog, wave disturbance, target clustering, and background disturbance datasets, as shown in    Figure 18 shows the "North Ocean" USV.The "North Ocean" USV platform sensing system used in this paper consists of a maritime lidar, a laser lidar, an optoelectronic camera, an inertial measurement unit (IMU), a global positioning (GPS), and an industrial personal computer (IPC), as shown in Figure 19.The sensing computer is equipped with an 8-core CPU and an NVIDIA RTX2080 GPU with 7981 MB of memory.Figure 18 shows the "North Ocean" USV.The "North Ocean" USV platform sensing system used in this paper consists of a maritime lidar, a laser lidar, an optoelectronic camera, an inertial measurement unit (IMU), a global positioning system (GPS), and an industrial personal computer (IPC), as shown in Figure 19.The sensing computer is equipped with an 8-core i7-7700T CPU and an NVIDIA RTX2080 GPU with 7981 MB of memory.Figure 18 shows the "North Ocean" USV.The "North Ocean" USV platform sensing system used in this paper consists of a maritime lidar, a laser lidar, an optoelectronic camera, an inertial measurement unit (IMU), a global positioning system (GPS), and an industrial personal computer (IPC), as shown in Figure 19.The sensing computer is equipped with an 8-core i7-7700T CPU and an NVIDIA RTX2080 GPU with 7981 MB of memory.Figure 18 shows the "North Ocean" USV.The "North Ocean" USV platform sensing system used in this paper consists of a maritime lidar, a laser lidar, an optoelectronic camera, an inertial measurement unit (IMU), a global positioning system (GPS), and an industrial personal computer (IPC), as shown in Figure 19.The sensing computer is equipped with an 8-core i7-7700T CPU and an NVIDIA RTX2080 GPU with 7981 MB of memory.The high-precision photoelectric video reconnaissance instrument, which is outfitted with a color CCD white light camera, is the apparatus used to acquire visible RGB images.This camera's maximum resolution is 1920 by 1080.It possesses the ability to output video images in the form of network encoding and automatically control the aperture.

Evaluation Metrics
Precision (P) is defined as the number of positive samples detected correctly at the same time as a proportion of all positive samples detected; the higher the accuracy, the lower the probability of false detection of the target, so it is also called the accuracy.Recall (R) is defined as the number of positive samples detected correctly at the same time as the proportion of the total positive samples.The formulae for accuracy and recall are shown in Equations ( 11) and (12), respectively: where P denotes the accuracy rate and R denotes the recall rate.The above formulae can be used to obtain the values of accuracy and recall at different thresholds, and the Pcurve is plotted.The area enclosed by the P-R curve and the coordinate axis is the average accuracy (AP), and its calculation formula is shown in Equation ( 13): However, in practice, if integration is used to obtain the average accuracy, the steps are more cumbersome, so the interpolation sampling method is usually adopted to calculate the average value, and its calculation formula is shown in Equation ( 14): 1 ( ), 0, 0.1, 0.2,..., 0.95 11 In order to examine the extent of the model's lightweight, the experiments will also use the number of parameters of the network model and the number of floating-point operations (GFLOPs), which are negatively correlated with the lightweight of the model.The high-precision photoelectric video reconnaissance instrument, which is outfitted with a color CCD white light camera, is the apparatus used to acquire visible RGB images.This camera's maximum resolution is 1920 by 1080.It possesses the ability to output video images in the form of network encoding and automatically control the aperture.

Evaluation Metrics
Precision (P) is defined as the number of positive samples detected correctly at the same time as a proportion of all positive samples detected; the higher the accuracy, the lower the probability of false detection of the target, so it is also called the accuracy.Recall (R) is defined as the number of positive samples detected correctly at the same time as the proportion of the total positive samples.The formulae for accuracy and recall are shown in Equations ( 11) and (12), respectively: where P denotes the accuracy rate and R denotes the recall rate.The above formulae can be used to obtain the values of accuracy and recall at different thresholds, and the P-R curve is plotted.The area enclosed by the P-R curve and the coordinate axis is the average accuracy (AP), and its calculation formula is shown in Equation ( 13): However, in practice, if integration is used to obtain the average accuracy, the steps are more cumbersome, so the interpolation sampling method is usually adopted to calculate the average value, and its calculation formula is shown in Equation ( 14): In order to examine the extent of the model's lightweight, the experiments will also use the number of parameters of the network model and the number of floating-point operations (GFLOPs), which are negatively correlated with the lightweight of the model.
The lighter the model, the lower these two parameters are, and the more favorable the model will be for deployment on USVs.

Experimental Results and Analysis
The training results and mAP metrics statistics are shown in Table 3.The method in this paper shows an increase in mAP for each category of target compared to the base YOLOv7-Tiny model.A comparison of the precision-recall curves of the method in this paper and the baseline model is shown in Figure 20.It can be seen from the figure that the method in this paper has a larger curve coverage area compared to the baseline model, which means that the method in this paper is more accurate.The lighter the model, the lower these two parameters are, and the more favorable the model will be for deployment on USVs.

Experimental Results and Analysis
The training results and mAP metrics statistics are shown in Table 3.The method in this paper shows an increase in mAP for each category of target compared to the base YOLOv7-Tiny model.A comparison of the precision-recall curves of the method in this paper and the baseline model is shown in Figure 20.It can be seen from the figure that the method in this paper has a larger curve coverage area compared to the baseline model, which means that the method in this paper is more accurate.A confusion matrix was utilized to evaluate the accuracy of the proposed model's results.Each column of the confusion matrix represents the predicted proportions of each category, while each row represents the true proportions of the respective category in the data, as depicted in Figure 21.A confusion matrix was utilized to evaluate the accuracy of the proposed model's results.Each column of the confusion matrix represents the predicted proportions of each category, while each row represents the true proportions of the respective category in the data, as depicted in Figure 21.
The confusion matrix's dark blue portion is shown on the diagonal in Figure 19, with a high accuracy of over 83% for the right prediction of each category.The research also shows that the detection findings show very minimal category confusion, no more than 17%.This demonstrates that the suggested model has a strong learning power and that the labels assigned to the data in this research are pretty reasonable.The background's influence (background FP), which obviously includes cargo ship and boat tags in its inaccurate predictions, is the primary cause of the poor detection findings.This further illustrates that the current data environment is not homogeneous but rather richly diversified and complex due to the great quantity of these two tags and the intricacy of the context in which they are located.Figure 22 displays the training's result curve.For the same 300 training epochs, the loss decreases more quickly.It is important to note that this strategy considerably raises the recall rate.Accordingly, our technique not only increases accuracy but also learns surface target properties more effectively, lowers the likelihood of missing a target detection, and identifies more targets in complex aquatic environments.The confusion matrix's dark blue portion is shown on the diagonal in Figure 19, with a high accuracy of over 83% for the right prediction of each category.The research also shows that the detection findings show very minimal category confusion, no more than 17%.This demonstrates that the suggested model has a strong learning power and that the labels assigned to the data in this research are pretty reasonable.The background's influence (background FP), which obviously includes cargo ship and boat tags in its inaccurate predictions, is the primary cause of the poor detection findings.This further illustrates that the current data environment is not homogeneous but rather richly diversified and complex due to the great quantity of these two tags and the intricacy of the context in which they are located.Figure 22 displays the training's result curve.For the same 300 training epochs, the loss decreases more quickly.It is important to note that this strategy considerably raises the recall rate.Accordingly, our technique not only increases accuracy but also learns surface target properties more effectively, lowers the likelihood of missing a target detection, and identifies more targets in complex aquatic environments.The confusion matrix's dark blue portion is shown on the diagonal in Figure 19, with a high accuracy of over 83% for the right prediction of each category.The research also shows that the detection findings show very minimal category confusion, no more than 17%.This demonstrates that the suggested model has a strong learning power and that the labels assigned to the data in this research are pretty reasonable.The background's influence (background FP), which obviously includes cargo ship and boat tags in its inaccurate predictions, is the primary cause of the poor detection findings.This further illustrates that the current data environment is not homogeneous but rather richly diversified and complex due to the great quantity of these two tags and the intricacy of the context in which they are located.Figure 22 displays the training's result curve.For the same 300 training epochs, the loss decreases more quickly.It is important to note that this strategy considerably raises the recall rate.Accordingly, our technique not only increases accuracy but also learns surface target properties more effectively, lowers the likelihood of missing a target detection, and identifies more targets in complex aquatic environments.

Comparison with Other Popular Models
In order to verify the superiority of the proposed model, this paper also compares it with other mainstream lightweight target detection models.In addition to the YOLO series models, this paper also refers to other lightweight models, such as MobileNetv3 [31], GhostNetv2 [32], ShuffleNetv2 [33], PP-PicoDet [34], and FasterNet [35], which are combined with YOLOv7-Tiny for comparison experiments.Considering its effectiveness, the platform used for the comparison is the industrial personal computer for the "North Ocean" USV platform.The experimental results are shown in Table 4.By contrasting several models, we can find that the accuracy and recall of YOLOv8 have the highest accuracy and recall, while the number of methodological parameters in this study is modest and only partially redundant.Other types of models, such as Faster-RCNN, RetinaNet [36], and CenterNet [37], can also perform well.However, their models are too sophisticated to be deployed on USVs and do not match the requirements for detection speed.The model after combining GhostNetv2 that has the fewest GFLOPs.Although YOLOX-Tiny has the fastest detection speed and the fewest parameters, its recall is substantially lower than that of the baseline model, making it useless for detecting targets on the water's surface.Although this method's mAP is substantially improved and its GFLOPs are slightly greater than other lightweight approaches, it is more effective at combining multi-scale water surface target features.Although the method in this paper does not have the advantage of speed, it fully satisfies the input requirement of 30 fps for the USV-equipped optoelectronic pods and can easily achieve real-time detection.

Ablation Experiments
In purpose of verifying the effectiveness of each method proposed in this paper, ablation experiments were conducted, and the results are shown in Table 5.The ablation experiments show that the network learning capability is effectively improved by replacing the LeakyReLU activation function with the Mish and SiLU activation functions.R2B improves accuracy by better integrating multi-scale features.R2B is lighter and more suitable for water surface target detection than the original MCB, reducing the 1.22 M parameters of the network model.The addition of CoordConv to Neck incorporates more feature information, and the increase in the number of parameters and computations is almost negligible.After using DyHead, the number of prediction channels is set to 128, which can effectively improve the accuracy while slightly reducing the parameters, but of course, it also brings some increase in inference time.

Comparative Analysis of Visualization Results
Some visualizations of the detection results on the test set are shown in Figure 23.The method presented in this work is better able to learn multi-scale target features, as shown in Figure 23.For instance, angle fluctuations and intraclass variances have an impact the detection of multi-scale ships with significant aspect ratio variations; however, this method is more successful in identifying and capturing the target information of the ship.Additionally, this approach works better in challenging aquatic conditions, as shown in the red boxes, such as foggy weather, overlapping targets, small targets, and light and darkness effects.

Comparative Analysis of Visualization Results
Some visualizations of the detection results on the test set are shown in Figure 23.The method presented in this work is better able to learn multi-scale target features, as shown in Figure 23.For instance, angle fluctuations and intraclass variances have an impact on the detection of multi-scale ships with significant aspect ratio variations; however, this method is more successful in identifying and capturing the target information of the ship.Additionally, this approach works better in challenging aquatic conditions, as shown in the red boxes, such as foggy weather, overlapping targets, small targets, and light and darkness effects.The heat map is drawn by the Grad-CAM [38].Gradient-weighted Class Activation Mapping (Grad-CAM) uses the gradients of any target concept flowing into the final convolutional layer to produce a coarse localization map highlighting the important regions in the image for predicting the concept.Our efficiency captures spatial information on the depth feature map.For instance, when determining the type of ship, the more reliable bow and stern structures are given more consideration than the intermediate hull.Additionally, it combines more environmental data and concentrates on targets that are easily missed.This image effectively illustrates how mindful attention may be used to learn from DyHead.
volutional layer to produce a coarse localization map highlighting the important regions in the for predicting the concept.Our efficiency captures spatial information on the depth feature map.For instance, when determining the type of ship, the more reliable bow and stern structures are given more consideration than the intermediate hull.Additionally, it combines more environmental data and concentrates on targets that are easily missed.This image effectively illustrates how mindful attention may be used to learn from Dy-Head.

Experiments in Generalization Ability
Another Singapore Maritime Dataset (SMD) [39] was prepared to validate the applicability of the model for multi-scene surface target detection tasks.SMD is a video dataset of sea scenes containing numerous multi-scale ship targets, with images taken on deck and ashore, mainly video continuous frame images.Its dataset is shown in Figure 25 below.In this paper, a frame-drawing method is used to create a home-made dataset to validate the generalization capability of the model to different scenes.

Experiments in Generalization Ability
Another Singapore Maritime Dataset (SMD) [39] was prepared to validate the applicability of the model for multi-scene surface target detection tasks.SMD is a video dataset of sea scenes containing numerous multi-scale ship targets, with images taken on deck and ashore, mainly video continuous frame images.Its dataset is shown in Figure 25 below.In this paper, a frame-drawing method is used to create a home-made dataset to validate the generalization capability of the model to different scenes.
volutional layer to produce a coarse localization map highlighting the important regions in the image for predicting the concept.Our efficiency captures spatial information on the depth feature map.For instance, when determining the type of ship, the more reliable bow and stern structures are given more consideration than the intermediate hull.Additionally, it combines more environmental data and concentrates on targets that are easily missed.This image effectively illustrates how mindful attention may be used to learn from Dy-Head.

Experiments in Generalization Ability
Another Singapore Maritime Dataset (SMD) [39] was prepared to validate the applicability of the model for multi-scene surface target detection tasks.SMD is a video dataset of sea scenes containing numerous multi-scale ship targets, with images taken on deck and ashore, mainly video continuous frame images.Its dataset is shown in Figure 25 below.In this paper, a frame-drawing method is used to create a home-made dataset to validate the generalization capability of the model to different scenes.In this section, several models are also selected for comparative analysis.The experimental platform and parameters were kept consistent, and the results were compared across multiple models in the following Figure 26.The line graphs show that ours and the YOLOv5s are the best in terms of SMD, with similar metrics in every aspect, and both perform significantly better than the base model.The method presented in this study, however, is computationally challenging, converges more quickly, and has a limited number of parameters.When integrated with PicoDet network work, the network also outperforms the baseline model, although the issue that the PicoDet model is too computationally expensive still exists.
In this section, several models are also selected for comparative analysis.The experimental platform and were kept consistent, and the results were compared across multiple models in the following Figure 26.The line graphs show that ours and the YOLOv5s are the best in terms of SMD, with similar metrics in every aspect, and both perform significantly better than the base model.The method presented in this study, however, is computationally challenging, converges more quickly, and has a limited number of parameters.When integrated with PicoDet network work, the network also outperforms the baseline model, although the issue that the PicoDet model is too computationally expensive still exists.Figure 27 displays the outcomes of the partial detection on the test set.Comparing the contents of the red boxes in the figure shows that method proposed in this study is also effective in different scenarios.Compared to other lightweight approaches, the detection frame has fewer errors and misses and is more accurate.In conclusion, the method presented in this work is more broadly applicable than existing lightweight methods and is appropriate for a variety of applications.

Conclusions and Discussion
With the development of deep learning, more and more research has focused on the field of water surface target detection.In this paper, a lightweight detection method for USV is investigated.The method of enhancing multi-scale feature fusion of surface targets is investigated while ensuring sufficient detection speed.Most previous studies have mostly used a combination of different lightweight convolutional approaches as well as attentional mechanisms, and these operations can significantly reduce detection accuracy.In this paper, we combine the characteristics of multi-scale water surface targets and focus

Conclusions and Discussion
With the development of deep learning, more and more research has focused on the field of water surface target detection.In this paper, a lightweight detection method for USV is investigated.The method of enhancing multi-scale feature fusion of surface targets is investigated while ensuring sufficient detection speed.Most studies have mostly used a combination of different lightweight convolutional approaches as well as attentional mechanisms, and these operations can significantly reduce detection accuracy.In this paper, we combine the characteristics of multi-scale water surface targets and focus on fusing more effective features with fewer convolution operations.The capture and fusion of valid feature information are enhanced by mapping multi-scale features to residuals and combining them with spatial information enhancement.Multiple attention-aware fusions of the detection task are then used to further create an algorithm suitable for water surface target detection.
This paper presents a lightweight multi-scale feature-enhanced detection method for surface target detection on USV that can achieve a balance of efficiency and accuracy.The proposed model has 22.1% fewer parameters than the original model, 15% fewer GFLOPs, a 6.2% improvement in mAP@0.5, a 4.3% rise in mAP@0.5:0.95, and it satisfies the real-time criteria.Compared with the original YOLOv7Tiny model and other lightweight methods, it has obvious advantages in terms of missed and wrong detections in sophisticated scenes, combines accuracy and real-time performance, and is more suitable for water surface target detection.The generalization ability over the original model in different water scenarios also has a clear advantage.This paper also combines other lightweight methods and designs other improved models for comparative experiments, providing a valuable reference for the re-examination of USV lightweight detection.
Due to the constraints, no experiments were conducted for real detection missions.Future research should consider conducting sea trials to verify the practical effectiveness of the method and further reduce the computational effort of the model, making it less demanding to deploy.Additionally, there were not enough little targets in the experimental data that were challenging to detect due to the equipment's restricted viewing range and the distribution of the dataset.One of the goals of future research will be to increase the accuracy of small targets.

Figure 2 .
Figure 2. Detailed view of the YOLOv7-Tiny modules.

Figure 2 .
Figure 2. Detailed view of the YOLOv7-Tiny modules.Figure 2. Detailed view of the YOLOv7-Tiny modules.

Figure 2 .
Figure 2. Detailed view of the YOLOv7-Tiny modules.Figure 2. Detailed view of the YOLOv7-Tiny modules.

Figure 3 .
Figure 3.Comparison of three activation functions.

Figure 3 .
Figure 3.Comparison of three activation functions.

Figure 3 .
Figure 3.Comparison of three activation functions.

EachFigure 5 .
Figure 5.Comparison of different structures.(a) Bottleneck; (b) Res2Net.The Res2Net module processes features in a multi-scale way for splitting, facilitating the extraction of global and local information.The output of the Res2Net module contains various numbers and combinations of receptive field size scales.All splits are interconnected, allowing for more effective processing of features by 1:1 convolutional splits and

Figure 8 .
Figure 8.An illustration of our Dynamic Head approach.

Figure 8 .
Figure 8.An illustration of our Dynamic Head approach.
is a hyperfunction that learns to control boundary of the activation function.θ(•) is implemented similarly to dynamic ReLU, which first performs global averaging pooling in the L × S dimension to reduce the dimensionality, then employs a normalization layer, two fully linked layers, and a shift-ed S-shaped function to normalize the output to [-1, 1].Figure9illustrates the DyHead network structure used with the YOLOv7-Tiny.
function that learns to control boundary of the activation function.()  is impleme similarly to dynamic ReLU, which first performs global averaging pooling in the L dimension to reduce the dimensionality, then employs a normalization layer, two linked layers, and a shift-ed S-shaped function to normalize the output to [-1, 1].Figu illustrates the DyHead network structure used with the YOLOv7-Tiny.

Figure 9 .
Figure 9.The detailed configuration of Dynamic Head.

Figure 9 .
Figure 9.The detailed configuration of Dynamic Head.

26 Figure 10 .
Figure 10.Structure diagram of the improved network.

Figure 10 .
Figure 10.Structure diagram of the improved network.

J 26 Figure 11 .
Figure 11.Detailed view of the improved network modules.

Figure 11 .
Figure 11.Detailed view of the improved network modules.

J 26 Figure 13 .
Figure 13.Suzhou Tai Lake watershed map and surroundings.

Figure 14 .
Figure 14.Sea map and surroundings of Lushun Sea in Dalian.

Figure 15 .
Figure 15.Sea map and surroundings of Tiger Beach in Dalian.

Figure 14 .
Figure 14.Sea map and surroundings of Lushun Sea in Dalian.

Figure 15 .
Figure 15.Sea map and surroundings of Tiger Beach in Dalian.

Figure 14 . 26 Figure 13 .
Figure 14.Sea map and surroundings of Lushun Sea in Dalian.

Figure 14 .
Figure 14.Sea map and surroundings of Lushun Sea in Dalian.

Figure 15 .
Figure 15.Sea map and surroundings of Tiger Beach in Dalian.

Figure 16 .
In order to create the experimental dataset, an 8:2 ratio of the training set to the validation set was established, with 6824 images comprising the training set and 1716 images comprising the test set.Dividing the training and test sets ensured that the number of target labels for each category was proportional to the distribution of the dataset.The images captured by USV are mainly used to complement the relatively small number of categories in the dataset.The distribution of labels in statistical Figure 17 is shown, with enough labels of each species distributed in the training and validation sets.

Figure 15 .
Figure 15.Sea map and surroundings of Tiger Beach in Dalian.The data cover a variety of realistic and complex scenarios, such as backlighting, fog, wave disturbance, target clustering, and background disturbance datasets, as shown in Figure16.In order to create the experimental dataset, an 8:2 ratio of the training set to the validation set was established, with 6824 images comprising the training set and 1716 images comprising the test set.Dividing the training and test sets ensured that the number of target labels for each category was proportional to the distribution of the dataset.The images captured by USV are mainly used to complement the relatively small number of categories in the dataset.The distribution of labels in statistical Figure17is shown, with enough labels of each species distributed in the training and validation sets.

Figure 17 .
Figure 17.The instances information statistics of our dataset.

Figure 18 .
Figure 18.The "North Ocean" USV platform and trial surroundings.

Figure 17 .
Figure 17.The instances information statistics of our dataset.

Figure 18 .
Figure 18.The "North Ocean" USV platform and trial surroundings.

Figure 17 .
Figure 17.The instances information statistics of our dataset.

Figure 17 .
Figure 17.The instances information statistics of our dataset.

Figure 18 .
Figure 18.The "North Ocean" USV platform and trial surroundings.Figure 18.The "North Ocean" USV platform and trial surroundings.

Figure 18 .
Figure 18.The "North Ocean" USV platform and trial surroundings.Figure 18.The "North Ocean" USV platform and trial surroundings.

Figure 19 .
Figure 19.Hardware structure diagram of the sensing system of the "North Ocean" USV platform.

Figure 19 .
Figure 19.Hardware structure diagram of the sensing system of the "North Ocean" USV platform.

Figure 20 .
Figure 20.Comparison of the Precision-Recall curves.

Figure 20 .
Figure 20.Comparison of the Precision-Recall curves.

Figure 21 .
Figure 21.The confusion matrix of the proposed model.

Figure 22 .
Figure 22.Comparison of visual graphs of the training process.

Figure 21 .
Figure 21.The confusion matrix of the proposed model.

Figure 21 .
Figure 21.The confusion matrix of the proposed model.

Figure 22 .
Figure 22.Comparison of visual graphs of the training process.Figure 22.Comparison of visual graphs of the training process.

Figure 22 .
Figure 22.Comparison of visual graphs of the training process.Figure 22.Comparison of visual graphs of the training process.

Figure 24
Figure 24 compares the deep network attention heat map of the detection outcomes.The heat map is drawn by the Grad-CAM [38].Gradient-weighted Class Activation

Figure 24
Figure24compares the deep network attention heat map of the detection outcomes.The heat map is drawn by the Grad-CAM[38].Gradient-weighted Class Activation Mapping (Grad-CAM) uses the gradients of any target concept flowing into the final convolutional layer to produce a coarse localization map highlighting the important regions in the image for predicting the concept.Our efficiency captures spatial information on the depth feature map.For instance, when determining the type of ship, the more reliable bow and stern structures are given more consideration than the intermediate hull.Additionally, it combines more environmental data and concentrates on targets that are easily missed.This image effectively illustrates how mindful attention may be used to learn from DyHead.

Figure 24 .
Figure 24.Comparison of the deep attention heat map.(a) Baseline; (b) Ours combines a variety of conscious attention.

Figure 24 .
Figure 24.Comparison of the deep attention heat map.(a) Baseline; (b) Ours combines a variety of conscious attention.

Figure 24 .
Figure 24.Comparison of the deep attention heat map.(a) Baseline; (b) Ours combines a variety of conscious attention.

Figure 26 .
Figure 26.Comparison of the results on SMD.(a) Comparison of the map; (b) Comparison of the precision; (c) Comparison of the recall.

Figure 26 .
Figure 26.Comparison of the results on SMD.(a) Comparison of the map; (b) Comparison of the precision; (c) Comparison of the recall.

Figure 27
Figure27displays the outcomes of the partial detection on the test set.Comparing the contents of the red boxes in the figure shows that the method proposed in this study is also effective in different scenarios.Compared to other lightweight approaches, the detection frame has fewer errors and misses and is more accurate.In conclusion, the method presented in this work is more broadly applicable than existing lightweight methods and is appropriate for a variety of applications.

Table 3 .
Comparison of mAP before and after improvement.

Table 3 .
Comparison of mAP before and after improvement.

Table 4 .
Comparison of popular models on the sensing IPC for USV.

Table 5 .
The results of ablation experiments.