2.3. Convolution Optimization
The backbone network of the YOLOv13n algorithm relies on four convolutional layers for image feature extraction, with the final output feature maps providing crucial support for subsequent networks. However, in real-world UUV operation scenarios, underwater images are often affected by factors such as water scattering and absorption, leading to widespread blurring issues. Traditional convolution operations, due to their larger stride steps, can easily cause information loss during feature downsampling, making it difficult to fully extract effective features of targets from blurred underwater images. This severely impacts subsequent detection performance. To address this issue, this section draws inspiration from the core idea of SPD-Conv [
29] and performs targeted optimization on the four key convolutional layers in the backbone network. The specific optimization strategy is as follows: adjusting the original stride-2 convolutional layers to stride-1, while embedding an SPD layer after each convolutional layer to enhance the network’s feature extraction capability for blurred underwater images.
The core function of the SPD layer is to downsample feature maps without losing feature information, which fundamentally distinguishes it from the discard-based downsampling approach of traditional strided convolution. Specifically, for an input feature map , the SPD layer uniformly partitions it into non-overlapping sub-feature maps according to a predefined scaling factor . These sub-feature maps are then concatenated along the channel dimension to generate a new feature map . Compared to the original feature map , the spatial dimensions of are reduced to , while the number of channels is expanded by a factor of . This transformation mechanism ensures that fine-grained information from the original feature map is fully preserved, fundamentally avoiding the loss of effective features caused by strided sampling in traditional convolution operations.
As shown in
Figure 3, this process illustrates the optimized implementation of a single convolutional layer in the backbone network. It primarily involves two key steps: fine-grained feature capture and lossless transformation.
The purpose of the fine-grained feature capture step is to eliminate the interference caused by traditional convolutional strided sampling and fully preserve the fine-grained feature information in blurred underwater images. This is primarily achieved by adjusting the original stride-2 convolution to a stride-1
convolution. The calculation formula for the output size of the convolutional layer is shown in Equation (1):
Among them,
represents the size of the output feature map,
denotes the spatial size of the input feature map,
is the kernel size,
stands for the padding amount, and
indicates the convolution stride. In this paper, the kernel size is set to 3, padding is 1, and the stride is 1. Substituting these values into Equation (1) yields
. This result verifies that the spatial size of the output from a
convolution with a stride of 1 matches the spatial size of the input. This demonstrates that this operation can traverse the input feature map of size
pixel by pixel, thereby fully capturing the local details of underwater targets.
The purpose of the lossless transformation step is to meet the downsampling requirements of subsequent networks while endowing spatial features with richer expressive dimensions, thereby avoiding redundancy or loss of feature information. The core requirement is to achieve effective spatial compression while preserving feature details to the greatest extent. As a key parameter of the SPD layer in realizing this function, the scale factor directly determines the compression ratio of the input feature map. From the perspective of the fundamental need for downsampling, the scale factor must be at least 2 to effectively streamline the feature map. If it is less than 2, the required dimensional compatibility for subsequent networks cannot be met, leading to feature redundancy and reduced computational efficiency. As the minimum effective value that fulfills the downsampling requirement, a factor of 2 offers the mildest compression degree, thereby maximizing the preservation of critical information—such as weak edges and local textures—inherently fragile in underwater blurry targets, preventing excessive aggregation and loss. Therefore, the lossless transformation step is primarily achieved by employing an SPD layer with a scale factor of 2. The spatial partitioning formula for the SPD layer is shown in Equation (2):
where
is the spatial splitting operation with a scaling factor of
, and the output is a collection of
non-overlapping sub-feature maps.
is the number of channels in the input feature map,
is the height of the input feature map, and
is the width of the input feature map. In the context of this paper, the input feature map size is
. With a scaling factor of 2, it can be derived that the input feature map is split into 4 sub-feature maps, each of size
, and all sub-feature maps cover the complete spatial area of the original feature map without overlap.
After dual processing of fine-grained feature capture and lossless transformation, the generated feature maps can maximally achieve the extraction and complete preservation of target feature information. Such a convolution optimization strategy can provide richer, detailed information for underwater blurred targets, facilitating subsequent network target identification.
2.4. MBACA
In the feature enhancement stage of the YOLOv13n algorithm, attention mechanisms are often introduced to enable the algorithm to enhance the representation of useful features while suppressing interference from irrelevant features, thereby providing high-quality feature support for subsequent multi-scale feature fusion in the network. Addressing the issue of complex background interference affecting target features in underwater images captured during actual UUV operations, this section draws on the core idea of “multi-branch dynamic weighting” from the Selective Kernel Attention (SKA) [
30] mechanism to propose the Multi-Branch Adaptive Calibration Attention (MBACA) mechanism.
As shown in
Figure 4, SKA achieves dynamic weighted fusion of multi-scale features through a three-stage operation of “Split-Fuse-Select”. First, the Split operation generates multiple branches with different kernel sizes to capture feature information under varying receptive fields. Next, the Fuse operation performs element-wise addition of the features from each branch, extracts channel statistics via global average pooling, and then generates compact features through fully connected layers to guide weight allocation. Finally, in the Select operation, the softmax function is utilized to generate adaptive weights for each branch. The features from different branches are weighted and fused to output the final features.
Features processed by SKA can adaptively aggregate multi-scale features from different receptive fields, thereby enhancing the representation ability of targets. However, in UUV operational scenarios, due to the large-scale variations of underwater targets and complex background interference in captured images, attention mechanisms require richer branch coverage and more comprehensive interference suppression capabilities. Therefore, this section proposes MBACA, aiming to achieve sufficient coverage of multi-scale underwater target features and effective suppression of background interference.
As shown in
Figure 5, MBACA primarily adopts a hierarchical strategy of “multi-branch feature extraction, four-dimensional weighting, and feature screening fusion” to enhance target features while suppressing background interference. Specifically, the process is as follows: First, eight sets of branches with different kernel sizes are used to cover multi-scale underwater targets, adapting to the significant size variations of objects in underwater images. Next, weighting operations are applied across four dimensions: branch channels, spatial regions, global channels, and effective branches. This enables spatial focus on target regions, redundancy suppression in background channels, and independent optimization of branch features. Finally, a Fuse operation integrates the effective features from each branch after four-dimensional weighting. Combined with a residual connection, these features are fused with the original input, effectively preserving essential target information while efficiently mitigating interference from complex underwater backgrounds.
Assume the input feature map is , where is the batch size, is the number of channels in the input features, is the height of the input feature map, and is the width of the input feature map.
First, the input feature map
is processed using 8 sets of branches to generate multi-scale branch features
. To avoid excessive computational overhead when expanding the branches, the convolutional components of these eight branches incorporate depthwise separable convolutions. Depthwise separable convolutions decompose standard convolutions into depthwise and pointwise operations, significantly reducing the number of parameters compared to standard convolutions with the same kernel size. This effectively prevents the eight-branch structure from adding too much computational cost. Considering the characteristics of underwater targets, four kernel sizes are selected for the eight-branch structure: 1, 3, 5, and 7. Specifically, the 1 × 1 kernel captures the global semantic features of small targets and retains fine-grained global information, mitigating interference from underwater blurry images on target detection. The 3 × 3, 5 × 5, and 7 × 7 kernels progressively capture the local texture details of medium to large targets, overall contour features, and contextual associations, effectively compensating for feature loss and fragmentation caused by underwater image blur and complex backgrounds. Each branch includes convolution, batch normalization, and SiLU activation. The specific operations are shown in Equation (3):
where
,
denotes batch normalization,
denotes depthwise separable convolution, and
denotes 1 × 1 pointwise convolution. The 8 multi-scale features are stacked into a tensor
.
Next, four-dimensional weighting is applied to the branch feature tensor .
Branch Channel Dimension: This achieves independent optimization of channels for each branch. Based on the global average feature
of all branches, channel weights
are assigned to each branch feature individually. The specific operation is shown in Equation (4):
represents the first fully connected layer, which maps the input features of dimension
to an output dimension of
.
represents the second fully connected layer, which maps the intermediate features of dimension
back to dimension
.
, and
is the sigmoid activation function. The weighted branch features are given by Equation (5):
where
denotes the Hadamard product.
Spatial Region Dimension: Focus on target spatial regions and suppress background noise. Based on the original input feature map
, spatial attention is computed to generate a global spatial weight map
. The specific operation is shown in Equation (6):
The weighted branch features are given by Equation (7):
Global Channel Dimension: Achieve redundancy suppression in background channels through high-weight channel mixing enhancement. Based on the fused feature
from all branch features, where
, channel statistics
are first obtained via global average pooling. Then, the global channel weights
for the 8 branches are generated. The specific operation is shown in Equation (8):
where
represents a learnable temperature coefficient. Specifically, T is defined as a trainable parameter, with its initial value set to 0.8 to accommodate the low signal-to-noise ratio characteristics of underwater scenarios. During training, a range constraint of [0.05, 2.0] is imposed on T to prevent excessive concentration or dispersion of branch weights.
Valid Branch Dimension: Filter high-confidence branch features. Based on the global feature
, selection weights
for the 8 branches are generated to screen the branches. The specific operation is shown in Equation (9):
The weighted branch features are given by Equation (10):
Finally, the 8 branch features after four-dimensional weighting are summed according to the global channel weights, and the final enhanced feature
is generated by combining a lightweight residual connection. The specific operation is shown in Equation (11):
MBACA enhances the feature representation of underwater targets through multi-branch coverage of multi-scale underwater target features, four-dimensional weighting to focus on target regions and suppress background redundancy, and lightweight residual connections to preserve effective information. This simultaneously reduces interference from complex backgrounds on the features, ultimately improving the algorithm’s detection accuracy in underwater scenarios. The integration position of the MBACA attention mechanism needs to be determined in accordance with the feature propagation patterns in underwater images. The backbone network, from bottom to top, corresponds to low-level texture features (B3), mid-level semantic features (B4), and high-level global features (B5). Considering that underwater images are affected by background noise interference throughout the entire feature propagation process, low-level features contain high redundancy in details. Introducing the attention mechanism too early may lead to computational resource wastage. Additionally, high-level features have the most direct impact on the final detection results. Therefore, it is preliminarily assumed that the MBACA mechanism should be placed after the final layer (B5) of the backbone network. This ensures that the network can focus on targets through global features while avoiding the redundancy associated with low-level features. The reasonableness of this hypothesis will be verified through subsequent experiments.
2.5. Upsampling Optimization
The neck network structure of the YOLOv13n algorithm continues the bidirectional feature aggregation logic of PANet, with the upsampling module serving as the core component for multi-scale feature fusion. This is implemented using the nn.Upsample interface with nearest-neighbor interpolation. However, in the actual operational scenarios of UUVs, underwater images used for target detection often suffer from blurred object edge information. Due to the pixel-replication characteristic of nearest-neighbor interpolation, this method tends to amplify noise and introduce additional interference, thereby reducing the effectiveness of feature fusion and subsequently degrading the localization accuracy of target detection. To address this issue, this paper introduces the CARAFE [
31] upsampling module to optimize the upsampling operation in the YOLOv13n algorithm.
As shown in
Figure 6, the CARAFE module primarily employs a two-stage strategy of “kernel prediction and content-aware reassembly” to achieve feature upsampling. This approach not only ensures an increase in the spatial resolution of features after upsampling but also preserves target details and suppresses noise interference through content-adaptive mechanisms. Specifically:
First, for the input feature map with dimensions , a channel compressor reduces its channel count from to , thereby lowering computational costs while retaining key semantic information. Subsequently, the compressed features enter the kernel prediction module, where a content encoder performs local context encoding on the reduced-dimension features to generate kernel parameters of dimension . Here, represents the upsampling factor, and denotes the reassembly kernel size. A kernel normalizer then applies a softmax operation to each reassembly kernel, producing location-specific dynamic weights to adapt to the semantic differences between targets and backgrounds in underwater images.
Next, in the content-aware reassembly module, for any position in the output feature map, the corresponding source position in the input feature map is first located, and the local neighborhood features at this position are extracted. The neighborhood features are then combined with the dynamic weights output by the kernel prediction module through a reassembly operation, which performs a weighted summation to aggregate effective features within the neighborhood. This enables high-weight focusing on target regions and adaptive suppression of background noise.
Finally, through the aforementioned reassembly operation, feature aggregation is completed for all positions, outputting a high-resolution feature map with dimensions . This process achieves precise retention of edge details for underwater targets through dynamic kernels, thereby addressing the issue of noise interference introduced by nearest-neighbor interpolation methods.
After replacing the upsampling module in the neck of YOLOv13n with the CARAFE module, the model is able to adaptively generate sampling kernels based on the weights of local features during upsampling. This allows for more accurate restoration of target edges and texture information while avoiding the introduction of additional noise interference, thereby improving the subsequent network’s ability to recognize targets.
2.6. Loss Function Optimization
The bounding box regression loss of the YOLOv13n algorithm adopts the CIoU calculation method. In real-world UUV operation scenarios, dense multi-target environments are very common, where multiple targets interfere with each other, leading to frequent overlap and occlusion of bounding boxes across different targets. This results in confusion in the spatial correspondence between predicted and ground truth boxes, significantly increasing errors in IoU calculation. CIoU assigns equal loss weights to all samples without distinguishing between easy and hard examples. In complex multi-target interference scenarios, the model tends to focus on easy samples with simple backgrounds and no occlusion, while severely lacking regression accuracy for hard samples such as ambiguous targets in overlapping areas. This ultimately leads to increased bounding box localization errors and overall detection performance degradation. To address this issue, this section integrates the power-weighting concept of Alpha-IoU [
32] with the hard-sample focusing mechanism of Focal-IoU [
33] to design an Alpha-Focal-IoU loss function tailored for underwater target detection. Through dual optimization, this approach enhances bounding box localization accuracy in dense multi-target scenarios.
Alpha-IoU’s core innovation lies in introducing an adjustable power parameter
. By applying a power transformation to the IoU and loss penalty terms, it achieves gradient re-weighting for samples with different overlap levels, amplifying the IoU differences among samples, thereby addressing the difficulty of distinguishing truly matched samples from mismatched ones in dense multi-target overlapping scenarios. Its core formulas are given in Equations (12) and (13):
where
denotes the overlapping area between the predicted and ground-truth boxes,
denotes the union area of the predicted and ground-truth boxes,
, and
.
Focal-IoU’s core innovation lies in introducing the
modulation factor, which down-weights the loss of easy samples and up-weights the loss of hard samples. It addresses the issue of equal weighting for all samples in dense multi-target underwater scenes. Its core formula is given in Equation (14):
where
.
This section addresses the issues of low IoU discriminability and insufficient learning of hard samples in CIoU by integrating Alpha-IoU and Focal-IoU to optimize CIoU and design the Alpha-Focal-IoU loss function. Specifically, a power transformation is applied to the IoU term and penalty term of CIoU to amplify IoU differences in multi-target overlapping scenes; and the loss values are differentially weighted through the parameter γ, reducing the weight of easy samples and increasing the weight of hard samples. The core formulas are given in Equation (15) through Equation (18):
The Alpha-Focal-IoU loss function enables the algorithm, in underwater multi-target overlapping scenarios, not only to quickly distinguish between true matching boxes and interfering boxes, but also to focus on the bounding box regression of occluded targets. It effectively addresses the issue of reduced bounding box regression accuracy in underwater scenes caused by the original CIoU’s low IoU discriminability and equal sample weighting.