MBACA-YOLO: A High-Precision Underwater Target Detection Algorithm for Unmanned Underwater Vehicles

Han, Chuang; Chen, Shanshan; Shen, Tao; Guo, Chengli

doi:10.3390/machines14020231

Open AccessArticle

MBACA-YOLO: A High-Precision Underwater Target Detection Algorithm for Unmanned Underwater Vehicles

¹

School of Measurement-Control Technology and Communications Engineering, Harbin University of Science and Technology, Harbin 150080, China

²

Sunny Group Co., Ltd., Yuyao 315400, China

^*

Author to whom correspondence should be addressed.

Machines 2026, 14(2), 231; https://doi.org/10.3390/machines14020231

Submission received: 25 January 2026 / Revised: 12 February 2026 / Accepted: 14 February 2026 / Published: 15 February 2026

(This article belongs to the Section Vehicle Engineering)

Download

Browse Figures

Versions Notes

Abstract

This paper addresses the issue of low detection accuracy in underwater optical images for unmanned underwater vehicles (UUVs) during practical operations, caused by factors such as uneven lighting, blur, complex backgrounds, and target occlusion. To enhance the autonomous perception and control capabilities of UUVs, a high-precision algorithm named MBACA-YOLO is proposed based on the YOLOv13n model. Firstly, the convolutional layers in the backbone network of YOLOv13n are optimized by replacing stride-2 convolutions with stride-1 and embedding SPD layers to enable richer feature extraction. Secondly, the newly proposed MBACA attention mechanism is integrated into the final layer of the backbone network, enhancing effective features and suppressing background noise interference. Thirdly, traditional upsampling in the neck network is replaced with CARAFE upsampling to mitigate noise pollution. Finally, an Alpha-Focal-CIoU loss function is designed to improve the accuracy of bounding box regression for underwater targets. To validate the algorithm’s effectiveness, experiments were conducted on the URPC dataset with the following evaluation protocol: 640 × 640 input resolution, batch size 1, FP32 precision, and standard NMS. All results are from a single random seed with 300 epochs of training. The proposed MBACA-YOLO algorithm outperforms the baseline YOLOv13n model, improving mAP@0.5 and mAP@0.5:0.95 by 3.1% and 2.8% respectively, while adding only 0.49M parameters and 1.0 GFLOPs, with an FPS drop of just 2 frames. This makes it an efficient, deployable perception solution for automated Unmanned Underwater Vehicles (UUVs), significantly advancing intelligent underwater systems.

Keywords:

underwater target detection; attention mechanism; YOLOv13; UUV

1. Introduction

In the study of marine resources, UUVs serve as a core exploration instrument, contributing indispensable and critical capabilities to the precise surveying and sustainable development and utilization of marine resources. Since 1957, when the University of Washington developed the first Autonomous Underwater Vehicle (AUV) for Arctic exploration [1], UUVs have evolved from early models with a diving depth of 3650 m and an endurance of 5.5 h [2] to systems capable of diving depths of 6000 m and achieving an endurance of 8500 km [3]. As a key tool in ocean exploration, UUVs can perform precise surveys of seabed resources through target detection technologies without direct human involvement. This not only saves significant human and material resources but also overcomes the safety risks and endurance limitations of traditional exploration methods.

As a core technological enabler for UUVs to achieve autonomous operations and safe navigation, underwater target detection algorithms initially relied primarily on sonar technology for target identification and localization. Classic sonar techniques include three main categories: Side-Scan Sonar (SSS) [4], Multibeam Echo Sounder (MBES) [5], and Synthetic Aperture Sonar (SAS) [6]. With ongoing in-depth research into and continuous advancements in technology, deep learning-based underwater target detection algorithms have gradually become the cornerstone of underwater target detection for UUVs.

The historical development of deep learning has evolved through multiple stages. It began with the McCulloch–Pitts (MP) neuron model proposed in 1943 [7], entered a systematic research era with the introduction of deep networks and layer-wise pre-training strategies by Hinton et al. in 2006 [8], and surged into a research boom with the proposal of AlexNet in 2012 [9]. Technologies such as ResNet [10], GAN [11], Transformer [12], and YOLO [13] subsequently emerged, driving the iterative optimization of deep learning models. The implementation process of deep learning-based underwater target detection algorithms involves several steps. First, the backbone network extracts target features from images layer by layer. Next, structures such as feature pyramid networks (FPN) and path aggregation networks (PANet) are used to fuse multi-scale features. Then, within a single-stage or two-stage detection framework, the fused feature maps are fed into detection heads. The model predicts the bounding box coordinates, confidence scores, and class probabilities of targets. A loss function calculates the discrepancy between predicted values and ground truth to update model parameters through backpropagation. Finally, non-maximum suppression (NMS) is applied to the raw detection results to eliminate duplicate bounding boxes. Based on a confidence threshold, valid detection boxes are filtered, ultimately outputting target categories and precise locations.

In recent years, researchers worldwide have conducted multi-dimensional optimization to address the issue of insufficient recognition accuracy in underwater target detection algorithms based on deep learning implementation processes. Li et al. [14], building upon the SSD-MV2 target detection model, proposed a lightweight feature extraction module with selective channels (SEB) and a feature extraction module with deformable convolutional kernels and selective channels (SDB). These modules effectively enhanced the algorithm’s detection accuracy for targets in underwater optical images. Liu and Wang [15], focusing on the problem of overlapping underwater targets, designed a deconvolution feature pyramid structure based on Faster R-CNN, which significantly improved the algorithm’s detection accuracy for marine benthic organisms. Gu et al. [16], addressing the issue of blurred sonar and optical images, proposed an improved zero-shot neural architecture search (NAS) method based on maximum entropy principles, applied to RT-DETR. The resulting NAS-DETR demonstrated advanced performance in underwater target detection. Zhai et al. [17], targeting unique interference sources in underwater environments, introduced AquaFuse-Net. This algorithm achieved enhanced perceptual accuracy and robustness through comprehensive optimization of the backbone network, neck network, and loss function. Wang et al. [18], concentrating on feature extraction in the backbone network, designed a multi-scale high-frequency information enhancement module and a multi-scale gated channel information optimization module. These improvements strengthened the backbone network’s ability to extract high-frequency information, effectively mitigating distortion issues caused by scattering and color shifts in the low-frequency background information of underwater images. Moreover, self-supervised learning has also shown promising prospects in underwater-related anomaly detection tasks. Han et al. [19], through their research on self-supervised multi-transformation learning for time series anomaly detection, offer valuable insights for tackling the challenges of limited labeled data and complex interference in underwater environments.

Among various target detection algorithms, the YOLO series has become a mainstream solution for underwater target detection tasks due to its end-to-end detection pipeline, efficient inference speed, excellent detection accuracy, and continuous version iterations. Research on YOLO-based improved underwater target detection algorithms has advanced rapidly. Domestic and international researchers have conducted extensive targeted studies on optimizing the YOLO algorithm to address the challenges of the underwater environment.

Feng and Jin [20] proposed the CEH-YOLO algorithm to tackle issues such as low contrast and small targets in underwater images. By improving the feature extraction capability and useful feature expression of the algorithm’s network, they effectively enhanced the average precision of underwater target detection. Liu et al. [21], addressing the blurriness and occlusion characteristics of underwater optical image targets, introduced the Bi2F-YOLO algorithm based on YOLOv7 [22]. By designing a Bi-level Routing Attention (BRA) mechanism and incorporating Partial Convolution (PConv), they effectively resolved object ambiguity and improved detection accuracy. Yi et al. [23], focusing on the issue of missed detection for small underwater targets, proposed the USSTD-YOLOv8n algorithm based on YOLOv8n [24]. Through optimizations in upsampling and convolutional operations, the algorithm achieved stronger feature extraction capabilities. Zhang et al. [25], targeting the specific scenario of small underwater targets, introduced the CSAF-YOLO algorithm. They optimized the YOLOv11 [26] algorithm in aspects such as multi-scale feature fusion, dynamic kernel modulation, attention mechanisms, and detection heads, significantly improving target detection performance in small target scenarios. Liang et al. [27], addressing challenges such as low visibility and overlapping biological occlusions underwater, enhanced the robustness and accuracy of underwater target detection by incorporating dynamic receptive fields into the C2f feature enhancement network, re-parameterizing the feature pyramid in the feature fusion network, and integrating perceptual attention mechanisms based on YOLOv8n.

Based on the research trends from both domestic and international scholars, it is evident that underwater optical images captured in practical UUV operation scenarios often suffer from issues such as uneven lighting, blurring, complex backgrounds, and target occlusion. These scenarios interfere with the layered processing networks of object detection algorithms. Specifically, for the backbone network, image blurring and low contrast cause the attenuation of high-frequency detail features. Coupled with pixel shifts induced by light refraction, the receptive fields of conventional convolutions struggle to accurately capture effective target features, leading to weakened feature representation capability. For the neck network, during multi-scale feature fusion, background clutter and refractive distortion exacerbate semantic confusion across features of different scales. Traditional upsampling methods based on pixel-level interpolation further amplify noise, reducing the effectiveness of feature fusion. For the head network, target occlusion and uneven lighting introduce ambiguity in bounding box localization. The morphological distortion of targets caused by light refraction increases errors in classification confidence calculation and reduces the precision of bounding box regression, ultimately degrading detection performance. To further enhance the detection accuracy of underwater target detection algorithms, it is essential to thoroughly address these practical challenges. By building upon the fundamental network architecture of target detection algorithms, targeted optimizations are required to strengthen the algorithm’s adaptability to complex underwater environments. Therefore, this paper proposes a high-precision underwater target detection algorithm named MBACA-YOLO, based on YOLOv13n [28], with the following specific improvement strategies:

(1): To address the issue of insufficient feature extraction capability in the backbone network caused by underwater image blurring, this study draws inspiration from the core concept of SPD-Conv and optimizes the structure of four convolutional layers in the YOLOv13n backbone network: the original stride-2 convolutional layers are adjusted to stride-1, and a SPD layer is embedded after each convolutional layer. This approach not only enables the network to extract richer feature information but also leverages the SPD layer’s ability to convert spatial dimensions into channel dimensions, thereby avoiding the introduction of additional computational overhead.
(2): To address the challenges of complex background interference and significant variations in target sizes in underwater images, the newly proposed MBACA attention mechanism in this chapter is embedded into the final layer of the backbone network. This attention mechanism integrates the triple advantages of spatial attention, branch-specific channel calibration, and branch-level attention. By combining a dual-branch feature extraction structure composed of standard convolution and depthwise separable convolution, it adaptively enhances the effective feature responses of underwater targets during feature propagation while specifically suppressing invalid feature interference caused by background clutter. Consequently, it significantly improves the discriminative quality of the features output by the backbone network, laying a high-quality feature foundation for subsequent detection tasks.
(3): To address the issues of feature confusion and declining bounding box regression accuracy caused by overlapping multiple targets in underwater images, this chapter proposes an improved solution: the traditional upsampling in the neck network is replaced with CARAFE upsampling to suppress noise interference. Additionally, an Alpha-Focal-CIoU loss function is designed, which enhances the differentiation of IoU discrepancies through Alpha exponentiation and focuses on overlapping hard cases via the Focal mechanism, thereby significantly improving the accuracy of bounding box regression for underwater targets.

2. Materials and Methods

2.1. YOLOv13

YOLOv13, released in 2025, is the latest iteration of the YOLO series and was jointly developed by Tsinghua University, Shenzhen University, and several other institutions. Its core breakthrough lies in leveraging hypergraph computation and end-to-end feature collaboration to overcome the limitations of previous YOLO algorithms, which were only capable of capturing local or pairwise correlations. As illustrated in Figure 1, while maintaining the classic Backbone-Neck-Head framework, YOLOv13 innovatively integrates three key modules. First, it introduces the Hypergraph Adaptive Correlation Enhancement (HyperACE) mechanism, which treats multi-scale feature pixels as hypergraph vertices and dynamically explores many-to-many high-order semantic correlations through learnable hyperedges, enabling global feature fusion. Second, it incorporates the Full-Path Adaptive Distillation (FullPAD) paradigm, which breaks unidirectional information flow bottlenecks to enhance gradient propagation efficiency and improve the retention of small object details. Third, it replaces traditional large-kernel convolutions with lightweight depthwise separable convolution modules, striking a balance between accuracy and efficiency. Similar to earlier YOLO models, YOLOv13 offers multiple scaled versions (n, s, l, x). This study primarily focuses on optimizing the YOLOv13n version.

2.2. MBACA-YOLO Overall Architecture

The overall architecture of the high-precision target detection algorithm MBACA-YOLO proposed in this paper for UUVs is shown in Figure 2, primarily consisting of four key components: convolutional optimization, attention mechanism embedding, upsampling optimization, and loss function optimization. Specifically, convolutional optimization involves replacing the four convolutional layers with a stride of 1 in the YOLOv13 backbone network with convolutional layers with a stride of 2 and adding an SPD (Space-to-Depth) layer [29] after them. Attention mechanism embedding refers to integrating the newly proposed MBACA attention mechanism into the final layer of the backbone network. Upsampling optimization entails replacing the two Upsample modules in the neck network with Content-Aware ReAssembly of Features (CARAFE) upsampling modules. Loss function optimization involves substituting the original CIoU loss function with the proposed Alpha-Focal-CIoU loss function.

2.3. Convolution Optimization

The backbone network of the YOLOv13n algorithm relies on four convolutional layers for image feature extraction, with the final output feature maps providing crucial support for subsequent networks. However, in real-world UUV operation scenarios, underwater images are often affected by factors such as water scattering and absorption, leading to widespread blurring issues. Traditional convolution operations, due to their larger stride steps, can easily cause information loss during feature downsampling, making it difficult to fully extract effective features of targets from blurred underwater images. This severely impacts subsequent detection performance. To address this issue, this section draws inspiration from the core idea of SPD-Conv [29] and performs targeted optimization on the four key convolutional layers in the backbone network. The specific optimization strategy is as follows: adjusting the original stride-2 convolutional layers to stride-1, while embedding an SPD layer after each convolutional layer to enhance the network’s feature extraction capability for blurred underwater images.

The core function of the SPD layer is to downsample feature maps without losing feature information, which fundamentally distinguishes it from the discard-based downsampling approach of traditional strided convolution. Specifically, for an input feature map

X

, the SPD layer uniformly partitions it into

s c a l e^{2}

non-overlapping sub-feature maps according to a predefined scaling factor

s c a l e

. These sub-feature maps are then concatenated along the channel dimension to generate a new feature map

X^{'}

. Compared to the original feature map

X

, the spatial dimensions of

X^{'}

are reduced to

1 / s c a l e

, while the number of channels is expanded by a factor of

s c a l e^{2}

. This transformation mechanism ensures that fine-grained information from the original feature map is fully preserved, fundamentally avoiding the loss of effective features caused by strided sampling in traditional convolution operations.

As shown in Figure 3, this process illustrates the optimized implementation of a single convolutional layer in the backbone network. It primarily involves two key steps: fine-grained feature capture and lossless transformation.

The purpose of the fine-grained feature capture step is to eliminate the interference caused by traditional convolutional strided sampling and fully preserve the fine-grained feature information in blurred underwater images. This is primarily achieved by adjusting the original stride-2 convolution to a stride-1

3 \times 3

convolution. The calculation formula for the output size of the convolutional layer is shown in Equation (1):

O = (\frac{I - K + 2 P}{S}) + 1

(1)

Among them,

O

represents the size of the output feature map,

I

denotes the spatial size of the input feature map,

K

is the kernel size,

P

stands for the padding amount, and

S

indicates the convolution stride. In this paper, the kernel size is set to 3, padding is 1, and the stride is 1. Substituting these values into Equation (1) yields

O = I

. This result verifies that the spatial size of the output from a

3 \times 3

convolution with a stride of 1 matches the spatial size of the input. This demonstrates that this operation can traverse the input feature map of size

O

pixel by pixel, thereby fully capturing the local details of underwater targets.

The purpose of the lossless transformation step is to meet the downsampling requirements of subsequent networks while endowing spatial features with richer expressive dimensions, thereby avoiding redundancy or loss of feature information. The core requirement is to achieve effective spatial compression while preserving feature details to the greatest extent. As a key parameter of the SPD layer in realizing this function, the scale factor directly determines the compression ratio of the input feature map. From the perspective of the fundamental need for downsampling, the scale factor must be at least 2 to effectively streamline the feature map. If it is less than 2, the required dimensional compatibility for subsequent networks cannot be met, leading to feature redundancy and reduced computational efficiency. As the minimum effective value that fulfills the downsampling requirement, a factor of 2 offers the mildest compression degree, thereby maximizing the preservation of critical information—such as weak edges and local textures—inherently fragile in underwater blurry targets, preventing excessive aggregation and loss. Therefore, the lossless transformation step is primarily achieved by employing an SPD layer with a scale factor of 2. The spatial partitioning formula for the SPD layer is shown in Equation (2):

P_{k} (X) = \{X_{i, j} | X_{i, j} \in R^{C \times \frac{H}{k} \times \frac{W}{k}}\}, \forall i, j \in \{1, 2, \dots, k\}

(2)

where

P_{k} (\cdot)

is the spatial splitting operation with a scaling factor of

k

, and the output is a collection of

k^{2}

non-overlapping sub-feature maps.

C

is the number of channels in the input feature map,

H

is the height of the input feature map, and

W

is the width of the input feature map. In the context of this paper, the input feature map size is

C_{i n} \times S \times S

. With a scaling factor of 2, it can be derived that the input feature map is split into 4 sub-feature maps, each of size

C_{i n} \times \frac{S}{2} \times \frac{S}{2}

, and all sub-feature maps cover the complete spatial area of the original feature map without overlap.

After dual processing of fine-grained feature capture and lossless transformation, the generated feature maps can maximally achieve the extraction and complete preservation of target feature information. Such a convolution optimization strategy can provide richer, detailed information for underwater blurred targets, facilitating subsequent network target identification.

2.4. MBACA

In the feature enhancement stage of the YOLOv13n algorithm, attention mechanisms are often introduced to enable the algorithm to enhance the representation of useful features while suppressing interference from irrelevant features, thereby providing high-quality feature support for subsequent multi-scale feature fusion in the network. Addressing the issue of complex background interference affecting target features in underwater images captured during actual UUV operations, this section draws on the core idea of “multi-branch dynamic weighting” from the Selective Kernel Attention (SKA) [30] mechanism to propose the Multi-Branch Adaptive Calibration Attention (MBACA) mechanism.

As shown in Figure 4, SKA achieves dynamic weighted fusion of multi-scale features through a three-stage operation of “Split-Fuse-Select”. First, the Split operation generates multiple branches with different kernel sizes to capture feature information under varying receptive fields. Next, the Fuse operation performs element-wise addition of the features from each branch, extracts channel statistics via global average pooling, and then generates compact features through fully connected layers to guide weight allocation. Finally, in the Select operation, the softmax function is utilized to generate adaptive weights for each branch. The features from different branches are weighted and fused to output the final features.

Features processed by SKA can adaptively aggregate multi-scale features from different receptive fields, thereby enhancing the representation ability of targets. However, in UUV operational scenarios, due to the large-scale variations of underwater targets and complex background interference in captured images, attention mechanisms require richer branch coverage and more comprehensive interference suppression capabilities. Therefore, this section proposes MBACA, aiming to achieve sufficient coverage of multi-scale underwater target features and effective suppression of background interference.

As shown in Figure 5, MBACA primarily adopts a hierarchical strategy of “multi-branch feature extraction, four-dimensional weighting, and feature screening fusion” to enhance target features while suppressing background interference. Specifically, the process is as follows: First, eight sets of branches with different kernel sizes are used to cover multi-scale underwater targets, adapting to the significant size variations of objects in underwater images. Next, weighting operations are applied across four dimensions: branch channels, spatial regions, global channels, and effective branches. This enables spatial focus on target regions, redundancy suppression in background channels, and independent optimization of branch features. Finally, a Fuse operation integrates the effective features from each branch after four-dimensional weighting. Combined with a residual connection, these features are fused with the original input, effectively preserving essential target information while efficiently mitigating interference from complex underwater backgrounds.

Assume the input feature map is

X \in R^{B S \times C \times H \times W}

, where

B S

is the batch size,

C

is the number of channels in the input features,

H

is the height of the input feature map, and

W

is the width of the input feature map.

First, the input feature map

X

is processed using 8 sets of branches to generate multi-scale branch features

\{X_{1}, X_{2}, \dots, X_{8}\}

. To avoid excessive computational overhead when expanding the branches, the convolutional components of these eight branches incorporate depthwise separable convolutions. Depthwise separable convolutions decompose standard convolutions into depthwise and pointwise operations, significantly reducing the number of parameters compared to standard convolutions with the same kernel size. This effectively prevents the eight-branch structure from adding too much computational cost. Considering the characteristics of underwater targets, four kernel sizes are selected for the eight-branch structure: 1, 3, 5, and 7. Specifically, the 1 × 1 kernel captures the global semantic features of small targets and retains fine-grained global information, mitigating interference from underwater blurry images on target detection. The 3 × 3, 5 × 5, and 7 × 7 kernels progressively capture the local texture details of medium to large targets, overall contour features, and contextual associations, effectively compensating for feature loss and fragmentation caused by underwater image blur and complex backgrounds. Each branch includes convolution, batch normalization, and SiLU activation. The specific operations are shown in Equation (3):

\{\begin{cases} X_{2 k - 1} = SiLU (B N (C o n v_{k_{i}} (X))) \\ X_{2 k} = SiLU (B N (P o int C o n v (D e p t h C o n v_{k_{i}} (X)))) \end{cases}

(3)

where

k_{i} \in {1, 3, 5, 7}

,

B N

denotes batch normalization,

D e p t h C o n v_{k_{i}}

denotes depthwise separable convolution, and

P o i n t C o n v

denotes 1 × 1 pointwise convolution. The 8 multi-scale features are stacked into a tensor

feats \in R^{8 \times B S \times C \times H \times W}

.

Next, four-dimensional weighting is applied to the branch feature tensor

feats

.

Branch Channel Dimension: This achieves independent optimization of channels for each branch. Based on the global average feature

g l o b a l_f e a t

of all branches, channel weights

M_{c i}

are assigned to each branch feature individually. The specific operation is shown in Equation (4):

M_{c i} = σ (L i n e a r_{2} (SiLU (L i n e a r_{1} (g l o b a l_f e a t)))) (i = 1, 2, \dots)

(4)

{Linear}_{1}

represents the first fully connected layer, which maps the input features of dimension

C

to an output dimension of

d

.

{Linear}_{2}

represents the second fully connected layer, which maps the intermediate features of dimension

d

back to dimension

C

.

d = m a x (32, C / 16)

, and

σ

is the sigmoid activation function. The weighted branch features are given by Equation (5):

{f e a t s}_{i}^{'} = f e a t s_{i} ⊙ M_{c i}

(5)

where

⊙

denotes the Hadamard product.

Spatial Region Dimension: Focus on target spatial regions and suppress background noise. Based on the original input feature map

X

, spatial attention is computed to generate a global spatial weight map

M_{s}

. The specific operation is shown in Equation (6):

M_{s} = σ (C o n v_{1 \times 1} (SiLU (B N (C o n v_{3 \times 3} (X)))))

(6)

The weighted branch features are given by Equation (7):

{f e a t s}_{i}^{″} = f e a t s_{i} ⊙ M_{s}

(7)

Global Channel Dimension: Achieve redundancy suppression in background channels through high-weight channel mixing enhancement. Based on the fused feature

U

from all branch features, where

U = \sum_{i = 1}^{s} X_{i}

, channel statistics

S

are first obtained via global average pooling. Then, the global channel weights

M_{c}

for the 8 branches are generated. The specific operation is shown in Equation (8):

M_{c} = softmax (\frac{1}{T} \cdot L i n e a r_{2} (L i n e a r_{1} (S)))

(8)

where

T

represents a learnable temperature coefficient. Specifically, T is defined as a trainable parameter, with its initial value set to 0.8 to accommodate the low signal-to-noise ratio characteristics of underwater scenarios. During training, a range constraint of [0.05, 2.0] is imposed on T to prevent excessive concentration or dispersion of branch weights.

Valid Branch Dimension: Filter high-confidence branch features. Based on the global feature

g l o b a l_f e a t

, selection weights

M_{b}

for the 8 branches are generated to screen the branches. The specific operation is shown in Equation (9):

M_{b} = softmax (L i n e a r_{2} (SilU (L i n e a r_{1} (g l o b a l_f a e t))))

(9)

The weighted branch features are given by Equation (10):

f e a t s^{‴} = f e a t s^{″} ⊙ M_{b}^{p e r m u t e (1, 0, 2, 3, 4)}

(10)

Finally, the 8 branch features after four-dimensional weighting are summed according to the global channel weights, and the final enhanced feature

Y

is generated by combining a lightweight residual connection. The specific operation is shown in Equation (11):

Y = \sum_{i = 1}^{8} (M_{c i} \cdot f e a t s^{‴}) + B N (C o n v_{1 \times 1} (X))

(11)

MBACA enhances the feature representation of underwater targets through multi-branch coverage of multi-scale underwater target features, four-dimensional weighting to focus on target regions and suppress background redundancy, and lightweight residual connections to preserve effective information. This simultaneously reduces interference from complex backgrounds on the features, ultimately improving the algorithm’s detection accuracy in underwater scenarios. The integration position of the MBACA attention mechanism needs to be determined in accordance with the feature propagation patterns in underwater images. The backbone network, from bottom to top, corresponds to low-level texture features (B3), mid-level semantic features (B4), and high-level global features (B5). Considering that underwater images are affected by background noise interference throughout the entire feature propagation process, low-level features contain high redundancy in details. Introducing the attention mechanism too early may lead to computational resource wastage. Additionally, high-level features have the most direct impact on the final detection results. Therefore, it is preliminarily assumed that the MBACA mechanism should be placed after the final layer (B5) of the backbone network. This ensures that the network can focus on targets through global features while avoiding the redundancy associated with low-level features. The reasonableness of this hypothesis will be verified through subsequent experiments.

2.5. Upsampling Optimization

The neck network structure of the YOLOv13n algorithm continues the bidirectional feature aggregation logic of PANet, with the upsampling module serving as the core component for multi-scale feature fusion. This is implemented using the nn.Upsample interface with nearest-neighbor interpolation. However, in the actual operational scenarios of UUVs, underwater images used for target detection often suffer from blurred object edge information. Due to the pixel-replication characteristic of nearest-neighbor interpolation, this method tends to amplify noise and introduce additional interference, thereby reducing the effectiveness of feature fusion and subsequently degrading the localization accuracy of target detection. To address this issue, this paper introduces the CARAFE [31] upsampling module to optimize the upsampling operation in the YOLOv13n algorithm.

As shown in Figure 6, the CARAFE module primarily employs a two-stage strategy of “kernel prediction and content-aware reassembly” to achieve feature upsampling. This approach not only ensures an increase in the spatial resolution of features after upsampling but also preserves target details and suppresses noise interference through content-adaptive mechanisms. Specifically:

First, for the input feature map

X

with dimensions

H \times W \times C_{m}

, a channel compressor reduces its channel count from

C

to

C_{m}

, thereby lowering computational costs while retaining key semantic information. Subsequently, the compressed features enter the kernel prediction module, where a content encoder performs local context encoding on the reduced-dimension features to generate kernel parameters of dimension

σ^{2} \times k_{u p}^{2}

. Here,

σ

represents the upsampling factor, and

k_{u p}

denotes the reassembly kernel size. A kernel normalizer then applies a softmax operation to each

k_{u p} \times k_{u p}

reassembly kernel, producing location-specific dynamic weights

w_{i}^{'}

to adapt to the semantic differences between targets and backgrounds in underwater images.

Next, in the content-aware reassembly module, for any position

l^{'}

in the output feature map, the corresponding source position

l

in the input feature map is first located, and the local neighborhood features

N (X_{l}, k_{u p})

at this position are extracted. The neighborhood features are then combined with the dynamic weights

w_{i}^{'}

output by the kernel prediction module through a reassembly operation, which performs a weighted summation to aggregate effective features within the neighborhood. This enables high-weight focusing on target regions and adaptive suppression of background noise.

Finally, through the aforementioned reassembly operation, feature aggregation is completed for all positions, outputting a high-resolution feature map

X^{'}

with dimensions

σ H \times σ W \times C

. This process achieves precise retention of edge details for underwater targets through dynamic kernels, thereby addressing the issue of noise interference introduced by nearest-neighbor interpolation methods.

After replacing the upsampling module in the neck of YOLOv13n with the CARAFE module, the model is able to adaptively generate sampling kernels based on the weights of local features during upsampling. This allows for more accurate restoration of target edges and texture information while avoiding the introduction of additional noise interference, thereby improving the subsequent network’s ability to recognize targets.

2.6. Loss Function Optimization

The bounding box regression loss of the YOLOv13n algorithm adopts the CIoU calculation method. In real-world UUV operation scenarios, dense multi-target environments are very common, where multiple targets interfere with each other, leading to frequent overlap and occlusion of bounding boxes across different targets. This results in confusion in the spatial correspondence between predicted and ground truth boxes, significantly increasing errors in IoU calculation. CIoU assigns equal loss weights to all samples without distinguishing between easy and hard examples. In complex multi-target interference scenarios, the model tends to focus on easy samples with simple backgrounds and no occlusion, while severely lacking regression accuracy for hard samples such as ambiguous targets in overlapping areas. This ultimately leads to increased bounding box localization errors and overall detection performance degradation. To address this issue, this section integrates the power-weighting concept of Alpha-IoU [32] with the hard-sample focusing mechanism of Focal-IoU [33] to design an Alpha-Focal-IoU loss function tailored for underwater target detection. Through dual optimization, this approach enhances bounding box localization accuracy in dense multi-target scenarios.

Alpha-IoU’s core innovation lies in introducing an adjustable power parameter

α

. By applying a power transformation to the IoU and loss penalty terms, it achieves gradient re-weighting for samples with different overlap levels, amplifying the IoU differences among samples, thereby addressing the difficulty of distinguishing truly matched samples from mismatched ones in dense multi-target overlapping scenarios. Its core formulas are given in Equations (12) and (13):

I o U_{α} = {(\frac{i n t e r}{u n i o n + e})}^{α}

(12)

L_{α - I o U} = 1 - I o U_{α}

(13)

where

i n t e r

denotes the overlapping area between the predicted and ground-truth boxes,

u n i o n

denotes the union area of the predicted and ground-truth boxes,

e = 1 \times 10^{- 7}

, and

α > 0

.

Focal-IoU’s core innovation lies in introducing the

γ

modulation factor, which down-weights the loss of easy samples and up-weights the loss of hard samples. It addresses the issue of equal weighting for all samples in dense multi-target underwater scenes. Its core formula is given in Equation (14):

L_{F o c a l - I o U} = (1 - I o U) \cdot {(I o U)}^{γ}

(14)

where

γ \geq 0

.

This section addresses the issues of low IoU discriminability and insufficient learning of hard samples in CIoU by integrating Alpha-IoU and Focal-IoU to optimize CIoU and design the Alpha-Focal-IoU loss function. Specifically, a power transformation is applied to the IoU term and penalty term of CIoU to amplify IoU differences in multi-target overlapping scenes; and the loss values are differentially weighted through the parameter γ, reducing the weight of easy samples and increasing the weight of hard samples. The core formulas are given in Equation (15) through Equation (18):

I o U_{α} = {(\frac{i n t e r}{u n i o n + e})}^{α}

(15)

v = \frac{4}{π^{2}} {(\arctan \frac{w^{g t}}{h^{g t}} - \arctan \frac{w}{h})}^{2}

(16)

C I o U_{α} = I o U_{α} - α \cdot \frac{ρ^{2} (b, b^{g t})}{c^{2}} - α \cdot v

(17)

L_{A l p h a - F o c a l - C I o U} = (1 - C I o U_{α}) \cdot {(C I o U_{α})}^{γ}

(18)

The Alpha-Focal-IoU loss function enables the algorithm, in underwater multi-target overlapping scenarios, not only to quickly distinguish between true matching boxes and interfering boxes, but also to focus on the bounding box regression of occluded targets. It effectively addresses the issue of reduced bounding box regression accuracy in underwater scenes caused by the original CIoU’s low IoU discriminability and equal sample weighting.

3. Results and Discussion

3.1. Dataset

The URPC dataset was adopted in this study to validate the effectiveness of the MBACA-YOLO algorithm. This dataset originated from the 2020 Underwater Robot Professional Competition for object grasping, with the images provided by the competition organizing committee. The original images and annotation files of the dataset were provided by the competition organizing committee. To meet the training requirements of the YOLO model, Python scripts were used to uniformly convert the original data format to the YOLO format. Furthermore, the 5543 images were randomly divided into training, validation, and test sets. By setting NumPy’s random seed to 42, a stratified split was performed in a ratio of 0.7:0.15:0.15.

The URPC dataset highly simulates the complex environments encountered in actual UUV operational scenarios. It contains four types of target marine organisms: sea cucumbers, sea urchins, scallops, and starfish. The dataset also includes common real-world challenging scenarios in UUV operations, such as uneven lighting, complex backgrounds, image blurring, and object occlusion. This effectively supports the verification of the algorithm’s detection performance under conditions close to real-world operations. A sample set of the dataset is shown in Figure 7.

3.2. Experimental Environment and Hyperparameter Settings

All experiments in this study were conducted on the same server. The hardware configuration of the server is as follows: the processor utilizes a 16 vCPU Intel(R) Xeon(R) Gold 6430, manufactured by Intel Corporation, Santa Clara, California, USA; the graphics card employs an RTX 4090 with 24GB of video memory. The software environment is built on a Linux operating system, with Python 3.12 as the core development language. The training and optimization of the YOLO model were implemented based on the PyTorch 2.8.0 deep learning framework, and GPU parallel computing acceleration was achieved using CUDA 12.8. All experiments in this study were trained with a unified set of hyperparameter settings. No pre-trained weights were used, while basic data augmentation techniques (including HSV color jitter, horizontal flipping, and mosaic augmentation) were applied during training. To ensure the reproducibility and stability of the results, a fixed random seed (seed = 0) was set for the entire training process. The detailed hyperparameter configuration is shown in Table 1.

During the training process of the YOLOv13n algorithm, the settings of the hyperparameters—optimizer and learning rate (LR)—are critical. The optimizer determines the direction and step size of model parameter updates, serving as a core factor in balancing convergence speed and fitting stability. The learning rate controls the magnitude of parameter updates, directly influencing the model’s ability to fully learn underwater target features and its generalization capability. Therefore, based on the hyperparameter settings in Table 1, we further investigate the configurations of the optimizer and learning rate within the YOLOv13n framework. Specifically, we select two representative optimizers: SGD (fixed gradient update) and Adam (adaptive gradient update), and two learning rate strategies: a fixed learning rate (SGD LR = 0.01, Adam LR = 0.001) and a cosine annealing strategy (with cosine annealing scheduler enabled). By pairing these options, four comparative experimental groups are formed. The experimental results are presented in Table 2. The findings indicate that when using the SGD optimizer with a fixed learning rate (LR = 0.01), the YOLOv13n algorithm demonstrates superior performance after training on the URPC dataset. This suggests that the optimal settings for the optimizer and learning rate in this study should be: optimizer = “SGD” and LR = 0.01.

3.3. Experimental Evaluation Metrics

To comprehensively evaluate the detection performance of the MBACA-YOLO algorithm, this study employs the following evaluation metrics at the precision level: Precision (P) and Recall (R) [34] are selected to assess the model’s capability in reducing false positives and minimizing missed detections. Mean Average Precision (mAP) is adopted as a comprehensive indicator to gauge overall model performance. For real-time performance and computational cost, we comprehensively evaluate model performance using multi-dimensional metrics: the number of parameters, GFLOPs, inference latency (processing time per single image), frames per second (FPS).

(1): P

Precision is primarily calculated by analyzing the samples predicted by the model as positive, determining the proportion among them that are actually positive. Its calculation formula is shown in Equation (19):

P = \frac{T P}{T P + F P}

(19)

Here,

T P

represents the number of true positives, i.e., the count of samples correctly predicted as positive by the model;

F P

represents the number of false positives, i.e., the count of samples incorrectly predicted as positive by the model. A higher precision indicates greater accuracy of the model in predicting positive classes and fewer negative samples being misclassified as positive.

(2): R

Recall is primarily calculated by analyzing the samples that are actually positive, determining the proportion that the model correctly identifies. Its calculation formula is shown in Equation (20):

R = \frac{T P}{T P + F N}

(20)

Here,

T P

represents the number of true positives, i.e., the count of samples correctly predicted as positive by the model;

F N

represents the number of false negatives, i.e., the count of samples incorrectly predicted as negative by the model. A higher recall indicates a stronger ability of the model to identify positive classes and fewer missed positive samples.

(3): mAP

mAP comprehensively evaluates the precision and recall of the model’s prediction results. Its data support is based on the Precision-Recall (PR) curve obtained from training. The PR curve plots recall on the horizontal axis and precision on the vertical axis. The area under this curve is calculated to determine the average precision for each category. Subsequently, the average precision across all categories is averaged to obtain the mean average precision for the entire dataset. This metric is typically calculated based on the Intersection over Union (IoU) threshold. In object detection tasks, commonly used thresholds include 0.5 and 0.5:0.95. Specifically, when the IoU between a predicted bounding box output by the model and the ground truth bounding box exceeds 0.5, the prediction is considered a valid match. The average precision calculated under this single threshold is denoted as mAP@0.5. Meanwhile, mAP@0.5:0.95 adopts a more comprehensive evaluation approach by selecting ten graded IoU thresholds at intervals of 0.05 within the range of 0.5 to 0.95. The average precision corresponding to each threshold is calculated separately, and the mean of these ten results is taken, ultimately recorded as mAP@0.5:0.95. In object detection experiments, these two metrics are typically used together to comprehensively evaluate the overall detection performance of the algorithm. The formulas for calculating mAP are shown in Equations (21) and (22):

A P = \int_{0}^{1} p (r) d r

(21)

m A P = \frac{1}{N} \sum_{i = 1}^{N} A P_{i}

(22)

(4): FPS

FPS, as a key metric for evaluating the real-time detection capability of a model, reflects the number of images the model can process per second. In this paper, the FPS value of the model is calculated by statistically averaging the inference latency per individual image.

(5): Parameters

Parameters (Params) reflects the total scale of model parameters and serves as a key metric for evaluating storage resource utilization. In this paper, the parameter count of the model is calculated by summing the total number of parameters across all network layers

(6): GLFOPs

Giga Floating-point Operations Per Second (GFLOPs) indicates the billions of floating-point operations the model can perform per second and serves as a key metric for evaluating computational complexity. In this paper, the GFLOPs value of the model is calculated by summing the total number of floating-point operations during the forward propagation of the network and converting it into units of billions of operations.

(7): Inference Latency

Inference Latency measures the time required for the model to process a single image, serving as a key metric for evaluating real-time responsiveness per sample. In this paper, the inference latency of the model is calculated by recording the total time from inputting a single image into the model to obtaining the detection output, and then averaging the results over multiple test runs.

3.4. Analysis of Algorithm Training Process

To compare the performance differences between the proposed MBACA-YOLO algorithm and the baseline algorithm during the training cycles, this section conducts a dynamic analysis of the algorithm’s training process. As shown in Figure 8, the figure presents the variation curves of mAP@0.5 over the training epochs for both the baseline algorithm YOLOv13n and the proposed MBACA-YOLO algorithm throughout the entire training process.

The experimental results indicate that during epochs 0–50, the mAP@0.5 of both algorithms shows a rapid upward trend, with MBACA-YOLO demonstrating a faster increase. This suggests that the improved algorithm exhibits stronger learning capability for target features in the early stages of training. During epochs 50–100, the growth of the curves gradually slows and stabilizes, yet MBACA-YOLO still maintains a leading trend, indicating that the improved algorithm retains higher feature learning capability in the mid-training phase. After 100 epochs, the curves of both algorithms tend to stabilize, indicating that both algorithms have achieved convergence. This also confirms that the study utilized a sufficient number of training epochs to ensure the stability of experimental results from a single run, thereby reliably reflecting the actual performance of the proposed algorithm. Notably, MBACA-YOLO demonstrates higher convergence accuracy, reflecting the superior performance of the improved algorithm. Through the analysis of the training processes of both algorithms, the effectiveness of the proposed MBACA-YOLO improvement strategy in enhancing target detection accuracy and convergence efficiency is fully validated.

3.5. Comparative Experiments on Multiple Positions of the Attention Mechanism

To determine the optimal integration position for the MBACA attention mechanism, this section integrates the attention mechanism after the second layer, the sixth layer, and the final layer of the backbone network, conducting three sets of comparative experiments. These experiments correspond to the low-level feature layer (B3), mid-level feature layer (B4), and high-level feature layer (B5), respectively. The experimental results are shown in Table 3. The results indicate that integrating the MBACA attention mechanism after the B5 layer significantly outperforms integration after the B3 and B4 layers in terms of accuracy, complexity, and processing speed. This validates the earlier hypothesis: integrating MBACA after the B5 layer, i.e., the final layer of the backbone network, enables the algorithm to achieve optimal performance.

3.6. Comparative Experiment on the Effectiveness of Attention Mechanisms

To verify the performance advantages of the MBACA attention mechanism proposed in this chapter for underwater object detection tasks, this section selects typical attention mechanisms, including SE [35], CBAM [36], ECA [37], CA [38], SimA [39], and SKA [30], for comparative experiments with the proposed MBACA attention mechanism. All attention mechanisms are applied to the same position in the YOLOv13n algorithm model. The experimental results are shown in Table 4.

The experimental results show that, under the same configuration, the algorithm incorporating the proposed MBACA attention mechanism demonstrates optimal performance in underwater object detection tasks. Compared to algorithms integrated with attention mechanisms such as SE, CBAM, ECA, Coordinate Attention, SimA, and SKA, its mAP@0.5 is improved by 0.9%, 0.5%, 1.1%, 1.7%, 0.9%, and 0.7%, respectively, achieving a leading position in detection accuracy. In terms of model complexity and processing speed, compared to the baseline model YOLOv13n, the MBACA attention mechanism resulting in an increase of only 0.49M parameters and 1.0 GFLOPs. This increment is significantly smaller than that of other attention mechanisms such as SE, CBAM, CA, and SKA. Regarding inference speed, MBACA only increases inference latency by 0.9ms, and FPS merely decreases slightly from 56 to 54, far outperforming other attention mechanisms. This indicates that while achieving optimal accuracy, the increases in model complexity and inference speed brought by MBACA remain within controllable limits. This outcome confirms that, compared to typical attention mechanisms, the MBACA attention mechanism is better suited for the YOLOv13n algorithm. It enables more efficient processing of feature information in underwater object detection tasks, effectively enhancing useful features while suppressing irrelevant interference.

3.7. Heatmap-Based Attention Mechanism Effectiveness Validation

To more intuitively demonstrate the enhancement effect of the proposed MBACA attention mechanism on target features in feature maps, this section selects representative images covering all target categories. The original YOLOv13n algorithm and the improved algorithm embedded with the MBACA attention mechanism are respectively applied to generate heatmaps at the output features of the last layer of the backbone network using the Grad-CAM (Gradient-weighted Class Activation Mapping) method. The comparative analysis validates the focusing and enhancement effects of the MBACA attention mechanism on key target features. The experimental results are shown in Figure 9. In the heatmaps, the color depth represents the algorithm’s attention intensity to feature regions, with blue indicating low-attention areas and red representing high-attention areas. The results show that compared to the baseline YOLOv13n algorithm, the algorithm embedded with the attention mechanism exhibits more complete and deeper red response regions in the target distribution areas. This experiment fully verifies that the MBACA attention mechanism effectively enhances target-related features in the feature maps.

3.8. Sensitivity Analysis of Alpha-Focal-CIoU Hyperparameters

To determine the optimal values of α and γ in the Alpha-Focal-CIoU loss function for adapting to the fuzzy, overlapping, and multi-scale characteristics of underwater object detection, this section employs a controlled variable approach to conduct a hyperparameter sensitivity analysis experiment. The experiment evaluates the impact of different parameter values on the model’s detection performance and ultimately identifies the optimal hyperparameter combination. The results of Analysis Experiment 1 are presented in Table 5. Here, γ is set to the classic optimal value of 2.0 as recommended in the original Focal-IoU paper [31], while α is sequentially assigned values across the core range of Alpha-IoU: 1.0, 2.0, 3.0, 4.0, and 5.0. The results indicate that the loss function achieves the best performance when α is set to 3.0.

The experimental results of Analysis Experiment 2 are presented in Table 6. In this experiment, α is set to the classic default value of 1.0 from the original Alpha-IoU paper [28], while γ is assigned values across the core focusing range of Focal Loss: 0.5, 1.0, 1.5, 2.0, and 2.5. The results indicate that the loss function achieves optimal performance when γ is set to 2.0.

The hyperparameter sensitivity analysis in this section demonstrates that for the Alpha-Focal-CIoU loss function, the algorithm achieves optimal performance when the hyperparameters are set to: α = 3.0 and γ = 2.0.

3.9. Ablation Study

To validate the effectiveness of the MBACA-YOLO algorithm proposed in this chapter, an ablation study was conducted using the URPC dataset, with YOLOv13n as the baseline model, to evaluate the individual contributions of each improved module. The experimental results are presented in Table 7 and Table 8, where a checkmark (√) indicates the inclusion of the corresponding module. The results demonstrate that compared to the baseline model YOLOv13n, on the precision metric, the MBACA-YOLO model achieved a precision increase from 82.4% to 84.1%, representing a 1.7% improvement; recall increased from 76.0% to 78.5%, a 2.5% improvement; and mAP@0.5 improved from 82.4% to 85.5%, a 3.1% increase; the mAP@0.5:0.95 increased from 47.3% to 50.1%, representing an improvement of 2.8%. Specifically, after applying convolutional optimization to the YOLOv13n baseline model, mAP@0.5 increased from 82.4% to 83.5%, a gain of 1.1%, while mAP@0.5:0.95 rose from 47.3% to 48.3%, an improvement of 1.0%. With the subsequent introduction of CARAFE upsampling, mAP@0.5 further improved from 83.5% to 84.1%, an increase of 0.6%, and mAP@0.5:0.95 increased from 48.3% to 48.9%, also rising by 0.6%. After integrating the MBACA attention mechanism, mAP@0.5 increased from 84.1% to 85.0%, a gain of 0.9%, and mAP@0.5:0.95 rose from 48.9% to 49.5%, improving by 0.6%. Finally, by combining the Alpha-Focal-CIoU loss function, mAP@0.5 increased from 85.0% to 85.5%, a gain of 0.5%, while mAP@0.5:0.95 rose from 49.5% to 50.1%, an improvement of 0.6%. On the model complexity and inference speed metrics, after convolutional optimization, the model’s parameter count increased by 0.09M, GFLOPs increased by 0.1, inference latency improved by 0.9 ms, while FPS remained largely unchanged. With the addition of the CARAFE module on top of the convolutional optimization, the model’s parameter count increased by 0.28 M compared to the baseline, GFLOPs increased by 0.6, inference latency decreased by 0.4 ms, and FPS still showed no significant change. Upon further integrating the MBACA attention mechanism, the model’s parameter count increased by 0.49 compared to the baseline, GFLOPs increased by 1.0, inference latency increased by 1.6 ms, and FPS slightly decreased to 54. Finally, after introducing the Alpha-Focal-CIoU loss function, the model’s parameter count and GFLOPs remained unchanged, with only a slight increase in inference latency and a minor decrease in FPS to 53. Overall, compared to the baseline model YOLOv13n, the increases in metrics measuring model complexity and speed for MBACA-YOLO remain within a controllable range, achieving a favorable balance between accuracy improvement and efficiency cost.

3.10. Comparative Experiment

To comprehensively validate the performance advantages of the MBACA-YOLO algorithm proposed in this chapter as a UUV target detection algorithm, a comparative experiment was conducted between the proposed model and seven other algorithms. The selected algorithms include the two-stage representative target detection algorithm Faster R-CNN [40], the earliest single-stage target detection algorithm SSD [41], the Transformer-based single-stage representative algorithm RT-DETR [42], the classic YOLO series algorithm YOLOv5n [43], YOLOv8n [24], and other improved underwater target detection algorithms such as COT-YOLO [44] and CSAF-YOLO [25]. COT-YOLO and CSAF-YOLO are also improved algorithms based on the YOLO series. Specifically, compared to COT-YOLO, which employs a CoT module and decoupled head improvements, the proposed MBACA-YOLO in this study adopts a collaborative design of SPD-Conv and the MBACA attention mechanism, demonstrating superior performance in feature extraction and interference suppression. CSAF-YOLO relies on stacking multi-branch attention and feature fusion modules to enhance its underwater detection capability. In contrast, the proposed MBACA-YOLO integrates SPD-Conv and MBACA attention into a streamlined process. This architectural simplification not only avoids the redundant computation caused by CSAF-YOLO’s multi-module stacking but also maintains comparable detection accuracy. All experiments were performed on the same server using the URPC dataset. The results are presented in Table 9.

The experimental results show that compared to Faster R-CNN, SSD, RT-DETR, YOLOv5n, YOLOv8n, YOLO-U, COT-YOLO, and YOLOv13n, In terms of accuracy metrics, the mAP@0.5 of the MBACA-YOLO model increased by 10.9%, 23.8%, 4%, 5.6%, 4.1%, 11.9%, 1.1%, and 3.1%, respectively; the mAP@0.5:0.95 of the MBACA-YOLO model increased by 9.4%, 15.2%, 4.9%, 8.0%, 6.5%, 8.1%, 2.8%, and 2.8%, respectively; In terms of model complexity and speed metrics, the parameter count of the MBACA-YOLO model is only 2.94 M, achieving significant lightweight advantages compared to models such as Faster R-CNN, SSD, and RT-DETR. Its GFLOPs is 7.2, which is significantly lower than those of Faster R-CNN and RT-DETR, and only slightly higher than that of the baseline model YOLOv13n. Regarding inference efficiency, MBACA-YOLO has an inference latency of 20.3 ms and an FPS of 53. Although slightly lower than YOLOv13n, it still outperforms other underwater-improved algorithms like CSAF-YOLO, maintaining favorable real-time performance while achieving accuracy improvements. These comparative experimental results indicate that the algorithm proposed in this chapter not only demonstrates superior detection accuracy compared to the baseline model but also maintains leading advantages in both detection accuracy and real-time performance when compared to mainstream target detection algorithms and other improved YOLO algorithms of the same type in UUV real-world operational scenarios.

3.11. Cross-YOLO Version Generalization Experiments

To verify the generalizability of the proposed improvement strategies in this paper, this section transfers the convolutional optimization, MBACA attention mechanism, upsampling optimization, and Alpha-Focal-CIoU loss function to mainstream YOLO algorithms such as YOLOv5n, YOLOv8n, YOLOv9n [45], and YOLOv10n [46]. The aim is to validate whether the improved algorithms with these strategies perform better in underwater object detection tasks compared to their original versions. The experimental configuration remains consistent with the previous section, conducted on the same server. The experimental results are shown in Table 10. In terms of accuracy: For YOLOv5n, mAP@0.5 increased from 79.9% to 83.2%, a gain of 3.3%, and mAP@0.5:0.95 improved from 42.1% to 45.2%, a rise of 3.1%. For YOLOv8n, mAP@0.5 increased from 81.4% to 84.2%, a gain of 2.8%, and mAP@0.5:0.95 improved from 43.6% to 46.1%, a rise of 2.5%. For YOLOv9n, mAP@0.5 increased from 81.8% to 84.6%, a gain of 2.8%, and mAP@0.5:0.95 improved from 44.2% to 46.8%, a rise of 2.6%. For YOLOv10n, mAP@0.5 increased from 82.1% to 85.0%, a gain of 2.9%, and mAP@0.5:0.95 improved from 44.8% to 47.5%, a rise of 2.7%. These results indicate that the proposed improvement strategies achieve stable performance gains across all versions, with YOLOv5n showing the most significant improvement. In terms of model complexity and detection speed, the increases in model complexity and reductions in detection speed for all algorithms after introducing the improvement strategies remain within acceptable limits. This demonstrates that the proposed improvement strategies enhance detection accuracy across versions without significantly increasing computational overhead, while still meeting real-time detection requirements.

3.12. Visualization Results Analysis

To more intuitively demonstrate the superior detection accuracy of the algorithm proposed in this chapter in UUV real-world operational target detection tasks, this paper conducted testing experiments using the test set of the URPC dataset on both the improved MBACA-YOLO model and the baseline YOLOv13n model. The detection results from typical underwater challenging scenarios—uneven lighting, complex backgrounds, image blur, color-distorted, low contrast and target occlusion—were selected for targeted analysis.

As shown in Figure 10, the first row illustrates that under uneven lighting conditions, the proposed MBACA-YOLO model in this chapter successfully filters out non-target objects that were incorrectly identified by the baseline YOLOv13n model. The second row in the figure demonstrates that MBACA-YOLO successfully identifies a sea urchin target that was missed by YOLOv13n.

As shown in Figure 11, the first row illustrates that in scenarios with complex backgrounds, the MBACA-YOLO model successfully filters out a scallop target that was incorrectly identified by YOLOv13n. The second row in the figure shows that MBACA-YOLO successfully identifies a scallop target that was missed by YOLOv13n.

As shown in Figure 12, the first row illustrates that under blurry image conditions, the MBACA-YOLO model successfully filters out two targets that were incorrectly identified by YOLOv13n. The second row in the figure demonstrates that MBACA-YOLO successfully identifies a sea urchin target that was missed by YOLOv13n.

As shown in Figure 13, the first row illustrates that in scenarios with target occlusion, the MBACA-YOLO model successfully filters out a sea cucumber target that was incorrectly identified by YOLOv13n. The second row in the figure shows that MBACA-YOLO successfully identifies a scallop target that was missed by YOLOv13n.

As shown in Figure 14, the first row demonstrates that in a color-distorted scenario, the MBACA-YOLO model successfully filters out a scallop target that was incorrectly recognized by YOLOv13n. The second row of the figure shows that MBACA-YOLO successfully identifies a scallop target that was missed by YOLOv13n.

As shown in Figure 15, the first row illustrates that in a low-contrast scenario, the MBACA-YOLO model successfully filters out a sea cucumber target that was incorrectly recognized by YOLOv13n. The second row of the figure demonstrates that MBACA-YOLO successfully identifies a sea cucumber target that was missed by YOLOv13n.

The visualization experimental results indicate that in challenging scenarios faced during actual UUV operations—such as uneven lighting, complex backgrounds, image blur, color-distorted, low contrast and target occlusion—the MBACA-YOLO target detection algorithm proposed in this chapter can significantly reduce the probabilities of false detections and missed detections.

3.13. Cross-Dataset Generalization Validation

To verify the generalization ability of the proposed MBACA-YOLO algorithm in underwater detection scenarios, addressing the limitations of the URPC dataset—such as its restricted target categories and uniform target sizes—this section conducts a generalization visualization validation using the RUOD dataset [47]. The RUOD dataset encompasses ten target categories, including holothurian, echinus, scallop, starfish, fish, corals, diver, cuttlefish, turtle, and jellyfish, covering targets of various scales such as small, medium, and large. Utilizing this dataset allows for a comprehensive evaluation of the proposed algorithm’s generalization capability in underwater detection scenarios. Therefore, adopting this dataset can effectively validate the generalization performance of the proposed algorithm in underwater target detection. The experimental setup in this section remains consistent with previous configurations, with YOLOv13n serving as the baseline algorithm.

(1): Comparative experiments on detection performance across different categories.

Both the YOLOv13n algorithm and the proposed MBACA-YOLO algorithm were trained separately on the RUOD dataset, aiming to evaluate whether the proposed MBACA-YOLO algorithm demonstrates superior performance across all target categories in RUOD. The experimental results are presented in Table 11. The findings indicate that, compared to the YOLOv13n algorithm, the proposed MBACA-YOLO algorithm achieves varying degrees of improvement in both mAP@0.5 and mAP@0.5:0.95 for all ten target categories. These results suggest that the proposed algorithm maintains better performance than the baseline algorithm when applied to other target categories in underwater detection tasks, thereby validating its generalization capability across multiple target categories.

(2): Analysis of multi-scale target detection results.

Both the YOLOv13n algorithm and the proposed MBACA-YOLO algorithm were applied to detect targets of different scales, aiming to verify whether the MBACA-YOLO algorithm can demonstrate superior performance across targets of varying sizes. The turtle was selected as a representative of large-sized targets, the jellyfish as a representative of medium-sized targets, and the fish as a representative of small-sized targets. The detection results are shown in Figure 16. The results indicate that the proposed MBACA-YOLO algorithm achieves higher detection confidence for large, medium, and small-sized targets compared to the YOLOv13n algorithm. These findings validate the generalization capability of MBACA-YOLO in detecting targets across different scales.

4. Conclusions

To address the issues of missed and false detections caused by uneven lighting, blur, complex backgrounds, and target occlusion in underwater optical images during actual UUV operations, this paper proposes a high-precision target detection algorithm named MBACA-YOLO. This provides a robust and lightweight technical solution for UUV underwater intelligent perception systems. First, based on the concept of SPD-Conv, convolution optimization was applied to four convolutional layers in the backbone network, enhancing the algorithm’s feature extraction capability. Next, the traditional upsampling module in the neck network was replaced with the CARAFE upsampling module, significantly reducing interference introduced during the upsampling process. Then, the newly proposed MBACA attention mechanism module was integrated into the final layer of the backbone network to assist the algorithm in processing extracted features—amplifying useful feature expressions while suppressing interference. Finally, the CIoU loss function was redesigned by combining the Alpha-IoU and Focal-IoU mechanisms, resulting in the proposed Alpha-Focal-CIoU loss function, which addresses the decline in bounding box regression accuracy.

To comprehensively evaluate the performance of MBACA-YOLO, ablation studies, comparative experiments, and visualization analysis were conducted on the URPC dataset. Experimental results demonstrate that, compared to the baseline model YOLOv13n, the MBACA-YOLO model achieves improvements of 3.1% in mAP@0.5 and 2.8% in mAP@0.5:0.95, significantly reducing the probability of false positives and missed detections. Notably, this is accomplished without a substantial increase in model complexity or a significant decline in detection speed. It demonstrates superior detection accuracy and robustness in complex underwater scenarios. This algorithm can be directly integrated into UUV perception systems, providing reliable technical support for autonomous UUV operations. It holds significant importance for enhancing the automation and intelligence of underwater electromechanical equipment and exhibits promising prospects for engineering applications.

Author Contributions

Conceptualization, C.H. and T.S.; methodology, S.C.; software, S.C. and C.G.; validation, S.C., T.S. and C.G.; formal analysis, C.H., T.S. and C.G.; investigation, S.C.; resources, C.H. and C.G.; data curation, S.C. and T.S.; writing—original draft preparation, S.C.; writing—review and editing, C.H. and T.S.; visualization, S.C.; project administration, C.G.; funding acquisition, C.H. and T.S. All authors have read and agreed to the published version of the manuscript.

Funding

This study was supported by the Program for Young Talents of Basic Research in Universities of Heilongjiang Province (YQJH2024077).

Data Availability Statement

Data are contained within the article; the code is confidential.

Conflicts of Interest

Chuang Han and Chengli Guo were employed by Sunny Group Co., Ltd. The remaining author declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Liu, P.; Qin, L.; Li, G.; Hou, D.; Zhu, Z. DVL-based Autonomous Navigation Method for Unmanned Undersea Vehicles. J. Unmanned Undersea Syst. 2023, 31, 373–380. [Google Scholar]
Anonymous. Special Purpose Underwater Research Vehicle (SPURV) Technical Report; University of Washington Applied Physics Laboratory: Seattle, WA, USA, 1957. [Google Scholar]
Wang, Z.; Qu, X.; Li, H. Research on cluster application and key technologies of autonomous underwater glider. CAAI Trans. Intell. Syst. 2024, 19, 1341–1350. [Google Scholar]
Burguera, A.; Oliver, G. High—Resolution Underwater Mapping Using Side—Scan Sonar. PLoS ONE 2016, 11, e0146396. [Google Scholar] [CrossRef] [PubMed]
Glenn, W.H. Sea—Beam: A multiple—Beam echo—Sounder system for sea—Floor mapping. Mar. Technol. Soc. J. 1970, 4, 19–23. [Google Scholar]
Brown, D.H. Application of the Synthetic Aperture Concept to High Resolution Sonar; National Academy of Sciences, National Research Council, Mine Advisory Committee: Washington, DC, USA, 1969. [Google Scholar]
McCulloch, W.S.; Pitts, W. A logical calculus of the ideas immanent in nervous activity. Bull. Math. Biophys. 1943, 5, 115–133. [Google Scholar] [CrossRef]
Hinton, G.E.; Osindero, S.; Teh, Y.W. A fast learning algorithm for deep belief nets. Neural Comput. 2006, 18, 1527–1554. [Google Scholar] [CrossRef]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet classification with deep convolutional neural networks. In Proceedings of the Advances in Neural Information Processing System, Lake Tahoe, NV, USA, 3–6 December 2012; pp. 1097–1105. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Goodfellow, I.J.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial nets. In Proceedings of the Advances in Neural Information Processing Systems, Montreal, PQ, Canada, 8–13 December 2014; pp. 2672–2680. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; pp. 5998–6008. [Google Scholar]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 779–788. [Google Scholar]
Li, B.; Huang, H.; Liu, J.; Liu, Z.; Wei, L. Underwater Optical Image Interested Object Detection Model Based on Improved SSD. J. Electron. Inf. Technol. 2022, 44, 3372–3378. [Google Scholar]
Liu, Y.; Wang, S.N. A quantitative detection algorithm based on improved faster R-CNN for marine benthos. Ecol. Inform. 2021, 61, 101228. [Google Scholar] [CrossRef]
Gu, X.T.; Tang, S.Y.; Cao, Y.M.; Yu, C.D. Underwater object detection in sonar imagery with detection transformer and zero-shot neural architecture search. arXiv 2025, arXiv:2505.06694. [Google Scholar] [CrossRef]
Zhai, Z.L.; Wang, K.K.; Su, S.Y. AquaFuse-net: An accuracy-efficiency balanced framework for underwater target detection with multi-scale feature enhancement. J. Supercomput. 2026, 82, 16. [Google Scholar] [CrossRef]
Wang, W.L.; Yu, Z.B.; Huang, M.X. Refining features for underwater object detection at the frequency level. Front. Mar. Sci. 2025, 12, 1544839. [Google Scholar] [CrossRef]
Han, H.; Fan, H.; Huang, X.; Han, C. Self-supervised multi-transformation learning for time series anomaly detection. Expert Syst. Appl. 2024, 253, 124339. [Google Scholar] [CrossRef]
Feng, J.; Jin, T. CEH-YOLO: A composite enhanced YOLO-based model for underwater object detection. Ecol. Inform. 2024, 82, 102758. [Google Scholar] [CrossRef]
Liu, X.; Zhao, K.; Liu, C.; Chen, L. Bi2F-YOLO: A novel framework for underwater object detection based on YOLOv7. Intell. Mar. Technol. Syst. 2025, 3, 9. [Google Scholar] [CrossRef]
Wang, J.; Yeh, K.; Liao, H. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 746–755. [Google Scholar]
Yi, W.; Yang, J.; Yan, L. Research on underwater small target detection technology based on single-stage USSTD-YOLOv8n. IEEE Access 2024, 12, 69633–69641. [Google Scholar] [CrossRef]
Ultralytics. YOLOv8: Real-Time Object Detection at 1000 FPS. Ultralytics Technical Report. 2023. Available online: https://docs.ultralytics.com/yolov8/ (accessed on 25 December 2023).
Zhang, H.R.; Feng, W.M.; Yang, L.X.; Ma, Y.J. CSAF-YOLO: An improved underwater small target detection algorithm based on YOLO11. J. Comput. Appl. 2026. Available online: https://link.cnki.net/urlid/51.1307.TP.20260108.1256.004 (accessed on 15 January 2026).
Jocher, G.; Qiu, J. Ultralytics YOLO11. 2024. Available online: https://github.com/ultralytics/ultralytics (accessed on 10 October 2024).
Liang, X.; Zhao, J.; Yu, H. Lightweight Underwater Target Detection Algorithm Based on YOLOv8. Infrared Technol. 2024, 46, 1015–1024. [Google Scholar]
Lei, M.; Li, S.; Wu, Y.; Hu, H. YOLOv13: Real-Time Object Detection with Hypergraph-Enhanced Adaptive Visual Perception. arXiv 2025, arXiv:2506.17733. [Google Scholar]
Sunkara, R.; Luo, T. No more strided convolutions or pooling: A new CNN building block for low-resolution images and small objects. In Proceedings of the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML PKDD), Grenoble, France, 19–23 September 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 443–459. [Google Scholar]
Li, X.; Wang, W.; Hu, X.; Yang, J. Selective Kernel Networks. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 510–519. [Google Scholar]
Wang, J.; Chen, K.; Xu, R.; Liu, Z.; Loy, C.C.; Lin, D. CARAFE: Content-aware reassembly of features. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Korea, 27 October–2 November 2019; pp. 27–36. [Google Scholar]
He, J.; Erfani, S.; Ma, X.; Bailey, J.; Chi, Y.; Hua, X.-S. Alpha-IoU: A family of power intersection over union losses for bounding box regression. In Proceedings of the Advances in Neural Information Processing Systems, Online, 6–14 December 2021; Volume 34. [Google Scholar]
Han, J.; Gong, Y.; Wang, Q.; Zhang, Z.; Li, S. Focaler-IoU: More focused intersection over union loss. arXiv 2024, arXiv:2401.10525. [Google Scholar]
Bono, F.M.; Radicioni, L.; Cinquemani, S. A novel approach for quality control of automated production lines working under highly inconsistent conditions. Eng. Appl. Artif. Intell. 2023, 122, 106149. [Google Scholar] [CrossRef]
Hu, J.; Shen, L.; Albanie, S.; Sun, G.; Wu, E. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 7132–7141. [Google Scholar]
Woo, S.; Park, J.; Lee, J.-Y.; Kweon, I.S. CBAM: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Wang, B.; Peng, X.; Wang, Q.; Li, C.; Hou, H.; Cheng, M.-M. ECA-Net: Efficient channel attention for deep convolutional neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 11531–11539. [Google Scholar]
Hou, Q.; Zhou, D.; Feng, J. Coordinate attention for efficient mobile network design. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 19–25 June 2021; pp. 13713–13722. [Google Scholar]
Koohpayegani, S.A.; Pirsiavash, H. SimA: Simple softmax-free attention for vision transformers. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 3–8 January 2024; pp. 2607–2616. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. In Proceedings of the Advances in Neural Information Processing Systems, Montreal, PQ, Canada, 7–12 December 2015; pp. 91–99. [Google Scholar]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.-Y.; Berg, A.C. SSD: Single shot multibox detector. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; pp. 21–37. [Google Scholar]
Zeng, Z.; Qian, Y.; Zhang, X.; Cao, Y.; Han, K. DETRs beat YOLOs on real-time object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 20328–20337. [Google Scholar]
Ultralytics. YOLOv5: Real-Time Object Detection with Incremental Improvements. Ultralytics Technical Report, 2020. Available online: https://docs.ultralytics.com/yolov5/ (accessed on 20 December 2023).
Li, J.; Han, Y.; Dong, C.; Wang, Y.; Wang, J.; Zhang, Y.; Yang, S. YOLO-U: An underwater object detection algorithm based on structural reparameterization and dual attention mechanism. J. Shanghai Ocean. Univ. 2025, 34, 696–706. [Google Scholar]
Wang, C.-Y.; Yeh, I.-H.; Liao, H.-Y.M. YOLOv9: Learning What You Want to Learn Using Programmable Gradient Information. arXiv 2024, arXiv:2402.13616. [Google Scholar] [CrossRef]
Wang, A.; Chen, H.; Liu, L.H.; Chen, K.; Lin, Z.J.; Han, J.G.; Ding, G.G. YOLOv10: Real-Time End-to-End Object Detection. arXiv 2024, arXiv:2405.14458. [Google Scholar]
Fu, C.; Liu, R.; Fan, X. Rethinking general underwater object detection: Dataset, challenges and solutions. Neurocomputing 2023, 527, 243–256. [Google Scholar] [CrossRef]

Figure 1. YOLOv13 Overall Architecture.

Figure 2. MBACA-YOLO Overall Architecture.

Figure 3. Optimization Implementation Process of a Single Convolutional Layer.

Figure 4. SKA Implementation Process.

Figure 5. MBACA Implementation Process.

Figure 6. CARAFE Implementation Process.

Figure 7. Samples of the URPC Dataset in Different Scenarios.

Figure 8. mAP@0.5 variation curves during the training process of the two algorithms.

Figure 9. Comparison of Heatmaps Before and After Embedding the MBACA Attention Mechanism.

Figure 10. Detection Results Under Uneven Lighting Conditions.

Figure 11. Detection Results Under Complex Background Conditions.

Figure 12. Detection Results Under Blurry Image Conditions.

Figure 13. Detection Results Under Target Occlusion Conditions.

Figure 14. Detection Results Under Color-Distorted Conditions.

Figure 15. Detection Results Under Low Contrast Conditions.

Figure 16. Comparative analysis of detection results for targets of different scales.

Table 1. Experimental Hyperparameter Settings.

Parameter Name	Parameter Setting
Image Size	640 × 640
Epochs	300
Batch Size	32
Weight Decay	0.0005

Table 2. Experimental Results of Different Combinations of Optimizers and Learning Rate Strategies.

Parameter Combinations	P (%)	R (%)	mAP@0.5 (%)	mAP@0.5:0.95 (%)
SGD + Fixed LR	82.4	76.0	82.4	47.3
SGD + Cosine Annealing LR	81.9	75.5	81.8	46.8
Adam + Fixed LR	81.5	74.9	81.2	46.2
Adam + Cosine Annealing LR	80.7	74.1	80.5	45.6

Table 3. Comparative Experiments on Multiple Positions of the Attention Mechanism.

Integration Position of MBACA	P (%)	R (%)	mAP@0.5 (%)	mAP@0.5:0.95 (%)	Params (M)	GFLOPs	Inference Latency	FPS
Behind the B3 Layer	81.2	75.1	82.3	46.8	2.58	6.2	21.3	47
Behind the B4 Layer	82.7	76.8	83.8	48.1	2.61	6.4	20.1	50
Behind the B5 Layer	83.5	78.3	85.0	49.5	2.94	7.2	20.2	54

Table 4. Comparative Experimental Results of Different Attention Mechanisms.

Model	P (%)	R (%)	mAP@0.5 (%)	mAP@0.5:0.95 (%)	Params (M)	GFLOPs	Inference Latency	FPS
YOLOv13n	82.4	76.0	82.4	47.3	2.45	6.2	18.6	56
YOLOv13n+SE	83.0	77.9	84.1	48.9	3.01	8.0	21.3	41
YOLOv13n+CBAM	82.1	78.0	84.5	49.3	2.97	7.4	22.5	40
YOLOv13n+ECA	83.1	76.9	83.9	48.5	3.17	7.4	19.2	42
YOLOv13n+CA	82.7	77.5	83.3	48.2	2.99	7.8	20.7	42
YOLOv13n+SimA	82.0	78.6	84.1	48.5	3.25	7.9	21.0	41
YOLOv13n+SKA	81.8	77.4	84.3	48.7	3.02	7.6	23.1	41
YOLOv13n+MBACA	83.5	78.3	85.0	49.5	2.94	7.2	20.2	54

Table 5. Experimental Results for Incremental Values of α with γ = 2.0.

α Values	P (%)	R (%)	mAP@0.5 (%)	mAP@0.5:0.95 (%)
1.0	82.5	77.2	83.8	48.2
2.0	83.6	78.0	84.9	49.5
3.0	84.1	78.5	85.5	50.1
4.0	83.2	77.8	84.2	49.0
5.0	82.1	77.0	83.5	48.0

Table 6. Experimental Results for Incremental Values of γ with α = 1.0.

γ Values	P (%)	R (%)	mAP@0.5 (%)	mAP@0.5:0.95 (%)
0.5	82.0	76.8	83.2	47.5
1.0	82.3	77.0	83.5	48.0
1.5	82.4	77.1	83.7	48.1
2.0	82.5	77.2	83.8	48.2
2.5	82.2	77.0	83.6	48.0

Table 7. Ablation Experiment Results—Precision Metrics.

YOLOv13n	Convolution Optimization	CARAFE	MBACA	Alpha-Focal-CIoU	P (%)	R (%)	mAP@0.5 (%)	mAP@0.5:0.95 (%)
√					82.4	76.0	82.4	47.3
√	√				82.5	76.8	83.5	48.3
√	√	√			82.6	77.9	84.1	48.9
√	√	√	√		83.5	78.3	85.0	49.5
√	√	√	√	√	84.1	78.5	85.5	50.1

Table 8. Ablation Experiment Results—Model Complexity and Speed Metrics.

YOLOv13n	Convolution Optimization	CARAFE	MBACA	Alpha-Focal-CIoU	Params (M)	GFLOPs	Inference Latency (ms)	FPS
√					2.45	6.2	18.6	56
√	√				2.54	6.3	17.7	56
√	√	√			2.73	6.8	18.2	56
√	√	√	√		2.94	7.2	20.2	54
√	√	√	√	√	2.94	7.2	20.3	53

Table 9. Comparative Experiment Results.

Model	P (%)	R (%)	mAP@0.5 (%)	mAP@0.5:0.95 (%)	Params (M)	GFLOPs	Inference Latency	FPS
Faster R-CNN	82.5	48.5	74.6	40.7	42.0	160.0	20.0	50
SSD	69.7	62.5	61.7	34.9	19.0	85.0	32.3	31
RT-DETR	83.1	73.9	81.5	45.2	28.0	90.0	19.6	51
YOLOv5n	84.0	73.1	79.9	42.1	1.9	4.5	21.3	47
YOLOv8n	83.2	72.9	81.4	43.6	2.2	5.0	20.4	49
COT-YOLO	77.9	69.4	73.7	42.0	3.1	7.2	18.9	53
CSAF-YOLO	83.8	78.1	85.0	48.5	3.2	7.6	23.0	30
YOLOv13n	82.4	76.0	82.4	47.3	2.45	6.2	18.6	56
MBACA-YOLO (Ours)	84.1	78.5	85.5	50.1	2.94	7.2	20.3	53

Table 10. Experimental results before and after improvements for each algorithm version.

Model	P (%)	R (%)	mAP@0.5 (%)	mAP@0.5:0.95 (%)	Params (M)	GFLOPs	Inference Latency	FPS
YOLOv5n	84.0	73.1	79.9	42.1	1.9	4.5	21.3	47
MBACA-YOLOv5n	86.3	75.5	83.2	45.2	2.42	5.6	23.0	44
YOLOv8n	83.2	72.9	81.4	43.6	2.2	5.0	20.4	49
MBACA-YOLOv8n	85.1	74.8	84.2	46.1	2.71	6.1	22.0	46
YOLOv9n	82.8	73.5	81.8	44.2	2.3	5.3	19.6	51
MBACA-YOLOv9n	84.7	75.3	84.6	46.8	2.77	6.2	20.7	48
YOLOv10n	82.5	73.8	82.1	44.8	2.35	5.5	19.2	52
MBACA-YOLOv10n	84.8	76.0	85.0	47.5	2.86	6.6	20.5	49

Table 11. Training results for each category when using YOLOv13n and MBACA-YOLO algorithms.

Target Category	YOLOv13-mAP@0.5 (%)	MBACA-YOLO-mAP@0.5 (%)	YOLOv13-mAP@0.5:0.95 (%)	MBACA-YOLO-mAP@0.5:0.95 (%)
holothurian	0.764	0.862	0.488	0.677
echinus	0.856	0.889	0.488	0.594
scallop	0.844	0.891	0.569	0.582
starfish	0.843	0.851	0.568	0.619
fish	0.790	0.791	0.503	0.513
corals	0.779	0.972	0.603	0.751
diver	0.899	0.958	0.795	0.908
cuttlefish	0.944	0.957	0.832	0.844
turtle	0.945	0.995	0.846	0.889
jellyfish	0.795	0.773	0.583	0.602

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Han, C.; Chen, S.; Shen, T.; Guo, C. MBACA-YOLO: A High-Precision Underwater Target Detection Algorithm for Unmanned Underwater Vehicles. Machines 2026, 14, 231. https://doi.org/10.3390/machines14020231

AMA Style

Han C, Chen S, Shen T, Guo C. MBACA-YOLO: A High-Precision Underwater Target Detection Algorithm for Unmanned Underwater Vehicles. Machines. 2026; 14(2):231. https://doi.org/10.3390/machines14020231

Chicago/Turabian Style

Han, Chuang, Shanshan Chen, Tao Shen, and Chengli Guo. 2026. "MBACA-YOLO: A High-Precision Underwater Target Detection Algorithm for Unmanned Underwater Vehicles" Machines 14, no. 2: 231. https://doi.org/10.3390/machines14020231

APA Style

Han, C., Chen, S., Shen, T., & Guo, C. (2026). MBACA-YOLO: A High-Precision Underwater Target Detection Algorithm for Unmanned Underwater Vehicles. Machines, 14(2), 231. https://doi.org/10.3390/machines14020231

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

MBACA-YOLO: A High-Precision Underwater Target Detection Algorithm for Unmanned Underwater Vehicles

Abstract

1. Introduction

2. Materials and Methods

2.1. YOLOv13

2.2. MBACA-YOLO Overall Architecture

2.3. Convolution Optimization

2.4. MBACA

2.5. Upsampling Optimization

2.6. Loss Function Optimization

3. Results and Discussion

3.1. Dataset

3.2. Experimental Environment and Hyperparameter Settings

3.3. Experimental Evaluation Metrics

3.4. Analysis of Algorithm Training Process

3.5. Comparative Experiments on Multiple Positions of the Attention Mechanism

3.6. Comparative Experiment on the Effectiveness of Attention Mechanisms

3.7. Heatmap-Based Attention Mechanism Effectiveness Validation

3.8. Sensitivity Analysis of Alpha-Focal-CIoU Hyperparameters

3.9. Ablation Study

3.10. Comparative Experiment

3.11. Cross-YOLO Version Generalization Experiments

3.12. Visualization Results Analysis

3.13. Cross-Dataset Generalization Validation

4. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI