1. Introduction
With the continuous advancement of global marine resource exploitation and strategic security demands, underwater object detection technology has become a core pillar in marine exploration, military security, and emergency rescue operations [
1,
2]. From shipwreck recovery and underwater infrastructure inspection to submarine tracking and drowning victim search missions, rapid and precise object recognition capabilities are critical to mission success. However, underwater environments inherently suffer from significant optical attenuation, turbid water, and complex lighting conditions, rendering traditional optical imaging techniques inadequate for acquiring clear images. Synthetic Aperture Sonar (SAS) [
3,
4,
5], which is an underwater imaging technology of virtual synthesis of super large aperture through the movement of sonar platform, plays an important role in underwater object detection. In SAS technology, the small sonar array is used to continuously transmit and receive sound waves while moving. By accurately recording the platform position and echo phase, the received signals along the way are coherently superimposed and processed to synthesize the effect equivalent to the ultra-long physical aperture. It can achieve long-distance, high-resolution and clear imaging without blurring with distance, especially suitable for fine detection of mines, pipelines, sunken ships and other targets [
6]. Sonar technology, leveraging the strong penetration capability of sound waves and exceptional resistance to environmental interference, has emerged as an irreplaceable solution for underwater detection. By actively emitting acoustic waves and analyzing their echoes, sonar systems can effectively reconstruct the contours, textures, and spatial distribution of underwater objects [
7,
8], providing foundational data for object detection and localization in complex scenarios [
9]. As the volume of underwater sonar image data requiring rapid detection continues to grow, the demand for both speed and accuracy in detection methods has intensified. Moreover, challenges such as acoustic wave multipath reflections, environmental noise-induced speckle artifacts in images, and shadow zones behind objects caused by side-scan sonar characteristics [
10,
11] further complicate high-accuracy object detection.
The development of object detection technology for SAS images can be divided into two major phases: traditional image processing and deep learning. Traditional methods primarily rely on physical models and signal processing techniques [
4,
12,
13,
14], with typical workflows including manual feature extraction steps such as threshold segmentation, edge detection, and morphological filtering. For instance, object regions are segmented by setting grayscale thresholds, edge contours are extracted using Sobel or Canny operators, and object localization is completed via region growing or template matching. However, most of these methods depend on physical models, signal processing approaches, and expert-designed feature rules [
12,
13,
15,
16], exhibiting poor adaptability to speckle noise, shadow artifacts, and low-contrast objects prevalent in sonar images. This results in low detection efficiency and insufficient generalization. Consequently, with the advent of deep learning, traditional sonar image processing techniques have gradually been phased out, giving way to deep learning-based methods [
17,
18,
19,
20,
21,
22], which dominate sonar image analysis.
Advancements in deep learning, particularly architectures such as Convolutional Neural Networks (CNN) [
23] and Transformers [
24], have demonstrated exceptional performance in processing image and video data, commonly applied to tasks in computer vision and natural language processing [
25]. Deep learning excels at learning invariant features directly from data, offering superior robustness and reducing reliance on manual intervention. Thus, deep learning-based image processing methods hold significant advantages over traditional approaches. Current CNN-based techniques for sonar images are categorized into single-stage and two-stage detection frameworks. Representative single-stage models include the YOLO series [
2,
26,
27,
28,
29] and single shot multibox detector (SSD) [
30], while two-stage detection is exemplified by Faster R-CNN [
31,
32] and Mask R-CNN [
33]. In the field of object detection, single-stage detection methods generate object classification results and localization coordinates directly from global image features, and their architectural design eliminates the generation of region candidate frames, a feature that enables them to achieve a more efficient inference process compared to two-stage methods. However, such methods usually show a slight degradation in detection accuracy, and this performance limitation mainly stems from the fact that their end-to-end prediction mechanism needs to process dense spatial samples simultaneously, which leads to a significant increase in computational resource consumption [
34]. Moreover, the deep learning methods, including one-stage and two-stage methods, are used for SAS image underwater object detection. Specifically, YOLO [
35,
36], Faster R-CNN [
37] and SSD [
38] are usually considered for SAS image target detection.
The introduction of Transformer technology has opened new optimization pathways for sonar image processing. Its core strength lies in establishing long-range dependencies among image pixels through self-attention mechanisms, significantly enhancing suppression of speckle noise and reverberation artifacts in sonar images. For sonar video sequence analysis, Transformers can model spatiotemporal features concurrently, leveraging temporal attention weights to align multi-frame object trajectories and mitigate missed detections of dynamic objects.
Although these models and algorithmic applications have significantly advanced object detection technology, they still face numerous challenges when processing underwater image data, including water scattering and noise interference. Underwater sonar images typically exhibit low contrast between objects and backgrounds, along with blurred edges. Additionally, acoustic wave multipath reflections introduce speckle noise and artifacts in the images, as shown in
Figure 1. Furthermore, the substantial size variations of underwater objects in imagery pose difficulties for models to adapt effectively to multi-scale data, making the handling of multi-scale variations of an image another critical challenge.
In this paper, a high-accuracy underwater object detection algorithm (HAUOD) for SAS images is proposed. This paper improves on the basis of YOLOv8. Firstly, a special preprocessing module and data enhancement strategy are designed for the problem of low background contrast and fuzzy edges of underwater sonar images. Then, the C2fD module fused with differential features is proposed to ensure that the model has a stronger ability to capture details. Furthermore, an underwater multi-scale context attention mechanism is designed to further improve the sensitivity of the model to weak objects. Experimental results on the Sonar Common Target Dataset (SCTD) reveal that the HAUOD methodology achieves 95.1% recognition accuracy, surpassing the YOLOv8n reference model by an 8.3% absolute improvement. The solution demonstrates strengthened detection consistency for diminutive underwater objects and heightened adaptability to varied marine conditions. Compared with YOLOv8s, the proposed HAUOD algorithm can achieve higher accuracy with only model size, and reduce nearly half of the computational complexity. Moreover, the HAUOD method can achieve higher accuracy than YOLOv10 and YOLOv11 with an appropriate model size and computational efficiency. These demonstrate that the proposed method can exhibit significant advantages in balancing computational efficiency and accuracy compared to mainstream detection models. The main contributions of this paper are as follows:
(1) To address the challenges of low contrast, noise interference, and blurred edges in SAS images, a high-accuracy underwater object detection algorithm for Synthetic Aperture Sonar (SAS) images is proposed, named the HAUOD algorithm. HAUOD first processes data through an image preprocessing module, then enhances the C2f module with the novel C2fD design, and finally introduces the Underwater Multi-scale Contextual Attention Mechanism, named UWA, to achieve efficient and robust underwater object detection.
(2) The SAS images are pre-processed and optimized for data enhancement. This strategy combines the three preprocessing methods of Contrast Limited Adaptive Histogram Equalization (CLAHE) enhancements, non-local mean denoising, and frequency domain bandpass filtering to form a three-level cascade, so as to enhance the image contrast and solve the problem of edge blur. CLAHE effectively enhances local image contrast, while non-local mean denoising suppresses noise, and frequency-domain band-pass filtering sharpens object edges by emphasizing regions of interest. Although traditional Mosaic data augmentation works well for optical images, its direct application to low-contrast, noise-corrupted sonar images risks distorting object features. Thus, dedicated SAS image preprocessing is introduced before Mosaic augmentation to ensure stable retention of object information.
(3) To tackle edge blurring in underwater sonar images, the C2fD module is proposed. The C2fD module uses the optimized spatial difference to extract the objects’ edge information to deal with the problems of edge blurring and texture missing in sonar image, and then uses the Enhanced Efficient Channel Attention (Enhanced ECA) mechanism and lightweight feature fusion strategy to prevent the imbalance between basic features and edge features, which effectively enhances the object recognition ability of the model in the underwater environment with low contrast and serious noise interference. In addition, the multi-scale feature extraction and fusion strategy can be used to deal with the significant change in underwater target size and balance the detection speed and accuracy.
(4) The Underwater Multi-scale Contextual Attention Mechanism, named UWA, is designed to enhance the effectiveness of the C2fD module. It integrates adaptive noise suppression and hierarchical dilated convolution groups to capture multi-scale contextual features. Channel-spatial dual-dimensional attention collaboration is then applied to amplify responses in object regions. Finally, dynamic gated residual fusion balances contributions from original and enhanced features, significantly improving sensitivity to faint objects.
2. The Proposed HAUOD Algorithm
2.1. Overall Architecture
The proposed HAUOD model in this study is built upon YOLOv8 architecture, with the overall algorithmic framework illustrated in
Figure 2. To address the challenges of low contrast, speckle noise, and low-frequency background interference in sonar images, a multi-strategy sonar preprocessing module is designed. This module first applies adaptive CLAHE [
39] enhancement to the luminance channel in the LAB color space to optimize local contrast while suppressing artifacts caused by global equalization. Subsequently, non-local mean denoising [
40] is employed to reduce speckle noise while preserving object edge structures. Finally, frequency-domain band-pass filtering [
41] is introduced to mitigate low-frequency background interference. This three-stage cascaded processing pipeline effectively optimizes contrast, suppresses artifacts, and minimizes low-frequency disturbances in sonar images [
42].
2.2. SonarMosaic Strategy
The proposed C2fD architecture addresses essential requirements for contour characteristic capture and underwater object scale adaptation through selective substitution of specific C2f components within the core network architecture. This upgrade enables synergistic feature enhancement and noise suppression while refining the preprocessed high-quality features. The C2fD module incorporates a spatial differential feature extraction component that explicitly computes horizontal and vertical gradients using fixed Sobel operators [
43] to amplify edge responses. An Enhanced Efficient Channel Attention (Enhanced ECA) [
44] mechanism is introduced, replacing 1D convolution with fully connected layers and optimizing global channel interactions through a reduction ratio of four. This enhancement improves small object detection accuracy and prevents missed detections caused by underwater object size variations. A dynamic fusion [
45,
46] module is further implemented to adaptively weight and combine base features with differential features, preventing gradient vanishing.
To address potential information loss during backbone network integration, the Underwater Multi-scale Contextual Attention (UWA) mechanism is developed. This attention mechanism employs hierarchical dilated convolution groups to model multi-scale contextual information and integrates channel-spatial dual-dimensional attention to suppress background interference. Dynamic gated residual fusion ensures balanced noise suppression and feature preservation, effectively resolving missed detections caused by multi-scale object distributions, high-frequency scattering noise, and background interference while maintaining contextual integrity. Overall, these technical enhancements greatly improve model robustness when operating in challenging underwater environments.
Aiming at the low-contrast, high-frequency noise interference and complex background characteristics of sonar images, this paper proposes a three-stage cascade preprocessing strategy to enhance object detectability through cooperative enhancement and noise suppression. The preprocessing module contains three stages: Adaptive CLAHE enhancement, non-local mean denoising and frequency domain bandpass filtering. The specific realization steps are shown in
Figure 3.
Underwater sonar detects objects that are usually characterized by low contrast, high noise and blurred object edges, which makes the traditional histogram equalization method introduce more noise when enhancing the contrast, resulting in the loss of details in the object area or over-amplification of background information. Traditional histogram equalization methods enhance image contrast by adjusting the distribution of pixel values, but the method performs equalization over the entire image range, which can easily lead to over-enhancement or loss of details in local areas. In contrast, CLAHE employs a local contrast enhancement strategy, which effectively improves the gradient strength of the object region while reducing the effect of noise by segmenting the image into multiple small regions, calculating the local histogram equalization separately, and preventing over-enhancement by using contrast limitation. The method adaptively adjusts the contrast within different local regions to achieve uniform luminance enhancement while avoiding artifacts from global histogram equalization. The BGR image is first converted to LAB color space. And then the CLAHE transform is applied only to the luminance channel (L) to avoid unnecessary influence on the color information. Finally, the enhanced luminance channel is merged with the original A and B channels and converted back to BGR format. However, the CLAHE-enhanced image will make the noise more obvious at the same time, so the non-local mean denoising is designed to form a synergistic process of “enhancement-purification”. Non-local mean denoising is based on pixel neighborhood similarity weighted denoising, which is able to preserve image edges and structural information, suppressing high frequency noise while reducing noise interference.
where
is the denoised value of the pixel,
is the normalization factor used to ensure that the sum of weights is 1,
and
are the pixel blocks in the neighborhood of the pixel sum, and h is the decay coefficient, which controls the decay speed of the similarity weights. Simultaneous processing of BGR three-channel information to avoid color distortion caused by single-channel denoising and the use of multi-threaded optimization greatly reduced the single-image denoising time.
Traditional mean filtering [
47] and Gaussian filtering [
48] mainly denoise by calculating the pixel mean value through the local region, but these methods also lead to the loss of image details and blurring of the object boundary while denoising. The non-local mean denoising method overcomes the limitations of traditional filters by utilizing the similarity between pixels and calculating similar pixel values through weighted averaging, which is able to retain the object edges and structural information while effectively denoising and is especially suitable for complex background processing of underwater sonar images. At the same time, the frequency domain band-pass filtering can deal with the noise from different dimensions, forming a dual noise reduction mechanism of “air purification-frequency domain focus”, which significantly improves the object signal-to-noise ratio.
Figure 3 illustrates the implementation workflow of spectral bandpass filtering, where two-dimensional DFT processing is applied to the noise-reduced input f(x,y). This frequency domain transformation enables selective signal filtering through amplitude spectrum manipulation:
The frequency domain component F(u,v) represents complex spectral values at coordinates (u,v), while f(x,y) corresponds to spatial domain intensity values at position (x,y), with M and N specifying the image’s vertical and horizontal resolutions, respectively.
Then, a bandpass filter H(u,v) is designed to construct the frequency mask, which is defined as
where
is the center of the frequency domain, and R is a preset radius that controls the range of frequencies retained. Then, filtering and inverse conversion first multiply the frequency domain image with the mask to obtain the filtered frequency domain image:
The conversion process subsequently employs inverse discrete Fourier transformation (IDFT) to restore the data to its original null domain configuration:
where
is the filtered frequency-domain image,
denotes the Fourier inverse transform symbol.
This process preserves the object information in the frequency domain while effectively suppressing the noise and low-frequency interference in the background. Since the traditional Mosaic has serious problems of noise superposition and object distortion when directly splicing low-quality sonar images, three-level preprocessing is embedded in front of the Mosaic to ensure the clarity and object integrity of the spliced subgraphs, which results in the SonarMosaic module. This module not only ensures the diversity of the original data but also further strengthens the object features and anti-interference ability in the image through preprocessing, which greatly improves the feature extraction accuracy of the subsequent network.
2.3. The Proposed C2fD Module
In order to improve the detection performance of YOLOv8 on SCTD sonar image datasets, this paper presents the C2fD module by improving the original C2f module. This module enhances the model’s ability to detect low-contrast, small and edge-ambiguous objects while maintaining computational efficiency by optimizing the optimized spatial difference module, dynamic feature fusion and enhanced efficient channel attention. The specific workflow is shown in
Figure 4.
Aiming at the problems of blurred object edges, low contrast, and background noise interference in underwater sonar images, the traditional convolutional neural network (CNN) relies on the adaptive convolutional kernel to implicitly extract the edge features, which is susceptible to noise contamination, leading to feature confusion. In this paper, we propose an explicit spatial gradient enhancement module (OptimizedSpatialDifference), which significantly improves the model’s ability to model the edge features of sonar images through a three-phase operation of channel compression, bi-directional gradient extraction and feature expansion. Its core design is as follows:
Implement channel compression and noise suppression, and perform channel dimension mean pooling on the input feature map:
This section suppresses channel-specific noise due to scattering in the sonar image by cross-channel information fusion while reducing the computational effort to 1/C of the original input. Experiments show that this step results in a reduction in the intensity of the background noise response. In addition to the input feature channel compression, a bi-directional gradient feature extraction method was used to explicitly enhance the horizontal and vertical edge response. A Sobel convolution kernel with fixed weights was used to compute the horizontal and vertical gradients:
The horizontal Sobel nucleus is:
The vertical Sobel nucleus is:
The introduction of the Sobel operator can also be utilized to suppress high-frequency noise interference using its smoothing properties. The in this operation represents the single-channel feature map after pooling across channel averages. The symbol C represents channel count, H indicates vertical dimension size, W corresponds to horizontal spatial extent, while G signifies bidirectional gradient components (horizontal and vertical) generated through convolutional filter operations.
In the feature enhancement stage, the dual-channel gradient features maintain dimensional consistency with the original input channel number through the channel dimension expansion strategy, which lays the structural foundation for multi-feature fusion. The process constructs a composite feature expression space by hierarchically splicing the horizontal and vertical gradient features with the original features. The fused feature vectors are then input to the subsequent convolutional layers for deep feature extraction. In order to optimize the feature quality, the OptimizedSpatialDifference module introduces a dynamic weight allocation mechanism, which achieves selective suppression of redundant feature responses through adaptive threshold constraints, thus enhancing the feature contribution of effective gradient information.
The dynamic feature fusion module achieves context-aware integration of core and complementary characteristics through attention-driven weight adaptation. This approach demonstrates reduced computational complexity relative to conventional feature pyramid networks while enhancing precision for small-scale object identification in sonar images, with the full implementation process detailed in
Figure 5.
Given the base and differential features, the hybrid features are first generated by element-by-element summation:
This operation initially fuses the complementary information of the two types of features to provide context-aware signals for weight learning. Then, a compression-excitation network is used to generate two-channel attention weights, which are successively achieved through the process of channel compression, weight prediction, and weight normalization:
First, the channel compression reduces the computational effort by compressing the channel dimension to C/8 through 1 × 1 convolution:
Weight prediction is then performed, and 1 × 1 convolution is again used to generate a two-channel weight map:
Finally, weight normalization is performed, and the Softmax function [
49] is applied along the channel dimensions to ensure that the weight of each spatial location satisfies
:
Feature selection is achieved by spatially adaptive weights, and residual connectivity is introduced to enhance the gradient flow:
where
is denoted as the sum of the original features
as well as
,
is the original weights generated by convolution, ∗ denotes the channel-by-channel multiplication,
is the base feature weighting term,
is the differential feature weighting term, and
is the residual term, and the residual coefficient of 0.3 is determined by a grid search, which effectively prevents gradient vanishing.
Because the spatial difference may strengthen the noise edge, but the dynamic feature fusion module can only solve the problem of invalid response, the spectral attention component employs holistic spectral weighting to mitigate interference in frequency band responses.
Although the traditional effective channel attention mechanism [
44,
50] can achieve lightweight channel attention, it uses one-dimensional convolution for local channel interaction, which makes it difficult to effectively model the global channel relationship and is sensitive to high-frequency noise in sonar images. To this end, this paper proposes Enhanced ECA. The operational workflow is detailed in
Figure 6, with three critical enhancements forming the framework’s core advancements:
The first is the global channel interaction modeling. This part abandons the one-dimensional convolution operation of the original ECA and uses the fully connected layer to construct the channel interaction path for channel dimensionality reduction in the global average pooled features, and the formula is the global average pooled features:
where r is the adjustable compression ratio, set to 4 in the C2fD module to balance the computational overhead. The channel dimension is then recovered by upscaling:
The global parameter sharing mechanism of the fully connected layer enables Enhanced ECA to capture cross-channel remote dependencies and enhance feature differentiation for low-contrast objects. Secondly, a hybrid SiLU-Sigmoid activation strategy is adopted to avoid the problem that the original ECA only uses the Sigmoid function in the output layer [
51], which is prone to cause gradient vanishing. The first step is to introduce SiLU activation function after dimensionality reduction [
52]:
Its smooth non-monotonic property improves the gradient flow and mitigates deep network training instability. But the output layer retains the Sigmoid function.
where
is a Sigmoid function that normalizes the attentional weights. Its role is to ensure the normalization of the attention weights and enhance the interpretability. Then the adjustable compression ratio mechanism is introduced to explicitly control the channel compression rate using the REDUCTION parameter, whose effect can be expressed as:
By adjusting the value of r, a flexible trade-off can be made between model lightweighting and feature expression capability. When r increases, the computation amount is reduced, but the high-frequency details may be lost; when r decreases, more channel interaction information is retained to improve the detection accuracy of small objects.
The C2fD module achieves the double optimization of space and channel dimension through the cascade design of “explicit edge extraction-adaptive feature fusion-global channel purification”. With the complementary functions of each sub-module and the synergistic tuning of parameters, the robustness of the model to low-contrast objects, fuzzy edges and noise interference in underwater sonar images is significantly improved, which lays the core foundation for the breakthrough of HAUOD’s overall performance.
2.4. Underwater Multi-Scale Attention Mechanism
After pre-processing and C2fD module processing, the image may not be able to be completely processed and there may still be a small part of the multi-scale object distribution, residual noise and low signal-to-noise ratio; for this reason, the design of the UnderwaterAttention mechanism is used to ensure that the HAUOD model can completely process all of the image. And the attention mechanism can ensure the coherence of the context to avoid feature loss. The module contains adaptive noise suppression, hierarchical null convolution group, two-dimensional attention synergy and dynamic gated residual fusion, and the overall flow is shown in
Figure 7 and
Figure 8.
Among them, the adaptive noise suppression module suppresses the residual high-frequency scattering noise in the C2fD output through the grouped convolution-dual-path noise reduction unit, which provides the initial purified feature maps for the subsequent modules, and reduces the interference of the noise on the multi-scale context modeling. The designed grouped convolution-dual path noise reduction unit is specifically divided into two parts: shallow noise separation and deep noise suppression, where the shallow noise separation is:
Grouped convolution is used to decouple channel correlation and enhance local noise pattern learning, and the GELU activation function [
53] provides smooth non-linear mapping. Depth noise suppression is:
Experiments show that this module enables high-frequency noise energy reduction while preserving object edge integrity. To capture contextual information at different scales, a parallel branch of multi-granularity sensory fields is constructed to achieve hierarchical null convolution, generating features that contain multi-granularity contextual information, providing rich spatial and semantic information for subsequent two-dimensional attention:
Splice the output of each level along the channel dimension:
Through experimental optimization, the null rate [d1, d2, d3] is set to [1, 2, 3], which enhances the small object’s receptive field coverage while maintaining the computational efficiency. In order to further suppress the background noise and enhance the object region response, cross-dimensional feature enhancement is realized by utilizing the synergy of attention in both channel and spatial dimensions, and the saliency of the object region is enhanced by suppressing the noisy channel and discrete noise points through cross-dimensional attention. The channel attention branching formula is:
The Hardswish activation function [
54] is used to balance non-linearity and gradient stability, and the compression ratio reduction of 4 controls the computational complexity. The spatial attention branching formula is:
Large kernel convolution (7 × 7) enhances spatial continuity perception and suppresses discrete noise point responses. The final enhanced features are:
Since excessive noise removal can potentially lead to a loss of details, in order to balance noise suppression with object feature retention, a dynamic gated residual fusion module is introduced, which balances the information contribution of the original features with that of the augmented features by means of learnable gating coefficients:
where
is a learnable parameter. Experiments show that its automatic adjustment range is 0.38 to 0.67 in different water depth scenarios to realize adaptive feature-enhanced intensity control.
3. Results
3.1. Experimental Environments
This study implemented all computational procedures on a Linux-based platform employing the PyTorch framework (version 2.0.0) for model development and the CUDA toolkit 11.8 for hardware acceleration. The experimental setup utilized an Intel Xeon Platinum 8362 central processing unit operating at 2.80 GHz base frequency, complemented by 50 gigabytes of system memory and an NVIDIA RTX 3090 graphical processing unit. Throughout the experimental iterations, fixed-dimensional parameters were maintained with input tensors standardized at 640 × 640 pixel resolution processed in groups of 10 samples per computational batch.
3.2. Dataset
To ensure the robustness of the method, the experimental validation phase utilizes the Sonar Common Object Detection Dataset (SCTD) [
55,
56], a benchmark resource in the field of underwater sensing that can be used to evaluate the effectiveness of the computational method. This dataset currently contains three types of typical objects: underwater wrecks, wrecked airplane wrecks, and victims. The SCTD dataset is described as follows.
The SCTD dataset currently contains three types of typical objects, namely, underwater shipwrecks, wrecked airplane wrecks, and victims, and contains 497 high-resolution sonar images collected from side-scan sonar, forward-looking sonar, and interferometric synthetic aperture sonar, which cover a variety of imaging equipment and scenarios. In this dataset, shipwrecks, airplanes, and humans account for 76.9%, 15.4%, and 7.7% of the total dataset, respectively. In addition, manual annotation was used to minimize errors, and annotation files in both Pascal VOC and MS COCO formats are provided.
3.3. Evaluate Metrics
In order to evaluate the merits of the HAUOD model, we used mean Average Precision (mAP50), precision and recall as evaluation metrics, giving the following definitions of these metrics:
Within object detection frameworks, average precision (AP) functions as the principal quantification criterion for evaluating single-category detection performance, obtained through systematic analysis of model discrimination accuracy between positive and negative specimen groups. Under binary classification configurations involving foreground–background differentiation, the mean average precision (mAP) metric inherently aligns with AP values due to categorical singularity. The evaluation architecture incorporates three core operational components: True Positives (TPs) quantify accurately localized objects through bounding box alignment with ground truth annotations, reflecting detection fidelity; False Positives (FPs) encompass erroneous classifications of background regions as objects or misidentifications of authentic objects, characterizing superfluous detection artifacts; while False Negatives (FNs) enumerate undetected objects within annotated datasets, serving as omission severity indicators. These metrics undergo synergistic integration via precision-recall calculus, where precision measures detection purity (TP/(TP + FP)) and recall evaluates completeness (TP/(TP + FN)), collectively establishing a multidimensional performance quantification framework for comprehensive model assessment.
3.4. Performance Analysis
In order to investigate the impact of the integration position of the C2fD module in the backbone network on the model performance, this study compares and analyzes the detection precision and computing efficacy of the HAUOD model under different configurations by systematically adjusting the embedding level of the module. The experiments use the SCTD dataset as the benchmark platform, and the mean average precision (mAP), floating-point operations (FLOPs), parameter sizes (Params), recall rate (R), checking accuracy (P), and multi-scale average precision under strict intersection and merger ratio thresholds (mAP50-95) are selected as the core evaluation metrics. In order to make HAUOD optimal in terms of model performance, control complexity, and enhancement of feature extraction capability, the integration position is iteratively adjusted to optimize the performance.
As shown in
Table 1, the comprehensive model performance is optimal when the C2fD module is integrated in layers 1 and 4 of the backbone network (Model 5, integration location [1, 0, 0, 1]). Its mAP reaches 94.3%, which is a 7.5% improvement in accuracy over the baseline model YOLOv8n, but the Flops are only improved by 0.8 G. In addition, the recall (89.0%) and precision (95.2%) of Model 5 are significantly better than the other configurations, suggesting that the integration strategy effectively balances computational complexity and detection efficiency while enhancing the expression of object features. Notably, the mAP50-95 of Model 5 reaches 69.3%, which is 7.8 percentage points higher than that of the baseline model (61.5%), verifying the facilitating effect of multi-level feature fusion on the robustness of detection under strict IoU thresholds.
Integration location indicates the location of C2fD integration in the backbone; for example, [1, 0, 0, 0] means replacing the C2fD module with the first C2f module in the backbone, and so on.
3.5. Ablation Experiments
This study employs YOLOv8n as the foundational framework within the HAOOD architecture to systematically evaluate the effectiveness of individual enhancement components. Through experimental validation on the SCTD dataset, the research methodically integrates three technical innovations: SonarMosaic-based data augmentation techniques, the C2fD structural module, and the Underwater Multi-scale Contextual Attention (UWA) mechanism. The investigation comprehensively examines how these progressive modifications influence object detection capabilities in aquatic environments. The experimental results are shown in
Table 2. After introducing the SonarMosaic module in the baseline model YOLOv8n, the mAP50 is improved from 86.8% to 0.898%, which is an improvement of 3 percentage points. The recall (R) improves from 0.721 to 0.851, indicating that the preprocessing significantly enhances the feature expression of low-contrast objects and reduces the missed detections. The mAP50-95 improves by 3.3 percentage points (61.5% → 64.8%), verifying the facilitation of the frequency-domain band-pass filtering with CLAHE enhancement for multi-scale detection. The addition of the C2fD module further improves the mAP50 to 94.3% by 4.5 percentage points, and the accuracy (Precision) from 93.3% to 95.2%. This module enhances the perception of fuzzy edge objects through explicit spatial difference feature extraction and dynamic fusion. The model achieves an mAP50-95 score of 69.3%, demonstrating that its multi-scale feature integration approach successfully alleviates recognition inconsistencies arising from size discrepancies among underwater objects. After the introduction of the UWA module, the mAP50 of the model HAUOD reaches 95.1% and the recall (R) is improved to 0.897, which verifies the synergistic effect of multi-scale contextual modeling and noise suppression. The mAP50-95 is improved from 69.3% to 70.0%, which demonstrates that the UWA enhances the robustness of detection under the stringent IoU threshold by the hierarchical null convolutional group. The accuracy (96.3%) reaches the highest value, demonstrating that the dynamic gating residual mechanism effectively balances noise suppression and object feature retention.
3.6. Performance Comparison
To assess the effectiveness of HAUOD, we compared five YOLO models, namely YOLOv5n, YOLOv8n, YOLOv8s, YOLOv10n, and YOLOv11n [
57], from which it can be seen that HAOUD has a mAP of 95.1%, which is significantly better than that of the baseline model YOLOv8n (86.8%) by 8.3 percentage points and is significantly better than the other models. The mAP50-95 metric (70.0%) is improved by 4.1 percentage points compared to YOLOv8n (65.9%), which indicates that it is more robust under strict IoU thresholds. The recall rate (89.7%) and precision rate (96.3%) are both leading, verifying the dual suppression effect of multi-strategy preprocessing and attention mechanisms on leakage and false detection. At present, there are many images detection methods based on deep learning. In order to further test the effectiveness of HAUOD, we compared the performance of Faster R-CNN and SSD. The experimental results are shown in
Table 3.
3.7. Detection Results
The detection results are obtained according to the models of the comparison test, and the specific results are shown in
Figure 9. In summary, the HAUOD model has very obvious advantages over other models in the detection of underwater sonar images.
In order to verify the rationality and effectiveness of the SonarMosaic module, we can visualize the processing results of this module and compare the visualization results with the initial image, as shown in
Figure 10. And in order to more intuitively test the pre-processing results, we adopt the Unified quality Assessment method for Sonar Imaging and Processing (UASIP) [
58] to evaluate the performances. With the UASIP method, the initial image’s quality score is
, and the pre-processing image’s quality score is
, which is higher than the initial image’s. The effectiveness of the preprocessing module is verified through the above test results.
4. Discussion
This research presents the HAUOD framework, which achieves marked performance enhancements in sonar-based underwater object recognition through collaborative optimization of integrated system components. The evaluation outcomes demonstrate that relative to the original YOLOv8 framework, HAUOD achieves significant performance enhancements: mean Average Precision (mAP) increases to 95.1% (+8.3%) while mAP50-95 rises to 70.0% (+4.1%), with corresponding recall and precision rates reaching 0.897 and 96.3%, respectively. These improvements confirm its enhanced effectiveness in low-contrast conditions, noisy underwater environments, and multi-scale object detection scenarios, surpassing the baseline model’s capabilities across critical operational parameters.
The three-level cascade preprocessing (CLAHE enhancement, non-local mean denoising, and frequency domain band-pass filtering) with multi-level preprocessing and data enhancement significantly improves the object signal-to-noise ratio; in particular, the frequency domain band-pass filtering effectively highlights the small objects and fuzzy edge features by suppressing the low-frequency background interference. Combined with the SonarMosaic data enhancement strategy, the model’s adaptability to complex noise environments is enhanced while preserving object integrity. Experiments show that only the introduction of the preprocessing module can improve the mAP by 3%, which verifies the effectiveness of the joint air-frequency domain optimization strategy.
The dynamic feature fusion mechanism of the C2fD module uses explicit spatial differential feature extraction to strengthen the edge response through the Sobel operator, while the dynamic weight assignment mechanism (DynamicFusion) mitigates the interference of noisy edges by adaptively fusing the base features with the differential features. The Enhanced ECA module uses global channel interaction modeling with the SiLU–Sigmoid hybrid activation strategy to further suppress the invalid response of the noisy channel. The ablation experiments show that the introduction of the C2fD module leads to an additional 4.5% mAP enhancement, confirming its key role in feature enhancement and noise suppression.
The Underwater Multi-scale Contextual Attention (UWA) mechanism captures multi-granularity contextual information through hierarchical null convolution sets and combines it with two-dimensional attention (channel-space synergy) to enhance the object region response. The dynamic gating residual mechanism ( = 0.38–0.67) adaptively balances the contributions of original and enhanced features and significantly improves the detection sensitivity of small objects in complex backgrounds. Experiments show that the UWA module further improves the mAP by 0.8% and especially exhibits stronger robustness under strict IoU thresholds (mAP50-95).