A High-Accuracy Underwater Object Detection Algorithm for Synthetic Aperture Sonar Images

Su, Jiahui; Xu, Deyin; Qiu, Lu; Xu, Zhiping; Lin, Lixiong; Zheng, Jiachun

doi:10.3390/rs17132112

Open AccessArticle

A High-Accuracy Underwater Object Detection Algorithm for Synthetic Aperture Sonar Images

by

Jiahui Su

¹,

Deyin Xu

¹,

Lu Qiu

¹,

Zhiping Xu

^1,2,*

,

Lixiong Lin

¹ and

Jiachun Zheng

¹

School of Ocean Information Engineering, Jimei University, Xiamen 361021, China

²

Key Laboratory of Underwater Acoustic Communication and Marine Information Technology (Xiamen University), Ministry Education, Xiamen 361005, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2025, 17(13), 2112; https://doi.org/10.3390/rs17132112

Submission received: 11 May 2025 / Revised: 8 June 2025 / Accepted: 18 June 2025 / Published: 20 June 2025

(This article belongs to the Special Issue Ocean Remote Sensing Based on Radar, Sonar and Optical Techniques (Second Edition))

Download

Browse Figures

Versions Notes

Abstract

Underwater object detection with Synthetic Aperture Sonar (SAS) images faces many problems, including low contrast, blurred edges, high-frequency noise, and missed small objects. To improve these problems, this paper proposes a high-accuracy underwater object detection algorithm for SAS images, named the HAUOD algorithm. First, considering SAS image characteristics, a sonar preprocessing module is designed to enhance the signal-to-noise ratio of object features. This module incorporates three-stage processing for image quality optimization, and the three stages include collaborative adaptive Contrast Limited Adaptive Histogram Equalization (CLAHE) enhancement, non-local mean denoising, and frequency-domain band-pass filtering. Subsequently, a novel C2fD module is introduced to replace the original C2f module to strengthen perception capabilities for low-contrast objects and edge-blurred regions. The proposed C2fD module integrates spatial differential feature extraction, dynamic feature fusion, and Enhanced Efficient Channel Attention (Enhanced ECA). Furthermore, an underwater multi-scale contextual attention mechanism, named UWA, is introduced to enhance the model’s discriminative ability for multi-scale objects and complex backgrounds. The proposed UWA module combines noise suppression, hierarchical dilated convolution groups, and dual-dimensional attention collaboration. Experiments on the Sonar Common object Detection dataset (SCTD) demonstrate that the proposed HAUOD algorithm achieves superior performance in small object detection accuracy and multi-scenario robustness, attaining a detection accuracy of

95.1 %

, which is

8.3 %

higher than the baseline model (YOLOv8n). Compared with YOLOv8s, the proposed HAUOD algorithm can achieve

6.2 %

higher accuracy with only

50.4 %

model size, and reduce the computational complexity by half. Moreover, the HAUOD method exhibits significant advantages in balancing computational efficiency and accuracy compared to mainstream detection models.

Keywords:

synthetic aperture sonar; underwater object detection; underwater image preprocessing; feature fusion

1. Introduction

With the continuous advancement of global marine resource exploitation and strategic security demands, underwater object detection technology has become a core pillar in marine exploration, military security, and emergency rescue operations [1,2]. From shipwreck recovery and underwater infrastructure inspection to submarine tracking and drowning victim search missions, rapid and precise object recognition capabilities are critical to mission success. However, underwater environments inherently suffer from significant optical attenuation, turbid water, and complex lighting conditions, rendering traditional optical imaging techniques inadequate for acquiring clear images. Synthetic Aperture Sonar (SAS) [3,4,5], which is an underwater imaging technology of virtual synthesis of super large aperture through the movement of sonar platform, plays an important role in underwater object detection. In SAS technology, the small sonar array is used to continuously transmit and receive sound waves while moving. By accurately recording the platform position and echo phase, the received signals along the way are coherently superimposed and processed to synthesize the effect equivalent to the ultra-long physical aperture. It can achieve long-distance, high-resolution and clear imaging without blurring with distance, especially suitable for fine detection of mines, pipelines, sunken ships and other targets [6]. Sonar technology, leveraging the strong penetration capability of sound waves and exceptional resistance to environmental interference, has emerged as an irreplaceable solution for underwater detection. By actively emitting acoustic waves and analyzing their echoes, sonar systems can effectively reconstruct the contours, textures, and spatial distribution of underwater objects [7,8], providing foundational data for object detection and localization in complex scenarios [9]. As the volume of underwater sonar image data requiring rapid detection continues to grow, the demand for both speed and accuracy in detection methods has intensified. Moreover, challenges such as acoustic wave multipath reflections, environmental noise-induced speckle artifacts in images, and shadow zones behind objects caused by side-scan sonar characteristics [10,11] further complicate high-accuracy object detection.

The development of object detection technology for SAS images can be divided into two major phases: traditional image processing and deep learning. Traditional methods primarily rely on physical models and signal processing techniques [4,12,13,14], with typical workflows including manual feature extraction steps such as threshold segmentation, edge detection, and morphological filtering. For instance, object regions are segmented by setting grayscale thresholds, edge contours are extracted using Sobel or Canny operators, and object localization is completed via region growing or template matching. However, most of these methods depend on physical models, signal processing approaches, and expert-designed feature rules [12,13,15,16], exhibiting poor adaptability to speckle noise, shadow artifacts, and low-contrast objects prevalent in sonar images. This results in low detection efficiency and insufficient generalization. Consequently, with the advent of deep learning, traditional sonar image processing techniques have gradually been phased out, giving way to deep learning-based methods [17,18,19,20,21,22], which dominate sonar image analysis.

Advancements in deep learning, particularly architectures such as Convolutional Neural Networks (CNN) [23] and Transformers [24], have demonstrated exceptional performance in processing image and video data, commonly applied to tasks in computer vision and natural language processing [25]. Deep learning excels at learning invariant features directly from data, offering superior robustness and reducing reliance on manual intervention. Thus, deep learning-based image processing methods hold significant advantages over traditional approaches. Current CNN-based techniques for sonar images are categorized into single-stage and two-stage detection frameworks. Representative single-stage models include the YOLO series [2,26,27,28,29] and single shot multibox detector (SSD) [30], while two-stage detection is exemplified by Faster R-CNN [31,32] and Mask R-CNN [33]. In the field of object detection, single-stage detection methods generate object classification results and localization coordinates directly from global image features, and their architectural design eliminates the generation of region candidate frames, a feature that enables them to achieve a more efficient inference process compared to two-stage methods. However, such methods usually show a slight degradation in detection accuracy, and this performance limitation mainly stems from the fact that their end-to-end prediction mechanism needs to process dense spatial samples simultaneously, which leads to a significant increase in computational resource consumption [34]. Moreover, the deep learning methods, including one-stage and two-stage methods, are used for SAS image underwater object detection. Specifically, YOLO [35,36], Faster R-CNN [37] and SSD [38] are usually considered for SAS image target detection.

The introduction of Transformer technology has opened new optimization pathways for sonar image processing. Its core strength lies in establishing long-range dependencies among image pixels through self-attention mechanisms, significantly enhancing suppression of speckle noise and reverberation artifacts in sonar images. For sonar video sequence analysis, Transformers can model spatiotemporal features concurrently, leveraging temporal attention weights to align multi-frame object trajectories and mitigate missed detections of dynamic objects.

Although these models and algorithmic applications have significantly advanced object detection technology, they still face numerous challenges when processing underwater image data, including water scattering and noise interference. Underwater sonar images typically exhibit low contrast between objects and backgrounds, along with blurred edges. Additionally, acoustic wave multipath reflections introduce speckle noise and artifacts in the images, as shown in Figure 1. Furthermore, the substantial size variations of underwater objects in imagery pose difficulties for models to adapt effectively to multi-scale data, making the handling of multi-scale variations of an image another critical challenge.

In this paper, a high-accuracy underwater object detection algorithm (HAUOD) for SAS images is proposed. This paper improves on the basis of YOLOv8. Firstly, a special preprocessing module and data enhancement strategy are designed for the problem of low background contrast and fuzzy edges of underwater sonar images. Then, the C2fD module fused with differential features is proposed to ensure that the model has a stronger ability to capture details. Furthermore, an underwater multi-scale context attention mechanism is designed to further improve the sensitivity of the model to weak objects. Experimental results on the Sonar Common Target Dataset (SCTD) reveal that the HAUOD methodology achieves 95.1% recognition accuracy, surpassing the YOLOv8n reference model by an 8.3% absolute improvement. The solution demonstrates strengthened detection consistency for diminutive underwater objects and heightened adaptability to varied marine conditions. Compared with YOLOv8s, the proposed HAUOD algorithm can achieve

6.2 %

higher accuracy with only

36 %

model size, and reduce nearly half of the computational complexity. Moreover, the HAUOD method can achieve higher accuracy than YOLOv10 and YOLOv11 with an appropriate model size and computational efficiency. These demonstrate that the proposed method can exhibit significant advantages in balancing computational efficiency and accuracy compared to mainstream detection models. The main contributions of this paper are as follows:

(1) To address the challenges of low contrast, noise interference, and blurred edges in SAS images, a high-accuracy underwater object detection algorithm for Synthetic Aperture Sonar (SAS) images is proposed, named the HAUOD algorithm. HAUOD first processes data through an image preprocessing module, then enhances the C2f module with the novel C2fD design, and finally introduces the Underwater Multi-scale Contextual Attention Mechanism, named UWA, to achieve efficient and robust underwater object detection.

(2) The SAS images are pre-processed and optimized for data enhancement. This strategy combines the three preprocessing methods of Contrast Limited Adaptive Histogram Equalization (CLAHE) enhancements, non-local mean denoising, and frequency domain bandpass filtering to form a three-level cascade, so as to enhance the image contrast and solve the problem of edge blur. CLAHE effectively enhances local image contrast, while non-local mean denoising suppresses noise, and frequency-domain band-pass filtering sharpens object edges by emphasizing regions of interest. Although traditional Mosaic data augmentation works well for optical images, its direct application to low-contrast, noise-corrupted sonar images risks distorting object features. Thus, dedicated SAS image preprocessing is introduced before Mosaic augmentation to ensure stable retention of object information.

(3) To tackle edge blurring in underwater sonar images, the C2fD module is proposed. The C2fD module uses the optimized spatial difference to extract the objects’ edge information to deal with the problems of edge blurring and texture missing in sonar image, and then uses the Enhanced Efficient Channel Attention (Enhanced ECA) mechanism and lightweight feature fusion strategy to prevent the imbalance between basic features and edge features, which effectively enhances the object recognition ability of the model in the underwater environment with low contrast and serious noise interference. In addition, the multi-scale feature extraction and fusion strategy can be used to deal with the significant change in underwater target size and balance the detection speed and accuracy.

(4) The Underwater Multi-scale Contextual Attention Mechanism, named UWA, is designed to enhance the effectiveness of the C2fD module. It integrates adaptive noise suppression and hierarchical dilated convolution groups to capture multi-scale contextual features. Channel-spatial dual-dimensional attention collaboration is then applied to amplify responses in object regions. Finally, dynamic gated residual fusion balances contributions from original and enhanced features, significantly improving sensitivity to faint objects.

2. The Proposed HAUOD Algorithm

2.1. Overall Architecture

The proposed HAUOD model in this study is built upon YOLOv8 architecture, with the overall algorithmic framework illustrated in Figure 2. To address the challenges of low contrast, speckle noise, and low-frequency background interference in sonar images, a multi-strategy sonar preprocessing module is designed. This module first applies adaptive CLAHE [39] enhancement to the luminance channel in the LAB color space to optimize local contrast while suppressing artifacts caused by global equalization. Subsequently, non-local mean denoising [40] is employed to reduce speckle noise while preserving object edge structures. Finally, frequency-domain band-pass filtering [41] is introduced to mitigate low-frequency background interference. This three-stage cascaded processing pipeline effectively optimizes contrast, suppresses artifacts, and minimizes low-frequency disturbances in sonar images [42].

2.2. SonarMosaic Strategy

The proposed C2fD architecture addresses essential requirements for contour characteristic capture and underwater object scale adaptation through selective substitution of specific C2f components within the core network architecture. This upgrade enables synergistic feature enhancement and noise suppression while refining the preprocessed high-quality features. The C2fD module incorporates a spatial differential feature extraction component that explicitly computes horizontal and vertical gradients using fixed Sobel operators [43] to amplify edge responses. An Enhanced Efficient Channel Attention (Enhanced ECA) [44] mechanism is introduced, replacing 1D convolution with fully connected layers and optimizing global channel interactions through a reduction ratio of four. This enhancement improves small object detection accuracy and prevents missed detections caused by underwater object size variations. A dynamic fusion [45,46] module is further implemented to adaptively weight and combine base features with differential features, preventing gradient vanishing.

To address potential information loss during backbone network integration, the Underwater Multi-scale Contextual Attention (UWA) mechanism is developed. This attention mechanism employs hierarchical dilated convolution groups to model multi-scale contextual information and integrates channel-spatial dual-dimensional attention to suppress background interference. Dynamic gated residual fusion ensures balanced noise suppression and feature preservation, effectively resolving missed detections caused by multi-scale object distributions, high-frequency scattering noise, and background interference while maintaining contextual integrity. Overall, these technical enhancements greatly improve model robustness when operating in challenging underwater environments.

Aiming at the low-contrast, high-frequency noise interference and complex background characteristics of sonar images, this paper proposes a three-stage cascade preprocessing strategy to enhance object detectability through cooperative enhancement and noise suppression. The preprocessing module contains three stages: Adaptive CLAHE enhancement, non-local mean denoising and frequency domain bandpass filtering. The specific realization steps are shown in Figure 3.

Underwater sonar detects objects that are usually characterized by low contrast, high noise and blurred object edges, which makes the traditional histogram equalization method introduce more noise when enhancing the contrast, resulting in the loss of details in the object area or over-amplification of background information. Traditional histogram equalization methods enhance image contrast by adjusting the distribution of pixel values, but the method performs equalization over the entire image range, which can easily lead to over-enhancement or loss of details in local areas. In contrast, CLAHE employs a local contrast enhancement strategy, which effectively improves the gradient strength of the object region while reducing the effect of noise by segmenting the image into multiple small regions, calculating the local histogram equalization separately, and preventing over-enhancement by using contrast limitation. The method adaptively adjusts the contrast within different local regions to achieve uniform luminance enhancement while avoiding artifacts from global histogram equalization. The BGR image is first converted to LAB color space. And then the CLAHE transform is applied only to the luminance channel (L) to avoid unnecessary influence on the color information. Finally, the enhanced luminance channel is merged with the original A and B channels and converted back to BGR format. However, the CLAHE-enhanced image will make the noise more obvious at the same time, so the non-local mean denoising is designed to form a synergistic process of “enhancement-purification”. Non-local mean denoising is based on pixel neighborhood similarity weighted denoising, which is able to preserve image edges and structural information, suppressing high frequency noise while reducing noise interference.

\bar{I} (x) = \frac{1}{Z (x)} \sum_{y \in R} e^{- \frac{| | I (N_{x}) - I (N_{y}) {| |}_{2, c o l o r e d}^{2}}{h^{2}}} I (y)

(1)

where

\bar{I} (x)

is the denoised value of the pixel,

Z (x)

is the normalization factor used to ensure that the sum of weights is 1,

I (N_{x})

and

I (N_{y})

are the pixel blocks in the neighborhood of the pixel sum, and h is the decay coefficient, which controls the decay speed of the similarity weights. Simultaneous processing of BGR three-channel information to avoid color distortion caused by single-channel denoising and the use of multi-threaded optimization greatly reduced the single-image denoising time.

Traditional mean filtering [47] and Gaussian filtering [48] mainly denoise by calculating the pixel mean value through the local region, but these methods also lead to the loss of image details and blurring of the object boundary while denoising. The non-local mean denoising method overcomes the limitations of traditional filters by utilizing the similarity between pixels and calculating similar pixel values through weighted averaging, which is able to retain the object edges and structural information while effectively denoising and is especially suitable for complex background processing of underwater sonar images. At the same time, the frequency domain band-pass filtering can deal with the noise from different dimensions, forming a dual noise reduction mechanism of “air purification-frequency domain focus”, which significantly improves the object signal-to-noise ratio. Figure 3 illustrates the implementation workflow of spectral bandpass filtering, where two-dimensional DFT processing is applied to the noise-reduced input f(x,y). This frequency domain transformation enables selective signal filtering through amplitude spectrum manipulation:

F (u, v) = \sum_{x = 0}^{M - 1} \sum_{y = 0}^{N - 1} f (x, y) e^{- j 2 π} \frac{u x}{M} + \frac{v y}{N})

(2)

The frequency domain component F(u,v) represents complex spectral values at coordinates (u,v), while f(x,y) corresponds to spatial domain intensity values at position (x,y), with M and N specifying the image’s vertical and horizontal resolutions, respectively.

Then, a bandpass filter H(u,v) is designed to construct the frequency mask, which is defined as

H (u, v) = \{\begin{matrix} 1, R \geq \sqrt{{(u - u_{0})}^{2} + {(v - v_{0})}^{2}} \\ 0, R < \sqrt{{(u - u_{0})}^{2} + {(v - v_{0})}^{2}} \end{matrix}

(3)

where

(u_{0}, v_{0})

is the center of the frequency domain, and R is a preset radius that controls the range of frequencies retained. Then, filtering and inverse conversion first multiply the frequency domain image with the mask to obtain the filtered frequency domain image:

F_{f i l} (u, v) = F (u, v) * H (u, v)

(4)

The conversion process subsequently employs inverse discrete Fourier transformation (IDFT) to restore the data to its original null domain configuration:

f_{f i l} (u, v) = F^{- 1} [F_{f i l} (u, v)]

(5)

where

F_{f i l} (u, v)

is the filtered frequency-domain image,

F^{- 1}

denotes the Fourier inverse transform symbol.

This process preserves the object information in the frequency domain while effectively suppressing the noise and low-frequency interference in the background. Since the traditional Mosaic has serious problems of noise superposition and object distortion when directly splicing low-quality sonar images, three-level preprocessing is embedded in front of the Mosaic to ensure the clarity and object integrity of the spliced subgraphs, which results in the SonarMosaic module. This module not only ensures the diversity of the original data but also further strengthens the object features and anti-interference ability in the image through preprocessing, which greatly improves the feature extraction accuracy of the subsequent network.

2.3. The Proposed C2fD Module

In order to improve the detection performance of YOLOv8 on SCTD sonar image datasets, this paper presents the C2fD module by improving the original C2f module. This module enhances the model’s ability to detect low-contrast, small and edge-ambiguous objects while maintaining computational efficiency by optimizing the optimized spatial difference module, dynamic feature fusion and enhanced efficient channel attention. The specific workflow is shown in Figure 4.

Aiming at the problems of blurred object edges, low contrast, and background noise interference in underwater sonar images, the traditional convolutional neural network (CNN) relies on the adaptive convolutional kernel to implicitly extract the edge features, which is susceptible to noise contamination, leading to feature confusion. In this paper, we propose an explicit spatial gradient enhancement module (OptimizedSpatialDifference), which significantly improves the model’s ability to model the edge features of sonar images through a three-phase operation of channel compression, bi-directional gradient extraction and feature expansion. Its core design is as follows:

Implement channel compression and noise suppression, and perform channel dimension mean pooling on the input feature map:

X_{p o o l} = \frac{1}{C} \sum_{i = 1}^{C} X_{i} \in R^{1 \times H \times W}

(6)

This section suppresses channel-specific noise due to scattering in the sonar image by cross-channel information fusion while reducing the computational effort to 1/C of the original input. Experiments show that this step results in a reduction in the intensity of the background noise response. In addition to the input feature channel compression, a bi-directional gradient feature extraction method was used to explicitly enhance the horizontal and vertical edge response. A Sobel convolution kernel with fixed weights was used to compute the horizontal and vertical gradients:

G = C o n v_{3 \times 3} (X_{p o o l}; W_{s o b e l}) \in R^{2 \times H \times W}

(7)

The horizontal Sobel nucleus is:

W_{h} = [\begin{matrix} - 1 & 0 & 1 \\ - 2 & 0 & 2 \\ - 1 & 0 & 1 \end{matrix}]

(8)

The vertical Sobel nucleus is:

W_{p} = [\begin{matrix} - 1 & - 2 & - 1 \\ 0 & 0 & 0 \\ 1 & 2 & 1 \end{matrix}]

(9)

The introduction of the Sobel operator can also be utilized to suppress high-frequency noise interference using its smoothing properties. The

X_{p o o l}

in this operation represents the single-channel feature map after pooling across channel averages. The symbol C represents channel count, H indicates vertical dimension size, W corresponds to horizontal spatial extent, while G signifies bidirectional gradient components (horizontal and vertical) generated through convolutional filter operations.

In the feature enhancement stage, the dual-channel gradient features maintain dimensional consistency with the original input channel number through the channel dimension expansion strategy, which lays the structural foundation for multi-feature fusion. The process constructs a composite feature expression space by hierarchically splicing the horizontal and vertical gradient features with the original features. The fused feature vectors are then input to the subsequent convolutional layers for deep feature extraction. In order to optimize the feature quality, the OptimizedSpatialDifference module introduces a dynamic weight allocation mechanism, which achieves selective suppression of redundant feature responses through adaptive threshold constraints, thus enhancing the feature contribution of effective gradient information.

The dynamic feature fusion module achieves context-aware integration of core and complementary characteristics through attention-driven weight adaptation. This approach demonstrates reduced computational complexity relative to conventional feature pyramid networks while enhancing precision for small-scale object identification in sonar images, with the full implementation process detailed in Figure 5.

Given the base and differential features, the hybrid features are first generated by element-by-element summation:

F_{m} = F_{b a s e} + F_{d i f f} \in R^{C \times H \times W}

(10)

This operation initially fuses the complementary information of the two types of features to provide context-aware signals for weight learning. Then, a compression-excitation network is used to generate two-channel attention weights, which are successively achieved through the process of channel compression, weight prediction, and weight normalization:

First, the channel compression reduces the computational effort by compressing the channel dimension to C/8 through 1 × 1 convolution:

F_{c o m} = R e L U (C o n v_{1 \times 1} (F_{m})) \in R^{C / 8 \times H \times W}

(11)

Weight prediction is then performed, and 1 × 1 convolution is again used to generate a two-channel weight map:

W_{r a w} = C o n v_{1 \times 1} (F_{c o m}) \in R^{2 \times H \times W}

(12)

Finally, weight normalization is performed, and the Softmax function [49] is applied along the channel dimensions to ensure that the weight of each spatial location satisfies

W_{0} + W_{1} = 1

:

W = S o f t m a x (W_{r a w}, d i m = 1) \in {[0, 1]}^{2 \times H \times W}

(13)

Feature selection is achieved by spatially adaptive weights, and residual connectivity is introduced to enhance the gradient flow:

F_{f} = W_{0} * F_{b a s e} + W_{1} * F_{d i f f} + 0.3 * F_{b a s e}

(14)

where

F_{m}

is denoted as the sum of the original features

F_{b a s e}

as well as

F_{d i f f}

,

W_{r a w}

is the original weights generated by convolution, ∗ denotes the channel-by-channel multiplication,

W_{0} * F_{b a s e}

is the base feature weighting term,

W_{1} * F_{d i f f}

is the differential feature weighting term, and

0.3 * F_{b a s e}

is the residual term, and the residual coefficient of 0.3 is determined by a grid search, which effectively prevents gradient vanishing.

Because the spatial difference may strengthen the noise edge, but the dynamic feature fusion module can only solve the problem of invalid response, the spectral attention component employs holistic spectral weighting to mitigate interference in frequency band responses.

Although the traditional effective channel attention mechanism [44,50] can achieve lightweight channel attention, it uses one-dimensional convolution for local channel interaction, which makes it difficult to effectively model the global channel relationship and is sensitive to high-frequency noise in sonar images. To this end, this paper proposes Enhanced ECA. The operational workflow is detailed in Figure 6, with three critical enhancements forming the framework’s core advancements:

The first is the global channel interaction modeling. This part abandons the one-dimensional convolution operation of the original ECA and uses the fully connected layer to construct the channel interaction path for channel dimensionality reduction in the global average pooled features, and the formula is the global average pooled features:

F_{d o w n} = W_{d o w n} \cdot F_{a v g}, W_{d o w n} \in R^{C / r \times c}

(15)

where r is the adjustable compression ratio, set to 4 in the C2fD module to balance the computational overhead. The channel dimension is then recovered by upscaling:

F_{u p} = W_{u p} \cdot S i L U (F_{d o w n}), W_{u p} \in R^{C \times C / r}

(16)

The global parameter sharing mechanism of the fully connected layer enables Enhanced ECA to capture cross-channel remote dependencies and enhance feature differentiation for low-contrast objects. Secondly, a hybrid SiLU-Sigmoid activation strategy is adopted to avoid the problem that the original ECA only uses the Sigmoid function in the output layer [51], which is prone to cause gradient vanishing. The first step is to introduce SiLU activation function after dimensionality reduction [52]:

S i L U (x) = x \cdot σ (x)

(17)

Its smooth non-monotonic property improves the gradient flow and mitigates deep network training instability. But the output layer retains the Sigmoid function.

A = σ (F_{u p})

(18)

where

σ (x)

is a Sigmoid function that normalizes the attentional weights. Its role is to ensure the normalization of the attention weights and enhance the interpretability. Then the adjustable compression ratio mechanism is introduced to explicitly control the channel compression rate using the REDUCTION parameter, whose effect can be expressed as:

F L O P s α \frac{2 C^{2}}{r}

(19)

By adjusting the value of r, a flexible trade-off can be made between model lightweighting and feature expression capability. When r increases, the computation amount is reduced, but the high-frequency details may be lost; when r decreases, more channel interaction information is retained to improve the detection accuracy of small objects.

The C2fD module achieves the double optimization of space and channel dimension through the cascade design of “explicit edge extraction-adaptive feature fusion-global channel purification”. With the complementary functions of each sub-module and the synergistic tuning of parameters, the robustness of the model to low-contrast objects, fuzzy edges and noise interference in underwater sonar images is significantly improved, which lays the core foundation for the breakthrough of HAUOD’s overall performance.

2.4. Underwater Multi-Scale Attention Mechanism

After pre-processing and C2fD module processing, the image may not be able to be completely processed and there may still be a small part of the multi-scale object distribution, residual noise and low signal-to-noise ratio; for this reason, the design of the UnderwaterAttention mechanism is used to ensure that the HAUOD model can completely process all of the image. And the attention mechanism can ensure the coherence of the context to avoid feature loss. The module contains adaptive noise suppression, hierarchical null convolution group, two-dimensional attention synergy and dynamic gated residual fusion, and the overall flow is shown in Figure 7 and Figure 8.

Among them, the adaptive noise suppression module suppresses the residual high-frequency scattering noise in the C2fD output through the grouped convolution-dual-path noise reduction unit, which provides the initial purified feature maps for the subsequent modules, and reduces the interference of the noise on the multi-scale context modeling. The designed grouped convolution-dual path noise reduction unit is specifically divided into two parts: shallow noise separation and deep noise suppression, where the shallow noise separation is:

F_{p r e} = G E L U (C o n v_{3 \times 3} (X; g r o u p s = 4)) \in R^{C / 2 \times H \times W}

(20)

Grouped convolution is used to decouple channel correlation and enhance local noise pattern learning, and the GELU activation function [53] provides smooth non-linear mapping. Depth noise suppression is:

F_{d} = B N (C o n v_{3 \times 3} (F_{p r e})) \in R^{C \times H \times W}

(21)

Experiments show that this module enables high-frequency noise energy reduction while preserving object edge integrity. To capture contextual information at different scales, a parallel branch of multi-granularity sensory fields is constructed to achieve hierarchical null convolution, generating features that contain multi-granularity contextual information, providing rich spatial and semantic information for subsequent two-dimensional attention:

F_{d i l a 1} = R e L U (B N (C o n v_{3 \times 3} (F_{d}, dilation = 1)))

(22)

F_{d i l a 2} = R e L U (B N (C o n v_{3 \times 3} (F_{d}, dilation = 2)))

(23)

F_{d i l a 3} = R e L U (B N (C o n v_{3 \times 3} (F_{d}, dilation = 3)))

(24)

Splice the output of each level along the channel dimension:

F_{c o n t e x t} = C o n c a t (F_{d i l a 1}, F_{d i l a 2}, F_{d i l a 3}) \in R^{3 C / 4 \times H \times W}

(25)

Through experimental optimization, the null rate [d1, d2, d3] is set to [1, 2, 3], which enhances the small object’s receptive field coverage while maintaining the computational efficiency. In order to further suppress the background noise and enhance the object region response, cross-dimensional feature enhancement is realized by utilizing the synergy of attention in both channel and spatial dimensions, and the saliency of the object region is enhanced by suppressing the noisy channel and discrete noise points through cross-dimensional attention. The channel attention branching formula is:

w_{c} = σ (C o n v_{1 \times 1} (H a r d s w i s h (C o n v_{1 \times 1} (G A P (F_{c o n t e x t})))) \in {[0, 1]}^{3 C / 4}

(26)

The Hardswish activation function [54] is used to balance non-linearity and gradient stability, and the compression ratio reduction of 4 controls the computational complexity. The spatial attention branching formula is:

W_{s} = σ (C o n v_{7 \times 7} (F_{c o n t e x t} * W_{c})) \in {[0, 1]}^{1 \times H \times W}

(27)

Large kernel convolution (7 × 7) enhances spatial continuity perception and suppresses discrete noise point responses. The final enhanced features are:

F_{a t t} = (F_{c o n t e x t} * W_{c}) * W_{s}

(28)

Since excessive noise removal can potentially lead to a loss of details, in order to balance noise suppression with object feature retention, a dynamic gated residual fusion module is introduced, which balances the information contribution of the original features with that of the augmented features by means of learnable gating coefficients:

F_{f u s i o n} = C o n v_{1 \times 1} (C o n c a t (F_{a}, F_{a t}))

(29)

Y = X * λ + F_{f u s i o n} * (1 - λ) λ \in [0.38, 0.67]

(30)

where

λ

is a learnable parameter. Experiments show that its automatic adjustment range is 0.38 to 0.67 in different water depth scenarios to realize adaptive feature-enhanced intensity control.

3. Results

3.1. Experimental Environments

This study implemented all computational procedures on a Linux-based platform employing the PyTorch framework (version 2.0.0) for model development and the CUDA toolkit 11.8 for hardware acceleration. The experimental setup utilized an Intel Xeon Platinum 8362 central processing unit operating at 2.80 GHz base frequency, complemented by 50 gigabytes of system memory and an NVIDIA RTX 3090 graphical processing unit. Throughout the experimental iterations, fixed-dimensional parameters were maintained with input tensors standardized at 640 × 640 pixel resolution processed in groups of 10 samples per computational batch.

3.2. Dataset

To ensure the robustness of the method, the experimental validation phase utilizes the Sonar Common Object Detection Dataset (SCTD) [55,56], a benchmark resource in the field of underwater sensing that can be used to evaluate the effectiveness of the computational method. This dataset currently contains three types of typical objects: underwater wrecks, wrecked airplane wrecks, and victims. The SCTD dataset is described as follows.

The SCTD dataset currently contains three types of typical objects, namely, underwater shipwrecks, wrecked airplane wrecks, and victims, and contains 497 high-resolution sonar images collected from side-scan sonar, forward-looking sonar, and interferometric synthetic aperture sonar, which cover a variety of imaging equipment and scenarios. In this dataset, shipwrecks, airplanes, and humans account for 76.9%, 15.4%, and 7.7% of the total dataset, respectively. In addition, manual annotation was used to minimize errors, and annotation files in both Pascal VOC and MS COCO formats are provided.

3.3. Evaluate Metrics

In order to evaluate the merits of the HAUOD model, we used mean Average Precision (mAP50), precision and recall as evaluation metrics, giving the following definitions of these metrics:

m A P 50 = \frac{1}{N} \sum_{i = 1}^{N} A P_{i}

(31)

A P = \int_{0}^{1} P (R) d R

(32)

P r e c i sin = \frac{T P}{T P + F P}

(33)

R e c a l l = \frac{T P}{T P + F N}

(34)

Within object detection frameworks, average precision (AP) functions as the principal quantification criterion for evaluating single-category detection performance, obtained through systematic analysis of model discrimination accuracy between positive and negative specimen groups. Under binary classification configurations involving foreground–background differentiation, the mean average precision (mAP) metric inherently aligns with AP values due to categorical singularity. The evaluation architecture incorporates three core operational components: True Positives (TPs) quantify accurately localized objects through bounding box alignment with ground truth annotations, reflecting detection fidelity; False Positives (FPs) encompass erroneous classifications of background regions as objects or misidentifications of authentic objects, characterizing superfluous detection artifacts; while False Negatives (FNs) enumerate undetected objects within annotated datasets, serving as omission severity indicators. These metrics undergo synergistic integration via precision-recall calculus, where precision measures detection purity (TP/(TP + FP)) and recall evaluates completeness (TP/(TP + FN)), collectively establishing a multidimensional performance quantification framework for comprehensive model assessment.

3.4. Performance Analysis

In order to investigate the impact of the integration position of the C2fD module in the backbone network on the model performance, this study compares and analyzes the detection precision and computing efficacy of the HAUOD model under different configurations by systematically adjusting the embedding level of the module. The experiments use the SCTD dataset as the benchmark platform, and the mean average precision (mAP), floating-point operations (FLOPs), parameter sizes (Params), recall rate (R), checking accuracy (P), and multi-scale average precision under strict intersection and merger ratio thresholds (mAP50-95) are selected as the core evaluation metrics. In order to make HAUOD optimal in terms of model performance, control complexity, and enhancement of feature extraction capability, the integration position is iteratively adjusted to optimize the performance.

As shown in Table 1, the comprehensive model performance is optimal when the C2fD module is integrated in layers 1 and 4 of the backbone network (Model 5, integration location [1, 0, 0, 1]). Its mAP reaches 94.3%, which is a 7.5% improvement in accuracy over the baseline model YOLOv8n, but the Flops are only improved by 0.8 G. In addition, the recall (89.0%) and precision (95.2%) of Model 5 are significantly better than the other configurations, suggesting that the integration strategy effectively balances computational complexity and detection efficiency while enhancing the expression of object features. Notably, the mAP50-95 of Model 5 reaches 69.3%, which is 7.8 percentage points higher than that of the baseline model (61.5%), verifying the facilitating effect of multi-level feature fusion on the robustness of detection under strict IoU thresholds.

Integration location indicates the location of C2fD integration in the backbone; for example, [1, 0, 0, 0] means replacing the C2fD module with the first C2f module in the backbone, and so on.

3.5. Ablation Experiments

This study employs YOLOv8n as the foundational framework within the HAOOD architecture to systematically evaluate the effectiveness of individual enhancement components. Through experimental validation on the SCTD dataset, the research methodically integrates three technical innovations: SonarMosaic-based data augmentation techniques, the C2fD structural module, and the Underwater Multi-scale Contextual Attention (UWA) mechanism. The investigation comprehensively examines how these progressive modifications influence object detection capabilities in aquatic environments. The experimental results are shown in Table 2. After introducing the SonarMosaic module in the baseline model YOLOv8n, the mAP50 is improved from 86.8% to 0.898%, which is an improvement of 3 percentage points. The recall (R) improves from 0.721 to 0.851, indicating that the preprocessing significantly enhances the feature expression of low-contrast objects and reduces the missed detections. The mAP50-95 improves by 3.3 percentage points (61.5% → 64.8%), verifying the facilitation of the frequency-domain band-pass filtering with CLAHE enhancement for multi-scale detection. The addition of the C2fD module further improves the mAP50 to 94.3% by 4.5 percentage points, and the accuracy (Precision) from 93.3% to 95.2%. This module enhances the perception of fuzzy edge objects through explicit spatial difference feature extraction and dynamic fusion. The model achieves an mAP50-95 score of 69.3%, demonstrating that its multi-scale feature integration approach successfully alleviates recognition inconsistencies arising from size discrepancies among underwater objects. After the introduction of the UWA module, the mAP50 of the model HAUOD reaches 95.1% and the recall (R) is improved to 0.897, which verifies the synergistic effect of multi-scale contextual modeling and noise suppression. The mAP50-95 is improved from 69.3% to 70.0%, which demonstrates that the UWA enhances the robustness of detection under the stringent IoU threshold by the hierarchical null convolutional group. The accuracy (96.3%) reaches the highest value, demonstrating that the dynamic gating residual mechanism effectively balances noise suppression and object feature retention.

3.6. Performance Comparison

To assess the effectiveness of HAUOD, we compared five YOLO models, namely YOLOv5n, YOLOv8n, YOLOv8s, YOLOv10n, and YOLOv11n [57], from which it can be seen that HAOUD has a mAP of 95.1%, which is significantly better than that of the baseline model YOLOv8n (86.8%) by 8.3 percentage points and is significantly better than the other models. The mAP50-95 metric (70.0%) is improved by 4.1 percentage points compared to YOLOv8n (65.9%), which indicates that it is more robust under strict IoU thresholds. The recall rate (89.7%) and precision rate (96.3%) are both leading, verifying the dual suppression effect of multi-strategy preprocessing and attention mechanisms on leakage and false detection. At present, there are many images detection methods based on deep learning. In order to further test the effectiveness of HAUOD, we compared the performance of Faster R-CNN and SSD. The experimental results are shown in Table 3.

3.7. Detection Results

The detection results are obtained according to the models of the comparison test, and the specific results are shown in Figure 9. In summary, the HAUOD model has very obvious advantages over other models in the detection of underwater sonar images.

In order to verify the rationality and effectiveness of the SonarMosaic module, we can visualize the processing results of this module and compare the visualization results with the initial image, as shown in Figure 10. And in order to more intuitively test the pre-processing results, we adopt the Unified quality Assessment method for Sonar Imaging and Processing (UASIP) [58] to evaluate the performances. With the UASIP method, the initial image’s quality score is

55.78

, and the pre-processing image’s quality score is

59.02

, which is higher than the initial image’s. The effectiveness of the preprocessing module is verified through the above test results.

4. Discussion

This research presents the HAUOD framework, which achieves marked performance enhancements in sonar-based underwater object recognition through collaborative optimization of integrated system components. The evaluation outcomes demonstrate that relative to the original YOLOv8 framework, HAUOD achieves significant performance enhancements: mean Average Precision (mAP) increases to 95.1% (+8.3%) while mAP50-95 rises to 70.0% (+4.1%), with corresponding recall and precision rates reaching 0.897 and 96.3%, respectively. These improvements confirm its enhanced effectiveness in low-contrast conditions, noisy underwater environments, and multi-scale object detection scenarios, surpassing the baseline model’s capabilities across critical operational parameters.

The three-level cascade preprocessing (CLAHE enhancement, non-local mean denoising, and frequency domain band-pass filtering) with multi-level preprocessing and data enhancement significantly improves the object signal-to-noise ratio; in particular, the frequency domain band-pass filtering effectively highlights the small objects and fuzzy edge features by suppressing the low-frequency background interference. Combined with the SonarMosaic data enhancement strategy, the model’s adaptability to complex noise environments is enhanced while preserving object integrity. Experiments show that only the introduction of the preprocessing module can improve the mAP by 3%, which verifies the effectiveness of the joint air-frequency domain optimization strategy.

The dynamic feature fusion mechanism of the C2fD module uses explicit spatial differential feature extraction to strengthen the edge response through the Sobel operator, while the dynamic weight assignment mechanism (DynamicFusion) mitigates the interference of noisy edges by adaptively fusing the base features with the differential features. The Enhanced ECA module uses global channel interaction modeling with the SiLU–Sigmoid hybrid activation strategy to further suppress the invalid response of the noisy channel. The ablation experiments show that the introduction of the C2fD module leads to an additional 4.5% mAP enhancement, confirming its key role in feature enhancement and noise suppression.

The Underwater Multi-scale Contextual Attention (UWA) mechanism captures multi-granularity contextual information through hierarchical null convolution sets and combines it with two-dimensional attention (channel-space synergy) to enhance the object region response. The dynamic gating residual mechanism (

λ

= 0.38–0.67) adaptively balances the contributions of original and enhanced features and significantly improves the detection sensitivity of small objects in complex backgrounds. Experiments show that the UWA module further improves the mAP by 0.8% and especially exhibits stronger robustness under strict IoU thresholds (mAP50-95).

5. Conclusions

In this paper, a high-accuracy underwater object detection model based on the YOLOv8 model is proposed to address the challenges of low contrast, fuzzy edges, high-frequency noise, and small object leakage detection in underwater SAS images. Through the collaborative optimization of multi-modules, the model demonstrates a significant advantage in complex underwater scenarios. Experiments on the SCTD dataset show that the HAUOD algorithm achieves a mAP of 95.1%, which is 8.3% points higher than that of the baseline model YOLOv8n (86.8%), and the recall and precision reach 89.7% and 96.3%, respectively. Compared with mainstream models (e.g., YOLOv5s, YOLOv8s), HAUOD achieves a balance between precision and efficiency while maintaining a lightweight design (FLOPs of only 10.5 G), providing a viable solution for edge device deployment. Subsequent research efforts will prioritize enhancing computational responsiveness under challenging operational conditions.

Author Contributions

Conceptualization, J.S., D.X., L.Q., Z.X. and J.Z.; data curation, J.S. and L.Q.; formal analysis, J.S., D.X. and Z.X.; funding acquisition, Z.X. and J.Z.; investigation, J.S., L.Q., Z.X., L.L. and J.Z.; methodology, J.S., D.X., Z.X. and L.L.; project administration, Z.X., L.L. and J.Z.; resources, Z.X., L.L. and J.Z.; software, J.S., D.X. and L.Q.; supervision, Z.X., L.L. and J.Z.; validation, J.S., D.X., L.Q. and Z.X.; visualization, J.S.; writing—original draft, J.S., D.X., L.Q., Z.X. and L.L.; writing—review and editing, J.S., D.X., L.Q., Z.X., L.L. and J.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Fujian Provincial Natural Science Foundation of China (No. 2024J01120); the National Science Foundation of Xiamen, China (No. 3502Z202372013); the Open Project of the Key Laboratory of Underwater Acoustic Communication and Marine In-formation Technology (Xiamen University) of the Ministry of Education, China (No. UAC202304); the Fujian Province Young and Middle-aged Teacher Education Research Project (No. JAT220182); the Jimei University Startup Research Project (No. ZQ2022015); the Scientific Research Foundation of Jimei University (No. ZP2023010); Xiamen Ocean and Fishery Development Special Fund Project (No. 21CZB013HJ15); and the Xiamen Key Laboratory of Marine Intelligent Terminal R&D and Application (No. B2024008).

Data Availability Statement

The original contributions presented in the study are included in the article. Further inquiries can be directed to the corresponding author. The datasets used in this paper are included in the papers referenced.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Xu, Z.; Xu, D.; Lin, L.; Song, L.; Song, D.; Sun, Y.; Chen, Q. Integrated object detection and communication for synthetic aperture radar images. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2025, 18, 294–307. [Google Scholar] [CrossRef]
Zhang, X.; Tan, C.; Ying, W. An imaging algorithm for multireceiver synthetic aperture sonar. Remote Sens. 2019, 11, 672. [Google Scholar] [CrossRef]
Zhang, X.; Yang, P.; Sun, H. Frequency-domain multireceiver Synthetic Aperture Sonar imagery with Chebyshev polynomials. Electron. Lett. 2022, 58, 995–998. [Google Scholar] [CrossRef]
Anitha, U.; Malarkkan, S.; Jebaselvi, G.A.; Narmadha, R. Sonar image segmentation and quality assessment using prominent image processing techniques. Appl. Acoust. 2019, 148, 300–307. [Google Scholar] [CrossRef]
Tan, C.; Zhang, X.; Yang, P.; Sun, M. A novel sub-bottom profiler and signal processor. Sensors 2019, 19, 5052. [Google Scholar] [CrossRef] [PubMed]
Zhang, X.; Yang, P. Back projection algorithm for multi-receiver synthetic aperture sonar based on two interpolators. J. Mar. Sci. Eng. 2022, 10, 718. [Google Scholar] [CrossRef]
Ferguson, B.G.; Wyber, R.J. Application of acoustic reflection tomography to sonar imaging. J. Acoust. Soc. Am. 2005, 117, 2915–2928. [Google Scholar] [CrossRef]
Zhang, X.; Yang, P. An improved imaging algorithm for multi-receiver SAS system with wide-bandwidth signal. Remote Sens. 2021, 13, 5008. [Google Scholar] [CrossRef]
Kim, J.; Song, S.; Yu, S.C. Denoising auto-encoder based image enhancement for high resolution sonar image. In Proceedings of the 2017 IEEE Underwater Technology (UT), Busan, Republic of Korea, 21–24 February 2017; pp. 1–5. [Google Scholar]
Chai, Y.; Yu, H.; Xu, L.; Li, D.; Chen, Y. Deep learning algorithms for sonar imagery analysis and its application in aquaculture: A review. IEEE Sens. J. 2023, 23, 28549–28563. [Google Scholar] [CrossRef]
Zhang, X.; Yang, P.; Wang, Y.; Shen, W.; Yang, J.; Wang, J.; Ye, K.; Zhou, M.; Sun, H. A novel multireceiver SAS RD processor. IEEE Trans. Geosci. Remote Sens. 2024, 62, 4203611. [Google Scholar] [CrossRef]
Cervenka, P.; De Moustier, C. Sidescan sonar image processing techniques. IEEE J. Ocean. Eng. 2002, 18, 108–122. [Google Scholar] [CrossRef]
Chang, Y.C.; Hsu, S.K.; Tsai, C.H. Sidescan sonar image processing: Correcting brightness variation and patching gaps. J. Mar. Sci. Technol. 2010, 18, 1. [Google Scholar] [CrossRef]
Zhang, X.; Yang, P.; Feng, X.; Sun, H. Efficient imaging method for multireceiver SAS. IET Radar. Sonar Navig. 2022, 16, 1470–1483. [Google Scholar] [CrossRef]
Tian, H.; Guo, S.; Zhao, P.; Gong, M.; Shen, C. Design and implementation of a real-time multi-beam sonar system based on FPGA and DSP. Sensors 2021, 21, 1425. [Google Scholar] [CrossRef] [PubMed]
Zhang, Z.; Wu, R.; Li, D.; Lin, M.; Xiao, S.; Lin, R. Image stitching and target perception for autonomous underwater vehicle-collected side-scan sonar images. Front. Mar. Sci. 2024, 11, 1418113. [Google Scholar] [CrossRef]
Sun, S.; Xu, Z.; Cao, X.; Zheng, J.; Yang, J.; Jin, N. A high-performance and lightweight maritime target detection algorithm. Remote Sens. 2025, 17, 1012. [Google Scholar] [CrossRef]
Zheng, J.; Zhao, S.; Xu, Z.; Zhang, L.; Liu, J. Anchor boxes adaptive optimization algorithm for maritime object detection in video surveillance. Front. Mar. Sci. 2023, 10, 1290931. [Google Scholar] [CrossRef]
Yu, S. Sonar image target detection based on deep learning. Math. Probl. Eng. 2022, 2022, 5294151. [Google Scholar] [CrossRef]
Neupane, D.; Seok, J. A review on deep learning-based approaches for automatic sonar target recognition. Electronics 2020, 9, 1972. [Google Scholar] [CrossRef]
Zhu, P.; Isaacs, J.; Fu, B.; Ferrari, S. Deep learning feature extraction for target recognition and classification in underwater sonar images. In Proceedings of the 2017 IEEE 56th Annual Conference on Decision and Control (CDC), Melbourne, Australia, 12–15 December 2017; pp. 2724–2731. [Google Scholar]
Zhang, X.; Yang, P.; Wang, Y.; Shen, W.; Yang, J.; Ye, K.; Zhou, M.; Sun, H. LBF-based CS algorithm for multireceiver SAS. IEEE Geosci. Remote Sens. Lett. 2024, 21, 1502505. [Google Scholar] [CrossRef]
Bouvrie, J. Notes on Convolutional Neural Networks; Center for Biological and Computational Learning: Cambridge, MA, USA, 2006; pp. 38–44. Available online: https://web-archive.southampton.ac.uk/cogprints.org/5869/1/cnn_tutorial.pdf (accessed on 22 November 2006).
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; pp. 6000–6010. [Google Scholar]
Li, Z.; Liu, F.; Yang, W.; Peng, S.; Zhou, J. A survey of convolutional neural networks: Analysis, applications, and prospects. IEEE Trans. Neural Netw. Learn. Syst. 2021, 33, 6999–7019. [Google Scholar] [CrossRef] [PubMed]
Jocher, G.; Stoken, A.; Borovec, J.; Changyu, L.; Hogan, A.; Diaconu, L.; Poznanski, J.; Yu, L.; Rai, P.; Ferriday, R.; et al. ultralytics/YOLOv5: v3. 0; Zenodo: Geneva, Switzerland, 2020. [Google Scholar]
Ge, Z.; Liu, S.; Wang, F.; Li, Z.; Sun, J. YOLOx: Exceeding YOLO series in 2021. arXiv 2021, arXiv:2107.08430. [Google Scholar]
Hussain, M. YOLOv5, YOLOv8 and YOLOv10: The go-to detectors for real-time vision. arXiv 2024, arXiv:2407.02988. [Google Scholar]
Sohan, M.; Sai Ram, T.; Rami Reddy, C.V. A review on YOLOv8 and its advancements. In Proceedings of the International Conference on Data Intelligence and Cognitive Informatics, Tirunelveli, India, 18–20 December 2024; pp. 529–545. [Google Scholar]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. SSD: Single shot multibox detector. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Proceedings, Part I 14. Springer: Berlin/Heidelberg, Germany, 2016; pp. 21–37. [Google Scholar]
Jie, F.; Pingbo, W. Target detection in sonar image based on Faster RCNN. In Proceedings of the 2020 International Conference on Information Science, Parallel and Distributed Systems (ISPDS), Xi’an, China, 14–16 August 2020; pp. 25–30. [Google Scholar]
Carbó Mestre, P. Vessel Detection in Synthetic Aperture Radar Images Using Faster R-CNN Models: Advanced Monitoring Techniques to Improve Fisheries Management. 2023. Available online: https://openaccess.uoc.edu/server/api/core/bitstreams/5a5e48b3-39a9-49f2-a7e2-b10fc4aecf20/content (accessed on 20 June 2023).
Fan, Z.; Xia, W.; Liu, X.; Li, H. Detection and segmentation of underwater objects from forward-looking sonar based on a modified mask RCNN. Signal Image Video Process. 2021, 15, 1135–1143. [Google Scholar] [CrossRef]
Zhang, X. An efficient method for the simulation of multireceiver SAS raw signal. Multimed. Tools Appl. 2024, 83, 37351–37368. [Google Scholar] [CrossRef]
Xiong, C.; Lian, S.; Chen, W. An ensemble method for automatic real-time detection, evaluation and position of exposed subsea pipelines based on 3D real-time sonar system. J. Civ. Struct. Health Monit. 2023, 13, 485–504. [Google Scholar] [CrossRef]
Zhao, L.; Fu, L.; Yun, Q.; Qiao, Y.; Jin, J.; Zhu, X. Advanced Underwater Object Detection in Sonar Images: An Exploration Utilizing Innovative Detector and Transfer Learning. Available online: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=5019967 (accessed on 13 November 2024).
Williams, D.P.; Brown, D.C. Spectral partitioning of synthetic aperture sonar imagery for improved ATR. IEEE Geosci. Remote Sens. Lett. 2025, 22, 1501605. [Google Scholar] [CrossRef]
Heng, Z.; Shuping, H.; Jingfeng, X.; Yaohui, H.; Yubo, H. A review of intelligent detection methods for underwater targets in sonar images. In Proceedings of the 2023 IEEE 7th Information Technology and Mechatronics Engineering Conference (ITOEC), Chongqing, China, 4–6 March 2023; Volume 7, pp. 1462–1468. [Google Scholar]
Pizer, S.M. Adaptive histogram equalization and its variations. Comput. Graph. Image Process. 1977, 6, 184–195. [Google Scholar] [CrossRef]
Goossens, B.; Luong, H.; Pizurica, A.; Philips, W. An improved non-local denoising algorithm. In Proceedings of the 2008 International Workshop on Local and Non-Local Approximation in Image Processing (LNLA 2008), Lausanne, Switzerland, 23 August 2008; pp. 143–156. [Google Scholar]
Jain, P.; Tyagi, V. Spatial and frequency domain filters for restoration of noisy images. IETE J. Educ. 2013, 54, 108–116. [Google Scholar] [CrossRef]
Zhang, X.; Yang, P.; Sun, M. Experiment results of a novel sub-bottom profiler using synthetic aperture technique. Curr. Sci. 2022, 122, 00113891. [Google Scholar] [CrossRef]
Vairalkar, M.K.; Nimbhorkar, S. Edge detection of images using Sobel operator. Int. J. Emerg. Technol. Adv. Eng 2012, 2, 291–293. [Google Scholar]
Wang, Q.; Wu, B.; Zhu, P.; Li, P.; Zuo, W.; Hu, Q. ECA-Net: Efficient channel attention for deep convolutional neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 11534–11542. [Google Scholar]
Zhou, K.; Zhang, M.; Wang, H.; Tan, J. Ship detection in SAR images based on multi-scale feature extraction and adaptive feature fusion. Remote Sens. 2022, 14, 755. [Google Scholar] [CrossRef]
Zhang, X.; Yang, P.; Cao, D. Synthetic aperture image enhancement with near-coinciding Nonuniform sampling case. Comput. Electr. Eng. 2024, 120, 109818. [Google Scholar] [CrossRef]
Griffin, L.D. Mean, median and mode filtering of images. Proc. R. Soc. Lond. Ser. A Math. Phys. Eng. Sci. 2000, 456, 2995–3004. [Google Scholar] [CrossRef]
D’Haeyer, J.P. Gaussian filtering of images: A regularization approach. Signal Process. 1989, 18, 169–181. [Google Scholar] [CrossRef]
Franke, M.; Degen, J. The Softmax Function: Properties, Motivation, and Interpretation. 2023. Available online: https://osf.io/preprints/psyarxiv/vsw47_v1 (accessed on 28 September 2023).
Zhang, X.; Yang, P.; Dai, X. Focusing multireceiver SAS data based on the fourth-order legendre expansion. Circuits Syst. Signal Process. 2019, 38, 2607–2629. [Google Scholar] [CrossRef]
Han, J.; Moraga, C. The influence of the sigmoid function parameters on the speed of backpropagation learning. In Proceedings of the International Workshop on Artificial Neural Networks, Perth, WA, Australia, 27 November–1 December 1995; pp. 195–201. [Google Scholar]
Jocher, G.; Stoken, A.; Borovec, J.; Changyu, L.; Hogan, A.; Chaurasia, A.; Diaconu, L.; Ingham, F.; Colmagro, A.; Ye, H.; et al. ultralytics/YOLOv5: v4. 0-nn. SiLU () Activations, Weights & Biases Logging, PyTorch Hub Integration; Zenodo: Geneva, Switzerland, 2021. [Google Scholar]
Lee, M. Gelu activation function in deep learning: A comprehensive mathematical analysis and performance. arXiv 2023, arXiv:2305.12073. [Google Scholar]
Pydimarry, S.A.; Khairnar, S.M.; Palacios, S.G.; Sankaranarayanan, G.; Hoagland, D.; Nepomnayshy, D.; Nguyen, H.P. Evaluating Model Performance with Hard-Swish Activation Function Adjustments. arXiv 2024, arXiv:2410.06879. [Google Scholar]
Ning, M.; Tang, J.; Wu, H.; Zhong, H.; Ma, M. An efficient subswath-subband Chirp Z-Transform algorithm for multiple-receiver SAS considering the differential range curvature. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5207319. [Google Scholar] [CrossRef]
Zhang, P.; Tang, J.; Zhong, H.; Ning, M.; Liu, D.; Wu, K. Self-trained target detection of radar and sonar images using automatic deep learning. IEEE Trans. Geosci. Remote Sens. 2021, 60, 4701914. [Google Scholar] [CrossRef]
Khanam, R.; Hussain, M. YOLOv11: An overview of the key architectural enhancements. arXiv 2024, arXiv:2410.17725. [Google Scholar]
Cai, B.; Chen, W.; Zhang, J.; Ur Rehman Junejo, N.; Zhao, T. Unified no-reference quality assessment for sonar imaging and processing. IEEE Trans. Geosci. Remote Sens. 2025, 63, 3524835. [Google Scholar] [CrossRef]

Figure 1. Sound waves are reflected multiple times to produce an artifact sonar image.

Figure 2. The proposed HAUOD algorithm.

Figure 3. SonarMosaic multi-strategy sonar preprocessing flow.

Figure 4. C2fD module workflow.

Figure 5. Dynamic feature fusion process.

Figure 6. Enhanced ECA workflow diagram.

Figure 7. Underwater multi-scale attention mechanism.

Figure 8. Specific process of each part of the attention mechanism.

Figure 9. The above figure shows the results using different models on the SCTD dataset. From top to bottom and from left to right are HAUOD (

97 %

), YOLOv5n (

91 %

), YOLOv8n (

88 %

), YOLOv8s (

93 %

), YOLOv10n (

92 %

), and YOLOv11n (

89 %

).

Figure 9. The above figure shows the results using different models on the SCTD dataset. From top to bottom and from left to right are HAUOD (

97 %

), YOLOv5n (

91 %

), YOLOv8n (

88 %

), YOLOv8s (

93 %

), YOLOv10n (

92 %

), and YOLOv11n (

89 %

).

Figure 10. Sonar image visualization before and after preprocessing. (a) Initial image (Quality Score is

55.78

), (b) preprocessed image (quality score is

59.02

).

Figure 10. Sonar image visualization before and after preprocessing. (a) Initial image (Quality Score is

55.78

), (b) preprocessed image (quality score is

59.02

).

Table 1. Performance comparison of C2fD in backbone networks at different integration positions.

Model	Integration Location	mAP (%)	FLOPs (G)	Params (m)	R (%)	P (%)	mAP50-95 (%)
1	[1, 0, 0, 0]	91.9	8.4	3.017	80.2	93.4	65.9
2	[0, 1, 0, 0]	90.6	8.8	3.100	86.1	83.9	66.7
3	[0, 0, 1, 0]	92.0	8.9	3.382	83.1	95.0	68.9
4	[0, 0, 0, 1]	94.0	8.6	3.690	88.3	94.0	68.3
5	[1, 0, 0, 1]	94.3	9.0	3.701	89.0	95.2	69.3
6	[0, 1, 0, 1]	91.1	9.4	3.784	78.2	95.4	67.1
7	[0, 0, 1, 1]	90.0	9.5	4.066	78.3	94.2	66.1
8	[1, 0, 1, 1]	92.6	9.8	4.077	84.2	91.1	68.4
9	[0, 1, 1, 1]	91.0	10.2	4.161	82.0	90.2	65.8
10	[1, 1, 1, 1]	89.1	10.6	4.172	77.4	92.8	67.0

The optimal value of each indicator is bolded and underlined.

Table 2. Ablation experiments.

SonarMosaic	C2fD	UWA	mAP50 (%)	R (%)	Precision (%)	mAP50-95 (%)
×	×	×	86.8	72.1	86.9	61.5
✓	×	×	89.8	85.1	93.3	64.8
×	✓	×	91.5	82.6	94.8	67.4
×	×	✓	89.7	78.0	92.5	64.2
✓	✓	×	94.3	89.0	95.2	69.3
✓	✓	✓	95.1	89.7	96.3	70.0

The bolded and underlined parts represent the optimal values for each indicator.

Table 3. Model comparison between multiple models vs. HAUOD.

Model	mAP(%)	FLOPs (G)	Params (m)	R (%)	P (%)	FPS	mAP50-95 (%)
YOLOv5n	87.7	7.1	2.504	75.0	92.0	86.12	64.9
YOLOv8n	86.8	8.2	3.011	72.1	86.9	81.32	61.5
YOLOv8s	88.9	28.4	11.127	75.5	91.2	72.36	66.3
YOLOv10n	83.6	8.2	2.696	67.9	88.4	85.09	60.5
YOLOv11n	90.0	6.3	2.583	85.8	89.5	85.61	62.2
Faster R-CNN	81.6	90.8	41.31	66.3	82.9	20.47	50.1
SSD	91.1	137.8	24.38	81.9	91.6	48.71	60.9
HAUOD	95.1	10.5	5.612	89.7	96.3	75.34	70.0

The bolded and underlined parts represent the optimal values for each indicator.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Su, J.; Xu, D.; Qiu, L.; Xu, Z.; Lin, L.; Zheng, J. A High-Accuracy Underwater Object Detection Algorithm for Synthetic Aperture Sonar Images. Remote Sens. 2025, 17, 2112. https://doi.org/10.3390/rs17132112

AMA Style

Su J, Xu D, Qiu L, Xu Z, Lin L, Zheng J. A High-Accuracy Underwater Object Detection Algorithm for Synthetic Aperture Sonar Images. Remote Sensing. 2025; 17(13):2112. https://doi.org/10.3390/rs17132112

Chicago/Turabian Style

Su, Jiahui, Deyin Xu, Lu Qiu, Zhiping Xu, Lixiong Lin, and Jiachun Zheng. 2025. "A High-Accuracy Underwater Object Detection Algorithm for Synthetic Aperture Sonar Images" Remote Sensing 17, no. 13: 2112. https://doi.org/10.3390/rs17132112

APA Style

Su, J., Xu, D., Qiu, L., Xu, Z., Lin, L., & Zheng, J. (2025). A High-Accuracy Underwater Object Detection Algorithm for Synthetic Aperture Sonar Images. Remote Sensing, 17(13), 2112. https://doi.org/10.3390/rs17132112

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A High-Accuracy Underwater Object Detection Algorithm for Synthetic Aperture Sonar Images

Abstract

1. Introduction

2. The Proposed HAUOD Algorithm

2.1. Overall Architecture

2.2. SonarMosaic Strategy

2.3. The Proposed C2fD Module

2.4. Underwater Multi-Scale Attention Mechanism

3. Results

3.1. Experimental Environments

3.2. Dataset

3.3. Evaluate Metrics

3.4. Performance Analysis

3.5. Ablation Experiments

3.6. Performance Comparison

3.7. Detection Results

4. Discussion

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI