1. Introduction
With the continuous growth in marine resource exploration and underwater environmental monitoring demands, acoustic target detection technology, as a core approach for underwater perception, has seen its research significance become increasingly prominent in deep-sea exploration applications. Compared to the significant attenuation suffered by optical and electromagnetic waves in aquatic media, sonar systems have established themselves as an irreplaceable technological solution for underwater target detection by leveraging their unique medium propagation characteristics–combining high-frequency acoustic wave velocity with sub-decibel level attenuation over kilometer-scale propagation distances [
1]. However, in practical application scenarios, underwater acoustic channels are significantly affected by multi-physical field coupling effects [
2]: marine ambient noise, narrowband harmonic noise induced by platform mechanical vibrations, and multipath reverberation interference caused by seafloor/seawater interface reflections. These composite disturbances lead to characteristic sonar image degradation manifestations: temporal-spatial joint distribution of signal-to-noise ratio deterioration, geometric distortion caused by beamforming anomalies, and broadening of the edge spread function at target scatterers, which severely constrain the feature extraction and classification accuracy of underwater targets.
In recent years, significant breakthroughs have been achieved in addressing the technical challenges of underwater acoustic target detection through systematic academic research. The Roe team [
3] developed an innovative methodology employing Support Vector Machines (SVM) with an integrated highlight-shadow detection framework, advancing feature interpretation in synthetic aperture sonar (SAS) imagery. The Lopera Research Group [
4] proposed a cascade-based solution for identifying man-made linear objects (MLO) on seabeds through multi-scale feature fusion: implementing joint noise suppression and edge enhancement via anisotropic diffusion filtering, coupled with precise highlight/shadow segmentation using fuzzy morphological operators. At the feature engineering level, they innovatively constructed a 32-dimensional hybrid feature vector incorporating morphological topology parameters, regional statistical characteristics, contour moment descriptors, and minimum enclosing ellipse features. These were integrated with a Bayesian probabilistic classification model using the Markov Chain Monte Carlo (MCMC) methods, demonstrating substantial improvement in average classification accuracy across both field-collected and simulated datasets. Fakiris et al. [
5] pioneered the application of Independent Component Analysis (ICA) to side-scan sonar (SSS) target detection, establishing a feature decoupling mechanism based on blind source separation that effectively suppresses seabed reverberation interference. Their framework achieved a 77% recall rate for automatic identification of small-scale targets (diameter < 0.5 m) in complex seabed sediment environments.
In the field of sonar image target detection, traditional computer vision methodologies based on feature-classifier coupling mechanisms historically dominated technological evolution. However, as marine observation scenarios grow increasingly heterogeneous (e.g., mixed seabed substrates and dynamic hydroacoustic environmental perturbations), the inherent constraints of manual feature engineering—particularly in target representation capacity and generalization performance—have become critically exposed. A pivotal transition occurred in 2012 when AlexNet achieved breakthrough performance on the ImageNet large-scale visual recognition challenge through its deep convolutional neural network (CNN) architecture, propelling a paradigm shift in underwater optical/acoustic image interpretation via deep learning. This innovative approach employs end-to-end training mechanisms to autonomously extract hierarchical abstract features from large-scale annotated datasets, effectively overcoming the intrinsic limitations of conventional methods, including subjectivity in feature design and weak environmental adaptability.
At present, deep learning-based target detection has three paradigms: one-stage, two-stage, and transformer-based detection. Two-stage models first use candidate box extraction to identify Regions of Interest (RoI), then perform recognition and localization. They are accurate but slow, so they are not ideal for real-time detection. This led to the development of one-stage algorithms, such as YOLO. As YOLO evolved, its accuracy on public test sets approached that of two-stage algorithms.
YOLO is making waves across various fields, including sonar image target detection. Zheng Linhan’s ScEMA-YOLOv8 model for underwater sonar target detection uses an EMA attention mechanism and an SPPFCSPC pooling module to extract features from blurred targets better. This model also adds detection layers and residual connections to improve the detection and positioning of small targets. Regretfully, it is not optimized for targets with scarce features, and simply adding a detection layer is not enough to handle scenes with drastic changes in target scale [
6]. Xie Guohao and Chen Zhe’s DA-YOLOv7 model for underwater sonar image target recognition features innovative modules, such as an omnidirectional convolutional channel prior convolutional attention efficient layer aggregation network, spatial pyramid pooling channel shuffle, and ghost shuffle convolutional enhanced layer aggregation network. These reduce computational load and improve the capture of local features and critical information. However, it still lacks optimization for strong noise interference in sonar images and for detecting small targets [
7]. Meng Junxia’s team [
8] used CycleGAN for data augmentation and integrated a global attention mechanism into the feature extraction phase of YOLOv8, achieving some engineering success.
Based on the above-mentioned challenges in current sonar target detection and the strategies used by previous researchers, this study carries out the following work:
- (1)
The experiments collected a sonar dataset with two target categories. To address the data scarcity issue, ADA-StyleGAN3, which excels in generating high-quality sonar images, was introduced. The implementation had two stages: first, pre-training models using high-fidelity synthetic data from ADA-StyleGAN3; then, fine-tuning the network parameters with real underwater data. Experiments showed that this approach effectively improved detection accuracy.
- (2)
The FASFFHead with four detection heads is proposed. It uses ASFF for feature fusion, forming a cross-scale adaptive weighting mechanism. Differentiable spatial pyramid pooling for multi-level feature integration better preserves the semantic info of tiny targets.
- (3)
The Edge-Contour Attention Mechanism (EEA) proposed in this study innovatively incorporates a dual-branch gradient guidance architecture, which substantially enhances gradient response characteristics at target edges through differentiated gradient information processing. This approach achieves deep integration with multi-scale deformable convolution technology, enabling adaptive focusing on critical contour regions while effectively suppressing boundary reverberation noise interference.
- (4)
Through structural reorganization and optimization of the SC_Conv convolution module, this study innovatively constructs a cross-module collaborative architecture that integrates multi-level feature interaction and fusion with the C3K2 module, forming a feature extraction unit with a dual-path compression mechanism. The joint-optimized architecture achieves high-fidelity feature representation via a channel-spatial co-attention mechanism, effectively suppressing inter-channel correlation redundancy and spatial dimension information redundancy while enhancing feature extraction precision.
- (5)
The Focaler-IOU was adopted to replace the original loss function. Focusing on different regression samples enhances detection performance across tasks and addresses shortcomings in existing bounding box regression methods for various objects, thereby improving detection capabilities in different scenarios.
2. Data
2.1. Image Dataset Collection
The experiment was carried out in a lake in Wuhan, China, with a complex bottom terrain, a depth of 5–15 m, and a muddy bottom that effectively prevents target burial. Two typical underwater targets were deployed: suspended and bottom-resting ones. The suspended target is a 1-meter-diameter hollow sphere with holes. The bottom-resting target has an oval-head cylindrical shape, 2 m long, and 0.5 m in diameter. The synthetic aperture sonar, operating at 220–260 kHz with a 3 cm × 2 cm imaging resolution, was used for target detection and recognition.
Figure 1 presents a sonar image. The left and right sections show acoustic wave detection and imaging. The central shaded area indicates signal loss due to an unreceived bottom echo. Three red-marked regions show bottom-resting targets.
Figure 2 shows a suspended-target sonar image with one red-marked suspended target.
In the experiment, sonar was used for single-channel data collection. The collected sonar images were converted to pseudo-color to enhance visual effects, highlight details, and distinguish different intensity ranges. After rigorous data classification and filtering, a dataset of 784 raw sonar images (3600 × 1600 resolution) was built. This dataset consisted of 421 images with 545 bottom-resting targets, 366 images with 397 suspended targets, and 30 images without any targets.
2.2. Image Dataset Generation
Faced with the challenge of insufficient training data from only 784 original sonar images, this study employs data augmentation techniques for dataset expansion. Traditional image data augmentation methods include geometric transformations such as flipping, rotation, scaling, and translation; color transformations such as brightness and contrast adjustment; and adding noise or applying filtering. However, these conventional approaches can only expand data through linear transformations or simple combinations, failing to generate novel features or semantic content beyond the original data distribution. They may even compromise image authenticity (e.g., salt-and-pepper noise and excessive sharpening) and cause models to learn invalid features. Compared to traditional methods constrained by linear transformations that may compromise image authenticity, Generative Adversarial Networks (GANs) [
9,
10,
11] demonstrate distinct advantages. Through unsupervised adversarial optimization mechanisms, GANs adapt particularly well to the unique characteristics of sonar imagery: low-resolution textures, high-noise environments, and specialized underwater terrain features. Although existing studies have explored multi-scale generation networks [
12] and simulation-integrated synthesis methods [
13], three critical challenges persist: First, sonar image generation must integrate physical propagation characteristics. Traditional DCGAN struggles with realistic noise pattern reproduction, Pix2Pix [
14,
15] requires impractical paired dataset scales, and CycleGAN [
16,
17,
18] lacks physical modeling capabilities for acoustic wave propagation. Second, multi-dimensional domain gaps exist between synthetic and real data, encompassing background features, sensor parameters, and material properties, which compromise model generalization during deployment. Third, current generative models fail to optimally balance semantic consistency with physical fidelity.
To address the aforementioned challenges with maximum potential, we introduce the ADA-StyleGAN3 method. ADA-StyleGAN3, proposed by Li Liang’s team, is a novel few-shot sonar image generation network that integrates the adaptive discriminator augmentation strategy into StyleGAN3 [
19]. Compared to conventional StyleGAN3, this approach resolves the augmentation leakage issue from traditional data enhancement techniques to the generator, ensuring high stability during dataset generation under limited samples while effectively preventing overfitting problems.
Despite the aforementioned efforts, it is undeniable that generative networks, such as GANs, inherently exhibit data distribution discrepancies between synthetic and real data during the autonomous synthesis of complex scenarios by learning data distributions. These discrepancies manifest widely in aspects such as background, lighting, perspectives, and other contextual variations. When training models using both real and synthetic datasets and testing on real data, such differences inevitably impact the detector’s generalization capability, introducing generalization errors. Therefore, minimizing the gap between the model’s error on synthetic data (source domain) and real data (target domain) becomes crucial. To address this, this study adopts a transfer learning approach based on model parameter knowledge transfer, aiming to reduce generalization errors and decrease the model’s reliance on extensive real-world data. Specifically, the ADA-StyleGAN3 generative network is employed to synthesize large-scale approximate data for pre-training under the limited sample size of the sonar dataset. Subsequently, transfer training is conducted on the pre-trained model using the original real dataset to enhance generalization performance.
Building on this framework, this study employs ADA-StyleGAN3 to synthesize sonar images. Ubuntu 18.04.6 is used as the workstation’s operating system, with Python as the programming language. PyTorch (v1.10.0) is utilized to deploy and train the deep learning network, supported by two Nvidia A30 GPUs with 24 GB of memory each. The training process uses the stylegan3-t model with the Adam optimizer. The generator’s initial learning rate is 0.003, and the discriminator’s is 0.0015. The batch size is set to 16, and the discriminator is trained for 1000 kimg. The is 0.4. Finally, 900 synthetic data samples with ground-truth labels were obtained after a rigorous manual screening and labeling process. There are 450 images each of suspended targets and bottom-resting targets, with one target per image.
4. Experiment Results and Analysis
4.1. Experimental Results
In this experiment, the sonar image dataset was collected and organized from tests on Wuhan Mulan Lake, comprising two categories: bottom-resting and suspended targets. It has 544 training images and 120 validation and test images each. The experiment ran on Windows 11 with an NVIDIA RTX 2000 Ada Generation Laptop GPU (24 GB VRAM), using the torch2.0.0+cu118 deep learning framework and Python3.10.
First, ADA-StyleGAN3 was used to generate 450 images for each target type. These images were used to pre-train the SFE-YOLO network for 100 epochs with an initial learning rate of 0.01 and SGD momentum of 0.937. After pre-training, the real dataset was fine-tuned for 150 epochs with the same parameters based on the pre-trained weights. The specific experimental environment and configuration are shown in
Table 1.
The model was evaluated using Precision, Recall, F1-score curves, average precision (AP), mean average precision (mAP), and parameter count. Precision quantifies the proportion of true-positive predictions among all predicted instances. After setting an IoU threshold, the predicted bounding boxes are matched with the ground-truth boxes. A prediction is considered a true positive (TP) if the predicted bounding box meets the IoU threshold and the predicted category is correct. If the model predicts an object not present in the ground truth or assigns the wrong category, the prediction is considered a false positive (FP). Conversely, if a ground-truth object is not detected by the model, it is categorized as a false negative (FN). The definition of precision is expressed as Equation (14). The recall quantifies the proportion of correctly identified positive instances relative to the total number of actual positive samples. A higher recall value indicates that the model successfully detects a larger fraction of positive-class samples, thereby demonstrating superior detection performance. The calculation method is formally defined in Equation (15).
The F1 score is a critical metric in classification tasks, serving as a comprehensive evaluation of a model’s precision and completeness in detecting all targets. It harmonizes precision (accuracy of positive predictions) and recall (ability to identify all relevant instances), with values bounded between 0 (worst) and 1 (optimal). The calculation method is formally defined in Equation (16). Average Precision (AP) evaluates the model’s performance for individual categories, while mean Average Precision (mAP) assesses its overall effectiveness across all categories. The mAP is computed as the arithmetic mean of AP values over all categories. Their calculation methods are formally defined in Equations (17) and (18), respectively. Finally, FPS measures the detection speed of algorithms. The larger the FPS value, the faster the model detection speed and the better the real-time performance. The calculation method is shown in Equation (19).
The proposed SFE-YOLO model was evaluated on a self-collected dataset to assess its performance. After 150 training epochs, the model achieved convergence. As illustrated in
Figure 12, box_loss quantifies the discrepancy between predicted and ground-truth bounding boxes, with lower values indicating higher detection accuracy. cls_loss measures the divergence between predicted and ground-truth class labels, where reduced values reflect improved classification precision. Furthermore, dfl_loss formulates continuous coordinate prediction as discrete probability distribution prediction, enabling more accurate coordinate localization. Minimized values of this loss signify enhanced prediction fidelity.
As shown in
Figure 13, in the SFE-YOLO model, C represents bottom-resting targets (objects settled on underwater surfaces), while F denotes suspended targets (objects floating in water without settling). In the YOLOv11n network, Class 1 is designated for bottom-resting targets and Class 2 for suspended targets. Comparing the left and right subfigures, the SFE-YOLO model demonstrates substantial improvements in precision and completeness under full-target detection conditions compared to the baseline model. Additionally, its performance curves exhibit enhanced smoothness, reflecting the model’s superior stability and robustness in dynamic environments.
As illustrated in
Figure 14,
Figure 15 and
Figure 16, the specific comparison between SFE-YOLO and the baseline YOLOv11n across three critical metrics-accuracy, recall, and mAP50—demonstrates that the enhanced model achieves substantial performance improvements in all evaluated dimensions. These results validate the effectiveness of the proposed architectural modifications in advancing both detection robustness and generalizability.
The final experimental results are visualized in
Figure 17. The first column shows the original images. The second column displays the baseline detection results. The third column presents the SFE-YOLO detection results. The first image row demonstrates that the baseline YOLOv11n model produces substantially more false alarms for bottom-resting targets than SFE-YOLO. Analysis of the second and third rows indicates that the baseline model underperforms the improved version in detecting feature-ambiguous bottom-resting targets while also yielding less confident predictions for suspended targets. Notably, the fourth row highlights the enhanced model’s superior detection accuracy in target localization.
4.2. Ablation Experiment
This paper uses YOLOv11n as the baseline to conduct ablation experiments on the improved algorithm (all experiments are performed on the pre-trained model generated by the ADA-StyleGAN3 dataset) and verifies the effectiveness of each improvement in sonar target detection by sequentially adding different improvements to the model.
Firstly, as shown in
Table 2, when replacing the original detection head with the FASFF four-head detection head, it can be seen from the table below that GFLOPs increase significantly, and the number of parameters increases by about 1.4 M, but the Pre, R, and mAP50 only increase by 0.2%, 2.8%, and 0.4%, respectively. This is because sonar target detection is not simply small-scale target detection. Sonar images suffer from low resolution and strong interference. Relying solely on FASFF’s feature fusion and small-target detection head may introduce computational complexity, harming performance. Still, FASFFHead is crucial. When used alone, its performance gain is not obvious compared to the increased computation, but when combined with our network, it significantly enhances feature fusion and detection stability. As shown in experiments 6 and 8, using FASFFHead brings a qualitative improvement. Also,
Table 3 indicates that models with more parameters than ours without FASFFHead do not match its performance.
Secondly, when adding the EEA attention mechanism or the C3K2_Sc module separately, it can be seen that compared to the YOLOv11n baseline model, the number of parameters decreases while Pre, R, and mAP50 are significantly improved. This indicates that both the EEA attention mechanism and the C3K2_Sc module have the ability to reduce data redundancy and show greater sensitivity to sonar targets and more excellent feature extraction capabilities for sonar images. Then, when adding these two improvement methods simultaneously, it can be seen that groups 5, 6, and 7 all exhibit better performances, with Pre and mAP50 in each group reaching over 80%. Notably, the accuracy of group 7 reaches 88.4%, which is 12.2% higher than the baseline.
Finally, when integrating the four improvement methods to form the SFE-YOLO model, the final accuracy reaches 92%, the recall rate reaches 90.3%, and mAP50 is 12.7% higher than the baseline, reaching 89.7%. This demonstrates the superior performance of SFE-YOLO in sonar target detection. In conclusion, the improved network structure achieves significant improvements in all evaluation metrics, and the number of parameters and computational cost remain within a practical and controllable range.
4.3. Comparative Experiment
For comparative analysis, this study evaluates multiple object detection networks, including Fast R-CNN, YOLOV11s, YOLOv7tiny, YOLOV11n, YOLOv5s, and YOLOv8n, using precision (P), recall (R), mAP@0.5, and parameter count as evaluation metrics. The performance of the improved YOLOv11 network and baseline models on the test set is comprehensively summarized in
Table 3 (all experimental results in
Table 3 were obtained through direct training using real data). Faster-RCNN, a two-stage algorithm, has a larger parameter size and computational load than one-stage algorithms, resulting in slower detection speeds. Moreover, on a dataset dominated by small objects, its mAP@0.5 is 5.2% lower than that of the algorithm presented in this paper. In comparison to the traditional lightweight detection algorithms YOLOv11n and YOLOv8n, the algorithm introduced in this paper shows a slight increase in parameter size but a significant improvement in all detection metrics. When matched against the more heavyweight YOLOv11s and YOLOv5s, the algorithm has approximately one-third of its parameter size yet achieves higher accuracy, recall, and FPS. It is evident that SFE-YOLO surpasses other network models in multiple parameters while maintaining a relatively small parameter size. With an FPS of 17, it satisfies general real-time detection requirements. Therefore, the SFE-YOLO algorithm enhances the accuracy of small target detection in sonar images while meeting high frame rate requirements and effectively balances detection accuracy and speed.
Table 3.
Evaluation metric comparison table of SFE-YOLO and other models.
Table 3.
Evaluation metric comparison table of SFE-YOLO and other models.
Method | Pre/% | R/% | mAP50 | Para/(M) | FPS |
---|
Fast R-CNN | 83.2 | 82.1 | 82.7 | 41.0 | 4 |
YOLOV11s | 85.3 | 86.3 | 88.1 | 9.4 | 15 |
YOLOv7tiny | 81.3 | 80.5 | 80.9 | 6.1 | 22 |
YOLOv8n | 78.2 | 77.5 | 77.8 | 3.0 | 38 |
YOLOv5s | 84.5 | 83.7 | 84.1 | 9.1 | 16 |
YOLOV11n | 75.1 | 74.9 | 77.0 | 2.6 | 36 |
SFE-YOLO | 90.4 | 89.2 | 87.9 | 3.7 | 17 |
To further demonstrate the advantages of the SFE-YOLO algorithm and the pre-training strategy using ADA-StyleGAN3-generated data, we conducted comparative experiments combining these two algorithms and strategies. The experimental results are shown in
Table 4.
As shown in the table above, given the scarcity of sonar image data, the strategy of first pre-training on generated data and then conducting transfer training on real data is effective. It improves the final model’s performance for both the baseline YOLOv11n and SFE-YOLO networks.
For YOLOv11n, this strategy boosts accuracy by 1.1%, recall by 0.7%, and mAP50 by 0.2%. For SFE-YOLO, accuracy increases by 1.6%, recall by 1.1%, and mAP50 by 1.8%. These results confirm that pre-training on synthetic data and then transfer training on real data reduces model generalization error. It also decreases the model’s reliance on large amounts of real data and improves the recognition accuracy of models trained on small-sample sonar data.