Research on Improved YOLO11 for Detecting Small Targets in Sonar Images Based on Data Enhancement

Wang, Xiaochuan; Zhang, Zhiqiang; Shang, Xiaodong

doi:10.3390/app15126919

Open AccessArticle

Research on Improved YOLO11 for Detecting Small Targets in Sonar Images Based on Data Enhancement

by

Xiaochuan Wang

,

Zhiqiang Zhang

^* and

Xiaodong Shang

Department of Weaponary Engineering, Naval University of Engineering, Wuhan 430033, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(12), 6919; https://doi.org/10.3390/app15126919

Submission received: 14 May 2025 / Revised: 9 June 2025 / Accepted: 10 June 2025 / Published: 19 June 2025

Download

Browse Figures

Review Reports Versions Notes

Abstract

Existing sonar target detection methods suffer from low efficiency and accuracy due to sparse target features and significant noise interference in sonar images. To address this, we introduce SFE-YOLO, an improved model based on YOLOv11. We replace the original detection head with an FSAFFHead module that enables adaptive spatial feature fusion. An EEA module is designed to direct the model’s attention to the intrinsic contour information of targets. We also enhance SC_Conv convolution and integrate it into C3K2 to improve detection stability and reduce information redundancy. Additionally, Focaler-IOU is introduced to boost the accuracy of multi-category target bounding box regression. Lastly, we employ a hybrid training strategy that combines pre-training with ADA-StyleGAN3-generated data and transfer learning with real data to alleviate the problem of insufficient training samples. The experiments show that, compared to the baseline YOLOv11n, the improved model’s precision and recall increase to 92% and 90.3%, respectively, and mAP50 rises by 12.7 percentage points, highlighting the effectiveness of the SFE-YOLO network and its transfer learning strategy in tackling the challenges of sparse small target features and strong noise interference in sonar images.

Keywords:

sonar image; YOLOv11; four detection heads; feature fusion; contour attention; transfer learning

1. Introduction

With the continuous growth in marine resource exploration and underwater environmental monitoring demands, acoustic target detection technology, as a core approach for underwater perception, has seen its research significance become increasingly prominent in deep-sea exploration applications. Compared to the significant attenuation suffered by optical and electromagnetic waves in aquatic media, sonar systems have established themselves as an irreplaceable technological solution for underwater target detection by leveraging their unique medium propagation characteristics–combining high-frequency acoustic wave velocity with sub-decibel level attenuation over kilometer-scale propagation distances [1]. However, in practical application scenarios, underwater acoustic channels are significantly affected by multi-physical field coupling effects [2]: marine ambient noise, narrowband harmonic noise induced by platform mechanical vibrations, and multipath reverberation interference caused by seafloor/seawater interface reflections. These composite disturbances lead to characteristic sonar image degradation manifestations: temporal-spatial joint distribution of signal-to-noise ratio deterioration, geometric distortion caused by beamforming anomalies, and broadening of the edge spread function at target scatterers, which severely constrain the feature extraction and classification accuracy of underwater targets.

In recent years, significant breakthroughs have been achieved in addressing the technical challenges of underwater acoustic target detection through systematic academic research. The Roe team [3] developed an innovative methodology employing Support Vector Machines (SVM) with an integrated highlight-shadow detection framework, advancing feature interpretation in synthetic aperture sonar (SAS) imagery. The Lopera Research Group [4] proposed a cascade-based solution for identifying man-made linear objects (MLO) on seabeds through multi-scale feature fusion: implementing joint noise suppression and edge enhancement via anisotropic diffusion filtering, coupled with precise highlight/shadow segmentation using fuzzy morphological operators. At the feature engineering level, they innovatively constructed a 32-dimensional hybrid feature vector incorporating morphological topology parameters, regional statistical characteristics, contour moment descriptors, and minimum enclosing ellipse features. These were integrated with a Bayesian probabilistic classification model using the Markov Chain Monte Carlo (MCMC) methods, demonstrating substantial improvement in average classification accuracy across both field-collected and simulated datasets. Fakiris et al. [5] pioneered the application of Independent Component Analysis (ICA) to side-scan sonar (SSS) target detection, establishing a feature decoupling mechanism based on blind source separation that effectively suppresses seabed reverberation interference. Their framework achieved a 77% recall rate for automatic identification of small-scale targets (diameter < 0.5 m) in complex seabed sediment environments.

In the field of sonar image target detection, traditional computer vision methodologies based on feature-classifier coupling mechanisms historically dominated technological evolution. However, as marine observation scenarios grow increasingly heterogeneous (e.g., mixed seabed substrates and dynamic hydroacoustic environmental perturbations), the inherent constraints of manual feature engineering—particularly in target representation capacity and generalization performance—have become critically exposed. A pivotal transition occurred in 2012 when AlexNet achieved breakthrough performance on the ImageNet large-scale visual recognition challenge through its deep convolutional neural network (CNN) architecture, propelling a paradigm shift in underwater optical/acoustic image interpretation via deep learning. This innovative approach employs end-to-end training mechanisms to autonomously extract hierarchical abstract features from large-scale annotated datasets, effectively overcoming the intrinsic limitations of conventional methods, including subjectivity in feature design and weak environmental adaptability.

At present, deep learning-based target detection has three paradigms: one-stage, two-stage, and transformer-based detection. Two-stage models first use candidate box extraction to identify Regions of Interest (RoI), then perform recognition and localization. They are accurate but slow, so they are not ideal for real-time detection. This led to the development of one-stage algorithms, such as YOLO. As YOLO evolved, its accuracy on public test sets approached that of two-stage algorithms.

YOLO is making waves across various fields, including sonar image target detection. Zheng Linhan’s ScEMA-YOLOv8 model for underwater sonar target detection uses an EMA attention mechanism and an SPPFCSPC pooling module to extract features from blurred targets better. This model also adds detection layers and residual connections to improve the detection and positioning of small targets. Regretfully, it is not optimized for targets with scarce features, and simply adding a detection layer is not enough to handle scenes with drastic changes in target scale [6]. Xie Guohao and Chen Zhe’s DA-YOLOv7 model for underwater sonar image target recognition features innovative modules, such as an omnidirectional convolutional channel prior convolutional attention efficient layer aggregation network, spatial pyramid pooling channel shuffle, and ghost shuffle convolutional enhanced layer aggregation network. These reduce computational load and improve the capture of local features and critical information. However, it still lacks optimization for strong noise interference in sonar images and for detecting small targets [7]. Meng Junxia’s team [8] used CycleGAN for data augmentation and integrated a global attention mechanism into the feature extraction phase of YOLOv8, achieving some engineering success.

Based on the above-mentioned challenges in current sonar target detection and the strategies used by previous researchers, this study carries out the following work:

(1): The experiments collected a sonar dataset with two target categories. To address the data scarcity issue, ADA-StyleGAN3, which excels in generating high-quality sonar images, was introduced. The implementation had two stages: first, pre-training models using high-fidelity synthetic data from ADA-StyleGAN3; then, fine-tuning the network parameters with real underwater data. Experiments showed that this approach effectively improved detection accuracy.
(2): The FASFFHead with four detection heads is proposed. It uses ASFF for feature fusion, forming a cross-scale adaptive weighting mechanism. Differentiable spatial pyramid pooling for multi-level feature integration better preserves the semantic info of tiny targets.
(3): The Edge-Contour Attention Mechanism (EEA) proposed in this study innovatively incorporates a dual-branch gradient guidance architecture, which substantially enhances gradient response characteristics at target edges through differentiated gradient information processing. This approach achieves deep integration with multi-scale deformable convolution technology, enabling adaptive focusing on critical contour regions while effectively suppressing boundary reverberation noise interference.
(4): Through structural reorganization and optimization of the SC_Conv convolution module, this study innovatively constructs a cross-module collaborative architecture that integrates multi-level feature interaction and fusion with the C3K2 module, forming a feature extraction unit with a dual-path compression mechanism. The joint-optimized architecture achieves high-fidelity feature representation via a channel-spatial co-attention mechanism, effectively suppressing inter-channel correlation redundancy and spatial dimension information redundancy while enhancing feature extraction precision.
(5): The Focaler-IOU was adopted to replace the original loss function. Focusing on different regression samples enhances detection performance across tasks and addresses shortcomings in existing bounding box regression methods for various objects, thereby improving detection capabilities in different scenarios.

2. Data

2.1. Image Dataset Collection

The experiment was carried out in a lake in Wuhan, China, with a complex bottom terrain, a depth of 5–15 m, and a muddy bottom that effectively prevents target burial. Two typical underwater targets were deployed: suspended and bottom-resting ones. The suspended target is a 1-meter-diameter hollow sphere with holes. The bottom-resting target has an oval-head cylindrical shape, 2 m long, and 0.5 m in diameter. The synthetic aperture sonar, operating at 220–260 kHz with a 3 cm × 2 cm imaging resolution, was used for target detection and recognition. Figure 1 presents a sonar image. The left and right sections show acoustic wave detection and imaging. The central shaded area indicates signal loss due to an unreceived bottom echo. Three red-marked regions show bottom-resting targets. Figure 2 shows a suspended-target sonar image with one red-marked suspended target.

In the experiment, sonar was used for single-channel data collection. The collected sonar images were converted to pseudo-color to enhance visual effects, highlight details, and distinguish different intensity ranges. After rigorous data classification and filtering, a dataset of 784 raw sonar images (3600 × 1600 resolution) was built. This dataset consisted of 421 images with 545 bottom-resting targets, 366 images with 397 suspended targets, and 30 images without any targets.

2.2. Image Dataset Generation

Faced with the challenge of insufficient training data from only 784 original sonar images, this study employs data augmentation techniques for dataset expansion. Traditional image data augmentation methods include geometric transformations such as flipping, rotation, scaling, and translation; color transformations such as brightness and contrast adjustment; and adding noise or applying filtering. However, these conventional approaches can only expand data through linear transformations or simple combinations, failing to generate novel features or semantic content beyond the original data distribution. They may even compromise image authenticity (e.g., salt-and-pepper noise and excessive sharpening) and cause models to learn invalid features. Compared to traditional methods constrained by linear transformations that may compromise image authenticity, Generative Adversarial Networks (GANs) [9,10,11] demonstrate distinct advantages. Through unsupervised adversarial optimization mechanisms, GANs adapt particularly well to the unique characteristics of sonar imagery: low-resolution textures, high-noise environments, and specialized underwater terrain features. Although existing studies have explored multi-scale generation networks [12] and simulation-integrated synthesis methods [13], three critical challenges persist: First, sonar image generation must integrate physical propagation characteristics. Traditional DCGAN struggles with realistic noise pattern reproduction, Pix2Pix [14,15] requires impractical paired dataset scales, and CycleGAN [16,17,18] lacks physical modeling capabilities for acoustic wave propagation. Second, multi-dimensional domain gaps exist between synthetic and real data, encompassing background features, sensor parameters, and material properties, which compromise model generalization during deployment. Third, current generative models fail to optimally balance semantic consistency with physical fidelity.

To address the aforementioned challenges with maximum potential, we introduce the ADA-StyleGAN3 method. ADA-StyleGAN3, proposed by Li Liang’s team, is a novel few-shot sonar image generation network that integrates the adaptive discriminator augmentation strategy into StyleGAN3 [19]. Compared to conventional StyleGAN3, this approach resolves the augmentation leakage issue from traditional data enhancement techniques to the generator, ensuring high stability during dataset generation under limited samples while effectively preventing overfitting problems.

Despite the aforementioned efforts, it is undeniable that generative networks, such as GANs, inherently exhibit data distribution discrepancies between synthetic and real data during the autonomous synthesis of complex scenarios by learning data distributions. These discrepancies manifest widely in aspects such as background, lighting, perspectives, and other contextual variations. When training models using both real and synthetic datasets and testing on real data, such differences inevitably impact the detector’s generalization capability, introducing generalization errors. Therefore, minimizing the gap between the model’s error on synthetic data (source domain) and real data (target domain) becomes crucial. To address this, this study adopts a transfer learning approach based on model parameter knowledge transfer, aiming to reduce generalization errors and decrease the model’s reliance on extensive real-world data. Specifically, the ADA-StyleGAN3 generative network is employed to synthesize large-scale approximate data for pre-training under the limited sample size of the sonar dataset. Subsequently, transfer training is conducted on the pre-trained model using the original real dataset to enhance generalization performance.

Building on this framework, this study employs ADA-StyleGAN3 to synthesize sonar images. Ubuntu 18.04.6 is used as the workstation’s operating system, with Python as the programming language. PyTorch (v1.10.0) is utilized to deploy and train the deep learning network, supported by two Nvidia A30 GPUs with 24 GB of memory each. The training process uses the stylegan3-t model with the Adam optimizer. The generator’s initial learning rate is 0.003, and the discriminator’s is 0.0015. The batch size is set to 16, and the discriminator is trained for 1000 kimg. The

λ_{T h r e s h o l d}

is 0.4. Finally, 900 synthetic data samples with ground-truth labels were obtained after a rigorous manual screening and labeling process. There are 450 images each of suspended targets and bottom-resting targets, with one target per image.

3. Methodology

3.1. Network Architecture for YOLOv11

YOLOv11 [20,21,22], as a state-of-the-art algorithm in object detection, achieves performance optimization through architectural innovations. Its core improvements encompass three aspects: a dynamically configurable module, attention-enhanced mechanisms, and detection head restructuring. By introducing the parameterized C3k2 module [22], the algorithm dynamically adjusts topological structures according to network depth, enabling cross-version compatibility through switching to YOLOv8’s C2f module equivalent in shallow networks while maintaining adaptive feature extraction capabilities. The C2PSA module, developed by integrating multi-head self-attention mechanisms into traditional C2 architectures, enhances small object detection through dual channel-spatial visual cue enhancement. The classification head innovatively incorporates dual-path depthwise separable convolution components [23], employing coordinated design of independent channel processing strategies and standard convolutional kernels to significantly reduce computational complexity without compromising feature resolution accuracy. These systematic improvements establish a technical closed-loop through dynamic structure compression, attention guidance, and computational unit optimization, thereby enhancing the balance between detection precision and inference efficiency while preserving model lightweight characteristics. The overall network architecture is shown in Figure 3:

3.2. Improved Network

Although YOLOv11 has achieved significant improvements in image object detection tasks, it still lacks targeted optimization methods to address the characteristics of sonar image targets, such as feature scarcity and severe noise interference. Direct application of the original network for sonar image target detection often suffers from high false alarm rates and low accuracy.

To address these challenges, we propose SFE-YOLO based on the YOLOv11n baseline model. In SFE-YOLO, we innovatively integrate ASFF (Adaptive Spatial Feature Fusion) into the detection head and introduce an additional P2 small-object detection head to effectively preserve semantic information of tiny targets. Furthermore, the proposed Edge-Contour Attention Mechanism (EEA) incorporates a dual-branch gradient guidance architecture, which significantly enhances gradient response characteristics at target edges through differentiated gradient information processing. This mechanism achieves adaptive focusing on critical contour regions while effectively suppressing boundary reverberation noise interference. Additionally, we synergistically combine the structurally reorganized and optimized SC_Conv module with C3K2 to form a feature extraction unit with a dual-path compression mechanism, enabling high-fidelity feature representation while reducing information redundancy. Finally, Focaler-IOU is introduced to replace the original loss function, addressing the adaptability shortcomings of existing bounding box regression methods for diverse objects. The overall structure is shown in Figure 4.

3.2.1. FASFFHead

Sonar imaging often suffers from blurry edges and lost details due to environmental disturbances and the physical damage of sound waves. Therefore, it is essential to strengthen multi-scale feature integration and utilization to better extract target features and suppress noise in complex backgrounds. ASFF (Adaptively Spatial Feature Fusion) is designed to improve multi-scale feature fusion in feature pyramids, such as FPNs (Feature Pyramid Networks) [24]. Unlike traditional FPN methods that use simple concatenation or addition for fusion, leading to feature conflicts and gradient interference at the same spatial location, ASFF introduces a dynamic weight mechanism. This mechanism adaptively learns weights to fuse features from different levels and dynamically adjusts weights based on input content, suppressing conflicting information and enhancing effective features.

ASFF mainly consists of two parts: re-scaling and adaptive fusion. During re-scaling, low-level features are first adjusted for channel numbers through 1 × 1 convolution and then enlarged via interpolation. The high-level features are downsampled through 3 × 3 convolution and pooling to align with the target layer’s resolution. After alignment, features are passed through 1 × 1 convolution to generate spatial weight maps. These weight maps are shown in Equation (1).

α_{i}^{l} (x, y) = \frac{e^{w_{i}^{l} (x, y)}}{\sum_{k = 1}^{L} e^{w_{k}^{l} (x, y)}}

(1)

Let

l

be the target level and

(x, y)

the spatial coordinates. The final output feature is the weighted sum of features from all levels, as shown in Equation (2).

F_{o u t}^{l} = \sum_{i = 1}^{L} α_{i}^{l} \cdot F_{i}^{l}

(2)

However, existing ASFF methods still suffer from edge blurring, feature suppression, and scale oscillation in sonar target detection. To address these issues, this study proposes a Four-head Adaptive Feature Fusion Detection Head (FASFFHead). By reconstructing the feature pyramid architecture and adding a high-resolution shallow feature output layer, we construct a four-level cascaded detection network to enhance the ability to capture details of small targets. We innovatively design a hierarchical weight decoupling mechanism, introducing independently learnable adaptive parameters to optimize cross-layer feature contribution ratios, combined with a progressive training strategy to balance the collaborative expression of shallow details and deep semantics. To resolve information loss during feature alignment, we integrate a dual-path calibration module that employs dilated convolution and channel attention to reinforce edge fidelity and critical feature filtering, respectively. This achieves multi-scale gradient propagation optimization, significantly improving small target detection accuracy and scale stability, as shown in Equation (3).

{\tilde{F}}_{o u t}^{l} = β \cdot F_{P 2}^{l} + \sum_{i = 3}^{5} α_{i}^{l} \cdot F_{i}^{l}

(3)

β is optimized via backpropagation, with a moderate initial value to prevent initial training oscillations. The weight generation branch is extended to four heads to compute the weight coefficients shown in Equation (4).

α_{i}^{l}, β = Softmax ({Conv}_{1 \times 1} (Concat (w_{P 2}, w_{P 3}, w_{P 4}, w_{P 5})))

(4)

The final output is expressed as shown in Equation (5).

F_{f i n a l}^{l} = β \cdot F_{P 2}^{l} + α_{3}^{l} \cdot F_{P 3}^{l} + α_{4}^{l} \cdot F_{P 4}^{l} + α_{5}^{l} \cdot F_{P 5}^{l}

(5)

In summary, FASFFHead mainly introduces additional coefficients. Adding an output and integrating a dual-path calibration module enhances sensitivity to tiny targets during significant target scale changes. Compared to the original detection head, which tends to miss or inaccurately locate tiny targets when target scales change drastically, the improved detection head effectively mitigates these negative impacts, thereby providing a more robust detection solution for complex underwater scenarios, the structure as shown in Figure 5.

3.2.2. EEA Attention Mechanism

To guarantee detection range, sonar detection often reduces frequency, sacrificing precision and leading to low spatial resolution and blurry image details [25,26]. For target detection in such complex, low-resolution sonar images, the target’s edge contour info is crucial for identification and accurate positioning. Though the C2PSA attention mechanism in YOLOv11 shows excellent performance across large datasets [27], it is too globally information-dependent, weakening its ability to capture local details. Therefore, in fine-grained tasks, such as edge detection and small-target recognition, C2PSA’s effectiveness declines, making it unsuitable for sonar image target recognition. Consequently, this paper introduces a novel Edge-Enhanced Attention (EEA) mechanism, which integrates multi-scale gradient perception, spatial attention, and channel attention. This mechanism enhances the model’s sensitivity to edge features, particularly in small-object detection within sonar images where fine edge details are crucial. EEA consists of three main components: multi-scale gradient perception, spatial-channel co-attention, and feature fusion, with its specific structure shown in Figure 6.

After feature maps enter the module, they undergo convolution and split into two branches. First, one branch performs multi-scale gradient perception and uses residual design (subtracting

x_{1}

) to highlight edge changes and suppress flat areas. The formula is as shown in Equation (6):

{grad}_{d} = {Conv}_{3 \times 3}^{dilation = d} (x_{1}) - x_{1}

(6)

The SC_attention module in EEA is a spatial-channel collaborative attention mechanism. This module concatenates multi-scale gradient features to generate spatial weights. Here,

w_{s 1}

and

w_{s 2}

stand for 1 × 1 convolution operations, and the output

S \in ℝ^{B \times 3 \times H \times W}

corresponds to the weight coefficients of the three scales at each position. The formulas are as Equation (7), and the architecture of the SC_attention module is shown in Figure 7.

S = σ (W_{s 2} \cdot ReLU (W_{s 1} \cdot [{grad}_{1}; {grad}_{2}; {grad}_{3}]))

(7)

Then, the weighted multi-scale gradient fusion operation is performed as per Formula (8). Here,

S_{d}

denotes the spatial attention map of the

d

-th scale, and ⊙ stands for element-wise multiplication.

weighted_grad = \sum_{d = 1}^{3} S_{d} ⊙ {grad}_{d}

(8)

The fused features then go through the C_attention (channel attention module) to adjust the importance weights of different channels, thereby enhancing the key channel feature responses via a channel attention mechanism. This is specifically represented as Equation (9), and in this equation, GPA is global average pooling

w_{c 1}

and

w_{c 2}

are 1 × 1 convolutions

C \in ℝ^{B \times c \times 1 \times 1}

. Its processing flow is shown in Figure 8:

C = σ (W_{c 2} \cdot ReLU (W_{c 1} \cdot GAP (weighted_grad)))

(9)

Finally, the enhanced edge features are fused with the original branch through element-wise addition, as in Equation (10).

y = W_{3} \cdot (x_{2} + weighted_grad ⊙ C)

(10)

The EEA module considers both spatial and channel dimensions, capturing multi-scale edge information and performing feature decomposition and fusion. This module enhances edges while preserving original information. Unlike the original C2PSA, EEA significantly strengthens edges and details in images, showing great potential in enhancing small targets and weak edge targets in sonar images. The overall implementation formula is as Equation (11):

y = W_{3} \cdot (W_{2} \cdot x + (\sum_{d = 1}^{3} S_{d} ⊙ ({Conv}_{d} (W_{1} \cdot x) - W_{1} \cdot x)) ⊙ C)

(11)

3.2.3. C3K2_Sc

Traditional convolutions often produce redundant features when processing sonar images with highly similar background features. This increases computational load and may cause overfitting, degrading model performance. To address this, this paper integrates ScConv [28] into the C3K2 module to form C3K2_Sc. ScConv (Spatial and Channel Reconstruction Convolution) aims to enhance network performance by optimizing feature extraction and reducing computational resource consumption; it consists of two main components: 1. Spatial Reconstruction Unit (SRU): Reduces spatial redundancy through separation and reconstruction. 2. Channel Reconstruction Unit (CRU): Reduces channel redundancy through a split-transformfuse strategy. However, the separation strategy of conventional CRU relies on a fixed channel splitting ratio, which lacks dynamic adaptation to input features and incurs risks of information loss. To address this limitation, we propose the CRU-Dynamic unit, where the compression ratio α dynamically adapts to input features, thereby enhancing the completeness of channel information. Figure 9 shows the structure of ScConv:

The structure and process of the SRU unit are shown in Figure 10. First, it receives the input feature

X

, and then normalizes it via Group Normalization to reduce the scale differences between feature maps. Next, it calculates weights using the channels

γ_{1}

γ_{2}

……

γ_{c}

of the input features and a Sigmoid function. These weights

w_{1}

w_{2}

……

w_{c}

are used to weight the features. The weighted features are split into two parts

X_{1}^{W}

and

X_{2}^{W}

. After transformation, the two parts are reconstructed through addition and concatenation to obtain the spatially refined feature

X^{W}

.

The architecture of the Channel Reconstruction Unit (CRU) is detailed below. Starting with the spatially refined feature

X^{W}

, the CRU first splits

X^{W}

into two parts. These parts undergo different 1 × 1 convolution processes via paths with ratios α and (1 − α). We have improved the definition of α by first introducing a feature-aware layer that extracts channel-wise statistics S from the input feature map X through Global Average Pooling (GAP). A lightweight Multi-Layer Perceptron (MLP) is then designed to generate the dynamic compression ratio parameter α, as formally expressed in Equation (12).

α = σ (W_{2} \cdot δ (W_{1} \cdot s))

(12)

where

W_{1}

and

W_{2}

are learnable weight matrices, δ denotes the ReLU activation function, and σ represents the Sigmoid function to constrain the output range to (0, 1). Global Convolution (GWC) and Point-wise Convolution (PWC) further transform these features. Finally, the transformed features

Y_{1}

and

Y_{2}

are pooled, then weighted and fused via Softmax to produce the final channel-refined feature

Y

. The CRU-Dynamic Unit Structure is shown in Figure 11.

The key improvements of the C3K2_Sc module lie in the feature extraction stage, where the standard convolution (Conv) in the original C3K2 is replaced by the enhanced ScConv module. This not only uses a spatial reconstruction unit (SRU) to dynamically suppress activation responses in irrelevant spatial areas but also integrates a dynamic channel reconstruction unit (CRU-Dynamic) to prune the channel dimension softly, thus precisely preserving target features and reducing computational resource consumption. During the subsequent feature fusion stage, the original identity mapping design of the bottleneck structure is retained; this ensures smooth gradient propagation during deep feature fusion and effectively avoids training divergence caused by architectural mutations. Unlike directly replacing all convolutions in the network with ScConv, this approach introduces the advantages of ScConv while maintaining the network’s overall compact structure. This mapping design enhances performance without significantly increasing parameters and computations. As the C3K2 module contains multiple convolutional layers for gradual feature extraction and fusion, integrating ScConv allows gradual feature enhancement across different levels, achieving more stable and accurate sonar image recognition.

3.2.4. Focaler-IOU

Focaler-IoU is a loss function designed for bounding box regression in object detection. Primarily, it focuses on diverse regression samples to enhance detector performance across various detection tasks. By reconstructing the IoU loss through linear interval mapping, Focaler-IoU strategically prioritizes samples based on their regression difficulty. Furthermore, it explicitly addresses the distributional imbalance between hard and easy samples during bounding box regression, which is a critical aspect often overlooked by traditional IoU-based losses. This is achieved by dynamically adjusting gradient contributions according to sample difficulty, thereby optimizing both training stability and model generalizability [29].

As defined in Equation (13), Focaler-IoU dynamically adjusts the loss based on the Intersection over the Union (IoU) value. When the IoU is below a lower threshold d, the loss is set to 0; when the IoU exceeds an upper threshold u, the loss is fixed at 1. For IoU values between d and u, the loss follows a linearly increasing function proportional to the IoU. This design ensures that the loss function exhibits controlled sensitivity within the intermediate IoU range, thereby prioritizing samples where the predicted and ground-truth bounding boxes exhibit moderate overlap. By focusing on these “moderate difficulty” samples—neither excessively challenging nor overly straightforward—the model is incentivized to learn discriminative features from scenarios requiring nuanced localization rather than overemphasizing trivial or outlier cases. The expression is shown in Equation (13).

I o U_{f o c a l e r} = \{\begin{array}{l} 0, & I o U < d \\ \frac{I o U - d}{u - d}, & d ≪ I o U ≪ u \\ 1, & I o U > u \end{array}

(13)

In sonar image object detection tasks, where detection difficulty varies significantly across target categories, we propose replacing the original Complete IoU (C-IoU) with Focaler-IoU to address the distributional imbalance between hard and easy samples, thereby enhancing detection precision. This adaptation specifically targets scenarios where conventional IoU losses inadequately handle ambiguous boundaries and partial occlusions common in underwater sonar data.

4. Experiment Results and Analysis

4.1. Experimental Results

In this experiment, the sonar image dataset was collected and organized from tests on Wuhan Mulan Lake, comprising two categories: bottom-resting and suspended targets. It has 544 training images and 120 validation and test images each. The experiment ran on Windows 11 with an NVIDIA RTX 2000 Ada Generation Laptop GPU (24 GB VRAM), using the torch2.0.0+cu118 deep learning framework and Python3.10.

First, ADA-StyleGAN3 was used to generate 450 images for each target type. These images were used to pre-train the SFE-YOLO network for 100 epochs with an initial learning rate of 0.01 and SGD momentum of 0.937. After pre-training, the real dataset was fine-tuned for 150 epochs with the same parameters based on the pre-trained weights. The specific experimental environment and configuration are shown in Table 1.

The model was evaluated using Precision, Recall, F1-score curves, average precision (AP), mean average precision (mAP), and parameter count. Precision quantifies the proportion of true-positive predictions among all predicted instances. After setting an IoU threshold, the predicted bounding boxes are matched with the ground-truth boxes. A prediction is considered a true positive (TP) if the predicted bounding box meets the IoU threshold and the predicted category is correct. If the model predicts an object not present in the ground truth or assigns the wrong category, the prediction is considered a false positive (FP). Conversely, if a ground-truth object is not detected by the model, it is categorized as a false negative (FN). The definition of precision is expressed as Equation (14). The recall quantifies the proportion of correctly identified positive instances relative to the total number of actual positive samples. A higher recall value indicates that the model successfully detects a larger fraction of positive-class samples, thereby demonstrating superior detection performance. The calculation method is formally defined in Equation (15).

The F1 score is a critical metric in classification tasks, serving as a comprehensive evaluation of a model’s precision and completeness in detecting all targets. It harmonizes precision (accuracy of positive predictions) and recall (ability to identify all relevant instances), with values bounded between 0 (worst) and 1 (optimal). The calculation method is formally defined in Equation (16). Average Precision (AP) evaluates the model’s performance for individual categories, while mean Average Precision (mAP) assesses its overall effectiveness across all categories. The mAP is computed as the arithmetic mean of AP values over all categories. Their calculation methods are formally defined in Equations (17) and (18), respectively. Finally, FPS measures the detection speed of algorithms. The larger the FPS value, the faster the model detection speed and the better the real-time performance. The calculation method is shown in Equation (19).

P = \frac{T P}{T P + F P}

(14)

R = \frac{T P}{T P + F N}

(15)

F 1 = 2 \times \frac{P \times R}{P + R}

(16)

A P = \int_{0}^{1} P (r) dr

(17)

m A P = \frac{1}{C} \sum_{i = 1}^{C} A P_{i}

(18)

F P S = \frac{1000}{\Pr e + I n f e r + P o s t}

(19)

The proposed SFE-YOLO model was evaluated on a self-collected dataset to assess its performance. After 150 training epochs, the model achieved convergence. As illustrated in Figure 12, box_loss quantifies the discrepancy between predicted and ground-truth bounding boxes, with lower values indicating higher detection accuracy. cls_loss measures the divergence between predicted and ground-truth class labels, where reduced values reflect improved classification precision. Furthermore, dfl_loss formulates continuous coordinate prediction as discrete probability distribution prediction, enabling more accurate coordinate localization. Minimized values of this loss signify enhanced prediction fidelity.

As shown in Figure 13, in the SFE-YOLO model, C represents bottom-resting targets (objects settled on underwater surfaces), while F denotes suspended targets (objects floating in water without settling). In the YOLOv11n network, Class 1 is designated for bottom-resting targets and Class 2 for suspended targets. Comparing the left and right subfigures, the SFE-YOLO model demonstrates substantial improvements in precision and completeness under full-target detection conditions compared to the baseline model. Additionally, its performance curves exhibit enhanced smoothness, reflecting the model’s superior stability and robustness in dynamic environments.

As illustrated in Figure 14, Figure 15 and Figure 16, the specific comparison between SFE-YOLO and the baseline YOLOv11n across three critical metrics-accuracy, recall, and mAP50—demonstrates that the enhanced model achieves substantial performance improvements in all evaluated dimensions. These results validate the effectiveness of the proposed architectural modifications in advancing both detection robustness and generalizability.

The final experimental results are visualized in Figure 17. The first column shows the original images. The second column displays the baseline detection results. The third column presents the SFE-YOLO detection results. The first image row demonstrates that the baseline YOLOv11n model produces substantially more false alarms for bottom-resting targets than SFE-YOLO. Analysis of the second and third rows indicates that the baseline model underperforms the improved version in detecting feature-ambiguous bottom-resting targets while also yielding less confident predictions for suspended targets. Notably, the fourth row highlights the enhanced model’s superior detection accuracy in target localization.

4.2. Ablation Experiment

This paper uses YOLOv11n as the baseline to conduct ablation experiments on the improved algorithm (all experiments are performed on the pre-trained model generated by the ADA-StyleGAN3 dataset) and verifies the effectiveness of each improvement in sonar target detection by sequentially adding different improvements to the model.

Firstly, as shown in Table 2, when replacing the original detection head with the FASFF four-head detection head, it can be seen from the table below that GFLOPs increase significantly, and the number of parameters increases by about 1.4 M, but the Pre, R, and mAP50 only increase by 0.2%, 2.8%, and 0.4%, respectively. This is because sonar target detection is not simply small-scale target detection. Sonar images suffer from low resolution and strong interference. Relying solely on FASFF’s feature fusion and small-target detection head may introduce computational complexity, harming performance. Still, FASFFHead is crucial. When used alone, its performance gain is not obvious compared to the increased computation, but when combined with our network, it significantly enhances feature fusion and detection stability. As shown in experiments 6 and 8, using FASFFHead brings a qualitative improvement. Also, Table 3 indicates that models with more parameters than ours without FASFFHead do not match its performance.

Secondly, when adding the EEA attention mechanism or the C3K2_Sc module separately, it can be seen that compared to the YOLOv11n baseline model, the number of parameters decreases while Pre, R, and mAP50 are significantly improved. This indicates that both the EEA attention mechanism and the C3K2_Sc module have the ability to reduce data redundancy and show greater sensitivity to sonar targets and more excellent feature extraction capabilities for sonar images. Then, when adding these two improvement methods simultaneously, it can be seen that groups 5, 6, and 7 all exhibit better performances, with Pre and mAP50 in each group reaching over 80%. Notably, the accuracy of group 7 reaches 88.4%, which is 12.2% higher than the baseline.

Finally, when integrating the four improvement methods to form the SFE-YOLO model, the final accuracy reaches 92%, the recall rate reaches 90.3%, and mAP50 is 12.7% higher than the baseline, reaching 89.7%. This demonstrates the superior performance of SFE-YOLO in sonar target detection. In conclusion, the improved network structure achieves significant improvements in all evaluation metrics, and the number of parameters and computational cost remain within a practical and controllable range.

4.3. Comparative Experiment

For comparative analysis, this study evaluates multiple object detection networks, including Fast R-CNN, YOLOV11s, YOLOv7tiny, YOLOV11n, YOLOv5s, and YOLOv8n, using precision (P), recall (R), mAP@0.5, and parameter count as evaluation metrics. The performance of the improved YOLOv11 network and baseline models on the test set is comprehensively summarized in Table 3 (all experimental results in Table 3 were obtained through direct training using real data). Faster-RCNN, a two-stage algorithm, has a larger parameter size and computational load than one-stage algorithms, resulting in slower detection speeds. Moreover, on a dataset dominated by small objects, its mAP@0.5 is 5.2% lower than that of the algorithm presented in this paper. In comparison to the traditional lightweight detection algorithms YOLOv11n and YOLOv8n, the algorithm introduced in this paper shows a slight increase in parameter size but a significant improvement in all detection metrics. When matched against the more heavyweight YOLOv11s and YOLOv5s, the algorithm has approximately one-third of its parameter size yet achieves higher accuracy, recall, and FPS. It is evident that SFE-YOLO surpasses other network models in multiple parameters while maintaining a relatively small parameter size. With an FPS of 17, it satisfies general real-time detection requirements. Therefore, the SFE-YOLO algorithm enhances the accuracy of small target detection in sonar images while meeting high frame rate requirements and effectively balances detection accuracy and speed.

Table 3. Evaluation metric comparison table of SFE-YOLO and other models.

Method	Pre/%	R/%	mAP50	Para/(M)	FPS
Fast R-CNN	83.2	82.1	82.7	41.0	4
YOLOV11s	85.3	86.3	88.1	9.4	15
YOLOv7tiny	81.3	80.5	80.9	6.1	22
YOLOv8n	78.2	77.5	77.8	3.0	38
YOLOv5s	84.5	83.7	84.1	9.1	16
YOLOV11n	75.1	74.9	77.0	2.6	36
SFE-YOLO	90.4	89.2	87.9	3.7	17

To further demonstrate the advantages of the SFE-YOLO algorithm and the pre-training strategy using ADA-StyleGAN3-generated data, we conducted comparative experiments combining these two algorithms and strategies. The experimental results are shown in Table 4.

As shown in the table above, given the scarcity of sonar image data, the strategy of first pre-training on generated data and then conducting transfer training on real data is effective. It improves the final model’s performance for both the baseline YOLOv11n and SFE-YOLO networks.

For YOLOv11n, this strategy boosts accuracy by 1.1%, recall by 0.7%, and mAP50 by 0.2%. For SFE-YOLO, accuracy increases by 1.6%, recall by 1.1%, and mAP50 by 1.8%. These results confirm that pre-training on synthetic data and then transfer training on real data reduces model generalization error. It also decreases the model’s reliance on large amounts of real data and improves the recognition accuracy of models trained on small-sample sonar data.

5. Conclusions

This paper proposes the SFE-YOLO method to address the challenges of sparse small-target features and high noise in sonar images. Introducing a small-target ASFF feature fusion detection layer enhances semantic information retention for tiny targets and significantly improves detection recall. The EEA edge contour attention mechanism strengthens edge gradient responses and focuses on key contour areas using multi-scale deformable convolutions. The integrated ScConv module performs dynamic feature calibration to jointly optimize noise suppression and feature enhancement. Experiments in a Wuhan water area show that SFE-YOLO outperforms the baseline YOLOv11n model. After transfer training, SFE-YOLO’s mAP50 increases by 12.7%, with a maximum parameter size of 3.7 M, making it suitable for future research on embedded devices. Future work will focus on further improving the model for embedded devices, balancing recognition accuracy and efficiency to provide effective solutions for real-world sonar target detection tasks.

Author Contributions

Data curation, Z.Z.; Formal analysis, Z.Z., X.W., and X.S.; Methodology, X.W. and Z.Z.; Software, X.W. and X.S.; Validation, Z.Z.; Writing—original draft, X.W.; Writing—review and editing, Z.Z. and X.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China for Young Scientists, grant number 12304535, and the Self-Developed Program of the Naval University of Engineering (No. 2023507020).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available upon request from the corresponding author.

Acknowledgments

We ensure that everyone has given their consent.

Conflicts of Interest

The authors declare no conflict of interest.

References

Ji, W.T. Research on Sonar Image Recognition Method Based on Deep Transfer Learning. Master’s Thesis, Dalian University of Technology, Dalian, China, 2021. (In Chinese) [Google Scholar] [CrossRef]
Xu, M.; Zhang, Q. Underwater acoustic channel estimation method based on response generation network. J. Commun. 2025, 46, 199–212. (In Chinese) [Google Scholar]
Avi, A.; Roee, D. A Statistically-Based Method for the Detection of Underwater Objects in Sonar Imagery. IEEE Sens. J. 2019, 19, 6858–6871. [Google Scholar]
Lopera, O.; Dupont, Y. Automated target recognition with SAS: Shadow and highlight-based classification. In Proceedings of the 2012 Oceans, Hampton Roads, VA, USA, 14–19 October 2012; pp. 1–5. [Google Scholar] [CrossRef]
Fakiris, E.; Papatheodorou, G.; Geraga, M.; Ferentinos, G. An Automatic Target Detection Algorithm for Swath Sonar Backscatter Imagery, Using Image Texture and Independent Component Analysis. Remote Sens. 2016, 8, 373. [Google Scholar] [CrossRef]
Zheng, L.; Hu, T.; Zhu, J. Underwater Sonar Target Detection Based on Improved ScEMA-YOLOv8. IEEE Geosci. Remote Sens. Lett. 2024, 21, 1503505. [Google Scholar] [CrossRef]
Chen, Z.; Xie, G.; Deng, X.; Peng, J.; Qiu, H. DA-YOLOv7: A Deep Learning-Driven High-Performance Underwater Sonar Image Target Recognition Model. J. Mar. Sci. Eng. 2024, 12, 1606. [Google Scholar] [CrossRef]
Zheng, Y.; Yan, J.; Meng, J.; Liang, M. A Small-Sample Target Detection Method of Side-Scan Sonar Based on CycleGAN and Improved YOLOv8. Appl. Sci. 2025, 15, 2396. [Google Scholar] [CrossRef]
Dubey, P.; Kshatri, S.S.; Bhonsle, D.; Hung, B.T. Crafting Images with Generative Adversarial Networks (GANs) and Models; IGIGlobal: Hershey, PA, USA, 2025. [Google Scholar] [CrossRef]
Rana, S.; Gatti, M. Comparative Evaluation of Modified Wasserstein GAN-GP and State-of-the-Art GAN Models for Synthesizing Agricultural Weed Images in RGB and Infrared Domain. MethodsX 2025, 14, 103309. [Google Scholar] [CrossRef]
Zhu, L.M.; Yuan, H.J.; Kong, E.; Zhao, L.L.; Xiao, L.; Gu, D.B. Generative Adversarial Netwiths with Noise Optimization and Pyramid Coordinate Attention for Robust Image Denoising. Int. J. Intell. Syst. 2025, 2025, 1546016. [Google Scholar] [CrossRef]
Ma, Q.X. Research on Sonar Image Target Detection Algorithm Based on Deep Learning; Southeast University: Nanjing, China, 2020. (In Chinese) [Google Scholar]
Sung, M.S.; Kin, J.; Lee, M.; Kim, B.; Kim, T.; Kin, J.; Yu, S.-C. Realistic Sonar Image Simulation Using Deep Learning for Underwater Object Detection. Int. J. Control. Autom. Syst. 2020, 18, 523–534. [Google Scholar] [CrossRef]
Zhang, Z.; Tanimoto, Y.; Iwata, M.; Yoshida, S. Data generation using Pix2Pix to improve YOLO v8 performance in UAV-based Yuzu detection. Smart Agric. Technol. 2025, 10, 100777. [Google Scholar] [CrossRef]
Yan, B.; Yang, F.; Qiu, S.; Wang, J.; Xu, L.; Wang, W.; Peng, J. Latent normal images-based zero-negative sample rail surface defect segmentation method. Autom. Constr. 2025, 173, 106097. [Google Scholar] [CrossRef]
Lee, M.-H.; Go, Y.-H.; Lee, S.-H.; Lee, S.-H. Low-Light Image Enhancement Using CycleGAN-Based Near-Infrared Image Generation and Fusion. Mathematics 2024, 12, 4028. [Google Scholar] [CrossRef]
Kong, L.; Li, Z.; He, X.; Gao, Y.; Zhang, K. Humanlike-GAN: A two-stage asymmetric CycleGAN for underwater image enhancement. Signal Image Video Process. 2025, 19, 379. [Google Scholar] [CrossRef]
Gao, L.; Wu, H.; Sheng, Y.; Liu, K.; Wu, H.; Zhang, X. Enhancing the dataset of CycleGAN-M and YOLOv8s-KEF for identifying apple leaf diseases. PLoS ONE 2025, 20, e0321770. [Google Scholar] [CrossRef]
Li, L.; Li, Y.; Wang, H.; Yue, C.; Gao, P.; Wang, Y.; Feng, X. Side-Scan Sonar Image Generation Under Zero and Few Samples for Underwater Target Detection. Remote Sens. 2024, 16, 4134. [Google Scholar] [CrossRef]
Ren, G.; Wu, J.; Wang, W. Research on UAV Target Detection Based on Improved YOLOv11. J. Comput. Commun. 2025, 13, 74–85. [Google Scholar] [CrossRef]
Jalal, A.; Salman, A.; Mian, A.; Ghafoor, S.; Shafait, F. DeepFins: Capturing dynamics in underwater videos for fish detection. Ecol. Inform. 2025, 86, 103013. [Google Scholar] [CrossRef]
Zou, C.; Yu, S.; Yu, Y.; Gu, H.; Xu, X. Side-Scan Sonar Small Objects Detection Based on Improved YOLOv11. J. Mar. Sci. Eng. 2025, 13, 162. [Google Scholar] [CrossRef]
Cao, J.N.; Yang, W.M.; Luo, Y.T.; Pan, N.; Zhang, W. License Plate Recognition Algorithm Design Based on Deep Learning. Mod. Electron. Technol. 2025, 48, 135–139. (In Chinese) [Google Scholar] [CrossRef]
Qiao, Y.L.; Guo, Y.Y.; He, D.J. Cattle body detection based on YOLOv5-ASFF for precision livestock farming. Comput. Electron. Agric. 2023, 204, 107579. [Google Scholar] [CrossRef]
Mu, X.; Gao, W. Coverage path planning for multi-AUV considering ocean currents and sonar performance. Front. Mar. Sci. 2025, 11, 1483122. [Google Scholar] [CrossRef]
Li, F.; Guo, S.; Chen, Y. Sonar detection range index estimation approach in uncertain environments. AIP Conf. Proc. 2010, 1272, 375. [Google Scholar]
Liao, Y.; Li, L.; Xiao, H.; Xu, F.; Shan, B.; Yin, H. YOLO-MECD: Citrus Detection Algorithm Based on YOLOv11. Agronomy 2025, 15, 687. [Google Scholar] [CrossRef]
Guo, X.; Ma, X.; Mu, C.Y.; Zhang, H. A Lightweight Steel Wire Rope Defect Detection Method Based on YOLOv8-SLA. Mach. Tool Hydraul. 2025, 03, 140–144. (In Chinese) [Google Scholar] [CrossRef]
Xue, K.; Wang, J.; Wang, H. Research on Pupil Center Localization Detection Algorithm with Improved YOLOv8. Appl. Sci. 2024, 14, 6661. [Google Scholar] [CrossRef]

Figure 1. Sonar Image and Magnified Image of Bottom-Resting Target.

Figure 2. Sonar Image of Suspended Target and Magnified Image of Suspended Target.

Figure 3. Original YOLOv11 structure.

Figure 4. SFE-YOLO structure.

Figure 5. FASFF.

Figure 6. Edge-Enhanced Attention (EEA) structure and process.

Figure 7. SC_attention module.

Figure 8. C_attention module.

Figure 9. SC_Conv structure.

Figure 10. SRU Unit Structure.

Figure 11. CRU-Dynamic Unit Structure.

Figure 12. Training and validation losses and metric progression.

Figure 13. F1 Curves of SFE-YOLO (a) and YOLOv11n (b).

Figure 14. Comparison of Precision between SFE-YOLO and YOLOv11n.

Figure 15. Comparison of Recall between SFE-YOLO and YOLOv11n.

Figure 16. Comparison of mAP50 between SFE-YOLO and YOLOv11n.

Figure 17. Some visualization results.

Table 1. Table of computer configurations for model training.

Project	Parameter Values
CPU	AMD Ryzen 9 7945HX
GPU	NVIDIA RTX 2000 Ada
Operating system	Windows 11
Programming language	Python v3.10
Deep learning framework	Pytorch v2.0.0
CUDA	v11.8
Image size	640 × 640
Learning rate	0.01
Momentum	0.937
Weight decay	0.0005
Batch size	32
Epoch	150

Table 2. Ablation Experiment.

Group	Method					Pre/%	R/%	mAP50	GFLOs	Para/(M)	FPS
	YOLOV11n	FASFFHead	EEA	C3K2_Sc	Focaler-IOU
1	√					76.2	75.6	77.2	6.3	2.6	36
2	√	√				76.4	78.4	77.6	13.3	4.0	16
3	√		√			81.7	77.0	79.9	6.2	2.5	36
4	√			√		76.3	81.1	82.3	6.2	2.4	36
5	√	√	√			81.1	79.4	83.5	13.2	3.9	17
6	√		√	√		82.3	76.9	82.0	6.1	2.3	37
7	√	√		√		88.4	72.4	81.4	13.2	3.9	17
8	√	√	√	√		91.2	90.5	89.1	13.0	3.7	17
9	√	√	√	√	√	92.0	90.3	89.7	13.0	3.7	17

Table 4. Transfer Learning Comparative Experiment.

Method	Pre/%	R/%	mAP50	GFLOs	Para/(M)
YOLOV11n	75.1	74.9	77.0	6.3	2.6
SFE-YOLO	90.4	89.2	87.9	13.0	3.7
YOLOV11n + pre-trained model	76.2	75.6	77.2	6.3	2.6
SFE-YOLO + pre-trained model	92.0	90.3	89.7	13.0	3.7

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wang, X.; Zhang, Z.; Shang, X. Research on Improved YOLO11 for Detecting Small Targets in Sonar Images Based on Data Enhancement. Appl. Sci. 2025, 15, 6919. https://doi.org/10.3390/app15126919

AMA Style

Wang X, Zhang Z, Shang X. Research on Improved YOLO11 for Detecting Small Targets in Sonar Images Based on Data Enhancement. Applied Sciences. 2025; 15(12):6919. https://doi.org/10.3390/app15126919

Chicago/Turabian Style

Wang, Xiaochuan, Zhiqiang Zhang, and Xiaodong Shang. 2025. "Research on Improved YOLO11 for Detecting Small Targets in Sonar Images Based on Data Enhancement" Applied Sciences 15, no. 12: 6919. https://doi.org/10.3390/app15126919

APA Style

Wang, X., Zhang, Z., & Shang, X. (2025). Research on Improved YOLO11 for Detecting Small Targets in Sonar Images Based on Data Enhancement. Applied Sciences, 15(12), 6919. https://doi.org/10.3390/app15126919

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Research on Improved YOLO11 for Detecting Small Targets in Sonar Images Based on Data Enhancement

Abstract

1. Introduction

2. Data

2.1. Image Dataset Collection

2.2. Image Dataset Generation

3. Methodology

3.1. Network Architecture for YOLOv11

3.2. Improved Network

3.2.1. FASFFHead

3.2.2. EEA Attention Mechanism

3.2.3. C3K2_Sc

3.2.4. Focaler-IOU

4. Experiment Results and Analysis

4.1. Experimental Results

4.2. Ablation Experiment

4.3. Comparative Experiment

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI