Research on Improved YOLO11 for Detecting Small Targets in Sonar Images Based on Data Enhancement
Round 1
Reviewer 1 Report
Comments and Suggestions for AuthorsA small target detection model, SFE-YOLO, was developed from sonar images under conditions of sparse feature interference and strong noise interference. The architecture was constructed by extending YOLOv11 with four innovations: a four-head adaptive spatial feature fusion detection head, an edge-contour enhancement attention (EEA) block, a dual-path feature extractor, and a bounding box regression loss function. ADA-StyleGAN3 was employed to pre-train on simulated sonar data and transfer learning with actual data. Experimental verification was conducted on in-house-collected sonar data with performance evaluation of precision, recall, and mAP50.
Redundancy of content between sections is noted, particularly when elaborating on modules and experiment results.
The abstract is over-descriptive with weak focus and presentation.
Technical novelty is overemphasized, such that some modules are new implementations of a method without a clear justification of innovation.
Model inference speed and energy efficiency, which are critical to supporting claims of embedded deployment, are not compared.
The performance improvement (e.g., ablation study) in certain modules is small relative to the added computational burden.
Not much is mentioned regarding the generation of synthetic datasets, e.g., GAN training configuration or sample diversity of generated samples.
The introduction and related work sections are dense and not critically synthesized to make the gap in the paper's contribution evident.
The proposed scheme is not compared with any related work, especially with other improved YOLO11 schemes.
The resolution of the figures is poor.
Many figures are not mentioned in the text with a figure number. The majority of the figures are not accompanied by clear captions or descriptive remarks, rendering them hard for readers to interpret.
English needs extensive revision. Some statements are overly lengthy.
English needs extensive revision. Some statements are overly lengthy.
Author Response
Thank you very much for taking the time to review this manuscript. Here is my response to your feedback. Please refer to the highlighted section in the attachment for specific improvements.
Comments 1: Redundancy of content between sections is noted, particularly when elaborating on modules and experiment results.
Response 1:Thank you for pointing this out. We agree with this comment.We have already removed the redundant parts from the text. For the specific revisions, please refer to the highlighted sections in Chapter 3.2 and Chapter 4 of the attachment.
Comments 2:The abstract is over-descriptive with weak focus and presentation.
Response 2:Thank you for pointing this out. We agree with this comment.We have already reedited the abstract:Existing sonar target detection methods suffer from low efficiency and accuracy due to sparse target features and significant noise interference in sonar images. To address this, we introduce SFE-YOLO, an improved model based on YOLOv11. We replace the original detection head with an FSAFFHead module that enables adaptive spatial feature fusion. An EEA module is designed to direct the model's attention to the intrinsic contour information of targets. We also enhance SC_Conv convolution and integrate it into C3K2 to improve detection stability and reduce information redundancy. Additionally, Focaler-IOU is introduced to boost the accuracy of multi - category target bounding box regression. Lastly, we employ a hybrid training strategy that combines pre - training with ADA - StyleGAN3 - generated data and transfer learning with real data to alleviate the problem of insufficient training samples. Experiments show that, compared to the baseline YOLOv11n, the improved model's precision and recall increase to 92% and 90.3%, respectively, and mAP50 rises by 12.7 percentage points, highlighting the effectiveness of the SFE-YOLO network and its transfer learning strategy in tackling the challenges of sparse small target features and strong noise interference in sonar images.
Comments 3:Technical novelty is overemphasized, such that some modules are new implementations of a method without a clear justification of innovation.
Response 3:Thank you for pointing this out.We agree with this comment.We have improved the relevant descriptions and supplemented the corresponding arguments. For details, please refer to the highlighted parts in Section 3.2 of the attachment.
Comments 4:Model inference speed and energy efficiency, which are critical to supporting claims of embedded deployment, are not compared.
Response 4:Thank you for pointing this out.We agree with this comment.We have carried out the relevant experimental supplementation and explanation. For details, please refer to Tables 2 in Section 4.2 and Tables 3 in Section 4.3 of the attachment, as well as the highlighted explanatory parts in Section 4.3.
Comments 5:The performance improvement (e.g., ablation study) in certain modules is small relative to the added computational burden.
Response 5:Thank you for pointing this out.We have provided relevant explanations for this phenomenon:FASFF four-head detection head, it can be seen from the table below that GFLOPs increases significantly, and the number of parameters increases by about 1.4M, but the Pre, R, and mAP50 only increase by 0.2%, 2.8%, and 0.4%, respectively. This is because sonar target detection is not simply small - scale target detection. Sonar images suffer from low resolution and strong interference. Relying solely on FASFF's feature fusion and small - target detection head may introduce computational complexity, harming performance. Still, FASFFHead is crucial. When used alone, its performance gain is not obvious compared to the increased computation, but when combined with our network, it significantly enhances feature fusion and detection stability. As shown in experiments 6 and 8, using it brings a qualitative improvement. Also, table 3 indicates that models with more parameters than ours without FASFFHead don't match its performance.[On page 20 of the attachment, in the second paragraph of Section 4.2, the highlighted part.]
Comments 6:Not much is mentioned regarding the generation of synthetic datasets, e.g., GAN training configuration or sample diversity of generated samples.
Response 6:Thank you for pointing this out.We agree with this comment.We have added relevant explanations in the text:Building on this framework, this study employs ADA-StyleGAN3 to synthesize sonar images. Ubuntu 18.04.6 is used as the workstation's operating system, with Python as the programming language. PyTorch is utilized to deploy and train the deep learning network, supported by two Nvidia A30 GPUs with 24GB of memory each.The training process uses the stylegan3-t model, with the Adam optimizer. The generator's initial learning rate is 0.003, and the discriminator's is 0.0015. The batch size is set to 16, and the discriminator is trained for 1000 kimg. The is 0.4.Finally, after a rigorous manual screening and labeling process, 900 synthetic data samples with ground-truth labels were obtained.There are 450 images each of suspended targets and bottom - resting targets, with one target per image.[The highlighted part on page 6 of the manuscript.]
Comments 7:The introduction and related work sections are dense and not critically synthesized to make the gap in the paper's contribution evident.The proposed scheme is not compared with any related work, especially with other improved YOLO11 schemes.
Response 7:Thank you for pointing this out.We agree with this comment.We have added the relevant arguments and explanations.YOLO is making waves across various fields, including sonar image target detection. Zheng Linhan's ScEMA-YOLOv8 model for underwater sonar target detection uses an EMA attention mechanism and an SPPFCSPC pooling module to better extract features from blurred targets. It also adds detection layers and residual connections to improve the detection and positioning of small targets. But it's not optimized for targets with scarce features, and simply adding a detection layer is not enough to handle scenes with drastic changes in target scale [6].Xie Guohao and Chen Zhe's DA-YOLOv7 model for underwater sonar image target recognition features innovative modules like an omnidirectional convolutional channel prior convolutional attention efficient layer aggregation network, spatial pyramid pooling channel shuffle, and ghost shuffle convolutional enhanced layer aggregation network. These reduce computational load and improve the capture of local features and critical information. However, it still lacks optimization for strong noise interference in sonar images and for detecting small targets [7].Meng Junxia's team [8] used CycleGAN for data augmentation and integrated a global attention mechanism into the feature extraction phase of YOLOv8, achieving some engineering success.Based on the above-mentioned challenges in current sonar target detection and the strategies used by previous researchers, this study carries out the following work:[the highlighted parts on pages 2 and 3 of the manuscript.]
Comments 8:The resolution of the figures is poor.
Response 8:Thank you for pointing this out.We have adjusted all the figures and tables in the manuscript to make them clearer.For details, please refer to the attachment.
Comments 9:Many figures are not mentioned in the text with a figure number. The majority of the figures are not accompanied by clear captions or descriptive remarks, rendering them hard for readers to interpret.
Response 9:Thank you for pointing this out.We have adjusted all the figures and tables in the manuscript.For details, please refer to the attachment.
Comments 10:English needs extensive revision. Some statements are overly lengthy.
Response 10:Thank you for pointing this out.We have tried our best to delete the redundant parts.
Reviewer 2 Report
Comments and Suggestions for AuthorsThis paper proposes a method for detecting small objects using sonar images based on an improved YOLO11.
The proposed method uses:
- ADA-StyleGAN3 to generate high-quality acoustic images and address the problem of insufficient data,
- FASFFHead (a Four-head Adaptive Feature Fusion Detection Head to improve the detection accuracy of small targets by enhancing the fusion of multi-scale features,
- Edge-Contour Attention Mechanism (EEA) to enhance edge features in sonar images,
- Spatial and Channel Reconstruction Convolution for C3K2 module to reduce redundant features being generated when there is a similar background image, and
- Focaler-IOU introduces which is a new loss function for bounding box regression.
The reviewer has some concerns about the manuscript, which are described below.
- The more detailed explanation of the sonar images is required. In the present manuscript, it is difficult to understand the sonar image is very low quality compared to usual digital images.
- It is very difficult to understand Figure 1 and Figure 2. Why is it a mysterious reddish color?
- What are the original vertical and horizontal sizes of the sonar images? Density resolution? Single channel or multi-channel?
- In the experiment, the input size is 640x640 compared to the unknown original size of the sonar image. Does the target object still exist in the resized image? Does it not disappear due to resizing?
- The authors got a total of 784 raw sonar image, but how many are in which category?
- The font on the figures is too small and difficult to see when printed
- Figure 8, the position of the figure is strange.
- 3.2.2, second line from the top, What is the details26?
- 3.2.2, fifth line from the top, What is the datasets28?
Author Response
Thank you very much for taking the time to review this manuscript. Here is my response to your feedback. Please refer to the highlighted section in the attachment for specific improvements.
Comments 1: The more detailed explanation of the sonar images is required. In the present manuscript, it is difficult to understand the sonar image is very low quality compared to usual digital images.
Response 1:Thank you for pointing this out. We agree with this comment.In the manuscript, we compressed the sonar images, reducing their quality. We've addressed this in sections 2.1 and 4.1, and provided further explanation of the sonar images:Figure 1 presents a sonar image. The left and right sections show acoustic wave detection and imaging. The central shaded area indicates signal loss due to unreceived bottom echo. Three red - marked regions show bottom - resting targets. Figure 2 is a suspended - target sonar image with one red - marked suspended target.[Page 4 of the manuscript, the third line from the top]
Comments 2:It is very difficult to understand Figure 1 and Figure 2. Why is it a mysterious reddish color?
Response 2:Thank you for pointing this out. Each of Figure 1 and Figure 2 consists of two images: the sonar image and the magnified image of the target. I have adjusted the formatting to enhance readability.
Comments 3:What are the original vertical and horizontal sizes of the sonar images? Density resolution? Single channel or multi-channel?
Response 3:Thank you for pointing this out.In the experiment, sonar was used for single - channel data collection. The collected sonar images were converted to pseudo - color to enhance visual effects, highlight details, and distinguish different intensity ranges. [Page 5 of the manuscript, the first line from the top].The images input for model training were the color - adjusted ones, so they can be regarded as multi - channel images.
Comments 4:In the experiment, the input size is 640x640 compared to the unknown original size of the sonar image. Does the target object still exist in the resized image? Does it not disappear due to resizing?
Response 4:Thank you for pointing this out.To balance efficiency and accuracy, we compressed the original images to 640×640. This resolution significantly improved model efficiency while ensuring targets remain present and their features are still distinct.
Comments 5:The authors got a total of 784 raw sonar image, but how many are in which category?
Response 5:Thank you for pointing this out.After rigorous data classification and filtering, a dataset of 784 raw sonar images (3600×1600 resolution) was built. It consists of 421 images with 545 bottom - resting targets, 366 images with 397 suspended targets, and 30 images without any targets.[Page 5 of the manuscript, the third line from the top]
Comments 6:The font on the figures is too small and difficult to see when printed
Response 6:Thank you for pointing this out.We have adjusted all the images in the manuscript. Please see the attachment for details.
Comments 7:Figure 8, the position of the figure is strange.
Response 7:Thank you for pointing this out.We have adjusted the image in the manuscript.
Comments 8:3.2.2, second line from the top, What is the details26?
Response 8:Thank you for pointing this out.Here we used the wrong reference format. 26 indicates the 26th reference, and we have made corresponding adjustments.
Comments 9:3.2.2, fifth line from the top, What is the datasets28?
Response 9:Thank you for pointing this out.Here we used the wrong reference format. 28 indicates the 28th reference, and we have made corresponding adjustments.
Round 2
Reviewer 1 Report
Comments and Suggestions for AuthorsThe manuscript may be accepted.