Next Article in Journal
Effects of Habitat Change on the Wintering Waterbird Community in China’s Largest Freshwater Lake
Next Article in Special Issue
Deep Learning-Based Real-Time Detection of Surface Landmines Using Optical Imaging
Previous Article in Journal
Deciphering China’s Socio-Economic Disparities: A Comprehensive Study Using Nighttime Light Data
Previous Article in Special Issue
Detection Method of Infected Wood on Digital Orthophoto Map–Digital Surface Model Fusion Network
 
 
Article
Peer-Review Record

SMFF-YOLO: A Scale-Adaptive YOLO Algorithm with Multi-Level Feature Fusion for Object Detection in UAV Scenes

Remote Sens. 2023, 15(18), 4580; https://doi.org/10.3390/rs15184580
by Yuming Wang 1,2, Hua Zou 1,*, Ming Yin 2 and Xining Zhang 1
Reviewer 1: Anonymous
Reviewer 2: Anonymous
Remote Sens. 2023, 15(18), 4580; https://doi.org/10.3390/rs15184580
Submission received: 29 August 2023 / Revised: 13 September 2023 / Accepted: 14 September 2023 / Published: 18 September 2023

Round 1

Reviewer 1 Report

**Summary:**
The paper introduces a novel object detection framework called SMFF-YOLO, designed for detecting objects in images captured by unmanned aerial vehicles (UAVs). The main challenges addressed in this framework include multi-scale variations, complex backgrounds, and the detection of tiny-sized objects. SMFF-YOLO incorporates several key components, including a new prediction head, a tiny object detection head, the bidirectional feature fusion pyramid (BFFP) module, and the adaptive atrous spatial pyramid pooling (AASPP) module. The framework is evaluated using the VisDrone and UAVDT datasets, demonstrating improved detection accuracy, robustness, and adaptability in challenging scenarios.

**Contributions and Strengths:**
1. **New Prediction Head:** SMFF-YOLO introduces a novel prediction head that combines Swin-Transformer and CNN to improve feature representation by capturing both global and local information. This allows the model to better understand object semantics and spatial relationships, contributing to higher detection accuracy.

2. **Tiny Object Detection:** To address the challenge of detecting tiny objects, the framework includes an additional prediction head dedicated to tiny objects. This enhances sensitivity to small-sized targets, making the model more effective in detecting them even in complex backgrounds.

3. **Multi-level Feature Fusion:** The BFFP module efficiently aggregates multi-scale features, improving the model's ability to detect objects of varying sizes. This helps address scale variations commonly encountered in UAV-captured images.

4. **Adaptive Feature Fusion:** The AASPP module is introduced to handle complex backgrounds by adaptively fusing features, mitigating the impact of cluttered scenes on object detection.

5. **Evaluation and Robustness:** SMFF-YOLO is evaluated on real-world datasets (VisDrone and UAVDT) and demonstrates higher detection accuracy compared to existing methods. It also exhibits robustness in challenging scenarios involving complex backgrounds, tiny objects, and occlusions.

**Corrections and Suggestions:**
- The paper provides a detailed description of the proposed method but lacks a clear summary of the experimental results in the main body of the text. It would be beneficial to include a concise summary of the key results and improvements achieved by SMFF-YOLO in the main body of the paper.
- The paper mentions the introduction of a dynamic non-monotonic focal mechanism in the loss function but does not provide detailed information about this mechanism. It would be helpful to elaborate on how this mechanism contributes to the model's performance.
- The paper could benefit from including a discussion on limitations and potential future research directions to provide a more comprehensive view of the proposed framework.
- Consider adding visualizations or figures to illustrate the performance improvements achieved by SMFF-YOLO in different scenarios. This would enhance the clarity of the results presentation.

Overall, the paper presents a promising framework for object detection in UAV-captured images, addressing significant challenges in the field. The contributions, strengths, and experimental results showcase the effectiveness of SMFF-YOLO in improving detection accuracy and robustness.

Author Response

All revised words in the manuscript are in blue.

Reviewer #1: 

(1)The paper provides a detailed description of the proposed method but lacks a clear summary of the experimental results in the main body of the text. It would be beneficial to include a concise summary of the key results and improvements achieved by SMFF-YOLO in the main body of the paper.

Answer: Thank you for your suggestions. We added an experimental summary in section 4.4.5 and rearranged the order of the experiments.

 

(2)The paper mentions the introduction of a dynamic non-monotonic focal mechanism in the loss function but does not provide detailed information about this mechanism. It would be helpful to elaborate on how this mechanism contributes to the model's performance.

Answer: Based on your suggestions, we explained the functionality of the dynamic non-monotonic focal mechanism in Section 3.4.

 

(3)The paper could benefit from including a discussion on limitations and potential future research directions to provide a more comprehensive view of the proposed framework.

Answer: Thank you for your suggestion. We have specifically rewritten a paragraph in our conclusion to clarify the limitations of our method and possible future research directions.

 

(4)Consider adding visualizations or figures to illustrate the performance improvements achieved by SMFF-YOLO in different scenarios. This would enhance the clarity of the results presentation.

Answer: Thank you for your suggestion. We added some visual images to the object detection experiment in the occlusion scene to improve the clarity of the results.

Reviewer 2 Report

This paper proposes a scale-adaptive YOLO framework called SMFF-YOLO to address the precise detection of multiscale and tiny objects in UAV-captured images. It enhances the detection performance of tiny objects by designing new prediction head modules and adding additional tiny object detection heads. The model also utilizes hybrid attention and cascaded atrous convolutions to effectively extract multi-scale feature information, adapt to different scales of targets, and enhance the detection accuracy of multi-scale objects. The experimental results demonstrate that the proposed SMFF-YOLO achieves higher accuracy compared to other existing methods. The following suggestions might be beneficial to the article:

1. The BFFP module is missing in Fig. 1, which is inconsistent with the description in the paper. In addition, in the Neck section of Fig. 1, the CBS module lacks inputs.

2. Please provide additional explanations of W-MHSA, SW-MHSA and Fig. 5 in subsection 3.1 so as to allow the reader to better understand the AASPP.

3. Eqs. 4 and 5 are difficult to understand because the authors do not explain the meaning of the relevant symbols in them, such as [], (), etc.

4. It is confusing that the letter representations in Eq. 6-13 do not match those in Fig. 6.

5. Visualization Fig. 9 contradicts the explanation of its caption. Specifically, there is no picture "m" in Fig. 9, whereas the caption contains its explanation. Are you sure picture "a" is ground truth?

6. Although the introduction of the AASPP and ECA modules has led to a richer feature representation, it can be seen from Table 6 that additional computational expenditure is added. I would like to know how the authors balanced the conflict between performance and complexity.

7. Please cite more papers published in “Remote Sensing” in the last two years.

Moderate editing of English language required.

Author Response

(1) The BFFP module is missing in Fig. 1, which is inconsistent with the description in the paper. In addition, in the Neck section of Fig. 1, the CBS module lacks inputs.

Answer: Thank you for your suggestion. We have added the input to the CBS module and labeled the BFFP module in Fig 1.

 

(2) Please provide additional explanations of W-MHSA, SW-MHSA and Fig. 5 in subsection 3.1 so as to allow the reader to better understand the AASPP.

Answer: Thank you for your suggestion. We added explanations of the modules in Section 3.1 and provided an explanation of Figure 5 in Section 3.2.

(3) Eqs. 4 and 5 are difficult to understand because the authors do not explain the meaning of the relevant symbols in them, such as [], (), etc.

Answer: Thank you for your suggestion. We have added explanations for these symbols in Section 3.2.

(4) It is confusing that the letter representations in Eq. 6-13 do not match those in Fig. 6.

Answer: Thank you for your suggestion. We have corrected the symbols in Fig. 6.

 

(5) Visualization Fig. 9 contradicts the explanation of its caption. Specifically, there is no picture "m" in Fig. 9, whereas the caption contains its explanation. Are you sure picture "a" is ground truth?

Answer: Thank you for your suggestion. This was our oversight, and we have rewritten the explanation of its caption.

(6) Although the introduction of the AASPP and ECA modules has led to a richer feature representation, it can be seen from Table 6 that additional computational expenditure is added. I would like to know how the authors balanced the conflict between performance and complexity.

Answer: Thank you for your suggestion. We have added a discussion on the trade-off between model performance and complexity in the Discussion section.

(7) Please cite more papers published in “Remote Sensing” in the last two years.

Answer: Thank you for your suggestion. We have cited 6 papers published in “Remote Sensing” in the last two years.

Round 2

Reviewer 2 Report

Please publish this paper.

None.

Back to TopTop