Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

Small Object Detection in UAV Remote Sensing Images Based on Intra-Group Multi-Scale Fusion Attention and Adaptive Weighted Feature Fusion Mechanism

Remote Sens. 2024, 16(22), 4265; https://doi.org/10.3390/rs16224265

by Zhe Yuan

, Jianglei Gong, Baolong Guo^*, Chao Wang, Nannan Liao

, Jiawei Song

and Qiming Wu

Reviewer 1:

Gang Wang

Reviewer 2: Anonymous

Remote Sens. 2024, 16(22), 4265; https://doi.org/10.3390/rs16224265

Submission received: 30 September 2024 / Revised: 3 November 2024 / Accepted: 13 November 2024 / Published: 15 November 2024

(This article belongs to the Special Issue Remote Sensing Image Classification and Semantic Segmentation (Second Edition))

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

The problem description in the paper is accurate and worth promoting.

1. In the classification of deep learning object detection methods, distinguishing YOLOv1-4 from YOLO6.8.9 is not accurate enough, and should only be classified from the perspectives of two-stage and single-stage. The presence or absence of anchor points is another perspective. These two classification methods are not decoupled.

2. In the "Related Work" section, the references are too old, especially in the "Attention Mechanism" section, which has developed rapidly in the past three years. It is recommended to add updated references.

3. DIOR is not a specialized dataset for testing small object detection algorithms. You can try the AI-TOD dataset from Wuhan University.

4. In comparison indicators, indicators such as parameter quantity or parameter weight capacity can be increased.

5. The horizontal and vertical coordinates in Figure 8 are not clear, it is recommended to increase them.

6. Suggest adding comparative experiments with specialized algorithms for small target detection in unmanned aerial vehicles such as YOLOv4-UAV.

Author Response

Comments 1: In the classification of deep learning object detection methods, distinguishing YOLOv1-4 from YOLOv6.8.9 is not accurate enough, and should only be classified from the perspectives of two-stage and single-stage. The presence or absence of anchor points is another perspective. These two classification methods are not decoupled.

Response 1: Thank you reviewers for your insights into the classification of deep learning object detection methods. We have made revisions to clarify our classification.

“Presently, mainstream deep learning based object detection algorithms can be categorized into two primary classes: anchor-based and anchor-free algorithms. Anchor-based methods encompass two-stage object detection frameworks reliant on region proposals, including Faster R-CNN [18] and Cascade R-CNN [19], as well as single-stage regression-oriented object detection frameworks like YOLOv1-v4[20-23], YOLOv7[25], SSD [26], and RetinaNet [27]. Conversely, anchor-free methodologies comprise models such as YOLOv6 [24], YOLOv8 [28], YOLOv9 [29], and YOLOv10 [30]. Deep learning-based object detection algorithms, whether anchor-based or anchor-free, have exhibited substantial ad-vantages and have significantly propelled the advancement of 66 computer vision technology.”

We have carried out the following revisions to enhance the clarity of the discussion:

“In the current domain of deep learning, object detection algorithms are primarily classified into two major categories based on whether they depend on the Region Proposal Network (RPN): two-stage object detection algorithms that are RPN-dependent and one-stage object detection algorithms that are RPN-independent. Two-stage object detection algorithms, like Faster R-CNN [18] and Cascade R-CNN [19], usually entail a preprocessing step that generates candidate regions via RPN and subsequently performs target classification and localization on these regions. In contrast, one-stage object detection algorithms, for instance, SSD [20] and RetinaNet [21], directly predict the bounding boxes and classes of targets on the image without the need for an extra candidate region generation process. Furthermore, object detection algorithms can be further categorized according to whether they employ anchor boxes. Anchor-based algorithms, such as YOLOv1-v4[22-25] and YOLOv7[27], predict the position and size of targets using predefined anchor boxes. These algorithms enhance the detection accuracy by adjusting the anchor boxes to match the actual bounding boxes. On the other hand, anchor-free algorithms, such as YOLOv6 [26], YOLOv8 [28], YOLOv9 [29], and YOLOv10 [30], do not rely on predefined anchor boxes but directly predict the bounding boxes, which contributes to simplifying the detection process and potentially enhances model flexibility. These classifications not only mirror the diversity in the design of object detection algorithms but also embody the distinct requirements for speed and accuracy in practical applications. As deep learning technology progresses continuously, these algorithms are also being constantly optimized and evolved to accommodate increasingly complex detection tasks.”

We anticipate that this revised classification can address the reviewers' concerns and offer a more lucid comprehension of the disparities among object detection algorithms. This version retains your core viewpoints while augmenting fluency and logicality.

Comments 2: In the "Related Work" section, the references are too old, especially in the "Attention Mechanism" section, which has developed rapidly in the past three years. It is recommended to add updated references.

Response 2: We are grateful for the valuable comments from the reviewers. We have replaced the references [42-44] in the "Attention Mechanism" section with relevant studies from the recent three years and provided explanations for the updated literature.

In the context of drone detection where objects are submerged by the complex background, Wang et al. [42] enhanced feature extraction by introducing the SeNet channel attention mechanism, thereby suppressing background interference. Nevertheless, the SeNet channel attention mechanism merely focuses on the information of the channel dimension and fails to effectively capture spatial features. To overcome this constraint, Li et al. [43] incorporated the subspace attention mechanism (ULSAM) into the YOLOv4 model, generating distinct attention feature maps for each feature map subspace. However, the ULSAM also only prioritizes the spatial dimension while disregarding the channel information. Although these attention mechanisms have demonstrated a certain degree of effectiveness in practice, they frequently neglect the key elements of the spatial or channel dimensions, resulting in the loss of crucial information. In response to these challenges, Wang et al. [44] proposed a global attention mechanism that integrates the information of both the channel and spatial dimensions, enhancing the accuracy of target detection. However, compared with methods that only focus on a single dimension, this approach demands more computational resources and exhibits higher complexity.

Through these updated literatures, we hope to present a more comprehensive account of the current development of attention mechanisms in unmanned aerial vehicle (UAV) detection and the existing challenges.

Comments 3: DIOR is not a specialized dataset for testing small object detection algorithms. You can try the AI-TOD dataset from Wuhan University.

Response 3: We are grateful for the reviewers' suggestions. We have incorporated the AI-TOD dataset into the paper and conducted comparative experiments. Meanwhile, the abstract, the experimental part and the conclusion have been revised.

Abstract: In view of the issues of missed and false detections encountered in small object detection for UAV remote sensing images, and the inadequacy of existing algorithms in terms of complexity and generalization ability, we propose a small object detection model named IA-YOLOv8 in this paper. This model integrates the intra-group multi-scale fusion attention mechanism and the adaptive weighted feature fusion approach. In the feature extraction phase, the model employs a hybrid pooling strategy that combines Avg and Max pooling to replace the single Max pooling operation used in the original SPPF framework. Such modifications enhance the model's ability to capture minute features of small objects. In addition, an adaptive feature fusion module is introduced, which is capable of automatically adjusting the weights based on the significance and contribution of features at different scales to improve the detection sensitivity for small objects. Simultaneously, a lightweight intra-group multi-scale fusion attention module is implemented, which aims to effectively mitigate background interference and enhance the saliency of small objects. Experimental results indicate that the proposed IA-YOLOv8 model has a parameter quantity of 10.9MB, attaining an average precision (mAP) value of 42.1% on the Visdrone2019 test set, an mAP value of 82.3% on the DIOR test set, and an mAP value of 39.8% on the AI-TOD test set. All these results outperform the existing detection algorithms, demonstrating the superior performance of the IA-YOLOv8 model in the task of small object detection for UAV remote sensing.

Experiments and Analysis:

(1) Experimental Environment Setup and Data Set Introduction

The AI-TOD dataset is a large-scale dataset for the detection of micro-targets in aerial images, which was released by Wuhan University in 2021. This dataset was constructed by extracting some images and object instances from datasets such as DOTAv1.5, xView, VisDrone2018, Airbus Ship, and DIOR. It encompasses 8 categories, with a total of 28,036 aerial images and 700,621 instances. The pixel resolution of each image is 800×800 pixels, and the average size of the targets in the images is 12.8 pixels. Additionally, for a better evaluation of the algorithms, 11,214 images were utilized for training, 2,804 for validation, and 14,018 for testing.

(2) Comparative Experiment

Table 5. Comparative Analysis of Different Algorithms on the AI-TOD Dataset.

Method		Backbone Network	Input_size	Parameter (MB)	Model size(MB)	mAP(%)
Faster RCNN[18]		ResNet50+FPN	600×600	41.5	83.3	25.6
Cascade-RCNN[19]		ResNet50+FPN	600×600	64.3	128.9	27.4
SSD[26]		VGG16	300×300	23.7	47.7	14.7
RetinaNet[27]	ResNet50+FPN		800×800	36.8	73.9	26.3
YOLOv5s	CSPDarkNet-53		512×512	8.7	17.6	30.7
YOLOv8s[28]	CSPDarkNet-53		512×512	10.6	21.5	35.9
YOLOv9s[29]	CSPDarkNet-53		512×512	6.8	13.8	30.9
YOLOv10s[30]	CSPDarkNet-53		512×512	7.7	15.6	29.5
M-YOLOv8s[65]	CSPDarkNet-53		300×300	5.9	11.8	38.6
Imporved_ YOLOv8[66]	CSPDarkNet-53		300×300	5.6	11.2	37.4
IA-YOLOv8(ours)	CSPDarkNet-53		512×512	10.9	22.1	39.8

The experimental results presented in Tables 4 and 5 indicate that the IA-YOLOv8 algorithm proposed herein attains average precisions (mAP) of 82.3% and 39.8% on the DIOR and AI-TOD datasets respectively. This showcases superior performance in comparison to other algorithms and accentuates its robust generalization capability. Notably, contrary to the trend witnessed for YOLOv9s on the VisDrone2019 dataset, on the DIOR and AI-TOD datasets, the reduction in the quantity of parameters and model size does not align with an enhancement in performance. The mAP of YOLOv9s declines as the number of parameters and model size decrease. This phenomenon underlines the inadequate generalization ability of YOLOv9s within the context of optical remote sensing image processing.

(3) Visual Analysis

According to Figure 11, the green dashed lines delineate the comparison regions for the two algorithms. Through the comparison of Figures (a1) and (a2), it can be ascertained that the YOLOv8s algorithm fails to detect low-resolution "persons" and "vehicles". By contrast, the IA-YOLOv8 algorithm can successfully identify these objects. When observing Figures (b1) and (b2), it is manifest that in scenarios containing "vehicles", the YOLOv8s algorithm has a considerable miss detection rate for low-resolution "vehicles", while the IA-YOLOv8 algorithm not only exhibits precise detection capabilities but also reduces the miss detection rate to a certain extent. Likewise, by contrasting Figures (c1) and (c2), it can be perceived that for "vehicles" with low resolution and sparse pixel density, the YOLOv8s algorithm is prone to miss detections. Conversely, the IA-YOLOv8 algorithm presents robust detection performance. In conclusion, on the AI-TOD test set, the IA-YOLOv8 algorithm demonstrates significant superiority over the YOLOv8s algorithm.

Figure 11. Detection results of YOLOv8s and IA-YOLOv8 on the AI-TOD test dataset. The figures in (a),(b) and (c) illustrate the input images. The figures in (a1),(b1) and (c1) present the detection results obtained using YOLOv8s. The figures in (a2),(b2) and (c2) present the detection results obtained using IA-YOLOv8.

Conclusion:

In this paper, we propose IA-YOLOv8, a UAV small object detection algorithm that addresses the challenges of identifying low-resolution and indistinct small objects in UAV remote sensing images by incorporating an intra-group multi-scale fusion attention mechanism and adaptive weighted feature fusion mechanism. The algorithm effectively mitigates detail loss associated with the original SPPF's reliance on Max pooling alone by integrating Avg pooling and Max pooling, thereby significantly enhancing the model's capacity to capture and represent small object features. Furthermore, an adaptive feature fusion module combines deep semantic features with shallow detail features to enable a more comprehensive capture of small object characteristics. Additionally, we introduce a lightweight intra-group multi-scale fusion attention module to improve small object feature information while reducing background interference. Experimental results on the Visdrone2019 , DIOR and AI-TOD datasets demonstrate that our IA-YOLOv8 algorithm achieves mAP values of 42.1% , 82.3% and 39.8%, respectively, requiring only 10.9MB of parameters, thus showcasing substantial improvements over existing object detection algorithms.

Comments 4: In comparison indicators, indicators such as parameter quantity or parameter weight capacity can be increased.

Response 4: We are grateful for the reviewers' suggestions. To solve this problem, we have added an index of the parameter weight size in each comparative experiment. Specifically as follows:

Table 3. Comparative Analysis of Different Algorithms on the Visdrone2019 Dataset.

Method	Backbone Network	Input_size	Parameter (MB)	Model size(MB)	mAP(%)
Faster RCNN[18]	ResNet50+FPN	600×600	41.5	83.3	28.6
Cascade-RCNN[19]	ResNet50+FPN	600×600	64.3	128.9	32.7
SSD[26]	VGG16	300×300	23.7	47.7	15.3
RetinaNet[27]	ResNet50+FPN	800×800	36.8	73.9	24.9
YOLOv5s	CSPDarkNet-53	512×512	8.7	17.6	33.4
YOLOv8s[28]	CSPDarkNet-53	512×512	10.6	21.5	34.2
YOLOv9s[29]	CSPDarkNet-53	512×512	6.8	13.8	34.7
YOLOv10s[30]	CSPDarkNet-53	512×512	7.7	15.6	32.5
M-YOLOv8s[65]	CSPDarkNet-53	300×300	5.9	11.8	41.2
Imporved_ YOLOv8[66]	CSPDarkNet-53	300×300	5.6	11.2	40.5
IA-YOLOv8(ours)	CSPDarkNet-53	512×512	10.9	22.1	42.1

Table 4. Comparative Analysis of Different Algorithms on the DIOR Dataset.

Method		Backbone Network	Input_size	Parameter (MB)	Model size(MB)	mAP(%)
Faster RCNN[18]		ResNet50+FPN	600×600	41.5	83.3	65.3
Cascade-RCNN[19]		ResNet50+FPN	600×600	64.3	128.9	72.7
SSD[26]		VGG16	300×300	23.7	47.7	55.4
RetinaNet[27]	ResNet50+FPN		800×800	36.8	73.9	68.8
YOLOv5s	CSPDarkNet-53		512×512	8.7	17.6	79.1
YOLOv8s[28]	CSPDarkNet-53		512×512	10.6	21.5	79.3
YOLOv9s[29]	CSPDarkNet-53		512×512	6.8	13.8	78.9
YOLOv10s[30]	CSPDarkNet-53		512×512	7.7	15.6	77.1
M-YOLOv8s[65]	CSPDarkNet-53		300×300	5.9	11.8	80.7
Imporved_ YOLOv8[66]	CSPDarkNet-53		300×300	5.6	11.2	80.3
IA-YOLOv8(ours)	CSPDarkNet-53		512×512	10.9	22.1	82.3

Table 5. Comparative Analysis of Different Algorithms on the AI-TOD Dataset.

Method		Backbone Network	Input_size	Parameter (MB)	Model size(MB)	mAP(%)
Faster RCNN[18]		ResNet50+FPN	600×600	41.5	83.3	25.6
Cascade-RCNN[19]		ResNet50+FPN	600×600	64.3	128.9	27.4
SSD[26]		VGG16	300×300	23.7	47.7	14.7
RetinaNet[27]	ResNet50+FPN		800×800	36.8	73.9	26.3
YOLOv5s	CSPDarkNet-53		512×512	8.7	17.6	30.7
YOLOv8s[28]	CSPDarkNet-53		512×512	10.6	21.5	35.9
YOLOv9s[29]	CSPDarkNet-53		512×512	6.8	13.8	30.9
YOLOv10s[30]	CSPDarkNet-53		512×512	7.7	15.6	29.5
M-YOLOv8s[65]	CSPDarkNet-53		300×300	5.9	11.8	38.6
Imporved_ YOLOv8[66]	CSPDarkNet-53		300×300	5.6	11.2	37.4
IA-YOLOv8(ours)	CSPDarkNet-53		512×512	10.9	22.1	39.8

Comments 5: The horizontal and vertical coordinates in Figure 8 are not clear, it is recommended to increase them.

Response 5: We are grateful for the reviewers' feedback. We have enhanced the abscissa and ordinate of Figure 8 to enhance its clarity. Specifically, we have enlarged the font size of the coordinates and ensured that the labels are more legible. It is hoped that such modification can satisfy your suggestion.

Comments 6: Suggest adding comparative experiments with specialized algorithms for small target detection in unmanned aerial vehicles such as YOLOv4-UAV.

Response 6:

We are grateful for the reviewers' suggestions. We have added comparative experiments with the algorithms [65-66] specifically targeting the detection of small targets for unmanned aerial vehicles (UAVs). These experiments are designed to further validate the effectiveness of our approach and have been elaborated in the experimental part. We hope these newly added contents can fulfill your expectations. Specifically, as presented in the three tables of the above-mentioned Response 4, such as M-YOLOv8s and Imporved_YOLOv8.

[65] Duan, S.; Wang, T.; Li, T.; Yang, W. M-YOLOv8s: An improved small target detection algorithm for UAV aerial photography. Journal of Visual Communication and Image Representation. 2024, 104, 104289.

[66] Ning, T.; Wu, W.; Zhang, J. Small object detection based on YOLOv8 in UAV perspective. Pattern Analysis and Applications. 2024, 27, 103.

Author Response File: Author Response.pdf

Reviewer 2 Report

Comments and Suggestions for Authors

The paper proposes detection procedure of small targets in UAV remote sensing images.

The detection technique involves fusion of several features, dynamically weighted for achieving higher 'average precision'.

I found this paper relevant and interesting. However, reading it reveals some difficulties in understanding the whole detection scheme from the sensors to the decision-making / classifier / predictor.

It seems that the authors 'jump' directly to the YOLO algorithm, comparing between previous versions of the program, while relying on their comprehensive literature survey, without giving the required background of the whole detection system.

Moreover, the algorithm is claimed to improve detection probability and reducing probability of false alarm, while it is not clear what are the main reasons for failures (limited resolution, focusing, noise, image distortions, atmospheric clearance...).

1. For completeness, and in order to make this paper 'self-contained', I suggest that the authors will somewhat extend the description of the whole detection system, explain the different layers in Fig. 1, and pointing out the place of their additional task in the detection and classification chain.

2. Also, give some background on the physical issues causing reduction in the detection performance with which the proposed algorithm is aiming to encounter.

and

3. If possible, explain (or assess) how the additional features contribute to the improvement.

4. Looking at the confusion matrix in Fig. 8, reveals that the background is playing a role on the classification results in both compared YOLO versions. Please elaborate on this matter.

5. I also noticed that some of the abbreviations are given without their complete phrasing.

Author Response

Comments 1: For completeness, and in order to make this paper 'self-contained', I suggest that the authors will somewhat extend the description of the whole detection system, explain the different layers in Fig. 1, and pointing out the place of their additional task in the detection and classification chain.

Response 1: We have further elaborated on the description of the entire detection system:

This network consists of four parts: input, backbone, neck, and Prediction. (1) Input: This section is accountable for adjusting the image to the size demanded by the network to guarantee the effectiveness during the feature extraction process. Meanwhile, the Mosaic data augmentation technique is adopted to enhance the generalization ability and anti-interference capacity of the network.(2) BackBone: This section employs the CSPDarkNet-53 network to extract features of the input image at diverse scales to accommodate the variations in the size of the target.(3) Neck: This section utilizes the FPN (Feature Pyramid Network) structure to enable better integration of the spatial detail information of the shallow layers with the semantic information of the deep layers, thus forming multi-scale features.(4) Prediction: This section constitutes the final part of the entire detection network and is responsible for decoding and predicting the features from the Neck. It is utilized to generate the final category and bounding box regression. By generating predictions on feature maps of different scales, precise detection of targets of different scales can be achieved. The following are the enhancements made to the algorithmic framework:

Comments 2: Also, give some background on the physical issues causing reduction in the detection performance with which the proposed algorithm is aiming to encounter.

Response 2: Thank you for your valuable comments. We will modify “In conclusion, while the aforementioned algorithms have demonstrated improvements in small object detection performance within UAV remote sensing images, they suffer from high model complexity and inadequate generalization capabilities. This results in the persistent issue of missed detections and false positives for small objects. To address these challenges, in this paper, we propose a small object detection network based on the YOLOV8 detection algorithm specifically tailored for UAV remote sensing imagery. The proposed approach is designed to mitigate problems related to excessive model parameters, limited generalization ability, and the occurrence of missing or misidentified small objects, thereby enhancing the accuracy of small object detection and facilitating their effective deployment on UAVs. The primary contributions are as follows:” in the "Introduction" section to“In conclusion, while the aforementioned algorithms have demonstrated improvements in small object detection performance within UAV remote sensing images, they suffer from high model complexity and inadequate generalization capabilities. This results in the persistent issue of missed detections and false positives for small objects. Furthermore, in the detection of small objects in UAV images, the detection performance is typically deteriorated due to the following physical factors: (1) Small objects merely occupy a limited number of pixels, making it challenging for the model to extract sufficient discriminative features. (2) Objects in the natural environment are frequently partially occluded by other objects, thereby triggering missed detections and false detection. (3) Changes in weather and lighting conditions lead to a reduction in the contrast between the object and the background, thereby influencing the detection accuracy. To cope with the challenges of the aforementioned algorithmic deficiencies and the decline in detection performance caused by physical factors, this paper proposes a small object detection network for UAV remote sensing images based on YOLOv8. This approach is intended to address these challenges, enhance the accuracy of small object detection, and facilitate its effective deployment on UAVs. The main contributions are as follows:”

Comments 3: If possible, explain (or assess) how the additional features contribute to the improvement.

Response 3: Thank you for your valuable comments. We have offered a rather detailed explanation in the "Ablation Experiment" section of the paper. To be specific, the experimental results indicate that the introduction of additional detection branches and other modules has significantly enhanced the detection accuracy, recall rate, and mean average precision (mAP), and improved the model's performance in complex scenarios. The detailed effects of these modules are manifested in the experimental data in Table 2. For instance:
• The additional detection branch increased the depth of the model and brought about remarkable performance enhancements: the accuracy rose from 32.9% to 39.0%, the recall rate increased from 49.9% to 52.9%, and the mAP increased from 34.2% to 40.8%.
• The Mix-SPPF, IGMSFA, and AWFF modules achieved further fine-tuning of the detection accuracy in the small target detection tasks through more efficient feature fusion and learning capabilities (e.g., the IGMSFA module increased the recall rate by 1.0% and the mAP by 0.3%).
Consequently, we hold that these modules possess significant performance advantages in complex detection scenarios and have been verified in the experimental data. It is hoped that this explanation can further illuminate the role of these modules in enhancing the model's performance.

Comments 4: Looking at the confusion matrix in Fig. 8, reveals that the background is playing a role on the classification results in both compared YOLO versions. Please elaborate on this matter.

Response 4: Thank you for your valuable comments. It can be observed from Figure 8 that when comparing the classification results of YOLOv8 and AS-YOLOv8, the background category exerts an influence on the classification outcome. This is attributed to the fact that the image scenes in the VisDrone2019 dataset are complex, with the majority of samples belonging to the background category, while the samples of other target categories are relatively scarce compared to the background category. In cases where the features of the background category and the target categories are similar or the boundaries are ambiguous, the model may erroneously classify an object as the background.

Comments 5: I also noticed that some of the abbreviations are given without their complete phrasing.

Response 5: Thank you for your valuable comments. We have made changes in the text as follows:

UAV- Unmanned Aerial Vehicle;

CBAM- Convolutional Block Attention Module;

SPP- Spatial Pyramid Pooling;

SPPCSPC- Spatial Pyramid Pooling - Cross Stage Partial Connections;

SPPFCSPC- Spatial Pyramid Pooling Fast - Cross Stage Partial Connections;

Author Response File: Author Response.pdf

Round 2

Reviewer 2 Report

Comments and Suggestions for Authors

All my comments are addressed in the revised manuscript.

Article Menu

Small Object Detection in UAV Remote Sensing Images Based on Intra-Group Multi-Scale Fusion Attention and Adaptive Weighted Feature Fusion Mechanism

Further Information

Guidelines

MDPI Initiatives

Follow MDPI