Unmanned Aerial Vehicle Object Detection Based on Information-Preserving and Fine-Grained Feature Aggregation

Zhang, Jiangfan; Zhang, Yan; Shi, Zhiguang; Zhang, Yu; Gao, Ruobin

doi:10.3390/rs16142590

Open AccessArticle

Unmanned Aerial Vehicle Object Detection Based on Information-Preserving and Fine-Grained Feature Aggregation

by

Jiangfan Zhang

,

Yan Zhang

^*,

Zhiguang Shi

,

Yu Zhang

and

Ruobin Gao

National Key Laboratory of Science and Technology on ATR, National University of Defense Technology (NUDT), Changsha 410073, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2024, 16(14), 2590; https://doi.org/10.3390/rs16142590

Submission received: 21 April 2024 / Revised: 29 May 2024 / Accepted: 12 July 2024 / Published: 15 July 2024

Download

Browse Figures

Versions Notes

Abstract

General deep learning methods achieve high-level semantic feature representation by aggregating hierarchical features, which performs well in object detection tasks. However, issues arise with general deep learning methods in UAV-based remote sensing image object detection tasks. Firstly, general feature aggregation methods such as stride convolution may lead to information loss in input samples. Secondly, common FPN methods introduce conflicting information by directly fusing feature maps from different levels. These shortcomings limit the model’s detection performance on small and weak targets in remote sensing images. In response to these concerns, we propose an unmanned aerial vehicle (UAV) object detection algorithm, IF-YOLO. Specifically, our algorithm leverages the Information-Preserving Feature Aggregation (IPFA) module to construct semantic feature representations while preserving the intrinsic features of small objects. Furthermore, to filter out irrelevant information introduced by direct fusion, we introduce the Conflict Information Suppression Feature Fusion Module (CSFM) to improve the feature fusion approach. Additionally, the Fine-Grained Aggregation Feature Pyramid Network (FGAFPN) facilitates interaction between feature maps at different levels, reducing the generation of conflicting information during multi-scale feature fusion. The experimental results on the VisDrone2019 dataset demonstrate that in contrast to the standard YOLOv8-s, our enhanced algorithm achieves a mean average precision (mAP) of 47.3%, with precision and recall rates enhanced by 6.3% and 5.6%, respectively.

Keywords:

UAV; small objects; feature fusion; deep learning; remote sensing

1. Introduction

With the development of remote sensing technology, unmanned aerial vehicles (UAVs) play an increasingly crucial role in various fields such as military, environmental management, traffic monitoring, and urban planning [1,2]. As an essential branch of object detection, object detection based on UAV images can accurately identify and locate significant objects in the images, such as vehicles and pedestrians. Deep neural networks have powerful feature representation and self-learning capabilities, enabling them to learn feature representations from a large amount of image data and achieve remarkable performance in object detection tasks. For instance, classic neural networks such as VGG [3], ResNet [4], SSD [5], RetinaNet [6], and PSPNet [7] have been successfully applied to various object detection tasks. Therefore, UAV image object detection methods based on deep learning are becoming a field of rapid development and intense research.

Most of the current widely used deep neural network models are designed based on manually collected image data, such as MS-COCO [8], PASCAL VOC [9,10], and others. These datasets are often collected based on the subjective preferences of the image capturers and differ significantly from UAV images captured by UAVs. In conventional scene images, targets are typically larger, with fewer instances in a single image. Aerial images captured by UAVs belong to the field of low-altitude remote sensing. Typically, the targets in these images are located far away and appear small, with their quantity significantly greater than that seen in natural scene images. These targets are also easily occluded by the background. Examples of such images are illustrated in Figure 1. This presents a challenge for deep learning research in this area. In UAV object detection tasks, the direct application of mature general algorithmic models may lead to a significant degradation in performance. On the one hand, small objects occupy fewer pixels in the image, leading to lower resolution and insufficient feature information. Furthermore, modern detectors attempt to build high-level semantic feature representations by stacking more and more pooling and down-sampling operations, leading to the gradual loss of features for small objects with fewer pixels during forward propagation [11]. In deep networks, the feature information of small objects is often obscured by backgrounds, limiting detection task performance. On the other hand, the issue of information propagation is partly tackled by various FPN structures that horizontally fuse low-resolution feature maps with high-resolution feature maps [5,12]. However, the direct fusion of information with different densities can lead to semantic conflicts, which restricts the expression of multi-scale features and causes small objects to be easily overwhelmed by conflicting information, resulting in false detections and missed detections.

Given the issues above, this paper makes targeted improvements in the above two aspects to enhance the detection performance of small objects in low-altitude remote sensing images captured by drones and targets in complex ground backgrounds. Firstly, this paper proposes an Information-Preserving Feature Aggregation (IPFA) module to replace generic feature aggregation methods such as stride convolution. The IPFA module constructs a more abstract semantic feature representation while preserving fine-grained information in the input image, enabling the model to maintain inherent information of small objects in the input image. In addition, this paper introduces a Fine-Grained Aggregation Feature Pyramid Network (FGAFPN) to enable multi-scale feature fusion and enhance the model’s capability for multi-scale object detection. Additionally, during the fine-grained aggregation process, this paper introduces a Conflict Information Suppression Feature Fusion Module (CSFM) that employs attention mechanisms to discard irrelevant information during the fusion process of feature maps at different levels, promoting the interaction of multi-scale feature information. Recognizing the significant role of the YOLO series [12,13,14,15,16,17,18] in the field of object detection, this paper applies the proposed methods to the YOLOv8 model. By integrating the existing strengths of the YOLOv8 model, the objective is to enhance its performance further and fulfill the demands of UAV image object detection tasks.

The main contributions are as follows.

This paper introduces an IPFA module to address the issue of feature information loss in conventional aggregation methods such as stride convolution. By splitting features across multiple dimensions and reassembling them, particularly along the channel dimension, the IPFA module enables effective feature aggregation while preserving the original features, thus constructing abstract semantic representations.
A CSFM is introduced to balance low-level spatial details and high-level semantic features during the feature fusion process. The CSFM integrates channel and spatial dimensions attention mechanisms to filter out redundancy and conflicts, thereby enhancing fusion effectiveness.
This paper proposes FGAFPN to exploit the advantages of deep and shallow feature maps fully. Through CSFM and cross-level connections, FGAFPN ensures a balance between semantic and spatial information in the output feature maps, reducing semantic information differences and generating conflicting information, thereby improving target detection performance, especially in scenarios with complex backgrounds.

This study applies the methods above to the YOLOv8 algorithm and proposes the IF-YOLO (YOLO Algorithm Based on Information-Preserving and Fine-Grained Feature Aggregation) model to enhance the detection capability of complex ground background objects and small objects in drone images. The effectiveness of these methods has been validated on the VisDrone2019 dataset, which consists of aerial images captured by drones.

2. Related Work

2.1. Object Detection

Object detection is a crucial task within computer vision (CV), focused on automatically identifying and precisely labeling the position and type of objects in images or videos. Traditional object detection methods such as Haar features [19] and SVM classifiers [20] typically rely on manually designed simple features, which struggle to extract complex semantic features, resulting in poor performance of the model in aspects like scale variation, pose changes, and occlusions. Compared to traditional methods, deep learning approaches can learn more advanced and abstract feature representations from raw data by combining multiple layers of neural networks. These features capture low-level visual characteristics of objects, such as color and texture, and gradually build more complex semantic representations through hierarchical nonlinear transformations, including shape, structure, context, and other information about the objects. This ability has led to significant advantages of deep learning methods in object detection tasks. Deep learning-based object detection algorithms fall into two main categories: one-stage and two-stage detection algorithms. Two-stage detection algorithms, like Faster R-CNN [21], Mask R-CNN [22], and CenterNet [23], break down the object detection process into two distinct stages. The initial stage identifies candidate regions within the image, while the subsequent stage classifies and refines bounding boxes for these candidates. Despite their higher accuracy, these methods often need faster detection speeds. On the other hand, one-stage detection algorithms (such as SSD [5], RetinaNet [6], YOLO series [12,13,14,15,16,17,18]) discard the region proposal step and directly perform classification and localization of objects. These one-stage detection algorithms notably enhance detection efficiency while reducing computational overhead. However, one-stage algorithms often need help with the problem of class imbalance, resulting in relatively lower detection accuracy. In addition, deep learning-based object detection algorithms can also be classified based on whether predefined anchor boxes are utilized. This categorization includes anchor-based detection algorithms (such as Scaled-YOLOv4 [24], YOLOv5 [16]) and anchor-free detection algorithms (such as Reppoints [25]).

The YOLO series has gained widespread attention and usage in the realm of object detection due to its excellent performance. Each version of the YOLO series has introduced improvements and enhancements. In YOLOv4 [15], the CSPDarkNet [26] structure is introduced to enhance the learning capability of the convolutional neural network. The SPPF module and PANet structure [27] also are introduced. YOLOv5 [16] introduces the Focus module, which has advantages in computational speed compared to conventional 3 × 3 convolutional layers. In addition, non-maximum suppression (NMS) is employed to remove overlapping candidate boxes. YOLOv7 [18] introduces the E-ELAN module based on the ELAN module. ELAN is a network module that manages the shortest and longest gradient paths. At the same time, E-ELAN represents a refinement of the ELAN module, enabling the training of multi-scale features without elongating the shortest gradient path. This enhancement notably boosts the feature extraction capacity of YOLOv7, thereby enhancing the accuracy and efficiency of object detection. YOLOv8 [28] is the latest version of the YOLO series. Compared to previous versions like YOLOv7 and YOLOv5, YOLOv8 has improved speed and accuracy. YOLOv8 is a one-stage detector that offers five versions, ranging from small to large: YOLOv8-n, YOLOv8-s, YOLOv8-m, YOLOv8-l, and YOLOv8-x. Typically, YOLOv8 uses CSPDarkNet to extract features from images, improves the C3 module to the C2f module, and adopts the SPPF module [29] in the last layer. In the neck part, YOLOv8 uses the PANet module for feature fusion and contextual information propagation. In the head part, YOLOv8 has made significant changes compared to the previous YOLO series, transitioning from a coupled head to a decoupled head and from anchor-based to anchor-free. This design makes the network more flexible and efficient. With all these advantages, YOLOv8 has become an easy-to-use and efficient object detector.

2.2. UAV Image Object Detection

During UAV aerial imaging, sensors are typically positioned at a distance from the objects being photographed, resulting in small imaging areas for the targets, fewer pixels, lower resolution, and a lack of sufficient feature information. This makes their detection challenging. With advancements in object detection technology, some algorithms designed explicitly for UAV image detection have been proposed. For example, the UFPMP-Net algorithm [30] addresses the issue of more minor scales and limited viewpoints in UAV datasets by designing a Unified Foreground Packing Module (UFP). It clusters the coarse detector’s sub-regions to suppress the background and infers the generated images in a single inference, improving detection accuracy while reducing computational time. Lu et al. [31] introduce a hybrid model blending convolutional neural networks (CNNs) with Transformers for effective object detection in UAV images, tackling issues like intricate backgrounds and varied scales. However, the model proposed in the study has the issue of high model complexity. The UAV-YOLOv8 algorithm [32] significantly improves the detection performance of small objects by introducing innovative designs such as Wise-IoU v3 loss, FFNB feature processing module, and BiFormer attention mechanism, while achieving a good balance in resource consumption. Zhang et al. proposed CFANet [33] for efficient UAV image object detection. They introduced the Cross-Layer Feature Aggregation (CFA) module to address semantic gaps in feature fusion, the Layered Associative Spatial Pyramid Pooling (LASPP) module for context capture, and the alpha-IoU loss function for faster convergence and better accuracy. They also used the overlapping slice (AOS) method for object integrity during high-resolution image slicing. Li et al. [34] conducted lightweight optimizations on the YOLOv8-s model. They replaced YOLOv8’s PANet structure with BiFPN [35] and introduced the Ghostblock [36] module into the backbone network, achieving improved detection capabilities with reduced parameters. TPH-YOLOv5 [37] achieves precise detection of small targets in UAV images by incorporating Transformer prediction heads and integrating the CBAM model. Drone-YOLO [38] improves small object detection accuracy based on the YOLOv8 model by introducing enhanced neck components, sandwich fusion modules, and RepVGG modules.

Although the methods mentioned above have achieved improvements in their application, they fail to consider the issue of information loss caused by general feature aggregation methods like stride convolution and pooling in deep learning object detection algorithms, as well as the problem of semantic conflicts introduced by direct fusion in feature pyramid networks. As a result, the performance improvement of these models is limited. In light of this, our proposed method considers these issues. It aims to construct semantic feature representations of input samples without losing object features and reduces the generation of conflicting information during multi-scale feature fusion. This approach improves the ability to detect small objects.

3. Methodology

The overall structure of our model is shown in Figure 2. Compared to the original YOLOv8, our model has a series of improvements to enhance its applicability for object detection in UAV images.

3.1. Information-Preserving Feature Aggregation Module

Feature aggregation is a method that integrates and combines multiple features from the same feature space to capture more comprehensive and accurate information. However, commonly used feature aggregation methods such as stride convolution and pooling can lead to catastrophic degradation of detection performance in deep neural networks due to the loss of feature information. Stride convolution reduces the size of the output feature map and increases the receptive field size by increasing the stride of the convolutional kernel. This method directly aggregates the input features in the spatial dimension, leading to the compression of feature information for small objects. On the other hand, the pooling operation splits the features into four sub-features in the spatial dimension and preserves only partial sub-features while discarding the others. This approach may result in the loss of valuable information contained in the discarded sub-features. We propose an IPFA module that involves splitting and recombining features in both channel and spatial dimensions to address this issue. This module first applies a 3 × 3 convolution to expand the receptive field. Then, after a splitting-and-recombining process, the features undergo a 1 × 1 convolution to interact along the channel dimension, achieving feature aggregation. The purpose of this module is consistent with the operations between different stages in the YOLO series algorithms. YOLOv1 and YOLOv2 use convolution and max pooling operations between stages, whereas later versions adopt stride convolution operations [39]. As illustrated in Figure 3, all three approaches aim to increase the receptive field and reduce the feature map size. However, our design outperforms by delivering superior performance. It allows the construction of semantic feature representations of input samples without losing the original features. The specific structure of the IPFA module is shown in Figure 4.

Assuming the input feature map, X has a size of H × W × C, the IPFA module first employs a 3 × 3 convolution to increase the receptive field size while preserving the output feature map size:

Y = C o n v_{3} (X)

(1)

where the

C o n v_{3}

refers to conventional convolutions with kernel sizes of 3 × 3, and it is replaced with depthwise separable convolutions in the neck section to reduce the parameter count. Afterward, the features are split in both channel and spatial dimensions, resulting in 8 groups of sub-features. These processes are calculated as shown in the following equation:

\begin{matrix} f_{0, 0, 0} = Y [0 : 2 : W, 0 : 2 : H, 0 : 2 : C] & , & f_{1, 0, 0} = Y [1 : 2 : W, 0 : 2 : H, 0 : 2 : C], \\ ⋮ \\ f_{1, 1, 0} = Y [1 : 2 : W, 1 : 2 : H, 0 : 2 : C] & , & f_{1, 1, 1} = Y [1 : 2 : W, 1 : 2 : H, 1 : 2 : C] \end{matrix}

(2)

where W, H, and C refer to the maximum width, maximum height, and maximum number of channels of the input feature map, respectively.

f_{0, 0, 0}, f_{1, 0, 0}, \dots, f_{1, 1, 0}, f_{1, 1, 1}

represent the eight sub-features obtained after splitting the input features. These sub-features are then recombined using the concat operation in the channel dimension. Finally, a 1 × 1 convolution is used to perform information interaction on the recombined features in the channel dimension with the following formula:

Z = C o n v_{1} ({f_{0, 0, 0}, f_{1, 0, 0}, \dots, f_{1, 1, 0}, f_{1, 1, 1}})

(3)

where {,} denotes the concat operation and

C o n v_{1}

refers to conventional convolutions with kernel sizes of 1 × 1.

We have replaced the stride convolutions between different stages in the backbone and neck sections of YOLOv8 with the IPFA module. Compared to traditional methods, this newly implemented module can preserve the original feature information in the input samples, thereby enhancing the model’s performance.

3.2. Conflict Information Suppression Feature Fusion Module

In feature fusion, the “concat” or “add” operation is commonly used to fuse feature maps from different levels. However, simply fusing them with default weights can introduce a significant amount of redundant and conflicting information, leading to semantic changes in the current layer and causing small objects to be easily obscured by the background.

Therefore, the CSFM is proposed to filter out conflicting information during the fusion process and prevent the features of small objects from being overwhelmed. This module has two parallel branches: the Channel Conflict Information Suppression Module (CCSM) and the Spatial Conflict Information Suppression Module (SCSM). The objectives of CCSM and SCSM are to excavate important information in the input feature channel and spatial dimensions. CCSM compresses the input features in the spatial dimension through adaptive average pooling and adaptive max pooling to aggregate spatial information representing global image features. The output feature maps from both pooling methods are then combined to obtain more detailed global image features. On the other hand, SCSM compresses the input features in the channel dimension through two convolutional operations. This allows SCSM to obtain spatial attention maps corresponding to different levels of input features. Figure 5 illustrates the specific structure of CSFM. CCSM and SCSM generate adaptive weights in the channel and spatial dimensions, guiding the model to focus on more important information while discarding conflicting information.

Assuming that X is a shallow-level feature map, Y is the corresponding level feature map, and Z is a deep-level feature map. We first resize the input feature maps X and Z to the same size as Y. The formulas for this operation are as follows:

\begin{matrix} X^{'} & = D o w n s a p m l e (X) \\ Z^{'} & = U p s a m p l e (Z) \end{matrix}

(4)

where the

D o w n s a m p l e

operation is implemented using stride convolution, while the

U p s a m p l e

operation is implemented using bilinear interpolation. The three layers of feature maps with unified size are separately input into the upper and lower branches, where the output of the upper branch

O_{C}

can be derived from the following equation:

\begin{matrix} O_{C M} & = C R C (M P ({X^{'}, Y, Z^{'}})) \\ O_{C A} & = C R C (A P ({X^{'}, Y, Z^{'}})) \\ W_{C} & = S i g m o i d (O_{C M} + O_{C A}) \end{matrix}

(5)

O_{C} = W_{C} \otimes {X^{'}, Y, Z^{'}}

(6)

where

M P

refers to max pooling operations and

A P

refers to average pooling operations.

C R C

represents two convolution operations with a ReLU activation in between, and ⊗ denotes element-wise multiplication. After concatenating the input features, they are separately passed through average pooling and max pooling operations to generate the corresponding weights

O_{C A}

and

O_{C M}

. These two weights are then summed along the spatial dimension and passed through a sigmoid activation function to generate channel-wise adaptive weights

W_{C}

. Finally,

W_{C}

is element-wise multiplied with the concatenated input features to obtain the output of CCSM,

O_{C}

.

The spatial attention weights

W_{S}

generated by the SCSM can be expressed as

W_{S} = S o f t m a x (C o n v_{1} ({C o n v_{3} (X^{'}), C o n v_{3} (Y), C o n v_{3} (Z^{'})})))

(7)

where

S o f t m a x

is used to normalize the feature maps along the channel dimension. The input feature maps are individually convolved with a 3 × 3 kernel, resulting in three output feature maps with eight channels each. These feature maps are then concatenated together along the channel dimension. Subsequently, a 1 × 1 convolution is applied to reduce the number of channels to 3, matching the number of input feature maps. So the output of the lower branch

O_{S}

can be described as

O_{S} = {X^{'} \otimes W_{S} [0], Y \otimes W_{S} [1], Z^{'} \otimes W_{S} [2]} .

(8)

The input feature maps are multiplied element-wise with corresponding spatial weights

W_{S}

, and the resulting weighted feature maps are concatenated to obtain the output

O_{s}

of the SCSM. Finally, the outputs

O_{C}

and

O_{S}

of CCSM and SCSM are added element-wise to obtain the output

O_{c s}

of the CSFM module. ⊕ represents the element-wise sum. The formula can be described as follows:

O_{C S} = O_{C} \oplus O_{S} .

(9)

3.3. Fine-Grained Aggregation Feature Pyramid Network

In deep learning neural networks, shallow feature maps have smaller receptive fields and more robust feature representation capabilities, making them suitable for detecting small objects. Conversely, deep feature maps are more suitable for detecting large objects. Multi-scale feature fusion combines the advantages of shallow and deep feature maps. It aims to combine feature maps from different scales or resolutions to obtain a more comprehensive and rich feature representation, thereby achieving better feature fusion effects. However, directly fusing information of different densities can lead to semantic conflicts, limiting the expression of multi-scale features. To address this problem, we propose FGAFPN, whose specific structure can be observed in Figure 6.

FGAFPN first uses a Fine-Grained Feature Aggregation Module (FGAM) on multiple levels of feature maps P1 to P5 in the backbone, enabling some degree of information interaction between feature maps at different levels. From P1 to P5, detailed information decreases while semantic information increases. This ensures a balance of semantic and spatial information in the output feature maps L1 to L4, reducing the difference in information density between multi-level feature maps in the input pyramid network and mitigating the generation of conflicting information. This feature aggregation operation is performed before the pyramid network and is composed of multiple CSFM modules. After FGAM outputs the aggregated features, the pyramid network introduces additional connections for feature maps at the same level, inspired by the BiFPN architecture. This integration allows for the fusion of more target features without introducing excessive computational costs. By combining these techniques, FGAFPN achieves a more balanced integration of semantic and spatial information, effectively reducing semantic conflicts and improving the network’s ability to detect objects in complex backgrounds.

4. Experiment

4.1. Datasets

To assess the effectiveness of our proposed methods, we conducted experiments using the VisDrone2019 dataset [40]. The VisDrone2019 dataset is a collection of visible light images captured by drones, curated by the AISKYEYE team at the Lab of Machine Learning and Data Mining, Tianjin University. This benchmark dataset encompasses various altitudes, lighting conditions, and weather scenarios, featuring objects with diverse degrees of occlusion and deformation. Comprising 10,209 static images, the dataset’s resolutions range from 2000 × 1500 pixels to 480 × 360 pixels. It is partitioned into training, validation, and test sets, containing 6471, 548, and 3190 images, respectively. Figure 7 quantitatively illustrates some characteristics of the VisDrone2019 dataset. Figure 7a depicts the specific quantities of each category within the dataset. Figure 7c showcases the distribution of object bounding box center coordinates, with darker shades indicating a concentration of object centers in the middle to lower regions of the images. Figure 7b,d, respectively, present the size distribution of object bounding boxes and a scatter plot illustrating the correspondence between the widths and heights of these bounding boxes. These visualizations collectively indicate the prevalence of small objects and objects with significant scale variations within the dataset, thereby posing challenges for object detection tasks.

4.2. Evaluation Metrics

We evaluate our model’s reliability on both the test and validation sets. Our main emphasis lies in evaluating the model’s detection performance and complexity. Throughout the experiments, we employ various metrics to gauge the model’s performance, including precision rate (P), recall rate (R), mean average precision (mAP), floating-point operations (GFLOPS), million parameters (M), and detection time (ms). The calculation formulas for each evaluation metric are as follows:

P = \frac{TP}{TP + FP}

(10)

R = \frac{TP}{TP + FN}

(11)

AP = \int_{0}^{1} P (R) dR

(12)

mAP = \frac{1}{m} \sum_{i = 1}^{m} {AP}_{i}

(13)

among them,

T P

,

F P

, and

F N

, respectively, represent the quantities of correctly predicted positive samples, incorrectly predicted positive samples, and incorrectly predicted negative samples.

4.3. Implementation Details

Our experimental setup is as follows: the operating system is Ubuntu 20.04, the Python version is 3.9, the PyTorch version is 1.12.0, the CUDA version is 11.6, and the GPU is an NVIDIA GeForce RTX 3090. In the experiments, we trained for 300 epochs each time, with the first three epochs used for warm-up training. We used the SGD optimizer with an initial learning rate of 1

\times 10^{- 2}

. Instead of using a constant or linear decay strategy, we dynamically reduced the learning rate to 10

\times 10^{- 5}

using a cosine annealing schedule. Additionally, the weight decay coefficient and momentum were set to 5

\times 10^{- 4}

and 0.937, respectively. Since the initial sizes of the input images were relatively large, a series of preprocessing operations were performed to resize all images to a uniform size of 640 by maintaining their aspect ratios before training the image-based algorithm. As a result, we could set the batch size to 8 for efficient training. In all experiments in this paper, the experimental results were obtained through training and testing with the same parameter settings as mentioned above, and none of the involved models used pre-trained models.

4.4. Analysis of Results

4.4.1. Effect of IPFA Module

The impact of the IPFA module before and after its usage is presented in Table 1. Employing the IPFA module facilitates the enhancement of model detection accuracy. On the VisDrone2019-val dataset, the inclusion of the IPFA module led to a 3.3% improvement in mAP at 0.5 for the YOLOv5-s model and a 1.8% improvement for the YOLOv8-s model. This indicates that the IPFA module has superior optimization capabilities. This is primarily due to the IPFA module’s ability to maintain the inherent features of objects while creating semantic feature representations, thereby enhancing the model’s capacity to detect small objects. On the VisDrone2019-test dataset, the IPFA module improves the mAP at 0.5 for YOLOv5-s and YOLOv8-s models by 3.4% and 1.7%, respectively. This indicates that the IPFA module demonstrates excellent generalization ability. Figure 8 provides a more intuitive visualization of the improvements brought by the IPFA module. The first row of Figure 8 shows images captured from different lighting and scene conditions from the perspective of the UAV. The second row of Figure 8 presents the heatmap generated by YOLOv5-s using Gradient-Weighted Class Activation Mapping (Grad-CAM). The third row of Figure 8 displays the heatmap generated by YOLOv5-s with the IPFA module using Grad-CAM. Similarly, the fourth row of Figure 8 represents the heatmap generated by YOLOv8-s using Grad-CAM, and the last row of Figure 8 shows the heatmap generated by YOLOv8-s with the IPFA module using Grad-CAM. Compared to the second row of Figure 8, the third row of Figure 8 reveals that more small objects are covered by deeper red regions, accompanied by a reduction in red coverage in areas with no objects. Similarly, the same pattern emerges when comparing the fourth row to the last row of Figure 8. These observations indicate that the model with the IPFA module exhibits a higher level of attention towards small object regions and reduces its focus on irrelevant areas, thus validating the effectiveness of the IPFA module.

4.4.2. Effect of CSFM

As shown in Table 2, we compared CSFM’s performance when applied to different FPN methods. Since FGAFPN is designed as a four-detection-head structure, to ensure the validity of the results, we introduced an additional detection layer (adl) for the PANet and BiFPN methods as well. The position of the detection layer is delineated by the gray arrows in Figure 2. In Table 2, ✓ indicates the addition of this extra detection layer “adl”. Compared to FGAFPN using concatenation operations instead of CSFM (FGAFPN-CSFM), we observed significant improvements in detection performance after integrating the CSFM into FGAFPN. On the VisDrone2019-val dataset, the mAP increased from 44.8% to 45.8%, and recall rose from 43.1% to 44.0%. Similarly, on the VisDrone2019-test dataset, the mAP improved from 36.1% to 36.4%, precision increased from 47.5% to 47.8%, and recall increased from 37.7% to 38.1%. We also noticed considerable performance enhancements in both PANet and BiFPN models after integrating the CSFM. The mAP on the VisDrone2019-val dataset increased by 0.4% for PANet and 0.7% for BiFPN. However, the performance of PANet with CSFM was not as significant as that of FGAFPN. This is mainly because FGAFPN introduces additional cross-layer connections among feature maps at the same level to address the issue of information loss during feature fusion and further facilitate the reduction of conflicting information generation.

Through refined feature fusion, FGAFPN acquires detailed features and effectively uses contextual information, enabling a higher-level semantic understanding. Figure 9 intuitively demonstrates the effectiveness of FGAFPN through visualization. In images from the VisDrone2019 dataset, FGAFPN combined with CSFM can detect targets obscured by trees, buildings, and bridges, including small targets. This capability reduces false positives and false negatives to some extent. Other FPN methods utilizing CSFM also exhibit improved detection performance. Specifically, the BiFPN combined with CSFM can accurately detect targets obscured by obstacles such as trees and bridges, and the confidence of detected obscured targets is enhanced. This confirms the effectiveness of CSFM in enhancing multi-scale feature fusion.

4.5. Sensitivity Analysis

Before conducting comparison experiments, it is imperative to thoroughly explore the potential of the IPFA module and optimize the model construction. Considering the diverse impacts of varying numbers and positions of IPFA module instances, we conducted experiments on the VisDrone2019 dataset to determine the optimal model configuration. This paper not only compared the effects before and after the usage of the IPFA module but also evaluated the impact of different numbers and positions of IPFA module instances. Recognizing that detailed information is richer in shallow network features and more conducive to detecting small objects, we prioritized replacing the stride convolution in the shallow network. As shown in Table 3, ✓ indicated the replacement of the stride convolution in the current stage with the IPFA module. Based on the results, the best overall performance was achieved when the IPFA module was applied in both stage 1 (S1) and stage 2 (S2) in the neck while maintaining the unchanged application of IPFA in the backbone. Additionally, as the number of IPFA modules applied in the backbone increased, the mAP also improved. However, this led to increased model complexity and detection time. To balance detection accuracy and speed, we set the number of IPFA module applications in the backbone and neck to two each. The highest mAP of 47.3% was attained when the IPFA module was utilized simultaneously in stage 1 (S1) and stage 2 (S2) in the backbone. Therefore, in our model, the IPFA module was applied in stage 1 (S1) and stage 2 (S2) in both the backbone and neck.

4.6. Comparison with Mainstream Models

To validate the effectiveness of our proposed method, a comparison was conducted between our proposed model and several representative state-of-the-art models on the VisDrone2019 dataset. The comparison results are shown in Table 4. IF-YOLO achieved the highest mean average precision of 47.3%, a precision rate of 56.5%, and a recall rate of 45.3% on the VisDrone2019-val dataset. This is a significant improvement over the baseline method YOLOv8-s, which had a precision rate of only 50.2% and a recall rate of 39.7%. The improved accuracy of IF-YOLO can be attributed to the IPFA module and FGAFPN, which enhance the detection ability for small objects, provide multi-scale representation capability, and reduce false and missed detections. Regarding the model complexity, our model has a computational complexity of 47.5 GFLOPS, which is relatively high due to the additional improvements we made to capture comprehensive feature information for small objects in complex scenes. However, compared to the top-performing YOLOv5-x and YOLOv8-x, our model significantly reduces the number of parameters while achieving similar performance. Although our model falls slightly short in precision rate, it slightly outperforms the other models in recall rate and mAP at 0.5. These experimental comparisons show that our improvements result in a significant performance enhancement despite the slight increase in complexity.

We also compared the performance of YOLOv8-l, YOLOv3, IF-YOLO, and the baseline model YOLOv8-s in images taken under different scenarios and lighting conditions. The results are shown in Figure 10. Our study revealed that IF-YOLO is the most accurate model for object detection under dim lighting conditions. We observed that YOLOv8-s and YOLOv8-l missed many detections in scene 1, only detecting a small portion of the targets. YOLOv3 also exhibited limited capability in detecting vehicle objects. However, IF-YOLO detected most of the targets, including those occluded by obstacles, making it more accurate than the other models. Scene 2 demonstrated that YOLOv8-s, YOLOv8-l, and YOLOv3 failed to detect vehicles obstructed by bridges. IF-YOLO exhibited a more pronounced advantage when detecting numerous small objects, as depicted in the comparison plots for the three scenes. YOLOv8-s, YOLOv8-l, and YOLOv3 displayed numerous missed detections within the red boxes in the images, whereas IF-YOLO detected almost all targets. Overall, these results suggest that in UAV image object detection tasks, IF-YOLO outperforms others in detecting small, occluded, and targets under poor lighting conditions, reducing false and missed detections.

In addition to the above-mentioned comparison, we plotted the confusion matrix between our model and the baseline, as depicted in Figure 11. The horizontal axis represents the true categories, while the vertical axis represents the predicted categories. Diagonal elements indicate the proportion of correctly predicted categories, while off-diagonal elements represent the proportion of mispredictions. It can be observed from the figure that the diagonal region in the right subplot is higher and darker than that in the left subplot, indicating an improvement in our model’s ability to predict object categories correctly. These conclusions are consistent with the P-R curve comparison, as shown in Figure 12. The area under the P-R curve (AUC-PR) for each class in our model is higher than that of YOLOv8-s, further demonstrating the superiority of our model.

4.7. Ablation Experiments

To validate the effectiveness of each improvement strategy, we conducted ablation experiments based on the baseline model (YOLOv8-s), as shown in Table 5. The “adl” in Table 5 refers to the additional layer for small object detection. In the ablation experiments, two instances of the IPFA module were applied to both the backbone and the neck to maintain a balance between speed and accuracy.

When the additional layer for small object detection was added (YOLOv8-s+adl), the combination of the four detection heads addressed the issue of drastic changes in target scales, resulting in a 4.3% increase in mAP, with precision and recall rates improving by 4.0% and 3.6%, respectively. This also confirms that the four-detection-head structure is indispensable for small object detection in UAV images due to the large number of small targets. Furthermore, when the IPFA module was applied to the model to replace generic feature aggregation methods such as stride convolution (YOLOv8-s+adl+IPFA), the mAP at 0.5 improved by an additional 1.9%, with precision and recall rates further increasing by 2.3% and 1.2%. This indicates that the IPFA module enabled the model to construct a more abstract semantic feature representation of input samples while retaining the original features of small targets. With the application of the FGAFPN module (YOLOv8-s+FGAFPN), the mAP increased by 1.1% compared to the four-detection-head baseline model (YOLOv8-s+adl), and the recall rate improved by 0.7% without a decrease in the precision rate. As a four-detection-head structure, FGAFPN reduced the semantic disparity between feature maps at different levels, enhancing the model’s multi-scale representation capability. Finally, when all improvement strategies were applied to the baseline model (YOLOv8-s+IPFA+FGAFPN), the model achieved the best performance, with a 6.9% increase in mAP, and precision and recall rates improving by 6.3% and 5.6%, respectively. The results indicate that each improvement strategy applied to the baseline model led to varying degrees of performance enhancement in detection.

5. Conclusions

In this paper, IF-YOLO is proposed based on YOLOv8 to address the precise detection of multi-scale and small objects. Specifically, our algorithm resolves the problem of losing information from small objects caused by general feature aggregation methods and the semantic conflict issue resulting from directly fusing features of different levels in common FPN methods through three key improvements. Firstly, the IPFA module is introduced to replace the stride convolution, which achieves feature aggregation by splitting and reorganizing features in multiple dimensions and interacting information along the channel dimension, thereby constructing a semantically abstracted representation of the input sample without losing the original features of small objects. This enhancement improves the detection performance for small objects. Secondly, the CSFM is proposed to improve the fusion method. This module introduces attention mechanisms to filter out redundant and conflicting information brought about by the direct fusion of feature maps from different levels in both channel and spatial dimensions, thereby enhancing the effect of feature fusion. Finally, the FGAFPN is introduced to fuse the feature maps of adjacent levels in the backbone into the feature map of the current level through CSFM, achieving detailed feature aggregation and reducing the generation of conflicting information by enabling interaction between different-scale feature layers. Extensive experiments on the VisDrone2019 dataset have demonstrated the superiority of IF-YOLO. It exhibits higher accuracy and robustness in challenging scenarios with complex backgrounds, small objects, and occluded targets than other algorithms.

Despite our method’s impressive capabilities in UAV object detection, it has encountered difficulties in challenging conditions such as low lighting, blurriness, and occlusion. Therefore, we are considering combining image enhancement techniques with lightweight strategies to improve its overall effectiveness, versatility, and adaptability on edge devices.

Author Contributions

Conceptualization, J.Z. and Y.Z. (Yan Zhang); methodology, J.Z.; software, J.Z.; validation, J.Z., Z.S. and R.G.; formal analysis, J.Z. and R.G.; investigation, J.Z.; resources, J.Z. and Y.Z. (Yan Zhang); data curation, J.Z.; writing—original draft preparation, J.Z.; writing—review and editing, J.Z., R.G. and Y.Z. (Yu Zhang); visualization, J.Z.; supervision, J.Z.; project administration, J.Z.; funding acquisition, Y.Z. (Yan Zhang). All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China (No. 61302145).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Dataset: https://github.com/VisDrone (accessed on 11 July 2024).

Acknowledgments

The authors would like to express gratitude to the anonymous reviewers and editorial team members for their valuable comments and contributions.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Audebert, N.; Le Saux, B.; Lefevre, S. Beyond RGB: Very High Resolution Urban Remote Sensing With Multimodal Deep Networks. Isprs J. Photogramm. Remote Sens. 2018, 140, 20–32. [Google Scholar] [CrossRef]
Gu, J.; Su, T.; Wang, Q.; Du, X.; Guizani, M. Multiple Moving Targets Surveillance Based on a Cooperative Network for Multi-UAV. IEEE Commun. Mag. 2018, 56, 82–89. [Google Scholar] [CrossRef]
Simonyan, K.; Zisserman, A. Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
Berg, A.C.; Fu, C.Y.; Szegedy, C.; Anguelov, D.; Erhan, D.; Reed, S.; Liu, W. SSD: Single Shot MultiBox Detector. arXiv 2015, arXiv:1512.02325. [Google Scholar]
Lin, T.Y.; Goyal, P.; Girshick, R.B.; He, K.; Dollár, P. Focal Loss for Dense Object Detection. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 2999–3007. [Google Scholar]
Zhao, H.; Shi, J.; Qi, X.; Wang, X.; Jia, J. Pyramid Scene Parsing Network. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
Lin, T.Y.; Maire, M.; Belongie, S.; Bourdev, L.; Girshick, R.; Hays, J.; Perona, P.; Ramanan, D.; Zitnick, C.L.; Dollár, P. Microsoft COCO: Common Objects in Context. 2014. Available online: https://cocodataset.org/ (accessed on 11 July 2024).
Everingham, M.; Zisserman, A.; Williams, C.K.I.; Gool, L.V.; Allan, M.; Bishop, C.M.; Chapelle, O.; Dalal, N.; Deselaers, T.; Dorko, G.; et al. The PASCAL Visual Object Classes Challenge 2007 (VOC2007) Results. 2007. Available online: http://host.robots.ox.ac.uk/pascal/VOC/voc2007/index.html (accessed on 11 July 2024).
Everingham, M.; Gool, L.V.; Williams, C.K.I.; Winn, J.; Zisserman, A. The PASCAL Visual Object Classes Challenge 2012 (VOC2012) Results. 2012. Available online: http://host.robots.ox.ac.uk/pascal/VOC/voc2012/index.html (accessed on 11 July 2024).
Liu, Y.; Sun, P.; Wergeles, N.; Shang, Y. A Survey and Performance Evaluation of Deep Learning Methods for Small Object Detection. Expert Syst. Appl. 2021, 172, 114602. [Google Scholar] [CrossRef]
Redmon, J.; Farhadi, A. YOLOv3: An Incremental Improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the 2016 IEEE Conference on Computer Vision & Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
Farhadi, A.; Redmon, J. YOLO9000: Better, Faster, Stronger. arXiv 2016, arXiv:1804.02767. [Google Scholar]
Bochkovskiy, A.; Wang, C.Y.; Liao, H.Y.M. YOLOv4: Optimal Speed and Accuracy of Object Detection. arXiv 2020, arXiv:2004.10934. [Google Scholar]
Bochkovskiy, A.; Wang, C.Y.; Liao, H.Y.M. YOLOv5: A State-of-the-Art Object Detection System. 2020. Available online: https://github.com/ultralytics/yolov5 (accessed on 11 July 2024).
Li, C.; Li, L.; Jiang, H.; Weng, K.; Geng, Y.; Li, L.; Ke, Z.; Li, Q.; Cheng, M.; Nie, W.; et al. YOLOv6: A Single-Stage Object Detection Framework for Industrial Applications. arXiv 2022, arXiv:2209.02976. [Google Scholar]
Wang, C.Y.; Bochkovskiy, A.; Liao, H.Y.M. YOLOv7: Trainable Bag-of-Freebies Sets New State-of-the-Art for Real-Time Object Detectors. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 7464–7475. [Google Scholar]
Viola, P.A.; Jones, M.J. Rapid Object Detection using a Boosted Cascade of Simple Features. In Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2001), Kauai, HI, USA, 8–14 December 2001. [Google Scholar]
Cortes, C.; Vapnik, V. Support-Vector Networks. Mach. Learn. 1995, 20, 273–297. [Google Scholar] [CrossRef]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask R-CNN. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV 2017), Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
Zhou, X.; Wang, D.; Krähenbühl, P. Objects as Points. arXiv 2019, arXiv:1904.07850. [Google Scholar]
Wang, C.Y.; Bochkovskiy, A.; Liao, H.Y.M. Scaled-YOLOv4: Scaling Cross Stage Partial Network. arXiv 2020, arXiv:2011.08036. [Google Scholar]
Yang, Z.; Liu, S.; Hu, H.; Wang, L.; Lin, S. RepPoints: Point Set Representation for Object Detection. arXiv 2019, arXiv:1904.11490. [Google Scholar]
Wang, C.Y.; Liao, H.Y.M.; Wu, Y.H.; Chen, P.Y.; Yeh, I.H. CSPNet: A New Backbone that can Enhance Learning Capability of CNN. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Seattle, WA, USA, 14–19 June 2020. [Google Scholar]
Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J. Path Aggregation Network for Instance Segmentation. arXiv 2018, arXiv:1803.01534. [Google Scholar]
Jocher, G.; Chaurasia, A.; Qiu, J. Ultralytics YOLO. Version 8.0.0. 2023. Available online: https://github.com/ultralytics/ultralytics (accessed on 11 July 2024).
He, K.; Zhang, X.; Ren, S.; Sun, J. Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 37, 1904–1916. [Google Scholar] [CrossRef] [PubMed]
Huang, Y.; Chen, J.; Huang, D. UFPMP-Det: Toward Accurate and Efficient Object Detection on Drone Imagery. arXiv 2021, arXiv:2112.10415. [Google Scholar]
Lu, W.; Lan, C.; Niu, C.; Liu, W.; Lyu, L.; Shi, Q.; Wang, S. A CNN-Transformer Hybrid Model Based on CSWin Transformer for UAV Image Object Detection. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2023, 16, 1211–1231. [Google Scholar] [CrossRef]
Wang, G.; Chen, Y.; An, P.; Hong, H.; Hu, J.; Huang, T. UAV-YOLOv8: A Small-Object-Detection Model Based on Improved YOLOv8 for UAV Aerial Photography Scenarios. Sensors 2023, 23, 7190. [Google Scholar] [CrossRef]
Zhang, Y.; Wu, C.; Guo, W.; Zhang, T.; Li, W. CFANet: Efficient Detection of UAV Image Based on Cross-Layer Feature Aggregation. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5608911. [Google Scholar] [CrossRef]
Li, Y.; Fan, Q.; Huang, H.; Han, Z.; Gu, Q. A Modified YOLOv8 Detection Network for UAV Aerial Image Recognition. Drones 2023, 7, 304. [Google Scholar] [CrossRef]
Tan, M.; Pang, R.; Le, Q.V. EfficientDet: Scalable and Efficient Object Detection. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 10778–10787. [Google Scholar] [CrossRef]
Han, K.; Wang, Y.; Tian, Q.; Guo, J.; Xu, C.; Xu, C. GhostNet: More Features From Cheap Operations. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 1577–1586. [Google Scholar] [CrossRef]
Zhu, X.; Lyu, S.; Wang, X.; Zhao, Q. TPH-YOLOv5: Improved YOLOv5 Based on Transformer Prediction Head for Object Detection on Drone-captured Scenarios. arXiv 2021, arXiv:2108.11539. [Google Scholar]
Zhang, Z. Drone-YOLO: An Efficient Neural Network Method for Target Detection in Drone Images. Drones 2023, 7, 526. [Google Scholar] [CrossRef]
Terven, J.; Córdova-Esparza, D.M.; Romero-González, J.A. A Comprehensive Review of YOLO Architectures in Computer Vision: From YOLOv1 to YOLOv8 and YOLO-NAS. Mach. Learn. Knowl. Extr. 2023, 5, 1680–1716. [Google Scholar] [CrossRef]
Zhu, P.; Wen, L.; Du, D.; Bian, X.; Fan, H.; Hu, Q.; Ling, H. Detection and Tracking Meet Drones Challenge. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 44, 7380–7399. [Google Scholar] [CrossRef]

Figure 1. Dense and occluded targets in drone images and their magnified display.

Figure 2. Network of IF-YOLO.

Figure 3. Comparison of receptive field and feature map size changes between IPFA and the modules used in different stages of the YOLO series. (a) YOLOv1 and YOLOv2. (b) YOLOv3 and later versions. (c) Our model.

Figure 4. Structure diagram of IPFA module.

Figure 5. Structure diagram of CSFM.

Figure 6. Structure diagram of FGAFPN.

Figure 7. Information about the manual labeling of objects in VisDrone2019 dataset.

Figure 8. Grad-CAM visualization for different methods (YOLOv5-s, YOLOv5-s+IPFA, YOLOv8-s, YOLOv8-s+IPFA). The areas tending towards red indicate higher attention from the model.

Figure 9. (a) Comparison of detection results before and after applying CSFM with different algorithms in one scenario. (b) Comparison of detection results before and after applying CSFM with different algorithms in another scenario.

Figure 10. Comparisonof UAV image object detection results using different methods (YOLOv8-s, YOLOv8-l, YOLOv3, IF-YOLO).

Figure 11. Comparison of confusion matrices between baseline and our model. (a) YOLOv8-s. (b) IF-YOLO.

Figure 12. Comparison of P-R curves between baseline and our model. (a) YOLOv8-s. (b) IF-YOLO.

Table 1. Comparison results of the IPFA module on VisDrone2019 dataset.

Method	Val			Test
Method	P%	R%	mAP@0.5%	P%	R%	mAP@0.5%
YOLOv5-s	50.8	37.1	38.6	43.8	33.4	31.7
YOLOv5-s+IPFA	53.0 (↑ 2.2)	40.6 (↑ 3.5)	41.9 (↑ 3.3)	47.5 (↑ 3.7)	36.5 (↑ 3.1)	35.1 (↑ 3.4)
YOLOv8-s	50.2	39.7	40.4	46.6	34.7	33.0
YOLOv8-s+IPFA	53.0 (↑ 2.8)	40.8 (↑ 1.1)	42.1 (↑ 1.8)	47.9 (↑ 1.3)	36.1 (↑ 1.4)	34.7 (↑ 1.7)

↑ represents the increment.

Table 2. Comparison results of the CSFM on VisDrone2019 dataset.

Method	adl	Val			Test			Parameters (M)	GFLOPS
Method	adl	P%	R%	mAP@0.5%	P%	R%	mAP@0.5%	Parameters (M)	GFLOPS
PANet	✓	54.2	43.3	44.7	48.5	37.7	36.4	11.48	37.4
PANet+CSFM	✓	54.1	43.8	45.2 (↑ 0.5)	47.7	38.4	36.7 (↑ 0.3)	12.36	40.7
BiFPN	✓	57.2	45.2	47.6	49.4	39.2	38.2	8.96	54.9
BiFPN+CSFM	✓	56.8	46.5	48.3 (↑ 0.7)	49.2	39.7	38.5 (↑ 0.3)	9.12	58.6
FGAFPN-CSFM	-	54.5	43.1	44.8	47.5	37.7	36.1	11.56	37.1
FGAFPN	-	54.2	44.0	45.8 (↑ 1.0)	47.8	38.1	36.4 (↑ 0.3)	12.44	41.1

✓ indicates the addition of the extra detection layer “adl”. ↑ represents the increment.

Table 3. Comparison of detection results of IPFA modules at different application positions.

Backbone				Neck			Val			Test			Parameters (M)	GFLOPS	Time (ms)
S1	S2	S3	S4	S1	S2	S3	P%	R%	mAP@0.5%	P%	R%	mAP@0.5%	Parameters (M)	GFLOPS	Time (ms)
✓				✓			55.4	44.4	46.1	49.2	38.8	37.6	11.7	43.6	5.1
	✓			✓			55.5	44.3	46.2	49.0	39.1	37.7	11.75	43.6	5.1
		✓		✓			55.3	43.3	45.6	48.9	37.8	37.0	11.95	43.6	5.1
		✓			✓		55.8	42.9	45.5	48.0	38.7	37.3	12.0	43.6	5.1
✓				✓	✓		55.2	44.8	46.4	49.6	38.9	37.8	11.77	43.8	5.1
✓	✓			✓			53.9	45.3	46.4	50.7	38.9	38.4	11.77	47.2	5.3
	✓	✓		✓			55.2	44.7	46.4	49.1	38.7	37.8	12.01	47.2	5.2
✓	✓			✓	✓		56.5	45.3	47.3	50.2	39.1	38.6	11.83	47.5	5.2
	✓	✓		✓	✓		56.3	44.0	46.0	49.4	38.9	37.6	12.08	47.5	5.3
✓	✓	✓		✓			56.2	45.1	47.3	50.6	39.2	39.0	12.03	50.9	5.5

✓ indicates that the stride convolution in the corresponding stage is replaced by the IPFA module.

Table 4. Detection performance on different detection models on VisDrone2019 dataset.

Method	Val			Test			Parameters (M)	GFLOPS
Method	P%	R%	mAP@0.5%	P%	R%	mAP@0.5%	Parameters (M)	GFLOPS
Faster R-CNN	-	-	19.6	-	-	-	41.2	118.8
Cascade R-CNN	-	-	18.9	-	-	-	69.0	146.6
Li et al. [34]	-	-	42.2	-	-	-	9.66	-
Drone-YOLO-s	-	-	44.3	-	-	35.6	10.9	-
TPH-YOLOv5	-	-	46.4	-	-	-	60.43	145.7
UAV-YOLOv8	54.4	45.6	47.0	-	-	-	10.3	-
YOLOv3	57.8	43.3	45.7	50.1	38.8	37.3	103.67	282.5
YOLOv5-s	50.8	37.1	38.6	43.8	33.4	31.7	9.12	24.1
YOLOv5-m	53.1	41.5	42.6	47.4	36.0	34.7	25.05	64.0
YOLOv5-l	55.7	43.2	44.7	50.0	38.0	36.9	53.14	134.9
YOLOv5-x	56.6	44.8	46.3	50.8	38.9	38.4	97.16	246.0
YOLOv8-s	50.2	39.7	40.4	46.6	34.7	33.0	11.14	28.7
YOLOv8-m	54.0	42.7	43.6	49.9	36.6	36.0	25.86	79.1
YOLOv8-l	57.1	43.4	45.4	49.8	38.2	36.9	43.61	165.2
YOLOv8-x	58.1	45.0	46.6	50.4	38.8	38.2	68.13	257.4
IF-YOLO	56.5	45.3	47.3	50.2	39.1	38.6	11.83	47.5

Table 5. Comparison of model performance under different optimization strategies.

Method	P%	R%	mAP@0.5%	Parameters (M)	GFLOPS
YOLOv8-s	50.2	39.7	40.4	11.14	28.7
YOLOv8-s+adl	54.2 (↑ 4.0)	43.3 (↑ 3.6)	44.7 (↑ 4.3)	11.48	37.4
YOLOv8-s+FGAFPN	54.2 (↑ 4.0)	44.0 (↑ 4.3)	45.8 (↑ 5.4)	12.44	41.1
YOLOv8-s+adl+IPFA	56.5 (↑ 6.3)	44.5 (↑ 4.8)	46.6 (↑ 6.2)	10.87	43.1
YOLOv8-s+FGAFPN+IPFA	56.5 (↑ 6.3)	45.3 (↑ 5.6)	47.3 (↑ 6.9)	11.83	47.5

↑ represents the increment.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhang, J.; Zhang, Y.; Shi, Z.; Zhang, Y.; Gao, R. Unmanned Aerial Vehicle Object Detection Based on Information-Preserving and Fine-Grained Feature Aggregation. Remote Sens. 2024, 16, 2590. https://doi.org/10.3390/rs16142590

AMA Style

Zhang J, Zhang Y, Shi Z, Zhang Y, Gao R. Unmanned Aerial Vehicle Object Detection Based on Information-Preserving and Fine-Grained Feature Aggregation. Remote Sensing. 2024; 16(14):2590. https://doi.org/10.3390/rs16142590

Chicago/Turabian Style

Zhang, Jiangfan, Yan Zhang, Zhiguang Shi, Yu Zhang, and Ruobin Gao. 2024. "Unmanned Aerial Vehicle Object Detection Based on Information-Preserving and Fine-Grained Feature Aggregation" Remote Sensing 16, no. 14: 2590. https://doi.org/10.3390/rs16142590

APA Style

Zhang, J., Zhang, Y., Shi, Z., Zhang, Y., & Gao, R. (2024). Unmanned Aerial Vehicle Object Detection Based on Information-Preserving and Fine-Grained Feature Aggregation. Remote Sensing, 16(14), 2590. https://doi.org/10.3390/rs16142590

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Unmanned Aerial Vehicle Object Detection Based on Information-Preserving and Fine-Grained Feature Aggregation

Abstract

1. Introduction

2. Related Work

2.1. Object Detection

2.2. UAV Image Object Detection

3. Methodology

3.1. Information-Preserving Feature Aggregation Module

3.2. Conflict Information Suppression Feature Fusion Module

3.3. Fine-Grained Aggregation Feature Pyramid Network

4. Experiment

4.1. Datasets

4.2. Evaluation Metrics

4.3. Implementation Details

4.4. Analysis of Results

4.4.1. Effect of IPFA Module

4.4.2. Effect of CSFM

4.5. Sensitivity Analysis

4.6. Comparison with Mainstream Models

4.7. Ablation Experiments

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI