1. Introduction
The evolution of Uncrewed Aerial Vehicle (UAV) technology has facilitated the acquisition of high-resolution remote sensing images. Object detection within UAV remote sensing images holds promise for applications in various sectors including urban planning, land monitoring, precision agriculture, updates to geographic information systems, and military operations, among others [
1]. Nevertheless, the complexity of the background and the variability in size and orientation of objects in UAV remote sensing images introduce substantial challenges to object detection in UAV remote sensing images [
2].
The advent of Convolutional Neural Networks (CNN) has significantly advanced the field of object detection. Current mainstream detection methodologies can be classified into two categories: one-stage and two-stage detection methods. The one-stage YOLO (You Only Look Once)-series algorithm prioritizes speed, directly regressing and classifying the object box of feature map predictions across various scales to enhance inference speed [
3,
4]. On the other hand, the two-stage RCNN-series algorithm excels in accuracy, utilizing the Region Proposal Network (RPN) to generate candidate boxes, which are then classified and regressed [
5]. Inspired by these approaches, numerous researchers have integrated deep learning techniques into UAV remote sensing image object detection. To address the challenge of small-scale object detection, several novel methods have been proposed. Zeng et al. [
6] proposed the SCA-YOLO (Spatial and Coordinate Attention enhancement YOLO) algorithm. A hybrid attention module with associated coordinate attention was designed by the algorithm to enhance the feature extraction of small objects. Furthermore, improvements were made to the SEB (Simple and Efficient Bottleneck) to further distinguish the foreground and background characteristics. Zhou et al. [
7] developed the ADCSPDarknet53 backbone network based on YOLOV4, modifying the regression loss function to enhance the model’s ability to locate small objects. Zhuang et al. [
8] proposed a one-stage detection model with multi-scale feature fusion, integrating different levels of feature maps for improved detection accuracy with small objects. Lan et al. [
9] introduced a novel method for detecting small objects in optical remote sensing images. The proposed method constructs a spatial transformer that incorporates spatial attention and self-attention mechanisms. This facilitates the extraction of key features from relevant areas within the image space and mitigates the issue of small object leakage. Liu et al. [
10] proposed the YOLO-extract algorithm, which is based on YOLOv5. The algorithm integrates coordinated attention into the network by adopting the concept of residual networks. Moreover, by combining the hybrid expansion convolution with the redesigned residual structure, the algorithm optimizes the model’s feature extraction power for objects at various scales. Despite these advancements, CNN-based methods appear to have reached a performance plateau in detection tasks [
11].
In recent years, numerous researchers have explored the representation of images. Considering images as a collection of patches has opened new avenues for research in object detection. Dosovitakiy et al. [
12] proposed that transformers, possessing a capability for global-context information modeling and spatial adaptive aggregation, surpass the constraints of CNN and are widely employed in computer vision tasks [
13,
14,
15]. The ViTDet [
13] model utilizes the native ViT model as its backbone network for object detection. Notably, the ViTDet model is divided into four stages. The first few blocks of each stage use the window self-attention to improve computational efficiency, and the last block uses global self-attention to facilitate information exchange between different windows. This design renders the ViTDet model suitable for object detection tasks. Following this approach, Zhu et al. [
16] proposed TPH-YOLOv5, a network that replaces the original prediction head with Transformer Prediction Heads (TPH). Simultaneously, a prediction head is added to detect objects at varying scales, thereby augmenting the model’s object recognition capabilities. Zhang et al. [
17] introduced the Transformers for Remote Sensing (TRS) model, incorporating self-attention into the ResNet network to improve the model’s capacity to learn the overall features of the image and attain superior detection accuracy. Jiang et al. [
18] presented the RAST-YOLO (You Only Look Once with Region Attention and Swin Transformer) algorithm. The algorithm uses the Region Attention (RA) mechanism combined with a Swin Transformer as the backbone to extract features to enhance the detection accuracy of objects in complex backgrounds. Subsequently, the C3D module is employed to fuse deep and shallow semantic information, thereby enhancing the detection accuracy of small targets. Wang et al. [
19] introduced the MashFormer model, which combines a CNN with a transformer to enhance its representational ability in complex-background scenarios. Additionally, a multi-level feature aggregation module with cross-level feature alignment is designed to mitigate the semantic discrepancy between features extracted from shallow and deep layers.
The aforementioned method enhances the model’s capability to extract object features through the use of a transformer, thereby effectively improving the detection performance. However, due to the advancements in UAV technology, high-resolution UAV remote sensing images have become easily obtainable. High-resolution UAV remote sensing images result in small objects that occupy fewer pixels within the image and are surrounded by complex background information. The network faces difficulty in extracting effective features from small objects, thereby impeding the model’s ability to accurately locate and recognize them [
20]. Furthermore, the process of multiscale prediction using an FPN (Feature Pyramid Network) [
21] encounters the challenge of missing feature information. These challenges leave room for improving the detection performance of UAV remote sensing images.
To address the aforementioned issues, this paper employs the transformer structure to enhance the model’s feature-encoding capability and mitigate the loss of semantic information through the integration of diverse features. Specifically, this paper presents a Multi-Branch Parallel Network (MBPN) based on the ViTDet model. Initially, the object features from different feature maps input to FPN undergo enhancement through Receptive Field Enhancement (RFE) and Convolutional Self-Attention (CSA) modules. The RFE module is well-suited for shallow feature maps, enabling the extraction of features of varying sizes through convolutions with diverse kernel sizes. The resulting feature maps are concatenated along the channel dimension, while the original feature maps undergo convolutional and Softmax operations to generate attention maps. The CSA module is well-suited for deep feature maps, as it maps the feature maps to three distinct linear spaces (Q, K, V) through diverse convolution operations. Specifically, similarity calculations are performed by Q and K to derive the attention map, followed by weighting V to yield the final result. Additionally, the utilization of Multi-Branch Upsampling (MBUS) and Multi-Branch Downsampling (MBDS) modules yields diverse feature maps, which are then concatenated along the channel dimension and ultimately compressed through 1 × 1 convolution. Finally, the Feature-Concatenating Fusion (FCF) module is employed to merge the feature maps. This process involves sampling small-scale feature maps and concatenating them with large-scale feature maps, which are subsequently compressed through convolutional operations to yield the fused feature maps.
In summary, this paper contributes in the following ways:
- (1)
The introduction of the RFE and CSA modules into FPN enhances shallow and deep features, respectively, aiming to highlight the foreground and suppress noise interference.
- (2)
The MBUS and MBDS modules acquire diverse features through multiple paths, reducing the loss of feature information during the sampling process of the FPN.
- (3)
The FCF module fuses small-scale and large-scale feature maps, enriches feature information representation, and augments the semantic information of feature maps.
2. Related Work
The FPN employs a top-down structure with lateral connections to construct high-level semantic feature maps across multiple scales, thereby enhancing the flexibility of multi-scale representation and enjoying wide application in various detectors. However, it still presents certain limitations. For instance, there are semantic differences between layers, and direct fusion may diminish the power of multi-scale representation. Furthermore, feature information might be lost during the FPN network’s sampling process. In this section, two key aspects will be explored: enhancing the ability of multi-scale representation and minimizing feature information loss.
Enhancing the Ability of Multi-scale Representation. Addressing the issue of the FPN difficulty in adapting to changes in object scale. Tang et al. [
22] introduced the Scale-Aware Feature Pyramid Network (SARFNet). This approach employs 3-D convolution to establish a scale equilibrium pyramid convolution, enhancing the correlation between different feature levels and allowing flexible matching with objects exhibiting varying appearance changes. Additionally, Zhao et al. [
23] proposed a multi-scale feature fusion module, named BFPCAR, which mitigates the imbalance of attention in non-adjacent layers of the FPN network. Dong et al. [
24] innovatively replaced the lateral connection of the FPN with a deformable convolution lateral connection module to facilitate multi-scale object detection. Furthermore, Sun et al. [
25] developed a Multi-Scale Feature Pyramid Network (MS-FPN) that amplifies shallow and deep features through the Atrophy Convolution Pyramid (ACP) module, while adaptively learning and selecting crucial feature maps using multi-scale attention modules.
Minimizing Feature Information Loss. To mitigate the loss of information during feature sampling and fusion, Chen et al. [
26] introduced a Parallel Residual Dual Fusion Pyramid Network (PRB-FPN), designed to gather more comprehensive contextual information through bidirectional fusion. Furthermore, Guo et al. [
27] applied the Residual Feature Augmentation module to counteract the loss of semantic information resulting from channel reduction. This issue is addressed in this study through channel concatenation. Content-Aware Feature Reorganization (CARAFE) [
28] generates multiple features in each feature map through various groups of content perception methods. Feature upsampling is then achieved by rearranging the generated features into a spatial block. This paper obtains multiple features through a variety of methods, concatenates the resulting features, and finally derives the final feature map using convolution operations. Zheng et al. [
29] proposed the Gating Path Aggregation (GPA) network, asserting that different feature layers have varying degrees of importance. This network enhances the capability of information filtering during feature fusion.
4. Experiment
This section provides an overview of the datasets used in this study, the applied evaluation metrics, and experimental details. Subsequently, ablation experiments were performed on each module developed in this work to determine the contribution of each module to the performance enhancement. Finally, to validate the detection performance of the proposed model, this paper compares it with multiple methods on WCH and NWPU VHR10 datasets.
4.1. Datasets and Evaluation Metrics
In this paper, experiments are conducted on our own dataset WCH and the publicly available dataset NWPU VHR10 [
34,
35,
36], the details of which are as follows:
WCH. This dataset’s images are derived from aerial drone photography of Caidian District, Wuhan, suitable for UAV remote sensing image object detection. Due to the high resolution of the captured images, this paper cropped them to generate 1344 new 640 × 640 resolution images. After annotating the cropped images, a total of 32,349 instances covering one category are obtained, with each image containing multiple instances of arbitrary size and orientation. This paper randomly divided this dataset in an 8:2 ratio, resulting in 1075 images for training and 269 images for validation.
NWPU VHR10. This dataset’s images are sourced from Google Earth and the Vaihingen dataset, which consists of aerial drone photography from Vaihingen, Germany. The latter is a subset of the test data used by the German Association of Photogrammetry and Remote Sensing (DGPF) for digital aerial cameras. The NWPU VHR-10 dataset, annotated using Horizontal Bounding Boxes (HBB), is publicly accessible and suitable for object detection in UAV remote sensing images. This paper omits unlabeled images from the NWPU VHR-10 dataset, retaining 650 images and 3896 instances across ten categories. The images range from 400 × 500 to 1100 × 1800 in size. The dataset, characterized by variable object sizes and orientations, presents significant challenges. Given that the NWPU VHR10 dataset does not segregate training and validation sets, this paper calculated the image count for each category, dividing the images in an 8:2 ratio, resulting in 521 training images and 129 validation images.
Moreover, this paper used the AP, AP
50, AP
75, AP
S, AP
M, and AP
L metrics to evaluate the detection performance of the model. AP represents the average precision across 10 intersection over union (IoU) thresholds ranging from 0.5 to 0.95, with intervals of 0.05. AP
50 denotes the average precision at an IoU threshold of 0.5. AP
75 represents the average precision at an IoU threshold of 0.75. AP
S indicates the average precision for small objects. AP
M signifies the average precision for medium objects. AP
L represents the average precision for large objects. Among these metrics, AP corresponds to the area under the precision–recall (P-R) curve, where P stands for precision and R stands for recall, as defined by the following formula:
where TP, FP, and FN represent the number of true positives, false positives, and false negatives, respectively; P(R) is the precision–recall curve.
4.2. Implementation Details
This study’s experiments employ ViTDet, which is implemented based on the MMDetection framework, as the baseline model. The proposed model used the pretrained weights on the ImageNet [
37] dataset and initialized the remaining model parameters randomly. During training, the input image size was adjusted to 704 × 704 as part of data preprocessing, followed by random image cropping and flipping. The batch size is set to 2, the initial learning rate is 0.0001, and a linear warm-up strategy is used for the first 500 iterations. The model was trained for 30 epochs with a learning rate decay by a factor of 10 at the 15th and 25th epochs. The AdamW optimizer was used with beta coefficients set at (0.9, 0.999) and a weight decay of 0.1. All experiments were executed on an Ubuntu 20.0 system, with training accelerated by an NVIDIA GeForce RTX 4080 graphics card.
4.3. Ablation Experiments
Ablation experiments were conducted on the WCH dataset to assess the effectiveness of the proposed modules in this paper. To ensure a fair comparison, the hyperparameters for all ablation experiments were set according to the specifications outlined in
Section 4.2. Subsequently, the RFE, CSA, MBUS, MBDS, and FCF modules were individually added to the baseline model (ViTDet-B) for experimentation. The results of the ablation experiments are presented in
Table 1. In
Table 1, “√” indicates that the module was added, while “×” indicates its absence. The first row displays the results of experiments conducted on the baseline model. In this table, the red font signifies a decrease in the indicator, while the green font signifies an increase in the indicator.
Ablation on RFE module. The RFE module is applied to the shallow feature layers L1 and L2. Convolution operations with varying kernel sizes are performed to obtain feature maps with multi-scale information, enabling adaptation to multi-scale objects in UAV remote sensing images. The results in the second row of
Table 1 demonstrate that the utilization of the RFE module leads to an increase of 0.1% and 0.4% in AP
50 and AP
L, respectively. This indicates that the RFE module significantly enhances the detection accuracy of large objects. However, the AP
S and AP
M of this module still fall behind those of the baseline model. This could be attributed to the disturbance caused to the features of small and medium objects in the shallow layer when the deep features are fused additively with the shallow features.
Ablation on CSA module. The module is applied to the deep feature layers L3 and L4 to enhance feature expression through self-attention calculation and spatial position weighting of the feature map. The utilization of the CSA module results in a 0.2% increase in AP
50, as observed in the third row of
Table 1. This indicates that the model achieves improved accuracy in classifying and locating certain objects. Nevertheless, other indicators remained below the baseline level. This could be attributed to CSA enhancing both object features and noise features, particularly in complex scenes where the background occupies a significant portion of the area in UAV remote sensing images. Subsequently, a top-down fusion path is employed, which extends the distribution range of noise features, resulting in an unsatisfactory detection effect of the model.
Ablation on MBUS module. The purpose of this module is to upsample a smaller-sized feature map into a larger-sized feature map. Specifically, multiple feature-extraction branches are employed to acquire diverse feature information from high-level feature maps, which are subsequently utilized to construct high-level feature maps. The fourth row of
Table 1 reveals that the utilization of the MBUS module leads to a 0.3% and 1.3% increase in AP
50 and AP
L, respectively. These improvements can be attributed to the combination of diverse abstract features and positioning information. However, there was a slight decrease in AP
S and AP
M. This is because the MBUS module introduces additional background noise to the shallow feature layer, thereby impeding the model’s localization ability and degrading its performance.
Ablation on MBDS module. The purpose of this module is to replace the original pooling operation and mitigate the loss of semantic information during pooling. The findings from the fifth row of
Table 1 indicate that incorporating additional semantic information into the construction of the P5 feature map does not enhance the model’s performance. This could be attributed to the introduction of significant background noise into the P5 feature map by MBDS, thereby resulting in an unsatisfactory detection effect of the model. In contrast, the baseline model employs the pooling operation to generate the P5 feature map. While this approach results in the loss of certain semantic information, it also discards some noise information, mitigating the impact of noise on the model.
Ablation on FCF module. This module concatenates two feature maps along the channel dimension and utilizes convolutional operations to facilitate the interaction between spatial and channel information. The results from the sixth row of
Table 1 demonstrate that the inclusion of the FCF module in the baseline model leads to an improvement of 0.2%, 0.1%, and 0.8% in AP
50, AP
75, and AP
L, respectively. The effectiveness of the FCF module is confirmed. However, there was a slight decrease in AP
S and AP
M. This could be attributed to the FCF module covering the features of small and medium objects during the information exchange process, while the features of large objects are retained due to their larger spatial coverage.
4.4. Comparisons Experiments
Comparisons experiments were conducted on the WCH and NWPU VHR10 datasets to assess the performance of the proposed object detection model in this paper. This paper compare the proposed model with various object detection models, including one-stage mainstream models YOLOv7 and YOLOv8, two-stage classical models Faster RCNN and Cascade RCNN, and representative transformer-series models Swin Transformer and ViTDet (the baseline model in this paper). The hyperparameters of the proposed model align with those described in
Section 4.2, while the comparison models are implemented based on MMDetection and MMYOLO, respectively.
4.4.1. Comparison on the WCH Dataset
Table 2 presents the experimental outcomes obtained from the proposed model and the comparative model when applied to the WCH dataset. The proposed model demonstrates an increase of 0.1% and 2.4% in AP and AP
L, respectively, compared to the baseline model (ViTDet-B). These results indicate that the model proposed in this study enhances object positioning ability and exhibits improved perception of large objects. The ablation experiment results in
Table 1 further confirm these observations, attributing them to the utilization of the RFE, MBUS, and FCF modules. The RFE module enhances object features, the MBUS module enables the acquisition of diverse features, and the FCF module effectively fuses these diverse features. Notably, the proposed model achieves an AP
50 increase of 1.4% and 1.5% when compared to Faster RCNN and Cascade RCNN, respectively. This improvement can be attributed to the powerful encoding capabilities of the transformer. While the proposed model outperforms the one-stage object detection model—namely, YOLOv7—a significant gap remains between the proposed model and YOLOv8. Additionally, the proposed model surpasses Swin Transformer in terms of detection performance, specifically by improving the AP
L by 5%. This is potentially due to the fact that Swin Transformer employs local self-attention to reduce computational overhead and uses sliding windows for information propagation between different windows, whereas the proposed model employs global attention to propagate information, thereby surpassing the spatial information propagation limitations for enhanced performance.
Figure 7 presents the visualization of the detection results achieved by the proposed model and the comparative model on the WCH dataset. The first column presents scenes characterized by a sparse background and a dense distribution of objects. The second column showcases scenes with a more prominent background. The third column displays scenes with objects of varying colors, and the fourth column depicts scenes where the background and objects exhibit similarities. Each row corresponds to the detection results of a specific model. The presence of a red circle indicates a missed detection object, whereas yellow circles indicate objects that the proposed model successfully detects but other models fail to identify. Additionally, the prediction results have been obtained with a confidence level set to 0.8.
Figure 7 reveals that all models encounter the issue of missed detections. Notably, the YOLOv8 model outperforms other models in quantitative analysis, yet its visual detection results are unsatisfactory. This observation can be attributed to the low confidence level of YOLOv8 predictions, which is understandable considering its emphasis on detection speed. Moreover, the proposed model demonstrates stronger competitiveness compared to other models, particularly in scenarios involving occluded objects. This finding highlights the ability of the proposed model to enhance feature perception.
4.4.2. Comparison on the NWPU VHR10 Dataset
The experimental results of the model proposed in this paper and the comparison model on the NWPU VHR10 dataset are presented in
Table 3. The model in this paper demonstrates an improvement over the baseline model ViTDet-B, with an increase of 1.8% and 0.7% in AP
75 and AP
L, respectively, indicating enhanced detection performance for large objects. This finding aligns with the experimental results in
Table 1 and
Table 2. However, there has been a decline in other indicators, particularly a 1.0% decrease in AP
S. The ablation experiment reveals that this decline can be attributed to the limited perception ability of the improved module in this paper when detecting small objects in UAV remote sensing images. In comparison to YOLOv7, YOLOv8, Faster RCNN, and Cascade RCNN, the model proposed in this paper exhibits slight advantages in AP
50, AP
75, and AP
L. In contrast to Swin Transformer, the model in this paper achieves a slightly lower AP
50 score, which could be attributed to the advantage of Swin Transformer’s local self-attention mechanism.
Figure 8 presents the visualized detection results of the model proposed in this paper and the comparison model on the NWPU VHR10 dataset. The figure consists of four columns: the first column depicts scenes with a large object distribution, the second column portrays scenes with complex backgrounds and dense objects, the third column illustrates scenarios with small object distribution, and the fourth column represents scenes with large object distribution and redundant background information. Each row corresponds to the detection results of a specific model. The presence of a red circle denotes a missed detection object, while a yellow rectangle signifies that the model proposed in this paper detects it, while most other models do not. Furthermore, the prediction results are evaluated with a confidence level set to 0.8.
Figure 8 reveals that the models from the transformer series exhibit a lower rate of missed objects. The third column of
Figure 8 demonstrates that, in comparison to the baseline model ViTDet-B, the proposed model performs better in detecting objects in shadowed scenes, but its performance is suboptimal in scenes where the object closely resembles the background. This discrepancy may arise from the superior capability of the RFE and CSA modules in distinguishing objects from dissimilar backgrounds, while struggling to differentiate backgrounds that closely resemble objects.
5. Discussion
Ablation experiments on the WCH dataset are conducted to examine the influence of the modules proposed in this paper on the detection performance of the model. The results of the ablation experiments indicate that the designed module exhibits improvements in AP
50 but leads to declines in other indicators. This could be attributed to the deliberate design of each module to focus on improving specific indicators rather than multiple indicators simultaneously. The comparative experiments in
Section 4.4 demonstrate that the model presented in this paper outperforms the comparison model in detecting large objects. Furthermore, the proposed model exhibits impressive detection performance in scenes featuring occluded objects (Second column in
Figure 7) and shadowed scenes (Third column in
Figure 8). This can be attributed to the RFE module’s successful expansion of the object’s receptive field in the shallow feature map and the CSA module’s enhancement of the weight of the object feature.
Nevertheless, the model presented in this paper is suboptimal for detecting small and medium objects. This could be due to the presence of noise information in the L1, L2, L3, and L4 feature layers generated by the backbone network. While the RFE module is capable of filtering out certain noise from the shallow features through convolution, the CSA module inadvertently amplifies the eigenvalue of the noise when assigning weights to the object features in the deep feature map. Consequently, during the top-down fusion process, a portion of the noise from the deep feature maps is reintroduced into the shallow feature maps, thereby impacting the model’s detection performance.
Based on the above observations, in the task of detecting objects in UAV remote sensing images, it is imperative to progressively reduce the noise information within the deep feature map as the network becomes deeper, thereby enhancing the model’s detection performance.
6. Conclusions
The presence of complex background information and densely distributed objects in UAV remote sensing images can adversely affect the model’s detection performance. To address this issue, the present paper introduces the MBPN model, which enhances the FPN by making improvements. Initially, the RFE and CSA modules enhance the feature representation of foreground objects. Subsequently, the MBUS and MBDS modules mitigate the loss of semantic information during FPN sampling. Lastly, the FCF module alleviates the problem of semantic information misalignment during the feature fusion process.
Ablation experiments validate the efficacy of the proposed module in this study. Furthermore, comparative experiments conducted on the WCH and NWPU VHR10 datasets demonstrate the high competitiveness of the proposed method. Nevertheless, the model presented in this paper still exhibits certain limitations. For instance, in the ablation experiment, the enhanced module displays improvements in some evaluation indicators while experiencing decreases in others. Additionally, in the comparative experiment, the model demonstrates suboptimal detection performance for small and medium objects.
Subsequent research will involve conducting more comprehensive investigations aimed at enhancing the model’s detection accuracy for small objects.