DroneNet: Rescue Drone-View Object Detection

Wang, Xiandong; Yao, Fengqin; Li, Ankun; Xu, Zhiwei; Ding, Laihui; Yang, Xiaogang; Zhong, Guoqiang; Wang, Shengke

doi:10.3390/drones7070441

Open AccessArticle

DroneNet: Rescue Drone-View Object Detection

by

Xiandong Wang

¹,

Fengqin Yao

¹,

Ankun Li

²,

Zhiwei Xu

³,

Laihui Ding

³,

Xiaogang Yang

³,

Guoqiang Zhong

¹

and

Shengke Wang

^1,*

¹

Faculty of Information Science and Engineering, Ocean University of China, Qingdao 266100, China

²

Shandong Artificial Intelligence Institute, Qilu University of Technology (Shandong Academy of Sciences), Jinan 250014, China

³

Shandong Willand Intelligent Technology Co., Ltd., Qingdao 266100, China

^*

Author to whom correspondence should be addressed.

Drones 2023, 7(7), 441; https://doi.org/10.3390/drones7070441

Submission received: 30 May 2023 / Revised: 24 June 2023 / Accepted: 26 June 2023 / Published: 3 July 2023

(This article belongs to the Special Issue Advances in Imaging and Sensing for Drones)

Download

Browse Figures

Versions Notes

Abstract

:

Recently, the research on drone-view object detection (DOD) has predominantly centered on efficiently identifying objects through cropping high-resolution images. However, it has overlooked the distinctive challenges posed by scale imbalance and a higher prevalence of small objects in drone images. In this paper, to address the challenges associated with the detection of drones (DODs), we introduce a specialized detector called DroneNet. Firstly, we propose a feature information enhancement module (FIEM) that effectively preserves object information and can be seamlessly integrated as a plug-and-play module into the backbone network. Then, we propose a split-concat feature pyramid network (SCFPN) that not only fuses feature information from different scales but also enables more comprehensive exploration of feature layers with many small objects. Finally, we develop a coarse to refine label assign (CRLA) strategy for small objects, which assigns labels from coarse- to fine-grained levels and ensures adequate training of small objects during the training process. In addition, to further promote the development of DOD, we introduce a new dataset named OUC-UAV-DET. Extensive experiments on VisDrone2021, UAVDT, and OUC-UAV-DET demonstrate that our proposed detector, DroneNet, exhibits significant improvements in handling challenging targets, outperforming state-of-the-art detectors.

Keywords:

UAV; drone-view object detection; small object

1. Introduction

Object detection is a highly researched and rapidly advancing field in computer vision which aims to locate regions of interest and recognize their categories in images. With the recent advancements in deep neural networks [1,2] and the availability of comprehensive and extensive datasets [3], object detection algorithms have made significant progress and have been successfully applied in various fields including urban surveillance [4], traffic monitoring [5], and intelligent inspection [6]. Additionally, they have contributed to the advancement of formal methods for artificial intelligence verification [7,8].

Although object detection algorithms have achieved remarkable success in recent years [9,10,11], detecting objects in drone-view images remains a significant challenge due to the presence of various difficult objects, such as small, occluded, and densely packed objects. When applying existing models without specific modifications, the detection performance on drone-view images (VisDrone-DET [12]) is still considerably lower than that on natural images (MS COCO [3]).

Most previous works have primarily concentrated on devising efficient image cropping methods for detection [13,14,15]. While preprocessing the data can partially address the challenges related to small objects and partial occlusions, the detection performance itself has not witnessed significant enhancements. Figure 1 demonstrates that drone object detection is still approached as if it were identical to natural scene object detection, disregarding the distinctive attributes of drone images, which exhibit considerably higher proportions of smaller objects. In fact, the severity of this issue surpasses that observed in the COCO dataset.

Currently, VisDrone [12] serves as the primary dataset for drone-view object detection. However, the scarcity of authentic drone data has led many researchers to resort to utilizing datasets from other domains for object detection in drone-view scenarios. For example, some studies have employed the UAVDT dataset [16], originally intended for drone-view object tracking. While these datasets can provide valuable insights, they may not fully encompass the unique challenges presented in real-world drone scenes as they were not specifically designed for drone-view object detection. Therefore, we strongly believe that the collection and utilization of real-world drone datasets are paramount for achieving a comprehensive understanding of, and effective solutions for, object detection challenges in drone-view scenarios.

In this paper, we propose DroneNet, a specialized object detector specifically designed for aerial photography scenes. Compared to the approach of using cropping, DroneNet focuses on addressing the challenges of extracting target features that are difficult to capture in aerial scenes, including severe variations in target scales and a high prevalence of small objects. We introduce three modules: the feature information enhancement module, the split-concat feature pyramid network, and a coarse to refine label assign strategy, to improve the performance of DroneNet. In summary, the main contributions of our work are as follows:

(1): We present DroneNet, a specialized object detector tailored for drone-based scenes, delivering excellent performance in aerial photography scenarios. Our detector is specifically optimized for drone-view images and outperforms existing detectors on benchmark datasets such as VisDrone, demonstrating superior performance.
(2): We propose a feature-information-enhanced model (FIEM) to enhance the feature extraction capability of the backbone network, a split-concat feature pyramid network (SCFPN) to improve the feature fusion ability, and a coarse to refine label assign (CRLA) strategy to enhance the small object learning ability. By incorporating these three improvements, DroneNet has become a powerful detector for unmanned aerial vehicle (UAV) object detection.
(3): We have gathered a dataset named OUC-UAV-DET, which serves as a valuable resource for drone-view object detection. Our hope is that this dataset will foster advancements in the field and facilitate progress in the area of UAV-based object detection. Additionally, DroneNet has demonstrated impressive performance on this dataset.

2. Related Work

Generally speaking, current object detectors can be broadly categorized into two types: one-stage detectors and two-stage detectors. One-stage detectors, such as SSD [17], the YOLO series [18,19,20], and RetinaNet [21] directly extract features from the network to generate prediction results without the need for region proposals. On the other hand, two-stage detectors narrow down the search space for object detection by first generating class-agnostic region proposals, and then predicting the final coordinates and object class for each proposal. This approach reduces the search space for detection, leading to higher detection accuracy at the expense of efficiency. Representative two-stage detectors include Fast R-CNN [22], Faster R-CNN [23], Mask R-CNN [23], and Cascade R-CNN [24].

Numerous prior studies [13,14,25] have primarily concentrated on efficient cropping techniques for high-resolution aerial images. YOLOT [26] is a pioneering work that introduced a cropping strategy for aerial image processing. It employs a dense sliding window approach to divide high-resolution images into small chips which are then fed into a network for object detection. In contrast, ClusDet [25] performs chip detection through object clustering rather than individual objects. The detected chips are then fed into a fine-grained detector, effectively addressing the challenges of scale and sparsity in drone-view object detection simultaneously. Additionally, PRDNet [14] focuses more on the challenging objects that are most likely to be corrected to improve the final results, and incorporates global information to ensure a thorough exploration of valuable background in the images.

Although these approaches have alleviated some difficulties in aerial scene object detection, they consider object detection in aerial scenes as a generic problem similar to object detection in natural scenes. They treat it as a special data preprocessing step rather than designing specialized detectors to address the unique challenges present in aerial scenes.

TPHYOLOv5 [27] acknowledges the limitations of the previous work and builds upon YOLOv5 by introducing an additional prediction head for detecting objects of different scales. The original prediction head is replaced with transformer prediction heads (TPHs) to leverage the potential of self-attention mechanisms in prediction. Additionally, TPHYOLOv5 integrates the convolutional block attention module (CBAM) to identify attention regions in scenes with dense objects.

However, in practical applications, it is not feasible to add a separate small object detection head as it significantly increases the memory and computational requirements. Therefore, following the approach of TPHYOLOv5, we adopt YOLOv5 as our baseline model, leveraging its capabilities to analyze and explore unmanned aerial vehicle (UAV) scenes. Ultimately, we design a detector specifically tailored for UAV scenarios.

3. Method

We adopt one-stage object detection algorithms as the foundation framework. Among these algorithms, the YOLO series stands out for its concise and clear structure, as well as its wide application. Specifically, the YOLOv5 algorithm demonstrates both high accuracy and fast execution speed, making it highly suitable for practical applications and deployments. However, the YOLO series algorithms were not specifically designed for drone object detection, resulting in suboptimal performance in drone scenarios, particularly when dealing with dense small objects in aerial scenes. Therefore, this paper chooses YOLOv5 as the base algorithm and proposes improvements and optimizations in its backbone network, feature fusion module, and label assignment process. The enhanced model is named DroneNet, as shown in Figure 2, aiming to meet the object detection requirements in drone scenarios.

3.1. Feature Information Enhancement Module

In object detection, feature extraction plays a crucial role in model training. The purpose of feature extraction is to extract meaningful information from raw data for object detection and convert it into feature vectors that are easier for machines to process. This aids the model in learning and converging more quickly. Poor performance in feature extraction can lead to subpar model training outcomes and even failure to achieve the desired performance. Therefore, the significance of feature extraction in object detection cannot be underestimated. We analyze the challenges faced when utilizing YOLOv5 in unmanned aerial vehicle (UAV) scenarios, primarily focusing on the inadequate effectiveness of the focus layer in extracting target features. To address this issue, a simple and effective method is proposed to enhance the focus layer’s ability to extract features of small objects, thereby improving the model’s performance in UAV scenarios.

Specifically, the feature information enhancement module consist of two parts, as shown in Figure 3: the shallow transition network and the deep fusion network. The shallow transition network consists of two consecutive

3 \times 3

convolutional layers. The difference lies in the stride values: the first convolutional layer has a stride of 1, while the second convolutional layer has a stride of 2. This design aims to minimize feature loss as much as possible. The deep fusion network draws inspiration from the residual connections in ResNet, where the features extracted by the shallow transition network are divided into two branches for more comprehensive fusion. One branch acquires global information through a 1 × 1 convolutional layer, while the other branch undergoes internal cross-layer fusion via multiple BottleNeck modules to obtain local information (where N is set to 1 in this case). The outputs of the global and local branches are then fused to obtain the output of the feature information enhancement module. Compared to the focus layer that directly performs slice-wise downsampling, the feature information enhancement module can better preserve the feature information of complex objects in unmanned aerial vehicle (UAV) object detection scenarios.

3.2. Split-Concat Feature Pyramid Network

Feature fusion integrates features extracted at different levels in a model, enabling the model to maintain high accuracy across multiple scales. In the context of unmanned aerial vehicle (UAV) object detection, feature fusion plays a crucial role in improving detection accuracy due to the large scale variation of the targets. In this paper, we propose SCFPN as the feature fusion component of DroneNet. It not only addresses the feature fusion challenge between different scales but also further fuses feature layers with a higher density of small objects, thus enhancing detection precision. Figure 4 illustrates the positive samples involved in the network training of YOLOv5, where yellow denotes positive samples, red represents annotated boxes, and black represents introduced padding during the training process. From left to right, P3, P4, and P5 correspond to the third, fourth, and fifth layers of the feature pyramid, respectively. As observed from Figure 4, due to the abundance of small objects in aerial scenes, a significant number of small objects are assigned to the P3 layer for learning, resulting in an excessive burden on the P3 layer. To alleviate this burden while ensuring the fusion of features from different layers, we design SCFPN, as depicted in Figure 2. First, the feature maps are upsampled from the higher level. Then, the C3 with different feature layer information is split into two paths for more comprehensive feature learning. Finally, these two paths are merged to obtain features with stronger semantic information.

3.3. Coarse to Refine Label Assign

Existing works, such as ATSS [28], PAA [29], and OTA [30], have demonstrated that label assignment is a crucial factor in determining the performance of object detection. These studies have shown that a well-designed label assignment scheme can significantly improve the accuracy and efficiency of detection systems. However, the existing label assignment strategies are not suitable for unmanned aerial vehicle (UAV) object detection scenarios. Specifically, the threshold-based label assignment strategy [20,21,22], which is based on a fixed threshold, ignores the variations in shape and size among different objects. For square or large objects, there are more high-quality anchors associated with them, resulting in more positive samples during the training phase. On the other hand, for elongated or small objects, most of the anchors are low quality, leading to a smaller number of positive samples during training. As a result, the network tends to prioritize the prediction of objects with balanced aspect ratios or larger sizes, suppressing the performance of elongated or small objects. This severely limits the detection accuracy of small objects in aerial scenes. Soft label assignment methods [30,31], which calculate soft labels and positive/negative weights based on the predicted results and ground-truth boxes, address the size limitations mentioned earlier. However, due to the increasing weight of positive samples in the network, the weight assigned to small objects becomes progressively lower, thereby restricting the precision of small object detection. Our approach starts by designing a set of positive samples through a hard label assignment scheme. This ensures a high recall rate for small objects. Subsequently, within the set of positive samples, we continue to refine the assignment using soft labels. This approach, known as coarse-to-fine label assignment, has been experimentally proven to effectively enhance the detection accuracy of small objects in UAV object detection scenarios, without increasing inference time or model size. Algorithm 1 illustrates the algorithmic workflow of the coarse-to-fine approach.

Algorithm 1 Coarse to Refine Label Assign (CRLA)

Input:

$G$ is a set of ground-truth boxes on the image
$L$ is the number of feature pyramid levels
$P_{b o x}$ is the prediction boxes on the image
$P_{c e n}$ is the prediction boxes center on the image
$P_{c l s}$ is the prediction classes score on the image
$A$ is a set of all anchor boxes
c is a robust hyperparameter used for coarse selection, with a default value of $0.2$
k is a robust hyperparameter used for refine selection, with a default value of 10

Output:

$P$ is a set of positive samples
$N$ is a set of negative samples

1:: for each ground-truth $g \in G$ do
2:: build an empty set for candidate positive samples of the ground-truth g: $C_{g} \leftarrow ⌀$ ;
3:: for each level $i \in [1, L]$ do
4:: $S_{i} \leftarrow$ $I o U (C_{g}, g) \geq c;$
5:: $C_{g} = C_{g} \cup S_{i};$
6:: end for
7:: for each candidate $c \in C_{g}$ do
8:: Get the top k samples with the highest Intersection over Union (IOU). $S_{i} \leftarrow$ $t o p (I o U (C_{g}, g), k);$
9:: $n \leftarrow$ $s u m (S_{i})$
10:: $c o s t \leftarrow$ $l o s s (C_{g}, g) + l o s s (C_{g}, P_{c e n}) + l o s s (C_{c} l s, P_{c l s})$
11:: $P = P \cup t o p (a r g m i n (c o s t), n);$
12:: end for
13:: end for
14:: $N = A - P;$
15:: return $P, N$ ;

4. Experiments

4.1. Implementation Details

In the training stage of our proposed DroneNet, we utilized YOLOv5 as the baseline network. Our implementation was based on PyTorch and trained on four NVIDIA RTX 2080ti GPUs. The initial learning rate was set to 0.01, the batch size was set to 8, and the number of training epochs was 300. Then, the entire network was trained using the SGD optimizer [32] with a weight decay of

5 \times 10^{- 4}

. For the datasets VisDrone, UAVDT, and OUC-UAV-DET, the input size of the detector was set to 640 × 640 pixels. During the testing stage, the input size also was 640 × 640 pixels, The non-maximum suppression (NMS) [33] threshold of final fusion and max detection number were set to 0.6 and 300, respectively. The mosaic data augmentation was applied during the training process to enhance the robustness of the DroneNet.

4.2. Datasets and Metrics

To demonstrate the efficacy of our proposed methodology, we conduct comprehensive experiments on two widely recognized benchmark datasets for drone-view object detection, namely, VisDrone2021-DET [12] and UAVDT [16]. Furthermore, we also evaluate the performance of DroneNet on an additional dataset, namely, OUC-UAV-DET, which was collected by ourselves.

The VisDrone dataset is currently widely used for evaluating unmanned aerial vehicle (UAV) object detection. It was curated and created by the Machine Learning and Data Mining Laboratory team at Tianjin University in 2018. The dataset comprises 10,209 images, with 6471 images allocated for training, 548 images for validation, and 3190 images for testing. It encompasses a total of 10 object categories, namely, pedestrian, people, bicycle, car, van, truck, tricycle, awning-tricycle, bus, and motorbike. Due to the current closure of the evaluation server, we are unable to test our method on the designated test dataset. The same as ClusDet [25], we have utilized the validation dataset as a substitute for testing purposes in order to evaluate the effectiveness of our method

UAVDT is a dataset for single-object tracking and multi-object tracking tasks, with a resolution of 1024 × 540 pixels and a total of 80,000 frames from 100 videos. Each frame is annotated with three object classes: cars, buses, and trucks. Due to the specific nature of the original annotations, in this study, only frames from the multi-object tracking task were used for object detection. Unlike previous works [25,34], problematic images with incorrect annotations, as shown in Figure 5, were manually removed from the dataset. The UAVDT dataset was divided into training and testing sets for object detection, consisting of 10,000 and 5000 images, respectively.

VisDrone and UAVDT are large-scale publicly available datasets for aerial drone photography, playing a significant role in unmanned aerial vehicle (UAV) vision research. However, these datasets also have some limitations. The VisDrone dataset undergoes compression processing, resulting in lower image resolution, with most images ranging from 1080 × 750 to 1920 × 1080. On the other hand, the UAVDT dataset primarily focuses on object tracking and has a limited variety of scenes, consisting mostly of consecutive frames. Therefore, to address these shortcomings, this paper aims to compile and create the OUC-UAV-DET dataset, by considering the demands of practical application scenarios. Figure 6 demonstrates the utilization of the Labelme tool for annotating OUC-UAV-DET.

The OUC-UAV-DET dataset is composed of nine object categories, including people, bicycle, car, van, truck, tricycle, bus, motor, and boat. Unlike the Visdrone dataset, the categories in OUC-UAV-DET are not excessively divided. A smaller and more realistic set of categories is provided, which is suitable for practical applications. The images in this dataset have a minimum resolution of 1920 × 1080, which is higher compared to datasets such as VisDrone and UAVDT, making it more suitable for real-world research. The dataset comprises 1245 training images, 207 validation images, and 619 test images.

To evaluate overall performance, the standard protocol used for the MS COCO dataset [3] is followed, which calculates average precision (AP) by averaging across multiple intersection-over-union (IoU) thresholds. These thresholds range from 0.5 to 0.95 in increments of 0.05.

4.3. Ablation Study

In this subsection, we perform a series of ablation experiments to analyze the impact of the hyperparameters involved and the contribution of key components in the proposed DroneNet.

The effectiveness of the feature information enhancement module. According to the results in Table 1, the introduction of the FIEM module led to a 1.93% improvement in accuracy, demonstrating its significant role in enhancing feature extraction capability. By visualizing the feature maps in Figure 7, it is evident that the FCOS module exhibits weaker learning capability compared to the FIEM module. This is particularly crucial for objects with weak features in drone-view, as the FIEM module assists the backbone network in better feature extraction.

The effectiveness of the coarse to refine label assign strategy. As shown in Table 1, applying the CRLA strategy to the baseline model led to a performance improvement of 6.23%. Additionally, by combining CRLA with FIIM and SFPN, the final model achieved an accuracy of 29.6%, which is significantly higher than the baseline model’s 21.2%. These results demonstrate the significant impact of the CRLA strategy in improving model performance. We compared our CRLA strategy with common label assignment methods, as shown in Table 2. For instance, when we incorporated a hard label assignment approach such as IOU into the YOLOv5 model, we observed a decrease in the model accuracy. This indicates that the hard label assignment method is not suitable, especially in aerial scenes with numerous small objects. However, when we applied a soft label assignment strategy such as SimOTA to the baseline, we observed an improvement in accuracy, suggesting that the soft label assignment strategy has some effectiveness. Furthermore, upon introducing the CRLA label assignment strategy, the model achieved the highest level of accuracy, providing evidence for the effectiveness of the CRLA label assignment strategy. These results strongly indicate that the CRLA strategy holds significant prospects and potential for application in drone object detection.

We first conducted a detailed analysis of the IoU threshold as it is crucial for determining the number of positive samples. We divided the IoU threshold into different intervals ranging from 0 to 0.5 and drew some conclusions based on our observations, as shown in Table 3. When the IoU threshold was set to 0, we found that the performance was not satisfactory. This is because a lower threshold introduces many low-quality positive samples, thereby reducing the accuracy of the model. On the other hand, when we adjusted the IoU threshold to 0.5, although some positive samples were filtered out, it also resulted in a decrease in precision.

As indicated in Table 4, different values of k ranging from 6 to 15 were utilized for training the detector. It is observed that the proposed method exhibits considerable insensitivity to variations in k within this range. Employing a large value of k, such as 15, leads to an abundance of low-quality candidates, resulting in a slight decrease in performance. Conversely, adopting a small value of k, such as 6, noticeably diminishes accuracy due to the insufficient number of candidate positive samples, which introduces statistical instability.

The effectiveness of the split-concat feature pyramid network. From Table 1, it can be observed that the model’s accuracy improves by 2.98% when the SCFPN is added on top of the baseline YOLOv5. This indicates that the SCFPN effectively integrates features from different scales. Furthermore, when the SCFPN is added to the FIEM module, the network achieves a significant improvement of 6.23%. This suggests that the SCFPN can better integrate with the FIEM module and enhance the network’s learning capability. Table 5 below presents a comparison between the SCFPN and other popular feature fusion modules, demonstrating improvements in accuracy across various scales for the SCFPN. This indicates that the SCFPN has better feature fusion capability in addressing scenarios with multiple scale variations.

4.4. Experiment Results

First, we compare our DroneNet network with the state-of-the-art detectors, as shown in Table 6. Then, we integrate the DroneNet network with the slicing strategy, taking the simplest YOLOT slicing strategy as an example, and compare it with several detectors related to slicing, as shown in Table 7. Finally, the speed performance of DroneNet is compared with that of the baseline model [38] and TPH-YOLOv5 [27], as shown in Table 8.

From Table 6, it is evident that the DroneNet model demonstrates outstanding performance in aerial photography scenes. It achieves remarkable results on all three datasets, indicating that the DroneNet model successfully integrates the FIEM module, SCFPN module, and CRLA strategy, thereby possessing excellent object recognition capabilities. This outcome highlights the superior performance of the DroneNet model in aerial photography tasks.

From Table 7, it can be observed that without the need for an overly intricate design of cropping strategies, significant improvements in accuracy can be achieved by enhancing the detector and employing the simplest sliding window cropping method.

From Table 8, it can be observed that DroneNet achieves a good balance between speed and accuracy. Compared to TPH-YOLOv5, DroneNet significantly improves the processing speed for image manipulation and object detection tasks. It also demonstrates competitiveness in terms of processing speed compared to the baseline model. This makes it more applicable in practical scenarios, especially in drone-related contexts where real-time performance and low latency are critical factors.

4.5. Visualization

In this subsection, we employ visualization to further verify and analyze our method. Firstly, we employ a confusion matrix to highlight the disparities between the detection outcomes of our proposed DroneNet and the baseline method. Furthermore, we employ category activation to comprehensively investigate the comparison between DroneNet and baseline methods across various scenarios. Finally, we present a comparison between DroneNet and baseline methods on three datasets, showcasing the scene detection results.

Confusion matrix visualization. In this part, we conduct a statistical analysis to determine the reasons behind each category error, and present our findings in the form of two confusion matrices. As illustrated in Figure 8, the two confusion matrices depict the statistical data for both the baseline and our proposed method. Specifically, the labels on the horizontal axis represent the ground-truth, while the labels on the vertical axis represent the predicted values. The color gradient, ranging from yellow to blue, indicates the hierarchical degree of influence on the corresponding categories. For instance, considering the first item in the first row, the value of 0.46 indicates that 46% of the predictions for the category “pedestrian” were accurate. It can be observed that DroneNet exhibits remarkable performance across different categories. Specifically, in the “bus” category, DroneNet shows the highest improvement among the ten categories, with an increase of 0.26 compared to the baseline methods. Furthermore, in the challenging category of “awning-tricycle”, DroneNet achieves a performance gain of 0.16. These results demonstrate the exceptional performance of DroneNet across various categories.

Category activation mapping visualization. category activation mapping (CAM) [54] is a method used to interpret the convolutional layers in convolutional neural networks. It generates class-specific “activation maps” by applying global average pooling and a linear classifier on the last convolutional layer. These maps can be used to visualize the regions in an image that the network considers important. CAM provides insights into how the network makes predictions. In this study, we employed the GradCAM++ [55] method to visualize the results of DroneNet and the baseline on the VisDrone dataset, as shown in Figure 9.

Figure 9A illustrates a common aerial scene consisting of vehicles arranged in a regular pattern and a few pedestrians. It can be observed that compared to the baseline, DroneNet demonstrates stronger classification capabilities and better recognition of small objects at the edges.

Figure 9B showcases a scenario with drastic variations in object scales in an aerial scene. It can be seen that DroneNet outperforms the baseline in handling scale changes, exhibiting a greater advantage.

Figure 9C depicts a situation where small objects are clustered in an aerial scene. It can be noted that due to inadequate learning of small objects, the baseline tends to misclassify the background as the target. In contrast, DroneNet exhibits good localization and classification abilities when dealing with small objects.

Figure 9D demonstrates a nighttime aerial scene. Due to the scarcity of nighttime data in the VisDrone training set, the baseline exhibits weaker classification capabilities in such scenes, often misclassifying the background as foreground targets. DroneNet, on the other hand, performs well in handling nighttime scenes but still has limitations, particularly in accurately identifying distant small objects compared to Figure 9C’s daytime scene.

Figure 9E presents a situation where rapid camera rotation during aerial filming causes blurring of the objects. It can be observed that both DroneNet and the baseline make misidentifications in such scenarios, but DroneNet possesses strong foreground target recognition capabilities.

Figure 9F displays an overexposed aerial scene. It can be seen that the baseline can only identify the clearly visible portions of the targets in this scenario, while DroneNet excels at recognizing objects in overexposed scenes.

These six scenarios collectively demonstrate that when confronted with complex aerial scenes and varying weather conditions during aerial filming, DroneNet exhibits superior recognition capabilities compared to the baseline. This indicates that DroneNet has a significant advantage in aerial scenes.

Visual comparisons of detection results. In this section, we utilize visualizations to demonstrate the qualitative comparisons between the detection results obtained from DroneNet and the baseline approach. Figure 10 illustrates these comparisons, highlighting the regions delineated by dashed boxes. Our method exhibits superior effectiveness in recognizing challenging objects compared to the baseline. Due to the typically small size of objects in drone-view images, we have enlarged some samples to ensure that they are clearly displayed. The results clearly indicate the strong effectiveness and robustness of our method across various complex scenarios when compared to the baseline approach. For instance, as illustrated in the magnified section of the second row, the baseline approach is significantly impacted by occlusions caused by pedestrians and vehicles. In contrast, our model excels at accurately identifying both vehicles and pedestrians. These visualizations demonstrate that DroneNet can effectively guide the detector to focus on challenging regions. By assisting the detector in better understanding difficult objects, our method facilitates the accurate recognition of objects, even in regions where ground-truth labels are not provided. However, this improvement in difficult object detection does not translate to an increase in average precision (AP). Nevertheless, our results showcase the exceptional capability of our method in surpassing the indicated value for difficult object detection.

5. Conclusions

In conclusion, this paper addresses the challenges of drone-view object detection (DOD) by introducing a specialized detector called DroneNet. We conducted a comprehensive analysis of complex regions in drone-view images and proposed innovative solutions. The feature information enhancement module (FIEM) effectively preserves object information, while the split-concat feature pyramid network (SCFPN) fuses feature information from different scales and enables exploration of feature layers with small objects. Additionally, the coarse to refine label assign (CRLA) strategy ensures adequate training of small objects. We also introduce a new dataset, OUC-UAV-DET, to further promote DOD research. Experimental results on datasets such as VisDrone2021, UAVDT, and OUC-UAV-DET demonstrate that DroneNet has achieved significant improvements in drone-centric object detection compared to state-of-the-art detectors, showcasing its tremendous potential for applications.

Author Contributions

Methodology, X.W.; Writing—review and editing, F.Y., A.L. and G.Z.; Supervision, S.W.; Software, Z.X., L.D. and X.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This work was partially supported by the National Key Research and Development Program of China under Grant No. 2018AAA0100400, HY Project under Grant No. LZY2022033004, the Natural Science Foundation of Shandong Province under Grants No. ZR2020MF131 and No. ZR2021ZD19, Project of the Marine Science and Technology cooperative Innovation Center under Grant No. 22-05-CXZX-04-03-17, the Science and Technology Program of Qingdao under Grant No. 21-1-4-ny-19-nsh, and Project of Associative Training of Ocean University of China under Grant No. 202265007. We also want to thank “Qingdao AI Computing Center” and “Eco-Innovation Center” for providing inclusive computing power and technical support of MindSpore during the completion of this paper.

Data Availability Statement

Datasets can be found at the following link: https://github.com/VisDrone/VisDrone-Dataset; https://sites.google.com/view/grli-uavdt/; https://github.com/XiandongWang/OUC-UAV-DET.

Conflicts of Interest

The authors declare no conflict of interest.

References

Wang, J.; Zhang, S.; Liu, Y.; Wu, T.; Yang, Y.; Liu, X.; Chen, K.; Luo, P.; Lin, D. RIFormer: Keep Your Vision Backbone Effective But Removing Token Mixer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023. [Google Scholar]
Woo, S.; Debnath, S.; Hu, R.; Chen, X.; Liu, Z.; Kweon, I.S.; Xie, S. ConvNeXt V2: Co-Designing and Scaling ConvNets with Masked Autoencoders. arXiv 2023, arXiv:2301.00808. [Google Scholar]
Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft coco: Common objects in context. In Proceedings of the Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, 6–12 September 2014; Springer: Berlin/Heidelberg, Germany, 2014. Part V 13. pp. 740–755. [Google Scholar]
Umair, M.; Farooq, M.U.; Raza, R.H.; Chen, Q.; Abdulhai, B. Efficient video-based vehicle queue length estimation using computer vision and deep learning for an urban traffic scenario. Processes 2021, 9, 1786. [Google Scholar] [CrossRef]
Singh, C.H.; Mishra, V.; Jain, K.; Shukla, A.K. FRCNN-Based Reinforcement Learning for Real-Time Vehicle Detection, Tracking and Geolocation from UAS. Drones 2022, 6, 406. [Google Scholar] [CrossRef]
Maslan, J.; Cicmanec, L. A System for the Automatic Detection and Evaluation of the Runway Surface Cracks Obtained by Unmanned Aerial Vehicle Imagery Using Deep Convolutional Neural Networks. Appl. Sci. 2023, 13, 6000. [Google Scholar] [CrossRef]
Krichen, M.; Mihoub, A.; Alzahrani, M.Y.; Adoni, W.Y.H.; Nahhal, T. Are Formal Methods Applicable to Machine Learning and Artificial Intelligence? In Proceedings of the 2022 2nd International Conference of Smart Systems and Emerging Technologies (SMARTTECH), Riyadh, Saudi Arabia, 9–11 May 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 48–53. [Google Scholar]
Raman, R.; Gupta, N.; Jeppu, Y. Framework for Formal Verification of Machine Learning Based Complex System-of-Systems. Insight 2023, 26, 91–102. [Google Scholar] [CrossRef]
Zhang, H.; Li, F.; Liu, S.; Zhang, L.; Su, H.; Zhu, J.; Ni, L.M.; Shum, H.Y. DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection. arXiv 2022, arXiv:2203.03605. [Google Scholar]
Feng, C.; Zhong, Y.; Gao, Y.; Scott, M.R.; Huang, W. Tood: Task-aligned one-stage object detection. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV). IEEE Computer Society, Montreal, QC, Canada, 10–17 October 2021; pp. 3490–3499. [Google Scholar]
Wang, C.Y.; Bochkovskiy, A.; Liao, H.Y.M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. arXiv 2022, arXiv:2207.02696. [Google Scholar]
Zhu, P.; Wen, L.; Du, D.; Bian, X.; Fan, H.; Hu, Q.; Ling, H. Detection and Tracking Meet Drones Challenge. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 7380–7399. [Google Scholar] [CrossRef]
Huang, Y.; Chen, J.; Huang, D. UFPMP-Det: Toward Accurate and Efficient Object Detection on Drone Imagery. AAAI Conf. Artif. Intell. 2022, 36, 1026–1033. [Google Scholar]
Leng, J.; Mo, M.; Zhou, Y.; Gao, C.; Li, W.; Gao, X. Pareto Refocusing for Drone-View Object Detection. IEEE Trans. Circuits Syst. Video Technol. 2022, 33, 1320–1334. [Google Scholar] [CrossRef]
Akyon, F.C.; Altinuc, S.O.; Temizel, A. Slicing Aided Hyper Inference and Fine-Tuning for Small Object Detection. In Proceedings of the 2022 IEEE International Conference on Image Processing (ICIP), Bordeaux, France, 16–19 October 2022; pp. 966–970. [Google Scholar] [CrossRef]
Du, D.; Qi, Y.; Yu, H.; Yang, Y.; Duan, K.; Li, G.; Zhang, W.; Huang, Q.; Tian, Q. The unmanned aerial vehicle benchmark: Object detection and tracking. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 370–386. [Google Scholar]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. Ssd: Single shot multibox detector. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Springer: Berlin/Heidelberg, Germany, 2016. Part I 14. pp. 21–37. [Google Scholar]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
Redmon, J.; Farhadi, A. YOLO9000: Better, faster, stronger. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 7263–7271. [Google Scholar]
Redmon, J.; Farhadi, A. Yolov3: An incremental improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. arXiv 2015, arXiv:1506.01497. [Google Scholar] [CrossRef] [PubMed] [Green Version]
He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2961–2969. [Google Scholar]
Cai, Z.; Vasconcelos, N. Cascade r-cnn: Delving into high quality object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 6154–6162. [Google Scholar]
Yang, F.; Fan, H.; Chu, P.; Blasch, E.; Ling, H. Clustered object detection in aerial images. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 8311–8320. [Google Scholar]
Van Etten, A. You only look twice: Rapid multi-scale object detection in satellite imagery. arXiv 2018, arXiv:1805.09512. [Google Scholar]
Zhu, X.; Lyu, S.; Wang, X.; Zhao, Q. TPH-YOLOv5: Improved YOLOv5 Based on Transformer Prediction Head for Object Detection on Drone-Captured Scenarios. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops, Montreal, BC, Canada, 11–17 October 2021; pp. 2778–2788. [Google Scholar]
Zhang, S.; Chi, C.; Yao, Y.; Lei, Z.; Li, S.Z. Bridging the gap between anchor-based and anchor-free detection via adaptive training sample selection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 9759–9768. [Google Scholar]
Kim, K.; Lee, H.S. Probabilistic anchor assignment with iou prediction for object detection. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Springer: Berlin/Heidelberg, Germany, 2020. Part XXV 16. pp. 355–371. [Google Scholar]
Ge, Z.; Liu, S.; Li, Z.; Yoshie, O.; Sun, J. Ota: Optimal transport assignment for object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 19–25 June 2021; pp. 303–312. [Google Scholar]
Ge, Z.; Liu, S.; Wang, F.; Li, Z.; Sun, J. Yolox: Exceeding yolo series in 2021. arXiv 2021, arXiv:2107.08430. [Google Scholar]
Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
Neubeck, A.; Van Gool, L. Efficient non-maximum suppression. In Proceedings of the 18th International Conference on Pattern Recognition (ICPR’06), Hong Kong, China, 20–24 August 2006; IEEE: Piscataway, NJ, USA, 2006; Volume 3, pp. 850–855. [Google Scholar]
Li, C.; Yang, T.; Zhu, S.; Chen, C.; Guan, S. Density Map Guided Object Detection in Aerial Images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, Seattle, WA, USA, 14–19 June 2020. [Google Scholar]
Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J. Path aggregation network for instance segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 8759–8768. [Google Scholar]
Tan, M.; Pang, R.; Le, Q.V. Efficientdet: Scalable and efficient object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 10781–10790. [Google Scholar]
Jocher, G. YOLOv5 by Ultralytics. Zenodo 2020. [Google Scholar] [CrossRef]
Pang, J.; Chen, K.; Shi, J.; Feng, H.; Ouyang, W.; Lin, D. Libra r-cnn: Towards balanced learning for object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 821–830. [Google Scholar]
Zhou, X.; Wang, D.; Krähenbühl, P. Objects as Points. arXiv 2019, arXiv:1904.07850. [Google Scholar]
Wang, J.; Sun, K.; Cheng, T.; Jiang, B.; Deng, C.; Zhao, Y.; Liu, D.; Mu, Y.; Tan, M.; Wang, X.; et al. Deep high-resolution representation learning for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 43, 3349–3364. [Google Scholar] [CrossRef] [Green Version]
Li, Y.; Chen, Y.; Wang, N.; Zhang, Z. Scale-aware trident networks for object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 6054–6063. [Google Scholar]
Tian, Z.; Shen, C.; Chen, H.; He, T. Fcos: Fully convolutional one-stage object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 9627–9636. [Google Scholar]
Zhu, C.; He, Y.; Savvides, M. Feature selective anchor-free module for single-shot object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–17 June 2019; pp. 840–849. [Google Scholar]
Wang, J.; Zhang, W.; Cao, Y.; Chen, K.; Pang, J.; Gong, T.; Shi, J.; Loy, C.C.; Lin, D. Side-aware boundary localization for more precise object detection. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Springer: Berlin/Heidelberg, Germany, 2020. Part IV 16. pp. 403–419. [Google Scholar]
Zhang, H.; Wang, Y.; Dayoub, F.; Sunderhauf, N. Varifocalnet: An iou-aware dense object detector. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Virtual, 19–25 June 2021; pp. 8514–8523. [Google Scholar]
Chen, Z.; Yang, C.; Li, Q.; Zhao, F.; Zha, Z.J.; Wu, F. Disentangle your dense object detector. In Proceedings of the 29th ACM International Conference on Multimedia, Virtual, 19–23 October 2021; pp. 4939–4948. [Google Scholar]
Zand, M.; Etemad, A.; Greenspan, M. Objectbox: From centers to boxes for anchor-free object detection. In Proceedings of the Computer Vision—ECCV 2022: 17th European Conference, Tel Aviv, Israel, 23–27 October 2022; Springer: Berlin/Heidelberg, Germany, 2022. Part X. pp. 390–406. [Google Scholar]
Bochkovskiy, A.; Wang, C.Y.; Liao, H.Y.M. YOLOv4: Optimal Speed and Accuracy of Object Detection. arXiv 2020, arXiv:2004.10934. [Google Scholar]
Li, C.; Li, L.; Geng, Y.; Jiang, H.; Cheng, M.; Zhang, B.; Ke, Z.; Xu, X.; Chu, X. YOLOv6 v3.0: A Full-Scale Reloading. arXiv 2023, arXiv:2301.05586. [Google Scholar]
Jocher, G.; Chaurasia, A.; Qiu, J. YOLO by Ultralytics. 2023. Available online: https://github.com/ultralytics/ultralytics/blob/main/CITATION.cff (accessed on 1 January 2023).
Liao, J.; Piao, Y.; Su, J.; Cai, G.; Huang, X.; Chen, L.; Huang, Z.; Wu, Y. Unsupervised Cluster Guided Object Detection in Aerial Images. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2021, 14, 11204–11216. [Google Scholar] [CrossRef]
Deng, S.; Li, S.; Xie, K.; Song, W.; Liao, X.; Hao, A.; Qin, H. A Global-Local Self-Adaptive Network for Drone-View Object Detection. IEEE Trans. Image Process. 2021, 30, 1556–1569. [Google Scholar] [CrossRef] [PubMed]
Zhou, B.; Khosla, A.; Lapedriza, A.; Oliva, A.; Torralba, A. Learning deep features for discriminative localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 21–26 July 2016; pp. 2921–2929. [Google Scholar]
Chattopadhay, A.; Sarkar, A.; Howlader, P.; Balasubramanian, V.N. Grad-cam++: Generalized gradient-based visual explanations for deep convolutional networks. In Proceedings of the 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), Lake Tahoe, NV, USA, 12–15 March 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 839–847. [Google Scholar]

Figure 1. Most current paradigms treat drone-view object detection as a conventional object detection task, where different cropping strategies are applied first and then passed through unmodified detectors. This approach neglects the unique characteristics of aerial scenes, such as intense scale variations and a large number of small objects.

Figure 2. DroneNet consists of three components: FIEM, SCFPN, and CRLA. FIEM is utilized to enhance the features of the backbone network, SCFPN is employed to fuse different feature layers, particularly to leverage information from feature layers related to small objects, while CRLA is responsible for selecting high-quality positive samples of small objects to be involved in the network training.

Figure 3. Model structure composition of the feature information enhancement module.

Figure 4. The visualization of positive sample distribution in the neck module during model training. A large number of small objects are allocated to the low-level feature maps, exhibiting significant scale variations.

Figure 5. The visualization of misannotations in the UAVDT dataset. We employed LabelImg for visual inspection and manually removed these data points that could potentially affect the experimental outcomes.

Figure 6. The OUC-UAV-DET dataset annotated using the Labelme tool.

Figure 7. Visualization of the first layer after passing through the FCOS module and the first layer after passing through the FIEM module. The color red represents background learning, while the color blue represents object learning.

Figure 8. Error analysis was conducted on the proposed DroneNet (bottom row) and the baseline method (top row) across all ten categories using the validation set of VisDrone2021-DET. The horizontal axis represents the ground-truth labels, while the vertical axis represents the predictions. The confusion matrix displays the percentage of errors. This plot clearly illustrates the substantial enhancement in object identification ability achieved by our proposed DroneNet.

Figure 9. The comparison of class activation visualizations between DroneNet and the baseline on the VisDrone dataset. Higher response values indicate higher predicted scores.

Figure 10. A comparison of class activation visualizations between DroneNet and the baseline on three drone-view datasets, namely, VisDrone-2021, UAVDET, and OUC-UAV-DET, is presented. To enhance visual appeal, we utilize different colors to represent various categories.

Table 1. Assessing the impact of integrating our feature information enhancement module (FIEM), split-concat feature pyramid network (SCFPN), and coarse to refine label assign (CRLA) into the baseline on the validation of VisDrone2021-DET. The black bolded numbers indicate the current best results and the red numbers indicate how much improvement has been achieved compared to the baseline.

FIEM	SCFPN	CRLA	mAP	Diff
			21.2	-
✓			23.13	+1.93
	✓		24.52	+2.98
✓	✓		27.43	+6.23
		✓	25.13	+3.93
✓	✓	✓	29.6	+8.4

Table 2. Comparison of the CRLA strategy with other label assignment strategies.The black bolded numbers indicate the current best results.

Method	mAP
Baseline	21.2
+IOU	20.35
+SimOTA	23.45
+CRLA	25.13

Table 3. The analysis of different values of hyperparameter IoU threshold on the VisDrone validation set. The black bolded numbers indicate the current best results.

IoU	0.0	0.1	0.2	0.3	0.4	0.5
mAP (%)	28.18	28.84	29.6	29.16	29.02	27.78

Table 4. The analysis of different values of hyperparameter K on the VisDrone validation set. The black bolded numbers indicate the current best results.

K	6	7	8	9	10	11	12	13	14	15
mAP (%)	29.34	29.39	29.4	29.4	29.6	29.5	29.53	29.44	29.4	29.3

Table 5. Experimental comparison of SCFPN modules. The black bolded numbers indicate the current best results.

	mAP	APs	APm	APl
FPN [35]	20.49	9.2	23.4	32.1
PAFPN [36]	21.2	13.2	30.9	39.2
BiFPN [37]	22.31	10.2	26.9	38.7
SCFPN	24.52	12.4	32.4	42.9

Table 6. Comparison of AP (%) on VisDrone, UAVDT, and OUC-UAV-DET by using our approach with various base detectors. The red numbers represent the highest precision, while the blue ones represent the second-highest precision.

	VisDrone						UAVDT						OUC-UAV-DET
Method	mAP	AP50	AP75	APs	APm	APl	mAP	AP50	AP75	APs	APm	APl	mAP	AP50	AP75	APs	APm	APl
Faster R-CNN [22]	21.9	37.1	22.7	13.1	33.6	27.2	81.4	98.3	95.5	74.5	86.5	89.9	38	63.2	40.3	24	42.6	44.8
SSD [17]	25.2	46.1	24.1	16.4	37	37.6	76.8	97.3	89.2	67.8	83.8	86.1	35.4	61.4	37	21.9	40.4	41.7
RetinaNet [21]	23.5	40.2	23.8	13.7	37.4	41	76.5	97	88.9	67.3	83.5	86.1	29	49.5	30.1	17.1	35.1	31.8
Cascade R-CNN [24]	24.5	39	26.1	15.2	36.7	39.2	84.2	98.6	96	76.7	88.6	92.7	39.2	63.8	42.1	25	43.7	46.5
Libra R-CNN [39]	21.7	36.7	22.4	13.4	32.6	34.6	81.1	96.7	95	71.8	85.6	90.1	37.7	63.1	40.1	23.7	42.3	44.9
CenterNet [40]	18.7	33.6	17.9	9.8	29.3	38.7	73.2	97.8	87.9	60.6	79.4	85.4	34.1	59.6	34.9	18	39.8	43.3
HRNet [41]	24.6	40.3	26.2	15.9	36.8	39.1	82.3	98.1	94.9	72.3	85.9	92.1	40.4	65.8	43.4	27.4	44.9	46
TridentNet [42]	20.7	35.3	20.9	12	30.9	37.5	79.9	98.6	94.3	71.8	85.2	90.8	36.8	62.1	38.4	21.2	42	45.2
FCOS [43]	19	31.9	19.7	10.2	29.1	38	80.8	98.9	94.7	72.8	85.7	89.6	36.3	61.4	37.6	21.2	41.6	43.6
FSAF [44]	20.8	36.4	20.5	13.3	29.3	34.7	82.1	98.8	95.3	74.2	85.9	91.6	34.5	59.2	35.4	20.7	38.7	40.2
Sabl [45]	21.9	36	22.9	12.9	33.1	33.8	84.1	98.3	96.4	76.3	88.6	92.3	38.8	63	41.6	24.4	43.6	46.9
VFNet [46]	23.1	37.3	24.1	14.2	33.9	39.4	85.5	98.9	97.4	79.2	89.1	91.7	39.8	64.2	42.2	24.5	45.6	46.8
TOOD [10]	24.4	39.8	25.3	15.5	35.5	41.4	86.5	99	97.8	81.6	89.3	93	40	64.8	43	25.6	44.9	45.8
DDOD [47]	23.3	38.2	24.2	14.4	34.5	39.6	85.1	98.9	97.5	78.9	88.2	92.8	39.1	64.1	41.5	24.2	44.3	46.2
ObjectBox [48]	22.5	39.9	22.1	13.4	34.2	38.8	86.9	98.3	98.1	81.4	89.1	93.4	46.5	71.3	50.7	34.7	50.4	51
YOLOv3 [20]	24.8	43.9	24.2	16.5	34.4	45.2	87.5	98.4	97.2	83	89.9	93.5	46.4	71.3	50.8	34.5	50.7	50.9
YOLOv4 [49]	23.5	39.2	23.4	13.3	35.4	45.1	87.9	98.1	97.5	82.5	89.7	93.2	43.2	66.4	46.1	28.8	47.2	48.9
YOLOX [31]	22.4	39.1	22.3	13.7	33.1	41.3	87.7	97.9	97.4	83.2	89.2	93.4	47.1	72	51.5	34.8	51.4	52.2
YOLOv6 [50]	27.1	44.5	27.7	17	40.1	47.1	88.1	98.7	97.5	82.9	89.6	92.4	46.8	71.6	51.6	33.9	50.9	51.7
YOLOv7 [11]	27.9	48.3	27.5	18.5	39	49.3	88.6	99.1	98.2	83.1	90.2	93.6	47.4	72.9	52.3	34.3	51.7	53.4
YOLOv8 [51]	25.9	42.9	26.4	16.6	38.2	45.8	87.5	98.2	97.1	82.4	89.3	93.1	46.3	71.1	51.2	34.1	50.6	51.6
Baseline [38]	21.2	37.3	20.8	13.2	30.9	39.2	88.2	98.5	97.5	82.9	90.1	93.1	45.9	70.7	50	32.7	50.7	51.1
DroneNet	29.6	50.4	29.6	19.9	41.9	49.6	89.1	99.4	98.9	83.7	91.2	94.6	48.8	73.3	53.2	35.8	52.9	54.8

Table 7. DroneNet, combined with sliding window crop method, is compared to various object detection strategies involving image cropping on the Visdrone validation set. The black bolded numbers indicate the current best results.

Method	AP	AP50	AP75
UCGNet [52]	32.8	53.1	33.9
ClusDet [25]	32.4	56.2	32.6
GLSAN [53]	32.5	55.8	33
PRDet [14]	38.6	60.8	40.6
DroneNet (+Crop)	39.11	62.1	42.5

Table 8. The speed performance comparison of DroneNet with the baseline model and TPH-YOLOv5 was conducted on the VisDrone dataset, using a 2080ti GPU. The black bolded numbers indicate the current best results.

Method	mAP	FPS
Baseline [38]	21.2	84.74
TPH-YOLOv5 [27]	25.73	68.4
DroneNet	29.6	84.03

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wang, X.; Yao, F.; Li, A.; Xu, Z.; Ding, L.; Yang, X.; Zhong, G.; Wang, S. DroneNet: Rescue Drone-View Object Detection. Drones 2023, 7, 441. https://doi.org/10.3390/drones7070441

AMA Style

Wang X, Yao F, Li A, Xu Z, Ding L, Yang X, Zhong G, Wang S. DroneNet: Rescue Drone-View Object Detection. Drones. 2023; 7(7):441. https://doi.org/10.3390/drones7070441

Chicago/Turabian Style

Wang, Xiandong, Fengqin Yao, Ankun Li, Zhiwei Xu, Laihui Ding, Xiaogang Yang, Guoqiang Zhong, and Shengke Wang. 2023. "DroneNet: Rescue Drone-View Object Detection" Drones 7, no. 7: 441. https://doi.org/10.3390/drones7070441

APA Style

Wang, X., Yao, F., Li, A., Xu, Z., Ding, L., Yang, X., Zhong, G., & Wang, S. (2023). DroneNet: Rescue Drone-View Object Detection. Drones, 7(7), 441. https://doi.org/10.3390/drones7070441

Article Menu

DroneNet: Rescue Drone-View Object Detection

Abstract

1. Introduction

2. Related Work

3. Method

3.1. Feature Information Enhancement Module

3.2. Split-Concat Feature Pyramid Network

3.3. Coarse to Refine Label Assign

4. Experiments

4.1. Implementation Details

4.2. Datasets and Metrics

4.3. Ablation Study

4.4. Experiment Results

4.5. Visualization

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI