Next Article in Journal
Drone Optimization in Factory: Exploring the Minimal Level Vehicle Routing Problem for Efficient Material Distribution
Previous Article in Journal
Research of an Unmanned Aerial Vehicle Autonomous Aerial Refueling Docking Method Based on Binocular Vision
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Open Set Vehicle Detection for UAV-Based Images Using an Out-of-Distribution Detector

1
School of Mechatronical Engineering, Beijing Institute of Technology, Beijing 100081, China
2
Chongqing Innovation Center, Beijing Institute of Technology, Chongqing 401120, China
*
Author to whom correspondence should be addressed.
Drones 2023, 7(7), 434; https://doi.org/10.3390/drones7070434
Submission received: 11 May 2023 / Revised: 29 June 2023 / Accepted: 29 June 2023 / Published: 30 June 2023

Abstract

:
Vehicle target detection is a key technology for reconnaissance unmanned aerial vehicles (UAVs). However, in order to obtain a larger reconnaissance field of view, this type of UAV generally flies at a higher altitude, resulting in a relatively small proportion of vehicle targets in its imaging images. Moreover, due to the unique nature of the mission, previously unseen vehicle types are prone to appearing in the surveillance area. Additionally, it is challenging for large-scale detectors based on deep learning to achieve real-time performance on UAV computing equipment. To address these problems, we propose a vehicle object detector specifically designed for UAVs in this paper. We have made modifications to the backbone of Faster R-CNN based on the target and scene characteristics. We have improved the positioning accuracy of small-scale imaging targets by adjusting the size and ratio of anchors. Furthermore, we have introduced a postprocessing method for out-of-distribution detection, enabling the designed detector to detect and distinguish untrained vehicle types. Additionally, to tackle the scarcity of reconnaissance images, we have constructed two datasets using modeling and image rendering techniques. We have evaluated our method on these constructed datasets. The proposed method achieves a 96% mean Average Precision at IoU threshold 0.5 (mAP50) on trained objects and a 71% mAP50 on untrained objects. Equivalent flight experiments demonstrate that our model, trained on synthetic data, can achieve satisfactory detection performance and computational efficiency in practical applications.

1. Introduction

In recent years, unmanned aerial vehicles (UAVs) have increasingly played crucial roles in civilian and military applications, such as outdoor surveying, urban remote sensing, and ground reconnaissance [1,2,3]. For UAV ground reconnaissance missions, airborne optical equipment is commonly used for information collection. Consequently, the rapid and accurate detection and identification of collected image data have become pressing issues to address. Furthermore, due to the high-altitude flight nature of UAVs, ground target imaging occupies only a small portion of the overall image. Detecting small targets amidst vast and complex ground backgrounds is an exceptionally challenging task. As a result, numerous research efforts have been devoted to small target detection [4,5,6,7,8,9]. Although these studies have improved the model’s ability to detect small objects, they have also introduced computational delays due to increased model complexity. Differing from previous research, this paper focuses on the characteristics of target imaging, eliminates model redundancy, and enhances both detection accuracy and inference speed.
By considering the characteristics of UAV aerial images and the computing power of the laboratory’s current airborne equipment, this paper selects a deep-learning-based model as the detection algorithm. To the best of our knowledge, the majority of existing airborne detection algorithms are based on one-stage detectors [10]. However, previous research indicates that one-stage detectors suffer from the trade-off between positioning and classification, resulting in low positioning accuracy for small target detection [11]. On the other hand, two-stage target detectors separate the tasks of positioning and classification, leading to higher positioning accuracy but slower computational speed. To address this issue, our study aims to customize and modify a two-stage target detector to make it suitable for small target detection in the context of UAVs. Furthermore, we simplify the model to increase its computational speed. Among the various two-stage target detection algorithms, Faster R-CNN [12] is a widely used and well-established method with numerous open-source code implementations and relevant research. Hence, we select Faster R-CNN as the base detector for our approach. In reconnaissance applications, there is a challenge of encountering previously unseen instances of certain target types, which makes it difficult for existing target detectors to recognize them. This challenge is particularly relevant to UAV surveillance. Additionally, since Faster R-CNN is a generic object detector, we need to enhance it specifically for ground vehicle object detection. Lastly, the scarcity of UAV reconnaissance image data poses another challenge that needs to be addressed.
To address the aforementioned challenges, we propose an open set vehicle object detection method specifically designed for UAV ground reconnaissance. Our contributions can be summarized as follows:
  • We construct two datasets of vehicle targets captured from the perspective of drone aerial photography, effectively overcoming the issue of data scarcity in this research area.
  • We design a backbone network tailored for detecting small targets within complex backgrounds. Additionally, we adjust the anchors of the region proposal network (RPN) to enhance detection accuracy and speed.
  • We introduce a postprocessing classification method based on out-of-distribution detection, enabling the identification of vehicle classes that were not encountered during the training phase.
The rest of this manuscript is organized as follows: in Section 2, we present a review of the relevant literature on object detection and out-of-distribution detection algorithms. Section 3 outlines the structure of our designed object detector and describes the improvements made. Experimental results are provided in Section 4 to validate the effectiveness of our proposed method. Finally, we conclude this paper in Section 5.

2. Related Work

2.1. Object Detection Algorithm

Current state-of-the-art vision-based target detection algorithms predominantly rely on deep learning methods [13]. These methods can be broadly categorized into three types based on their model architectures: anchor-free detectors, one-stage detectors, and two-stage detectors.
Anchor-free detectors, such as CenterNet [14] and ExtremeNet [15], do not depend on predefined anchors for detection. Two-stage detectors, such as Faster R-CNN and SPPNet [16], first locate the object’s position and then classify it. On the other hand, single-stage detectors, exemplified by the classic SSD [17] and the YOLO series [18] of detection algorithms, simultaneously classify and locate objects.
Furthermore, some researchers have recently integrated ideas from the Natural Language Processing (NLP) field into computer vision-based target detection. For instance, ViT FRCNN [19] utilizes the Vision Transformer (ViT) architecture. However, these algorithms are primarily designed for general object detection and are not specifically tailored for small target detection in the context of UAVs.

2.2. Object Detection Algorithm for UAV

In order to cater to the requirements of UAV applications, researchers have been actively designing algorithms that consider the specific characteristics of these tasks. For example, H. Zhou et al. [11] made improvements to the YOLOV4 algorithm by optimizing the convolution operation and independently designing the localization and classification modules in the detection algorithm. Y. Liu et al. introduced a supervised spatial attention module (SSAM) to enhance target focus [20]. Y. Li et al. proposed the MBSSD mechanism for UAV surveillance, which addressed the issue of limited sample size through transfer learning [21]. These algorithms are all research efforts that take into account the application context; however, they do not possess the ability to recognize new categories when faced with open scenes.

2.3. Out-of-Distribution Detection Algorithm

The study of out-of-distribution (OOD) detection originated from a simple baseline method that utilizes the maximum softmax probability (MSP) as the score for in-distribution (ID)/out-of-distribution discrimination [22]. Among these methods, postprocessing techniques offer significant advantages as they do not require modifying the original training model. This advantage is crucial in real-world production scenarios where the cost of retraining the model may be prohibitive. ODIN [23] is a postprocessing method that enhances the ID/OOD score difference through temperature scaling and input perturbations. This method demonstrates that using sufficiently large temperatures converts softmax scores back into logit space, effectively discriminating between ID and OOD data.
Unlike confidence calibration, which focuses on representing the true correctness of ID data, ODIN scores aim to maximize the gap between ID and OOD data. Building on these insights, this study introduces the concept of postprocessing OOD to enable the recognition of new category objects during open detection.

3. Proposed Method

In this paper, our focus is on detecting vehicle targets from the perspective of UAVs. Due to the small proportion of vehicle target pixels in the imagery captured by UAV airborne cameras and the potential occurrence of vehicle categories that were not encountered during training, the current deep-learning-based target detection algorithms require a large amount of training data to perform effectively. To support our study, we have designed a method for constructing the dataset within the proposed object detection framework.
Initially, we utilized open-source modeling and rendering software, Blender, to create models of 10 different types of vehicle targets and 11 distinct styles of landform scenes. Next, we combined the vehicle objects with the scenes to generate variations of vehicle objects appearing in different environments. Finally, leveraging the rendering engine of the software, we synthesized images of vehicles captured from the UAV’s aerial perspective, considering different lighting conditions within various scenes. These synthesized images form the basis of our dataset.
Once we have obtained the vehicle image data, the target detection algorithm requires the extraction of image features using a backbone network. In our research, we have selected Faster R-CNN as the base detection framework. The backbone network in general object detection algorithms facilitates multi-scale object detection through its large receptive field. In the original Faster R-CNN, VGG16 [24] is utilized as the backbone, with an input size of 1000 × 600 pixels. However, in UAV aerial images, even vehicles occupy a small proportion of the image, and there may not be significant variations among different vehicles. Considering these imaging characteristics of vehicle targets in UAV aerial images, it is unnecessary for the backbone network to possess an excessively large receptive field to cover the entire imaging domain. Therefore, custom modifications to the backbone network are necessary, taking into account the specific requirements of detecting vehicle targets in UAV aerial images.
The accuracy of object localization achieved by the RPN network is crucial for the overall performance of the Faster R-CNN algorithm. To improve target localization, we enhance the accuracy by using predefined anchors that closely align with the actual target size and aspect ratio. This approach significantly improves the positioning accuracy of the RPN network, especially considering the small proportion of imaging pixels occupied by vehicle targets from the UAV perspective and the minimal scale variation among the vehicles. The current target detection algorithms based on deep learning are primarily designed using convolutional neural networks (CNNs), where the input and output sizes are fixed during the design and training stages. Consequently, these models can only detect predefined categories of objects during the model prediction stage. To enable the model to recognize untrained vehicle classes, we introduce an out-of-distribution (OOD) detection method in the RoI detection head. This method is incorporated in a postprocessing manner and does not affect the original training process of the algorithm. By employing OOD detection, the model gains the capability to identify and distinguish untrained vehicle types that may appear during inference. This enhancement allows the model to handle novel vehicle classes that were not encountered during training without requiring modifications to the training process itself.
The overall structure of the proposed detector is illustrated in Figure 1. The synthetic data generation module employs techniques such as 3D modeling and image rendering to generate vehicle image data from the UAV’s perspective, considering various lighting conditions and scenes. For the backbone network, we reduce the receptive field by decreasing the number of convolutional layers to match the image size of vehicle targets in aerial images. Additionally, we incorporate the SENet attention module [25], which attenuates irrelevant feature channels while enhancing relevant feature channels. This helps address the issue of small targets being overwhelmed by complex backgrounds. To preserve the original image information without increasing the computational overhead, we design a downsampling network layer to replace the image downsampling step in the preprocessing stage. This ensures that the original image details are retained. The out-of-distribution detection module enables the recognition of unfamiliar vehicle categories by postprocessing the classification probabilities. This module operates separately and does not impact the original training process of the algorithm. Finally, the detection results are obtained by combining the localization and classification recognition, providing the final output of the detector.

3.1. Synthetic Data Generation

To address the scarcity of UAV aerial reconnaissance photography data, we employ data synthesis techniques to construct image datasets. The goal is to avoid monotony in the background and target states of the dataset. We utilize two methods to increase the diversity of images in the database, as illustrated in Figure 2.
To increase the diversity of the image dataset, we employed two approaches. Firstly, we modeled the scene in multiple styles to cover various real ground forms and minimize the impact of a single background on the vehicle images, as depicted in Figure 2a. Additionally, we introduced random variations in the position of the vehicle on the ground, which enhances the diversity of positive and negative samples and reduces the influence of irrelevant factors on the model.
Apart from modeling diverse scene styles, we also considered multiple types of vehicles in the dataset. This prevents the RPN positioning network from overfitting due to a small number of vehicle target categories, enabling the detection of unfamiliar vehicles in open scenes.
Considering the variations in UAV flying heights, the pixel ratio of the vehicle target in aerial images changes accordingly. Hence, we synthesized imaging data of vehicle targets under UAV aerial photography at two heights: 150 m and 300 m, as shown in Figure 2b. It is evident from the figure that, when the vehicle is 300 m away from the UAV, its imaging pixel ratio is significantly smaller.
During the modeling process, the vehicles were modeled in a 1:1 scale based on real vehicles, with the height accurately represented. When rendering, the intensity of ambient light was randomly adjusted to simulate different meteorological conditions in real scenes. The rendered output images have a size of 1920 × 1080 pixels.

3.2. Backbone Design

Faster R-CNN has demonstrated that using VGG16 as its backbone is effective in handling detection tasks with targets of different scales. However, for our specific application of aerial vehicle object detection, we made modifications to make it more suitable and also significantly reduce the network size. This modification aims to accelerate the inference speed of the entire detection network, enabling fast detection from the perspective of UAVs. Figure 3 illustrates the comparison between the modified backbone and the original VGG16 backbone.
Compared to the original VGG16 backbone network, the modified network, DSR4-VGG16, reduces the number of convolutional layers, computational complexity, and receptive field. This modification is done because the research objects have a small proportion of pixels in the image and there are no large-scale objects to be detected. The modified backbone does not compromise the detection ability of the overall detector.
The modified backbone, R4-VGG16, reduces the original 13 convolutional layers to 4 layers, providing a more suitable architecture for small object detection. Additionally, an attention mechanism is introduced in the backbone to suppress background features and enhance the features of the detected objects.
To address the issue of preprocessing methods potentially destroying the pixels of small objects, the first convolutional layer of the backbone is modified to include downsampling functionality. This modification, referred to as DSR4-VGG16, increases the convolutional step size, effectively preventing the destruction of small object pixels and improving the detection accuracy of small objects without significantly increasing computational costs. Figure 4 illustrates the specific structure of our proposed backbone network, DSR4-VGG16.

3.3. RPN’s Anchor Adjustment

The Faster R-CNN algorithm utilizes a region proposal network (RPN) for object localization by employing predefined anchors with fixed sizes and ratios on the input image. These anchors assist in regression learning to accurately locate the targets. However, the default anchor size and aspect ratio used in the original Faster R-CNN may not be suitable for detecting vehicle targets from the perspective of UAVs, where adjustments are necessary.
To address this issue, this study examines the positioning principle of the RPN network and conducts statistical analysis on the pixel size and aspect ratio of vehicles in the vehicle image data. The statistical results of the vehicle target’s pixel size are presented in Table 1. The table reveals that, when aerial photography is conducted at a height of 150 m, the average pixel size of the vehicle target is 84 × 84 pixels. Similarly, at a height of 300 m, the average pixel size reduces to 38 × 38 pixels. It is worth noting that the pixel sizes of the vehicle targets exhibit consistency in both length and width dimensions. This observation indirectly indicates that the vehicle targets in the constructed dataset are sufficiently randomized in orientation and possess diverse data representations.
The statistical results of the aspect ratio of the vehicle target’s imaging pixels are depicted in Figure 5. To enhance clarity, reciprocal processing is applied when the ratio is less than 1. Figure 5 reveals that the aspect ratio of the vehicle target is primarily concentrated around two ratios: 1:1 and 2:1. Considering that the proportion of target pixels in this study is small compared to the target pixel size of the public dataset, the distinction between the 1:1 and 2:1 ratios is not significant at this stage.
Taking into account the potential computational efficiency, the anchor ratio of the original Faster R-CNN algorithm is adjusted from (1:2, 1:1, 2:1) to (1:1) in this study. By doing so, the number of convolution kernels can be reduced by three times if only the 1:1 ratio is chosen. Additionally, for the dataset corresponding to the aerial photography height of 150 m, the anchor size is adjusted from the original (128, 256, 512) to (80). Similarly, for the aerial photography altitude of 300 m, the anchor size is adjusted to (40).

3.4. Open Set Vehicle Recognition

In the field of image classification, a common problem arises when deploying a trained classification model to an actual scene: when presented with an image of a new class that the model has never seen before, the model tends to make incorrect predictions. We also encountered a similar problem when deploying our trained model to real scenes, where the model could detect unfamiliar vehicle targets but often assigned incorrect labels. However, we noticed that the probability assigned to untrained vehicle classes was usually very low, although there were a few cases with high probability scores.
In the Faster R-CNN target detection framework, the target positioning and classification are performed separately. The region proposal network handles target positioning, while the RoI detection head is responsible for classifying the positioning results and refining the positioning accuracy. Therefore, we identified the RoI detection head as a potential source of the problem.
To address these issues and enable the trained model to have certain positioning and classification abilities for emerging vehicles, we introduce an out-of-distribution detection method. After the RoI detection head calculates the logits for each predefined category, a softmax operation is performed on the logits to obtain the probability vector. Based on this probability vector, we determine whether the target at the positioning point belongs to a sample within the known distribution or outside of it. If it belongs to a sample outside of the distribution, we label it as an “Unknown Vehicle”.
In this paper, we adopt ODIN, an out-of-distribution detection method. ODIN applies temperature scaling and input preprocessing to the maximum softmax probability (MSP), which enhances the differentiation between samples within and outside the known distribution, leading to accurate out-of-distribution detection. Considering the need for the model to classify predicted samples twice, we only introduce temperature scaling in this study to achieve efficient out-of-distribution detection while reducing model complexity and computational overhead.
Assume that the neural network f = ( f 1 , , f N ) is trained to classify N classes. For each RoI x, the MSP score(x) is calculated by f as:
s c o r e ( x ) = max p 1 , , p N p i = exp ( f i ( x ) ) j = 1 N exp ( f j ( x ) )
where fi(x) is the logits value of RoI x corresponding to class i, pi is its classification probability after using the softmax function, the max function is used to select the MSP to judge the distribution of x, and the score(x) is the MSP score(x) for RoI x.
The larger the score(x) of ROI x is, the more likely it is that it belongs to the ID sample; otherwise, it may belong to the OOD sample.
In this paper, we calculate MSP score(x) as:
s c o r e ( x ) = max p 1 , , p N p i = exp ( f i ( x ) / T ) j = 1 N exp ( f j ( x ) / T )
where T R + is the temperature scaling parameter and set to 1 during the training. Previous work [26] has demonstrated that using temperature scaling can separate the softmax scores between in-distribution and out-of-distribution images, making out-of-distribution detection efficient. In addition, x, fi(x), pi, max function are the same as Formula (1).
Next, we compare the calibrated MSP score(x) with a threshold δ. If score(x) is greater than the threshold, the sample x is classified as ID and the corresponding category label is attached; otherwise, it is classified as OOD and an “Unknown Vehicle” label is attached. Mathematically, it can be described as:
y ( x ) = arg max i p i i f   MSP   s c o r e ( x ) δ N + 1 i f   MSP   s c o r e ( x ) < δ
where δ is the distribution discrimination threshold, MSP score(x), N, pi are the same as Formula (1), and y(x) is the label assigned for RoI x. In this study, we empirically set δ to 0.8.
By employing this form of out-of-distribution detection based on classification probabilities during postprocessing, the original model can be empowered to identify potential objects in the scene that were not encountered during training, referred to as untrained categories, without the need for retraining.
In this research on target detection, we deviate from conventional approaches by addressing the issue of misidentification of out-of-distribution samples that is commonly encountered in model deployment.
In conclusion, we utilized open-source modeling and rendering software to construct two vehicle datasets from the perspective of UAV aerial photography. To tackle the challenge of detecting small targets in complex backgrounds and reduce model complexity, we modified the backbone of the detector from VGG16 to DSR4-VGG16 based on the convolution calculation method and its data flow direction. To enhance target positioning accuracy and speed, we conducted statistical analysis on the size and proportion of target imaging and accordingly adjusted the anchor size and proportion. Furthermore, to address the problem of vehicle target recognition in open scenes, we introduced an out-of-distribution detection method based on classification probabilities as a postprocessing step, enabling the model to identify emerging classes beyond the predetermined types.

4. Experiments

In order to evaluate the proposed vehicle target detection algorithm for open scenes, we initially constructed two specific datasets. One of these datasets was used as a test set for conducting experiments to compare the detection performance and inference speed. The other dataset was utilized as a validation set to assess the recognition capability of the proposed algorithm when encountering new sample classes. Finally, we designed a corresponding flight experiment to verify the detection and inference performance of the proposed algorithm on a small UAV.

4.1. Datasets and Evaluation Criteria

The algorithm we developed focuses on vehicle target detection from the perspective of UAVs. For this study, we constructed 10 different types of vehicle models. Additionally, to maximize the available data, we created 11 ground scenes to position our vehicle models. Since our goal is to simulate the UAV’s perspective in capturing vehicle images, we needed to set specific parameters for the rendering camera. To ensure consistency with the subsequent equivalent flight experiments, we configured the rendering camera parameters to match those of the flight experiment camera, with an output resolution of 1920 × 1080 pixels and a field of view of 85°. During the rendering process, two camera heights were selected, as depicted in Figure 6. One height was set at approximately 300 m to simulate capturing vehicle targets under high-altitude conditions, while the other was set at around 150 m to simulate capturing vehicle targets under low-altitude conditions.
A total of 1356 images were collected and labeled at a height of 300 m, and it was named BIT-VEHICLE10-300. We have made this dataset publicly available for research [27]. The collected dataset is divided into training set, validation set, and test set, consisting of 814, 271, and 271 images, respectively. Additionally, a total of 1555 images were collected and labeled at a height of 150 m, named BIT-VEHICLE10-150, and it is also publicly available for research [27]. The dataset is divided into a training set, validation set, and test set, containing 1046, 250, and 259 images, respectively. During the data synthesis process, we varied the intensity of light sources to simulate the lighting conditions of UAVs in real environments, ensuring diversity in lighting conditions. An example image is shown in Figure 6a, where the vehicle barrel’s shadow on the ground is clearly visible due to the lighting settings.
For algorithm performance evaluation, we utilize common metrics in object detection, including mean average precision (mAP) and frames per second (FPS). Intersection over Union (IoU) is used to measure the overlap between ground truth and predicted bounding boxes. mAP evaluates classification and localization performance based on IoU and class confidence scores during prediction. FPS measures the algorithm’s inference speed on a specific computing platform. In this study, we calculate two types of average precision using IoU thresholds of 0.50 and 0.70 (referred to as mAP50 and mAP70) to assess the algorithm’s positioning accuracy in more detail. The mAP metric follows the VOC2007 criteria [28]. FPS is determined by averaging the inference speed during model testing.

4.2. Implementation Details and Settings

We selected Faster R-CNN as our baseline model because our proposed detection network is designed based on the Faster R-CNN algorithm framework. To evaluate the contributions of each component in our method, we conducted ablation experiments on the BIT-VEHICLE10-300 dataset. For comparison, we implemented several object detection models on the BIT-VEHICLE10-300 dataset, including SSD, YOLOv4 [29], PPYOLOv2 [30], Faster R-CNN (based on VGG16 as the backbone), FR-H [31], SCRDet [32], PPYOLO [33], PPYOLOE [34], FCOS [35], and PicoDet_s [36]. These models include both single-stage and two-stage detectors. All detectors were trained using images of the same size (640 × 360 pixels) and pretrained models on the COCO dataset [37] to facilitate faster and better convergence. The training period was set to 15 epochs, the initial learning rate was set to 0.001, and the optimizer used was Adam. The learning rate was decayed by a factor of 0.1 at the 11th epoch, and the batch size was set to 1. To evaluate the recognition ability of our proposed algorithm on new vehicle categories, we tested it on the BIT-VEHICLE10-150 dataset. All training and testing experiments were performed on an NVIDIA RTX 2070 GPU. In the flight experiments, we deployed the models trained on the host machine to drones equipped with Nvidia Jetson Xavier NX.

4.3. Experiment Results

4.3.1. Ablation Experiment

To analyze the effectiveness of our proposed components, we conducted ablation studies on anchor adjustment, backbone redesign, and method integration.
Anchor Adjustment. We investigated the impact of adjusting the size and scale of the anchor based on Faster R-CNN. Figure 7 illustrates the change in model loss and detection performance (mAP) on the validation set during the training process. Figure 7a shows that the model parameters can quickly converge when adjusting the anchor. Figure 7b demonstrates that the performance of the anchor-adjusted model is improved to some extent, particularly when IOU = 0.5. The final test results on the test set are presented in Table 2. The data in the table indicates that adjusting the anchor enhances the detection performance of the detector, especially when comparing performance under two different IOU thresholds. The accuracy has significantly improved.
Backbone Redesign. Backbone Redesign: The backbone redesign involves three components: adding the attention module, reducing the network layer of the original backbone, and modifying the first convolutional layer as a downsampling convolutional layer.
SENet-VGG16. In SENet-VGG16, a SENet module is incorporated into the original VGG16 backbone to enhance target-related features and suppress background-related features on the feature channel. The experimental results are presented in Figure 8 and Table 3. Compared to the detector based on the original VGG16 backbone, the introduction of SENet improves the mAP50 and mAP70 in terms of detection performance. This demonstrates that incorporating the attention mechanism into the feature extraction process can enhance the accuracy of detecting small objects in large scenes.
RX-VGG16. R4-VGG16, R5-VGG16, and R7-VGG16 are variants of the original VGG16 backbone with a reduced number of convolutional layers. Specifically, R4-VGG16 has four convolutional layers, R5-VGG16 has five convolutional layers, and R7-VGG16 has six convolutional layers. This reduction in the number of layers aims to decrease the receptive field and maintain high spatial resolution, thereby minimizing the impact of deep networks on small object detection.
Table 4 and Figure 9 present the experimental results of detectors based on different variants of VGG16. Compared to the original VGG16 backbone, R4-VGG16, R5-VGG16, and R7-VGG16 detectors show significant improvements in detection accuracy and computational efficiency. Particularly, R4-VGG16 achieves a remarkable increase in detection accuracy, with a 30% improvement in mAP50 and a 37% improvement in mAP70. It also demonstrates a computational efficiency improvement of 5 fps. However, it is worth noting that R4-VGG16 has lower computational efficiency compared to R5-VGG16 due to the removal of an additional pooling layer, resulting in larger feature maps and increased data volume in subsequent layers.
DSN-VGG16. DSN-VGG16 modifies the first convolutional layer of the original VGG16 backbone by converting it into a downsampling convolutional layer. This modification aims to maintain computational efficiency while preserving the original image information, thereby minimizing its impact on small object detection. The experimental results, depicted in Figure 10 and Table 5, showcase the performance of the detector based on DSN-VGG16. In comparison to the detector based on the original VGG16 backbone, the DSN-VGG16-based detector exhibits no impact on computational speed while demonstrating certain improvements in detection accuracy. Specifically, there is a 6% increase in mAP performance when IOU = 0.7, indicating a significant enhancement in the positioning accuracy for small targets. This improvement is achieved by implementing the downsampling network layer to replace the downsampling in the image preprocessing stage.
Method Integration. DSR4-Faster R-CNN-AA: DSR4-Faster R-CNN-AA is an improved method that integrates the four aforementioned enhancement techniques. The experimental results, depicted in Figure 11 and Table 6, illustrate the performance of the proposed DSR4-Faster R-CNN-AA compared to the baseline method, Faster R-CNN. The improved method exhibits significant enhancements in both detection accuracy and computational efficiency.
When the input image resolution is set to 960 × 540 pixels, DSR4-Faster R-CNN-AA outperforms the original detector in terms of mAP50, which is increased by 33%, and mAP70, which is increased by 57%. Additionally, the frames per second (FPS) is improved by 8 fps. These improvements highlight the substantial enhancement in the proposed method’s ability to accurately detect small targets, with an mAP70 reaching 67%.
Table 6 reveals that, when the input image size is reduced, the performance of the original detector decreases significantly. However, the proposed DSR4-Faster R-CNN-AA method experiences only a slight decrease in detection performance, but it still outperforms the original detector, even when the input image size is not reduced. The mAP50 remains high at 68%, and the detector achieves a faster processing speed, reaching 37 fps. In practical scenarios, the image size can be adjusted based on the desired trade-off between detection accuracy and speed.

4.3.2. Results of BIT-VEHICLE10-300 Datasets

We conducted a comparison of the detection performance between our designed method and several existing object detectors using the BIT-VEHICLE10-300 dataset, with an input image size of 640 × 360 pixels. The results of the tests are presented in Table 7. From the table, it is evident that our designed method outperforms other methods in terms of detection performance. It achieves an mAP50 of 81% and an mAP70 of 45%, while maintaining a detection speed that is twice as fast as the baseline method (32.4 fps).
Among the compared methods, the single-stage detector YOLOv4 exhibits the fastest detection speed, but its detection accuracy, as measured by mAP50, is only 9%. The superior performance of our designed method can be attributed to the following factors: (1) adjusting the anchor size and scale to be more closely aligned with the objects being detected, thereby improving the detector’s positioning accuracy; (2) incorporating SENet, which enhances the feature extraction capability of the modified backbone by suppressing background interference features and emphasizing target-related features; (3) reducing the number of convolutional layers in the backbone, which improves the spatial resolution of feature extraction and enhances the focus on small objects; (4) employing downsampling convolution instead of image downsampling preprocessing, thereby avoiding the loss of original image information; (5) reducing the number of convolutional layers in the modified backbone by three times, resulting in faster detection speed.
Figure 12 showcases some detection results from the BIT-VEHICLE10-300 dataset using both the baseline method Faster R-CNN and our algorithm. It is apparent that our algorithm exhibits a stronger ability to detect small targets without missing detections and achieves higher accuracy in terms of target positioning and bounding box drawing.

4.3.3. Results of BIT-VEHICLE10-150 Datasets

The design of our method takes into account the scenario where vehicle classes that were not included in the training data appear during the prediction stage. We address this by introducing an out-of-distribution detection method that identifies out-of-distribution vehicle classes based on the classification probabilities calculated by the RoI detection head. This design offers two advantages. Firstly, it avoids the need to retrain the original network model. Secondly, it does not increase the complexity of the original network model.
To demonstrate the detection capability of the proposed method when encountering such scenarios, we trained and tested the open-set-oriented detection method DSR4-Faster R-CNN-AA-O on the BIT-VEHICLE10-150 dataset. The experimental results are presented in Table 8. As shown in the table, the designed detection method achieves a detection accuracy of 71% for mAP50 and 36% for mAP70. Most vehicle categories exhibit detection accuracy ranging from 80% to 90%. Although the detection accuracy for the newly emerging vehicle class “Unknown Vehicle” is relatively low at 42%, the method still demonstrates the ability to distinguish untrained vehicle classes from the trained ones.
Figure 13 illustrates the visualization results obtained from using Faster R-CNN and our designed algorithm for detecting classes that include untrained vehicles. When encountering untrained vehicle objects, Faster R-CNN tends to produce incorrect classification results with low confidence probabilities. As a result, these instances are typically ignored during the visualization process. However, the proposed algorithm is capable of recognizing untrained vehicle objects that the model cannot confidently identify as emerging vehicles, leveraging the confidence probabilities generated by the algorithm.

4.4. Flight Experiments

Because the proposed detection algorithm is trained and evaluated using synthetic data, we conducted an equivalent UAV flight experiment to validate the effectiveness of the algorithm in real-world scenarios. We deployed the entire proposed method during the UAV flight experiment.

4.4.1. Experiment Settings

The UAV used in the experiment is a quadrotor UAV developed by our lab. It is equipped with an onboard optical camera and an action camera. The optical camera is mounted on the UAV’s head using a single-degree-of-freedom turntable, allowing remote control to adjust its attitude. This camera is primarily used for aerial photography, providing an overhead view of the ground. The action camera is fixed on the back of the UAV and is used to observe the horizon in high-altitude environments and assist in maintaining the UAV’s horizontal attitude.
During the experiment, we focused on using the optical camera for ground shooting. The camera captures images with a resolution of 1920 × 1080 pixels. To deploy our algorithm on the UAV, we utilized an Nvidia Jetson Xavier NX processor. In order to simulate ground vehicle targets, we placed a vehicle model outdoors. The size of the model is approximately 200 cm × 10 cm × 5 cm. The overall setup of the outdoor flight experiment can be seen in Figure 14.
The designed algorithm is implemented on the GPU of Nvidia Jetson Xavier NX using PyTorch 1.1.0. The entire UAV program runs on the Robot Operating System (ROS), and the ROSBAG tool is used to record onboard data during the experiment, including detection results and calculation speed during detection.
We conducted on-the-fly experiments using the aforementioned setup to verify the detection performance of our proposed method in real-world environments. To match the size of the vehicle model, the flying height of the UAV was set to 10 m to maintain consistency with the synthetic data. The flight experiment scene and ground shooting images are shown in Figure 15.

4.4.2. Results

It can be observed from the Figure 16 that our detection algorithm successfully detects small objects in a wide field of view, demonstrating the ability of our proposed algorithm to achieve high-resolution object detection. Additionally, the experiment environment in this case differs significantly from the synthetic data background, being more cluttered. However, our algorithm is not affected by the background and successfully detects the target. This is attributed to the attention enhancement module, which weakens background features and prevents the model from overfitting to background-related features. Despite the camera vibration during UAV shooting, resulting in lower image quality compared to synthetic data, our algorithm still performs target detection, indicating its capability to suppress background interference.
We re-evaluated the real-time performance of our proposed method using recorded experimental data with the ROSBAG tool. The recorded results of the algorithm’s runtime on the UAV using input images of different sizes are presented in Table 9. During our flight tests, we utilized an input image size of 960 × 540 pixels and, without the use of any acceleration library, the algorithm achieved a calculation speed of 11 fps. However, for images with an input size of 480 × 270 pixels, the average speed of our algorithm reached 19 fps, demonstrating a good detection performance that is sufficient to meet the real-time application requirements on UAVs.
Table 9 illustrates that the majority of the algorithm’s runtime is attributed to the network model inference. In future applications, we can leverage acceleration libraries such as TensorRT to optimize the network model’s runtime by achieving accelerated computations. This optimization can further enhance the real-time performance of our algorithm.

5. Conclusions

Aiming at the challenges of small-scale imaging, untrained vehicle categories, and airborne real-time requirements for UAV-to-ground vehicle detection, we designed a detector to solve the above detection problems. In order to conduct vehicle detection research, we use modeling and image rendering to construct two datasets of vehicles from the perspective of UAVs. We redesigned the backbone network, namely DSR4-VGG16, according to the characteristics of target imaging and evaluated its performance and speed through ablation experiments. For target positioning, we adjusted the size and scale of the anchor, which greatly improved the positioning accuracy of the detector. For the detection problem of untrained vehicles, we introduce a postprocessing out-of-distribution detection method, which realizes the positioning and differentiation of untrained vehicle objects without changing the original training model. In order to verify the effectiveness of the algorithm trained on synthetic data, we designed a UAV equivalent flight experiment to prove the effectiveness of this research approach. In the next step, we will make improvements in the acceleration of model deployment to enhance real-time performance.

Author Contributions

F.Z. and W.L. designed the study. Y.S. and Z.Z. conducted the review of relevant literature. F.Z. constructed the network architecture and wrote the manuscript. W.M. and C.L. carried out the labeling work of the dataset. All authors have read and agreed to the published version of the manuscript.

Funding

This paper was supported by Natural Science Foundation of Chongqing (No. 2020ZX 1200048).

Data Availability Statement

The datasets generated during the current study can be found in Ref. [27], and relevant code implementations are available from the authors upon reasonable request.

Acknowledgments

Thanks to the State Key Laboratory of Explosive Science and Technology, Beijing Institute of Technology, for providing the experimental platform.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Christiansen, M.P.; Laursen, M.S.; Jørgensen, R.N.; Skovsen, S.; Gislum, R. Designing and Testing a UAV Mapping System for Agricultural Field Surveying. Sensors 2017, 17, 2703. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  2. Preethi Latha, T.; Naga Sundari, K.; Cherukuri, S.; Prasad, M.V. Remote Sensing UAV/Drone technology as a tool for urban development measures in APCRDA. Int. Arch. Photogramm. Remote Sens. Spat. Inf. Sci. 2019, 42, 525–529. [Google Scholar] [CrossRef] [Green Version]
  3. Jayaweera, H.M.P.C.; Hanoun, S. UAV Path Planning for Reconnaissance and Look-Ahead Coverage Support for Mobile Ground Vehicles. Sensors 2021, 21, 4595. [Google Scholar] [CrossRef]
  4. Yousefi, D.B.M.; Rafie, A.S.M.; Al-Haddad, S.A.R.; Azrad, S. A Systematic Literature Review on the Use of Deep Learning in Precision Livestock Detection and Localization Using Unmanned Aerial Vehicles. IEEE Access 2022, 10, 80071–80091. [Google Scholar] [CrossRef]
  5. Liu, Y.; Sun, P.; Wergeles, N.; Shang, Y. A survey and performance evaluation of deep learning methods for small object detection. Expert Syst. Appl. 2021, 172, 114602. [Google Scholar] [CrossRef]
  6. Tong, K.; Wu, Y. Deep learning-based detection from the perspective of small or tiny objects: A survey. Image Vis. Comput. 2022, 123, 104471. [Google Scholar] [CrossRef]
  7. Kiyak, E.; Unal, G. Small aircraft detection using deep learning. Aircr. Eng. Aerosp. Technol. 2021, 93, 671–681. [Google Scholar] [CrossRef]
  8. Bosquet, B.; Mucientes, M.; Brea, V.M. STDnet: Exploiting high resolution feature maps for small object detection. Eng. Appl. Artif. Intell. 2020, 91, 103615. [Google Scholar] [CrossRef]
  9. Cao, C.; Wang, B.; Zhang, W.; Zeng, X.; Yan, X.; Feng, Z.; Liu, Y.; Wu, Z. An improved faster R-CNN for small object detection. IEEE Access 2019, 7, 106838–106846. [Google Scholar] [CrossRef]
  10. Wu, X.; Li, W.; Hong, D.; Tao, R.; Du, Q. Deep learning for unmanned aerial vehicle-based object detection and tracking: A survey. IEEE Geosci. Remote Sens. Mag. 2021, 10, 91–124. [Google Scholar] [CrossRef]
  11. Zhou, H.; Ma, A.; Niu, Y.; Ma, Z. Small-Object Detection for UAV-Based Images Using a Distance Metric Method. Drones 2022, 6, 308. [Google Scholar] [CrossRef]
  12. Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. Adv. Neural Inf. Process. Syst. 2015, 28, 91–99. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  13. Tong, K.; Wu, Y.; Zhou, F. Recent advances in small object detection based on deep learning: A review. Image Vis. Comput. 2020, 97, 103910. [Google Scholar] [CrossRef]
  14. Duan, K.; Bai, S.; Xie, L.; Qi, H.; Huang, Q.; Tian, Q. Centernet: Keypoint triplets for object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27–28 October 2019; pp. 6568–6577. [Google Scholar]
  15. Zhou, X.; Zhuo, J.; Krahenbuhl, P. Bottom-up object detection by grouping extreme and center points. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 850–859. [Google Scholar]
  16. He, K.; Zhang, X.; Ren, S.; Sun, J. Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 37, 1904–1916. [Google Scholar] [CrossRef] [Green Version]
  17. Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. SSD: Single shot multibox detector. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; pp. 21–37. [Google Scholar]
  18. Jiang, P.; Ergu, D.; Liu, F.; Cai, Y.; Ma, B. A Review of Yolo algorithm developments. Procedia Comput. Sci. 2022, 199, 1066–1073. [Google Scholar] [CrossRef]
  19. Beal, J.; Kim, E.; Tzeng, E.; Park, D.H.; Zhai, A.; Kislyuk, D. Toward transformer-based object detection. arXiv 2020, arXiv:2012.09958. [Google Scholar]
  20. Liu, Y.; Yang, F.; Hu, P. Small-object detection in UAV-captured images via multi-branch parallel feature pyramid networks. IEEE Access 2020, 8, 145740–145750. [Google Scholar] [CrossRef]
  21. Yundong, L.; Han, D.; Hongguang, L.; Zhang, X.; Zhang, B.; Zhifeng, X. Multi-block SSD based on small object detection for UAV railway scene surveillance. Chin. J. Aeronaut. 2020, 33, 1747–1755. [Google Scholar]
  22. Hendrycks, D.; Gimpel, K. A baseline for detecting misclassified and out-of-distribution examples in neural networks. arXiv 2016, arXiv:1610.02136. [Google Scholar]
  23. Liang, S.; Li, Y.; Srikant, R. Enhancing the reliability of out-of-distribution image detection in neural networks. arXiv 2017, arXiv:1706.02690. [Google Scholar]
  24. Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
  25. Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
  26. Hinton, G.; Vinyals, O.; Dean, J. Distilling the knowledge in a neural network. arXiv 2015, arXiv:1503.02531. [Google Scholar]
  27. Fei, Z.; Wenzhong, L.; Yi, S.; Zihao, Z.; Wenlong, M.; Chenglong, L. Open Set Vehicle Detection for UAV-Based Images Using an Out-of-Distribution Detector. Available online: https://github.com/zhaoXF04/BIT-VEHICLE10-150-300 (accessed on 11 May 2023).
  28. Zhou, P.; Ni, B.; Geng, C.; Hu, J.; Xu, Y. Scale-transferrable object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 528–537. [Google Scholar]
  29. Alexey, B.; Wang, C.; Mark Liao, H. Optimal speed and accuracy of object detection. arXiv 2020, arXiv:2004.10934. [Google Scholar]
  30. Huang, X.; Wang, X.; Lv, W.; Bai, X.; Long, X.; Deng, K.; Dang, Q.; Han, S.; Liu, Q.; Hu, X.; et al. PP-YOLOv2: A practical object detector. arXiv 2021, arXiv:2104.10419. [Google Scholar]
  31. Ding, J.; Xue, N.; Xia, G.S.; Bai, X.; Yang, W.; Yang, M.Y.; Belongie, S.; Luo, J.; Datcu, M.; Pelillo, M.; et al. Object detection in aerial images: A large-scale benchmark and challenges. arXiv 2021, arXiv:2102.12219. [Google Scholar] [CrossRef]
  32. Yang, X.; Yang, J.; Yan, J.; Zhang, Y.; Zhang, T.; Guo, Z.; Sun, X.; Fu, K. SCRDet: Towards More Robust Detection for Small, Cluttered and Rotated Objects. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27–28 October 2019; pp. 8232–8241. [Google Scholar]
  33. Long, X.; Deng, K.; Wang, G.; Zhang, Y.; Dang, Q.; Gao, Y.; Shen, H.; Ren, J.; Han, S.; Ding, E.; et al. PP-YOLO: An effective and efficient implementation of object detector. arXiv 2020, arXiv:2007.12099. [Google Scholar]
  34. Xu, S.; Wang, X.; Lv, W.; Chang, Q.; Cui, C.; Deng, K.; Wang, G.; Dang, Q.; Wei, S.; Du, Y.; et al. PP-YOLOE: An evolved version of YOLO. arXiv 2022, arXiv:2203.16250. [Google Scholar]
  35. Tian, Z.; Shen, C.; Chen, H.; He, T. Fcos: Fully convolutional one-stage object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27–28 October 2019; pp. 9627–9636. [Google Scholar]
  36. Yu, G.; Chang, Q.; Lv, W.; Xu, C.; Cui, C.; Ji, W.; Dang, Q.; Deng, K.; Wang, G.; Du, Y.; et al. PP-PicoDet: A Better Real-Time Object Detector on Mobile Devices. arXiv 2021, arXiv:2111.00902. [Google Scholar]
  37. Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft coco: Common objects in context. In Proceedings of the Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, 6–12 September 2014; pp. 740–755. [Google Scholar]
Figure 1. The proposed open set vehicle detection method consists of three main components: a synthetic data generation module, a backbone network, and an out-of-distribution detection module. This structure enables object localization and open set recognition.
Figure 1. The proposed open set vehicle detection method consists of three main components: a synthetic data generation module, a backbone network, and an out-of-distribution detection module. This structure enables object localization and open set recognition.
Drones 07 00434 g001
Figure 2. The scene models and vehicle object models were constructed using 3D modeling and rendering software, and the rendered composite image data were outputted. (a) illustrates the 11 constructed scene models, while (b) represents the synthetic image data that were rendered and outputted for algorithm training.
Figure 2. The scene models and vehicle object models were constructed using 3D modeling and rendering software, and the rendered composite image data were outputted. (a) illustrates the 11 constructed scene models, while (b) represents the synthetic image data that were rendered and outputted for algorithm training.
Drones 07 00434 g002
Figure 3. The comparison of backbone networks includes VGG16, which serves as the backbone for Faster R-CNN, R4-VGG16, a modified backbone designed specifically for small object detection, and DSR4-VGG16, which incorporates SENet and downsampling network layers.
Figure 3. The comparison of backbone networks includes VGG16, which serves as the backbone for Faster R-CNN, R4-VGG16, a modified backbone designed specifically for small object detection, and DSR4-VGG16, which incorporates SENet and downsampling network layers.
Drones 07 00434 g003
Figure 4. The structure of the backbone network DSR4-VGG16. Conv represents a convolutional layer. Relu represents the activation function. Pooling represents the pooling layer. SENet represents the attention network layer.
Figure 4. The structure of the backbone network DSR4-VGG16. Conv represents a convolutional layer. Relu represents the activation function. Pooling represents the pooling layer. SENet represents the attention network layer.
Drones 07 00434 g004
Figure 5. The statistical results of the aspect ratio of vehicle imaging pixels in the two datasets. (a) is the statistical result corresponding to the BIT-VEHICLE10-300, and (b) is the statistical result corresponding to the BIT-VEHICLE10-150.
Figure 5. The statistical results of the aspect ratio of vehicle imaging pixels in the two datasets. (a) is the statistical result corresponding to the BIT-VEHICLE10-300, and (b) is the statistical result corresponding to the BIT-VEHICLE10-150.
Drones 07 00434 g005
Figure 6. (a) An example image from the vehicle dataset constructed at a height of 300 m. (b) An example image from the vehicle dataset constructed at a height of 150 m.
Figure 6. (a) An example image from the vehicle dataset constructed at a height of 300 m. (b) An example image from the vehicle dataset constructed at a height of 150 m.
Drones 07 00434 g006
Figure 7. (a) The change in loss during training. (b) The change in detection accuracy of the model on the validation set during training.
Figure 7. (a) The change in loss during training. (b) The change in detection accuracy of the model on the validation set during training.
Drones 07 00434 g007
Figure 8. (a) The change in loss during training. (b) The change in detection accuracy of the model on the validation set during training.
Figure 8. (a) The change in loss during training. (b) The change in detection accuracy of the model on the validation set during training.
Drones 07 00434 g008
Figure 9. (a) The change in loss during training. (b) The change in detection accuracy of the model on the validation set during training.
Figure 9. (a) The change in loss during training. (b) The change in detection accuracy of the model on the validation set during training.
Drones 07 00434 g009
Figure 10. (a) The change in loss during training. (b) The change in detection accuracy of the model on the validation set during training.
Figure 10. (a) The change in loss during training. (b) The change in detection accuracy of the model on the validation set during training.
Drones 07 00434 g010
Figure 11. (a) The change in loss during training. (b) The change in detection accuracy of the model on the validation set during training.
Figure 11. (a) The change in loss during training. (b) The change in detection accuracy of the model on the validation set during training.
Drones 07 00434 g011
Figure 12. Vehicle object detection experiment compared with other algorithms. (a) Detection results of the Faster R-CNN detector; (b) detection results of our designed detector.
Figure 12. Vehicle object detection experiment compared with other algorithms. (a) Detection results of the Faster R-CNN detector; (b) detection results of our designed detector.
Drones 07 00434 g012
Figure 13. Open set vehicle object detection experiment. (a) Detection results of the Faster R-CNN detector; (b) detection results of our designed detector.
Figure 13. Open set vehicle object detection experiment. (a) Detection results of the Faster R-CNN detector; (b) detection results of our designed detector.
Drones 07 00434 g013
Figure 14. Outdoor flight experiment device. (a) A UAV developed in the lab. (b) Nvidia Jetson Xavier NX.
Figure 14. Outdoor flight experiment device. (a) A UAV developed in the lab. (b) Nvidia Jetson Xavier NX.
Drones 07 00434 g014
Figure 15. The imaging effect of the outdoor flying site and the target at the equivalent flying height.
Figure 15. The imaging effect of the outdoor flying site and the target at the equivalent flying height.
Drones 07 00434 g015
Figure 16. Visualization of target detection at different UAV heights and positions in the outdoor flight experiment.
Figure 16. Visualization of target detection at different UAV heights and positions in the outdoor flight experiment.
Drones 07 00434 g016
Table 1. Statistical results of vehicle image size in two datasets.
Table 1. Statistical results of vehicle image size in two datasets.
DatasetWidth/PixelHeight/PixelMean/Pixel
BIT-VEHICLE10-15046–12341–12084
BIT-VEHICLE10-30018–4818–4838
Table 2. Experiment results of anchor adjustment.
Table 2. Experiment results of anchor adjustment.
MethodAnchor AdjustmentmAP50mAP70FPS
Faster R-CNN-63.6410.6915.1
Faster R-CNN72.3536.3015.1
Table 3. Experiment results of adding SENet.
Table 3. Experiment results of adding SENet.
BackbonemAP50mAP70FPS
VGG1663.6410.6915.1
SENet-VGG1665.5012.9914.0
Table 4. Experiment results of reducing the network layer.
Table 4. Experiment results of reducing the network layer.
BackbonemAP50mAP70FPS
VGG1663.6410.6915.1
R7-VGG1665.7918.3219.4
R5-VGG1666.9020.5824.3
R4-VGG1693.8047.6420.4
Table 5. Experiment results of modifying downsampling network layer.
Table 5. Experiment results of modifying downsampling network layer.
BackbonemAP50mAP70FPS
VGG1663.6410.6915.1
DSN-VGG1664.5316.4312.4
Table 6. Experiment results of method integration.
Table 6. Experiment results of method integration.
MenthodImage Input SizemAP50mAP70FPS
Faster R-CNN960 × 54063.6410.6915.1
DSR4-Faster R-CNN-AA96.1767.1323.3
Faster R-CNN640 × 3607.010.1315.4
DSR4-Faster R-CNN-AA81.0244.9532.4
Faster R-CNN480 × 270---
DSR4-Faster R-CNN-AA68.4431.5736.9
Table 7. Experiment results of vehicle object detection.
Table 7. Experiment results of vehicle object detection.
MethodmAP50mAP70FPS
SSD8.302.4443.6
YOLOv49.445.0860.7
PPYOLO15.196.2146.2
PPYOLOv214.785.9630.8
PPYOLOE_s7.972.8747.9
PPYOLOE_l12.525.0629.1
FCOS5.061.9922.2
PicoDet_s5.181.6958.2
Faster R-CNN7.010.1315.4
FR-H19.769.5718.2
SCRDet20.9411.2621.4
DSR4-Faster R-CNN-AA(ours)81.0244.9532.4
Table 8. Experimental results on open set vehicle object detection.
Table 8. Experimental results on open set vehicle object detection.
MethodmAP50mAP70FPS
DSR4-Faster R-CNN-AA-O71.3436.5819.4
Table 9. Experimental results of real-time performance on a UAV device.
Table 9. Experimental results of real-time performance on a UAV device.
Image Input
Size
Average Pre-Processing
Time (ms)
Average
Inference
Time (ms)
Average
Post-Processing
Time (ms)
Average
Total Process
Time (ms)
FPS
960 × 5403.282.42.788.311.32
640 × 3604.553.23.961.616.24
480 × 2705.344.24.453.918.57
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhao, F.; Lou, W.; Sun, Y.; Zhang, Z.; Ma, W.; Li, C. Open Set Vehicle Detection for UAV-Based Images Using an Out-of-Distribution Detector. Drones 2023, 7, 434. https://doi.org/10.3390/drones7070434

AMA Style

Zhao F, Lou W, Sun Y, Zhang Z, Ma W, Li C. Open Set Vehicle Detection for UAV-Based Images Using an Out-of-Distribution Detector. Drones. 2023; 7(7):434. https://doi.org/10.3390/drones7070434

Chicago/Turabian Style

Zhao, Fei, Wenzhong Lou, Yi Sun, Zihao Zhang, Wenlong Ma, and Chenglong Li. 2023. "Open Set Vehicle Detection for UAV-Based Images Using an Out-of-Distribution Detector" Drones 7, no. 7: 434. https://doi.org/10.3390/drones7070434

Article Metrics

Back to TopTop