Vehicle Detection in UAV Images via Background Suppression Pyramid Network and Multi-Scale Task Adaptive Decoupled Head

Pan, Mian; Xia, Weijie; Yu, Haibin; Hu, Xinzhi; Cai, Wenyu; Shi, Jianguang

doi:10.3390/rs15245698

Open AccessArticle

Vehicle Detection in UAV Images via Background Suppression Pyramid Network and Multi-Scale Task Adaptive Decoupled Head

by

Mian Pan

,

Weijie Xia

,

Haibin Yu

,

Xinzhi Hu

,

Wenyu Cai

^*

and

Jianguang Shi

School of Electronic Information, Hangzhou Dianzi University, Hangzhou 310005, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2023, 15(24), 5698; https://doi.org/10.3390/rs15245698

Submission received: 5 September 2023 / Revised: 25 November 2023 / Accepted: 7 December 2023 / Published: 12 December 2023

Download

Browse Figures

Versions Notes

Abstract

:

Vehicle detection based on unmanned aerial vehicle (UAV) aerial images plays a significant role in areas such as traffic monitoring and management, disaster relief, and more, garnering extensive attention from researchers in recent years. However, datasets acquired from UAV platforms inevitably suffer from issues such as imbalanced class distribution, severe background interference, numerous small objects, and significant target scale variance, presenting substantial challenges to practical vehicle detection applications based on this platform. Addressing these challenges, this paper proposes an object detection model grounded in a background suppression pyramid network and multi-scale task adaptive decoupled head. Firstly, the model implements a long-tail feature resampling algorithm (LFRA) to solve the problem of imbalanced class distribution in the dataset. Next, a background suppression pyramid network (BSPN) is integrated into the Neck segment of the model. This network not only reduces the interference of redundant background information but also skillfully extracts features of small target vehicles, enhancing the ability of the model to detect small objects. Lastly, a multi-scale task adaptive decoupled head (MTAD) with varied receptive fields is introduced, enhancing detection accuracy by leveraging multi-scale features and adaptively generating relevant features for classification and detection. Experimental results indicate that the proposed model achieves state-of-the-art performance on lightweight object detection networks. Compared to the baseline model PP-YOLOE-s, our model improves the

{A P}_{50 : 95}

on the VisDrone-Vehicle dataset by 1.9%.

Keywords:

object detection; Unmanned Aerial Vehicles; small object; multi-scale; background suppression

Graphical Abstract

1. Introduction

In recent years, drones have played a pivotal role in numerous practical engineering applications, such as traffic monitoring [1,2,3], intelligent traffic management [4], orchard pest detection [5], water resource monitoring [6], and disaster relief [7,8]. Compared to traditional remote sensing platforms, drone platforms equipped with cameras not only offer flexible deployment and a broad field of view but also possess commendable real-time capabilities in image transmission and processing. This makes them highly promising for vehicle detection tasks [9]. Consequently, vehicle detection based on drone platforms has garnered extensive attention from researchers.

In the realm of vehicle detection, the use of object detection methods based on deep neural networks has become mainstream. Broadly, these methods can be categorized into two-stage and single-stage object detection techniques. Two-stage object detection methods first generate region proposals in the initial stage, followed by classification and bounding box regression of these proposals in the second stage. The typical representatives of this approach are R-CNN [10] and its derivatives, such as SPPnet [11], Fast R-CNN [12], Faster R-CNN [13], R-FCN [14], and Mask R-CNN [15]. The advantages of this approach lie in its high precision in object detection. However, its downsides include a higher computational complexity and limited real-time capabilities, making it less suitable for deployment in resource-constrained detection hardware platforms. In contrast, single-stage detection methods, unlike their two-stage counterparts, directly predict the category of the objects and position on the feature map using preset anchor points, accomplishing object detection in a single step and thereby enhancing computational speed. Notable instances of this methodology include SSD [16], RetinaNet [17], and YOLO [18,19,20]. Single-stage methods, when compared to two-stage ones, excel in real-time performance but usually lag slightly in accuracy. Although the aforementioned deep learning-based object detection models have shown commendable performance on generic large-scale datasets such as COCO [21], ImageNet [22], and VOC2007/2012 [23], the unique context of data collected from drone platforms renders these datasets distinctively different. First, drone-taken images are often large and have a wide view, with a complicated background and many different objects. This makes it easy to confuse vehicles with similar-looking items, making vehicle detection and classification harder. Second, due to changes in drone height and differences in vehicle sizes, the same object might look different in size in pictures. Then, because of factors like where and when the photo is taken and the environment, there is a significant imbalance in the types of objects found in the database. Lastly, capturing images from an elevated perspective with a drone results in objects occupying a smaller portion of the image, leading to reduced image quality. Consequently, applying existing models based on standard datasets directly for vehicle detection from drone images tends to yield suboptimal performance.

To address these challenges, many researchers have attempted improvements in the workflow of object detection from both data and model perspectives. Regarding data-side enhancements, data augmentation techniques [24,25,26,27] are commonly used to address issues like multi-scale targets and uneven category distribution. For handling multi-scale targets, Li et al. [28] divided the original image into two separate pictures for feature extraction, solving the problem of vehicle small target distortion caused by image reduction, thereby effectively enhancing the accuracy of subsequent models in multi-scale target detection; Zhou et al. [29] employed an adaptive interpolation method, which first categorizes images using object relative scale (ORS) and then uses bilinear interpolation or SRGAN to enlarge and crop images with numerous small vehicle targets to the appropriate size. To improve the imbalance in category distribution, Li et al. [30] proposed a new definition for positive samples and data augmentation algorithms, ameliorating the imbalance between vehicle targets and the background as well as among vehicle samples. Pandey et al. [31] selected images with less frequent category occurrences and expanded them back into the dataset through rotation.

On the model front, considering the limited edge computing capabilities of drone platforms and the real-time requirement for algorithms deployed on them, models applied in actual projects often evolve from single-stage object detection models. A prevalent technique involves combining shallow and deep feature maps [32,33,34,35], primarily targeting low-resolution (small objects) and background interference issues. For small object detection, Xie et al. [36] extended YOLOv2 to DYOLO, enhancing accuracy through deep-shallow upsampling feature map fusion. Tayara et al. [37] developed a Full Convolutional Regression Network (FCRN) tailored for drone image-based small vehicle detection, building on Faster-RCNN. Liang et al. [38] introduced the FS-SSD network, a fusion of SSD and Feature Pyramid Network, which, through average pooling layers and deconvolution layers, bolstered small object detection precision. For addressing background interference, Xi et al. [39] proposed a fine-grained target focusing network (FiFoNet), which effectively selects multi-scale feature combinations during feature fusion, further enhancing multi-scale feature expressiveness and thus suppressing background noise. Ma et al. [40] presented an aerial visual scene-based object detection method (AVS-YOLO) that filters out redundant background information by introducing a dense connection in the feature pyramid section. Notably, while many studies have improved generic methods from various angles to better fit the unique context of drone aerial imagery, no single approach has yet provided a comprehensive solution framework to holistically address the aforementioned challenges.

To solve the aforementioned issues, this article proposes an object detection model based on the Background Suppression Pyramid Network and Multi-scale Task Adaptive Decoupled Head, built upon the PP-YOLOE framework. This model consists of four parts: data preprocessing, backbone network, feature fusion network, and detection head network. In the data preprocessing part, a long-tail feature resampling algorithm was designed to address the uneven distribution of target categories. The backbone network part employs the CSPResNet architecture from the PP-YOLOE framework. In the feature fusion network part, a background suppression module and a Lightweight Transformer module are added to the original network to reduce background information interference and enhance the capability of detecting small targets. In the detection head network, a newly designed multi-scale task adaptive decoupled head replaces the original detection head to deal with severe changes in object scale. This model design effectively improves detection accuracy in unmanned aerial vehicle target detection. The main contributions of this article include:

Long-tail feature resampling algorithm: This algorithm generates target prediction density maps and resamples small-sample targets based on these maps to alleviate the issue of imbalanced target category distribution.
Background suppression module: This module integrates spatial and channel attention mechanisms. Spatial attention focuses on key image regions to highlight targets, while channel attention emphasizes important features. The combined use of these attention methods can reduce the interference of the background in object detection.
Lightweight Transformer module: This module uses the concept of lightweight to divide input features into blocks, each then processed by a transformer. This design not only improves the ability of the model to detect small targets but also reduces computational requirements.
Multi-scale task adaptive decoupled head: This module processes multi-scale features from different receptive fields through dynamic convolution to extract target scale features, selecting the most optimal features. This approach addresses the problem of drastic target scale changes in unmanned aerial vehicle target detection.

2. Related Work

In this section, this paper will first review our baseline model and then delve into the detailed design of PP-YOLOE [41], from the overall network architecture and feature fusion network, to the detection head network.

The overall architecture of PP-YOLOE is built upon PP-YOLO v2 [42]. A series of upgrades and optimizations have been applied, as depicted in Figure 1. The design of the PP-YOLOE model showcases its uniqueness in several aspects. Firstly, the model employs a novel and unified backbone and neck design, which flexibly supports configurations of various sizes. Secondly, to effectively address the imbalance issue between classification and regression in object detection tasks, PP-YOLOE adopts the dynamic matching strategy of task alignment learning (TAL). This strategy significantly improves detection accuracy with an efficient label assignment method. Lastly, PP-YOLOE introduces the design of the efficient task-aligned head (ET-Head). Although this induces a minor speed reduction, it markedly enhances the detection precision of the model.

In the feature fusion (neck) network of PP-YOLOE, the model adopts the CSPRepResBlock module, as illustrated in Figure 2. This design utilizes residual connections, effectively addressing the vanishing gradient problem while simultaneously serving as a form of model ensembling to boost performance. Additionally, the implemented dense connection fusion strategy integrates intermediate-layer features with varied receptive fields, offering significant advantages in tasks like object detection, and showcasing commendable results.

In the detection head network of the PP-YOLOE model, the ET-Head module is specifically designed for classification and regression tasks, as depicted in Figure 3. The primary improvements encompass four facets. Firstly, the time-consuming task interaction feature module in the original task-aligned head (T-Head) is discarded. Secondly, while ensuring accuracy, the channel attention module is streamlined into an effective squeeze and extraction block (ESE). Furthermore, to enhance the speed, the classification task alignment module has been simplified to a shortcut. Lastly, addressing the intricate and deployment-unfriendly regression task alignment module in the T-Head, the model draws inspiration from the integration module in GFL [43] to model the detection bounding box.

In summary, PP-YOLOE demonstrates remarkable advantages in terms of performance, hardware compatibility, scalability, and flexible model configuration, outperforming YOLOv5 and YOLOX in the trade-off between speed and accuracy. Although PP-YOLOE, as a universal object detection model, has exhibited superior performance in numerous application scenarios, its direct application to the domain of drones might encounter some limitations. On the one hand, images captured by drones often encompass intricate backgrounds, potentially challenging the detection accuracy of the model. On the other hand, the feature distribution of drone-captured images diverges from typical image datasets, with more pronounced scale variations of objects from a drone’s perspective, necessitating specific model refinements. Thus, it is imperative to suitably adjust and enhance the PP-YOLOE model to cater to the unique demands of the drone realm.

Our model has been adapted and optimized, with PP-YOLOE serving as the benchmark. At the outset of the backbone network, we introduce a data augmentation module that aims to boost the representation of minority samples. Within the neck network, we drew inspiration from the path aggregation network (PAN) structure, incorporating a background suppression block during the top-down phase to mitigate the influence and interference of background noise on the model. In the subsequent bottom-up phase, a lightweight transformer block is integrated to further refine the performance of the model. At the detection head, we conceptualized and integrated a multi-scale task adaptive decoupled head.

The primary objective of this module is to address the drastic scale variations of a single object, thereby enhancing the recognition and handling capabilities of the model for targets of varying scales. Through such strategic design, our model not only retains its foundational performance but further elevates its ability to identify targets amidst complex backgrounds and sparse samples, simultaneously optimizing its adaptability to diverse object scale variations.

3. Proposed Approach

3.1. Overall Framework of the Model

In this study, we introduce a novel vehicle detection method based on the background suppression pyramid network and multi-scale task adaptive decoupled head. The overall architecture of the model, as illustrated in Figure 4, comprises four main components: data enhancement, backbone network, neck network, and detector head network.

Within the data enhancement segment, we employ a long-tailed feature resampling strategy. This approach, by generating drone images through the cropping and re-pasting of density maps, ameliorates the imbalance of vehicle target categories present within the dataset. In the backbone network, we continue to leverage the cross stage paritial network (CSPDarknet) architecture from the PP-YOLOE [41] for image feature extraction. Within the Neck network, image features first pass through a background suppression block. This block generates attention feature maps in both the channel and spatial dimensions, mitigating the interference of image backgrounds on target objects. Subsequently, features are processed by the lightweight transformer block, which, through multi-head attention and a fully connected approach, enhances the capability to capture varying local information, bolstering the detection prowess for smaller targets. In the detector head network, we design the multi-scale task adaptive decoupled head. Within this module, three distinct branches are designed, each extracting image features from different receptive fields, addressing the challenges posed by the pronounced scale variations in the to-be-detected vehicle targets.

In the sections that follow, we delve into the refinements made in the domains of data enhancement, the neck network, and the detector head network in this study.

3.2. Data Enhancement

In this paper, the data enhancement network is illustrated in Figure 5, centered around the long-tailed feature resampling algorithm for data augmentation. This algorithmic flow consists of three main components: density map generation, cropping based on the density map, and image processing, followed by re-stitching. Initially, the input image undergoes a trained network to produce a predictive density map, which is then used in series-wound with a sliding window and a threshold to create a density mask, resulting in cropped images. These cropped images are subsequently flipped and adjusted randomly before being stitched onto drone-view background images to generate new images.

In the density map generation segment, the image is channeled into three distinct paths. Each path employs varying convolution kernel sizes to capture features of different scales. Within each path, the image undergoes convolution followed by a

2 \times 2

pooling operation, a process that is repeated twice, ultimately reducing the output feature map size to a quarter of the original image. To retain the initial resolution, while circumventing the memory issues related to very high-resolution images, this paper employs a nearest-neighbor interpolation method to upscale the produced density map by a factor of four.

For the segment involving cropping based on the density map, the areas with densely packed vehicle targets exhibit a higher pixel intensity. Therefore, we utilize the segmentation algorithm from reference [44] to set thresholds within these regions. This aids in estimating the number of vehicle targets while filtering out pixels void of any target vehicles.

During the image processing and re-stitching phase, to ensure that the cropped images serve as a means for data augmentation of tail-end vehicle targets without altering the relative size of the vehicle targets to the original image, the cropped images are first horizontally flipped and subsequently subjected to contrast and brightness adjustments. To prevent image distortion, the adjustments are based on random values within a specified range. Depending on the dimensions of image, it is then spaced at specific intervals against a naturally captured drone background, resulting in the final composited image.

3.3. Background Suppression Pyramid Network

The structure of the background suppression pyramid network employed in this study is depicted in Figure 6. It is constituted of two branches, top-down and bottom-up, with lateral connections that fuse the feature maps from both branches. Within the top-down branch, the deeper network layers convey strong semantic information, providing a more abstract and global understanding crucial for tasks like target class identification and rough localization. Conversely, in the bottom-up branch, the shallower network layers communicate robust localization feature information, capturing the fine structure and positioning of the target. This type of information is vital for precise target localization and recognition tasks. The fusion of these feature maps amalgamates parameters from different backbone layers to their respective detection layers, significantly enhancing vehicle target recognition and localization. Building upon this, our study innovatively incorporates a background suppression block within the top-down branch. This ensures that, during the fusion phase, semantic information emphasizes vehicle target features while mitigating the influence of irrelevant background information. Additionally, within the bottom-up branch, we introduce the lightweight transformer block, leveraging multi-head self-attention mechanisms to bolster spatial localization for smaller vehicle targets. Subsequent sections delve into the background suppression block and lightweight transformer block in detail.

3.3.1. Background Suppression Block

In this paper, the background suppression block (BS Block) is introduced, as illustrated in Figure 7. The structure of this module is divided into two main components: channel attention and spatial attention. Both components operate independently on the original feature map, yielding channel and spatial weights, respectively. Each of these weights is then multiplicatively combined with the original input feature map to achieve adaptive feature refinement, producing two distinct feature maps. Ultimately, these maps are combined via pixel-wise addition and undergo normalization, resulting in a feature map enriched with both channel and spatial weights.

Given a primary feature map

F^{B}

from the backbone network, it is concurrently processed through both channel attention and spatial attention mechanisms. This results in the channel attention weight

M^{c}

and spatial attention weight

M^{s}

. These weights are then element-wise multiplied with the original input feature map, yielding the channel-informed feature map

F^{c}

and the spatial-informed feature map

F^{s}

, respectively. Ultimately, a pixel-wise summation of these two maps is conducted, followed by normalization, producing a feature map

F^{s c}

, encapsulating both channel and spatial information.

The channel attention mechanism used in this paper employs channel average pooling and channel max pooling to capture global and significant features within the channel dimension. By integrating these two methods, the model effectively extracts discriminative features closely associated with vehicle targets, thereby enhancing its capability for vehicle detection. The resultant features are then fed into a shared, fully connected network. The outputted features from this network undergo pixel-wise summation and a subsequent sigmoid normalization to derive the channel attention weight

M^{c}

. The channel-informed feature map

F^{c}

is defined as:

F^{c} = F^{B} \times M^{c}

(1)

where

F^{B}

is the input feature map from the backbone. The channel attention weight

M^{c}

is expressed as:

M^{c} = s i g m o i d (M L P (A v g P o o l (F^{B})) + M L P (M a x P o o l (F^{B})))

(2)

where

s i g m o i d (\cdot)

denotes the sigmoid activation function,

M L P (\cdot)

signifies the shared fully connected operation,

A v g P o o l (\cdot)

and

M a x P o o l (\cdot)

stand for average pooling and max pooling, respectively.

The spatial attention mechanism adopted in this paper, through the application of average pooling and max pooling operations on feature maps, effectively extracts global and prominent features in the spatial dimension. The combination of these two operations not only emphasizes key areas in the image but also highlights features crucial for vehicle target detection, thereby effectively enhancing the capability of the model to detect vehicle objects. After the concatenation of this information, convolution and normalization operations are applied to consolidate into the spatial attention weight

M^{s}

. The spatial-informed feature map

F^{s}

is then defined as:

F^{s} = F^{B} \times M^{s}

(3)

where

F^{B}

is the input feature map from the backbone. The spatial attention weight

M^{s}

is given by:

M^{s} = s i g m o i d ({C o n v}_{7 \times 7} (c o n c a t (A v g P o o l (F^{B}), M a x P o o l (F^{B}))))

(4)

where

{C o n v}_{7 \times 7}

represents the convolution operation with a

7 \times 7

kernel size, and

c o n c a t

signifies the concatenation operation.

Upon obtaining both the channel-informed feature map

F^{c}

and the spatial-informed feature map

F^{s}

, the final feature map

F^{s c}

can be represented as:

F^{s c} = s i g m o i d (F^{c} + F^{s})

(5)

3.3.2. Lightweight Transformer Block

The lightweight transformer block designed in this study is depicted in Figure 8. Given an input image feature

F^{p}

, it is first equally divided into

n \times n

local image features

{\hat{F}}_{i}

,

0 < i \leq n^{2}

. Subsequently, for each local image feature, a token

{\hat{F}}_{i}^{t}

is generated. Following an attention operation on the token, it is concatenated with its corresponding local image feature to form a new local image feature

{\hat{F}}_{i}^{w}

. Each of these new local features is subjected to a distinct attention operation. The final operation involves reconstructing and concatenating these processed local features to generate

F^{w t}

, which retains the same feature size as

F^{p}

.

Initially, the transformer operation of global image features is translated into transformer operations for local image features. Employing a transformer on local image features efficiently enhances computational efficiency. By partitioning the input feature map of size

H \times W

uniformly into

n \times n

local windows, and for simplicity letting

H = W

,

n = H^{2} / k^{2}

, where

k

is the size of the local image feature. The local image feature

{\hat{F}}_{i}

is represented as:

{\hat{F}}_{i} = {S p l i t}_{k \times k} (F^{p})

(6)

where

{S p l i t}_{k \times k}

divides the image feature by

k \times k

. Each local image feature initializes a token to capture its comprehensive information. The token initialization is given by:

{\hat{F}}_{i}^{t} = {A v g P o o l}_{k^{2} \to 1} ({C o n v}_{3 \times 3} (F^{p}))

(7)

where

{C o n v}_{3 \times 3}

effectively extracts from the image feature and

{A v g P o o l}_{k^{2} \to 1}

pools the

k^{2}

sized local image feature into a size 1 token. This token represents the holistic information of its respective local image feature. These local tokens

{\hat{F}}_{i}^{t}

are concatenated to produce

{\hat{F}}^{t}

, which undergoes an Attention & Conv layer to facilitate local image feature interactions via tokens, enriching the information capture. The Attention & Conv operation is expressed as:

{\hat{F}}^{A t} = C o n v (B N ({\hat{F}}^{t} + γ_{1} \cdot M H S A (C o n v (B N ({\hat{F}}^{t})))))

(8)

where

γ_{1}

is a hyperparameter and

M H S A

stands for multi-head self-attention.

{\hat{F}}^{A t}

is the result after the tokens have undergone Attention & Conv. These tokens are then separated:

{\hat{F}}_{i}^{A t} = {S p l i t}_{1 \times 1} ({\hat{F}}^{A t})

(9)

where

{S p l i t}_{1 \times 1}

divides

{\hat{F}}^{A t}

into

1 \times 1

sizes, yielding

n \times n

tokens

{\hat{F}}_{i}^{A t}

that have undergone information exchange.

Next, interactions between local image features and exchanged information tokens are computed to model both short and long-range spatial information. Local features

{\hat{F}}_{i}

and tokens

{\hat{F}}_{i}^{A t}

are concatenated to form

{\hat{F}}_{i}^{w}

. Each of these localized features

{\hat{F}}_{i}^{w}

enters an Attention & Conv to produce

{\hat{F}}_{i}^{A w}

:

{\hat{F}}_{i}^{A w} = C o n v (B N ({\hat{F}}_{i}^{w} + γ_{1} \cdot M H S A (C o n v (B N ({\hat{F}}_{i}^{w})))))

(10)

where

{\hat{F}}_{i}^{A w}

is split back into tokens after local attention

{\hat{F}}_{i}^{A w t}

and local image features

{\hat{F}}_{i}^{A}

. A reconstruction and concatenation operation yields the feature map

F^{w t}

:

F^{w t} = C o n c a t (r e s h a p e ({\hat{F}}_{i}^{A} * {\hat{F}}_{i}^{A w t} + {\hat{F}}_{i}^{A}))

(11)

3.4. Detection Head Network

In this work, we inherit the decoupled detection head design from PP-YOLOE and further refine it to enrich image features of multi-scale objects. This addresses the issue of dramatic scale variations of objects from drone perspectives, consequently enhancing the accuracy of object detection. The proposed multi-scale task adaptive decoupled head (MTAD Head) is illustrated in Figure 9. This network channels feature maps from three pyramid levels into distinct classification and regression branches. These branches extract classification and regression features, ultimately pinpointing the location and category of vehicle targets.

3.4.1. Multi-Scale Task Adaptive Decoupled Head

As depicted in Figure 9, the multi-scale task adaptive decoupled head (MTAD Head) is divided into classification and regression branches. Each branch encompasses features from three pyramid levels. These branches have different receptive fields, capturing multi-scale features, which enhances the capability of the model to detect objects with severe scale variations. Concurrently, the classification task requires features that are spatially coarse yet semantically rich to infer the category of the object. In contrast, the regression task necessitates high-resolution features imbued with edge details for a more precise object boundary regression. Our MTAD Head is designed to allow the model to adaptively learn features required for both classification and regression tasks. Each pyramid level is intrinsically linked to its two neighboring levels, ensuring that the features from the three pyramid levels cater to the demands of both classification and regression tasks.

3.4.2. Dynamic Convolutional Attention Block

The structure of the dynamic convolutional attention block (DCA) designed in this study is illustrated in Figure 10. Given a network input

F_{n}

, the feature map first undergoes a linear layer to yield

{\bar{F}}_{n}

. Both

F_{n}

and

{\bar{F}}_{n}

are then fed into the group deformable convolutional network (DCN). The multi-group DCN can learn richer information from various positions and distinct representational subspaces. Each DCN adaptively adjusts the sampling position and shape of its convolution kernel based on the deformation of the object, thereby enhancing the perceptual capability of the model towards the target. For common issues in drone imagery, such as object deformation and scale variation, the DCN can better adapt to feature variations of the target, broadening the coverage of the receptive field. The DCN passes

F_{n}

through a convolutional and a linear layer to obtain the mask

m_{n}

and the offset

∆ p_{n}

. The offset leads to deformed convolution, which, when multiplied by the mask, procures the weight of each sampling point, mitigating interference from background factors. The deformable convolution formula is as follows:

D C N (p_{0}) = \sum_{p_{n} \in R} w (p_{n}) \cdot (p_{0} + p_{n} + ∆ p_{n}) \cdot m_{n}

(12)

where

p_{0}

corresponds to the central coordinate in the convolution kernel.

p_{n}

signifies other points’ relative positional coordinates concerning the center of the kernel during convolution.

R

represents the relative positions of all sampling points in the kernel, and

m_{n}

is the weight of each sampling point. Outputs from multiple DCNs are channeled into the SimAM [45] to capture essential features, yielding

F^{S D}

:

F^{S D} = S i m A M ([{D C N}_{0} (F_{n}), {D C N}_{1} (F_{n}), \dots, {D C N}_{n} (F_{n})])

(13)

3.5. Loss Function

Section 3.5 of this study introduces two types of loss functions, which play a vital role in the density map generation network and the object detection network. Loss functions are crucial in the training process of deep learning models, as they assess the discrepancy between the output of the model and the actual results. The model can learn how to accurately predict and detect targets by minimizing these loss functions.

3.5.1. Density Map

The loss function for the density map generation network is based on pixel-wise mean absolute error and can be represented as:

L (θ) = \frac{1}{2 N} \times \sum_{i = 1}^{N} {‖D (X_{i}; θ) - D_{i}‖}^{2}

(14)

where

N

denotes the total number of training samples,

θ

represents the parameters of the data augmentation network,

X_{i}

and

D_{i}

are the input image and the ground truth density map, respectively.

D (X_{i}; θ)

represents the density map generated through the density map network. The method to obtain the ground truth density map

D_{i}

is similar to reference [46]. Initially, a geometry-adaptive Gaussian kernel is used to blur the annotation for each object, generating the actual target density map for drone imagery.

3.5.2. Object Detection

The loss function for the object detection network can comprehensively evaluate target location, scale, category, and confidence. It can be formulated as:

L o s s = \frac{1}{\sum_{i}^{N_{p o s}} \hat{t}} (α \cdot {l o s s}_{V F L} + β \cdot {l o s s}_{G I o U} + γ \cdot {l o s s}_{D F L})

(15)

where

\sum_{i}^{N_{p o s}} \hat{t}

where

N_{p o s}

is the number of positive samples and

\hat{t}

represents the target score (equivalent to target confidence).

{l o s s}_{V F L}

denotes the classification loss,

{l o s s}_{D F L}

indicates regression loss, and

{l o s s}_{G I o U}

represents the target localization loss. The hyperparameters

α

,

β

and

γ

are used to balance the individual losses.

4. Experiments

In the experimental section, we first constructed the VisDrone-Vehicle dataset based on the VisDrone-DET2019 dataset. Subsequently, by comparing the performance of our proposed method with relevant methods on the drone object detection task using the VisDrone-Vehicle dataset, we further confirmed and highlighted the practical value and potential impact of our approach.

4.1. VisDrone-Vehicle Dataset

The VisDrone-Vehicle dataset we utilized is built upon the VisDrone-DET2019 dataset. Based on the detection task target categories, we narrowed down the original categories to 5, namely cars, vans, trucks, buses, and motorcycles. The entire dataset comprises 6471 training images and 548 validation images. Specifically, the number of objects for cars, vans, trucks, buses, and motorcycles are 172,940, 15,534, 30,727, 8866, and 35,492, respectively. It is evident that the car category is the most prevalent, accounting for 65.6% of the total number of objects, while the bus category represents only 3.3%. There is a substantial inequality in the number of objects across categories, indicating a long-tail distribution in the data. The article defines large, medium, and small targets according to the COCO standard [21], objects with a bounding box area smaller than 32 × 32 pixels as small objects, those with an area greater than 32 × 32 pixels but smaller than 96 × 96 pixels as medium objects, and those with an area larger than 96 × 96 pixels as large objects. The distribution of large, medium, and small objects across each category is shown in Table 1. The dataset predominantly consists of small objects, but there is also a significant presence of medium and large objects. Scenes with objects of varying sizes appearing simultaneously are common, signifying considerable variation in object scales.

4.2. Experimental Setup

The experiments in this study were conducted on a tower server platform, SYS-4028GR-TR, provided by Supermicro, a company based in San Jose, California. For GPU, we employed a Nvidia RTX 2080Ti graphics card. The software environment was established on the Ubuntu 18.04 operating system, utilizing mmdetection and mmyolo as the object detection frameworks. During the training process, to achieve lightweight object detection, images of size 2000 × 1500 were resized to 640 × 640. We recommend using the stochastic gradient descent (SGD) method, initializing the learning rate at 0.005, setting momentum at 0.9, and weight decay at 0.0001.

As primary model evaluation metrics, this paper employs various average precisions (

A P

) based on different IoU thresholds. These include

A P

at an IoU threshold of 0.5 (denoted as

{A P}_{50}

),

A P

at an IoU threshold of 0.75 (denoted as

{A P}_{75}

), and

A P

at IoU thresholds ranging from 0.5 to 0.95, with an interval of 0.05 (denoted as

{A P}_{50 : 95}

). In addition, we considered the impact of area scale on model performance, introducing AP metrics based on different object scales. Collectively, these metrics offer a comprehensive and detailed perspective on model performance evaluation. The (

A P

metric can be calculated as:

A P = \int_{0}^{1} P (R) d R

(16)

where

P (R)

denotes the curve formed by Precision

P

and Recall

R

. Precision measures the accuracy of predictions, while Recall assesses the coverage of actual positive samples by the model, indicating how many actual positive samples the model can correctly detect. Precision and Recall can be expressed as:

P = \frac{T P}{T P + F P}

(17)

R = \frac{T P}{T P + F N}

(18)

where FN, FP, and TP represent the numbers of false negatives, false positives, and true positives, respectively.

4.3. Ablation Study

To validate the efficacy of the proposed modules in addressing the challenges of class imbalance, a plethora of small vehicle objects, significant object scale variations, and complex background information in drone-captured images, we conducted an ablation study on key components of our method. Specifically, we examined the long-tail feature resampling algorithm, the Background Suppression Pyramid Network (BSPN), and the multi-scale task-adaptive decoupled head. The results are presented in Table 2. In this context, the baseline model is the PP-YOLOE-s model without the inclusion of the aforementioned modules. Models M1, M2, M3, and M4, respectively, incorporate the long-tail feature resampling algorithm, the background suppression pyramid network, the multi-scale task-adaptive decoupled head, and a combination of the long-tail feature resampling algorithm and the multi-scale task-adaptive decoupled head. Model M5 represents the complete method proposed in this study.

As can be observed from Table 2, across all performance metrics, M1 outperforms the baseline, indicating that the long-tail feature resampling algorithm has enhanced the capability of the model to recognize small-sample targets, effectively improving the performance of the object detection model. Comparing the results of the baseline model (baseline) with M2, it is evident that adopting the background suppression pyramid network significantly boosts the performance of object detection. Particularly, there is an improvement of 0.9% and 1.1% in detecting medium and large targets, respectively. Notably, since we resized the input images, the area of medium-sized targets shrunk by an average of 7.3 times, placing them in the category of small objects in general object detection. Hence, the performance enhancement of our model in

{A P}_{m e d i u m}

slightly large, small objects essentially mirrors its prowess in handling smaller targets. This is attributed to our model effectively leveraging the pyramid network structure with background suppression capabilities, leading to a notable accuracy boost in small object detection. Comparing the results of the baseline and M3, there is an improvement across all performance metrics, with

{A P}_{s m a l l}

,

{A P}_{m e d i u m}

and

{A P}_{l a r g e}

improving by 0.3%, 1.1%, and 1.2%, respectively. This indicates that our modules enhance detection performance when facing drastic target scale variations, validating the effectiveness of our model. When comparing the results between the baseline and M4, it is evident that the combined use of the modules can further boost the performance of the model. From the results of M4, it is evident that combining LFRA with MTAD has significantly improved the overall detection capability of the model, compared with M1, an increase of 0.8% in

{A P}_{50 : 95}

. This indicates that M4 enhances the detection ability for small samples and shows strong adaptability to the frequent and drastic target scale changes commonly observed from a drone’s perspective. Comparing M5 with M1, it is found that the addition of a BSPN on the basis of M1 improves the detection of medium and large targets, suggesting that the model effectively suppresses background noise in object detection and better extracts features of medium and large targets. The comparison of M6 with M2 shows that the integration of an MTAD into the base of M2 leads to a 0.7% increase in

{A P}_{50 : 95}

. The combination of these two modules effectively enhances the overall detection ability of the model, demonstrating the targeted effectiveness of this combination for UAV-based vehicle object detection tasks. Comparing the outcomes of the baseline and M7 (the model introduced in this paper), the

{A P}_{50 : 95}

increases by 1.9% compared to the baseline. This further attests to the efficacy of each module in enhancing the performance of the model. Additionally, it suggests that when these modules are employed collectively, they synergize, leading to even better model capabilities.

The detection results from the ablation study of our proposed method on the VisDrone-Vehicle dataset are shown in Figure 11. White boxes in the image denote areas that the model failed to predict, while boxes of other colors indicate correctly detected targets. A comparison between Figure 11c,d reveals that upon the inclusion of the long-tail feature resampling algorithm, the recognition ability of the model for low-sample targets, namely buses and motorcycles, was enhanced. Four misdetected targets are seen in Figure 11c, specifically located in boxes 1, 3, 4, and 7. After incorporating the long-tail feature resampling algorithm, the four misdetected cases in Figure 11c were improved, indicating that the long-tail feature resampling algorithm is effective in addressing sample distribution imbalance. By contrasting Figure 11c,e, it can be observed that in Figure 11c, the unrelated background was misidentified as the detection target in boxes 1, 4, 7, and 9. However, in Figure 11e, this background noise is suppressed and no longer misinterpreted as a target. This suggests that the feature pyramid module based on transformer can suppress intricate background information, thereby reducing the false detection rate of the model. Comparing Figure 11c,f, after incorporating MTAD, the detection rate of the model for distant small targets was heightened, effectively recognizing distant car targets. Surprisingly, even some unannotated distant car targets were detected accurately. This demonstrates that the module effectively enhances the ability of the model to detect targets of the same category with drastic shape variations. When contrasting Figure 11c,g, it is evident that the enhanced model could proficiently recognize near-range low-sample motorcycle targets and distant car targets. This indicates a significant improvement in recognition precision for low-sample targets by the model and its capability to recognize targets with drastic shape variations. Lastly, a comparison between Figure 11c,h showcases that our proposed model, when compared with the baseline model, not only detects more distant small targets but also exhibits commendable recognition ability for low-sample motorcycle and bus targets amidst complex backgrounds. Overall, the detection capability of the model has been significantly uplifted.

4.4. Overall Model Performance Analysis

To further validate the detection performance of our proposed model on UAV vehicle object detection tasks, we carried out a comparative analysis between our model and other classic models on the VisDrone-Vehicle dataset. The detection results of various object detection methods on the VisDrone-Vehicle dataset are presented in Table 3.

From Table 3, it can be observed that the model proposed in this study, as a single-stage detection model, not only leads in overall performance but also boasts a parameter size of just 9.48 M, making it highly suitable for deployment on UAV platforms. Compared to the current top-performing single-stage model, YOLOv8-s, our proposed model shows an improvement of 2.6% on

{A P}_{50 : 95}

. Relative to the best-performing two-stage model, Cascade RCNN, our model sees an elevation of 4.9% in

{A P}_{50 : 95}

. Compared to the best model in this field, Faster RCNN NWD, our proposed model has achieved a 1.5% increase in

{A P}_{50 : 95}

, indicating that the overall performance of the model is superior in UAV vehicle detection tasks, with significant advancements across other metrics as well. Specifically, the

{A P}_{s m a l l}

of our model is 1.9% and 3.4% higher than YOLOv8-s and Cascade RCNN, respectively, highlighting that our model exhibits superior performance in UAV detection tasks, which typically involve smaller targets in the images, making it especially suited for UAV vehicle detection tasks. Furthermore, for

{A P}_{m e d i u m}

and

{A P}_{l a r g e}

, our model surpasses the top-performing single-stage model, YOLOv8-s, by 3% and 2.9%, respectively. When contrasted with the best-performing two-stage model, Cascade RCNN, our model records increments of 4.9% and 4.3% in these two metrics. When compared to the best model in this field, Faster RCNN NWD, these two metrics are higher by 3.9% and 8.2%, respectively. This underscores that our model is likely to exhibit superior performance in scenarios intrinsic to UAV vehicle detection tasks where there is a significant variance in target sizes.

In the evaluation of unmanned aerial vehicle (UAV) target detection systems, besides detection performance, the real-time nature, size, and computational efficiency of the model are also key indicators. In the UAV vehicle target detection field, the real-time nature of a model is typically measured by frames per second (FPS). Considering the flying altitude of UAVs, the field of view range, the average speed of ground targets, and actual engineering requirements, a model is generally considered to have high real-time performance when its FPS reaches 15. Such processing speed not only sufficiently covers the monitoring area but also ensures effective capture and analysis of the dynamics of targets within the field of view. As seen in Table 3, the model proposed in this article meets the real-time requirements on the FPS indicator and outperforms the latest UAV detection research like RT-DETR-s, CD Det, and Faster RCNN NWD. Additionally, when evaluating UAV vehicle target detection models, the size of the model and computational efficiency are commonly measured in GFlops, particularly important in hardware resource-limited application scenarios, such as on mobile devices or embedded systems. Usually, a higher GFlops value implies a larger, more complex model requiring more computational resources. The GFlops value of the model proposed in this paper is only 13.1, showing lighter weight and more computation-friendly characteristics compared to other Yolo series models currently used in actual engineering. These indicators collectively reflect the comprehensive performance advantages of our model in the field of UAV vehicle target detection.

The detection outcomes of the optimal single-stage and two-stage models in typical UAV vehicle detection scenarios are depicted in Figure 12. As can be discerned from Figure 12d, YOLOv8-s demonstrates commendable detection capabilities for smaller targets commonly found in UAV detection tasks. However, being a general detection model, it does not adequately account for the challenges posed by the complex backgrounds inherent in UAV detection, nor does it consider the class imbalances present within the dataset. As a result, the image exhibits numerous false detections, and it struggles with detecting under-represented objects such as buses and motorcycles. Observing Figure 12e, it becomes evident that the Cascade RCNN model, much like YOLOv8-s, faces difficulties in detecting smaller objects and exhibits a limited proficiency in recognizing under-sampled objects like buses and motorcycles. In contrast, our proposed model showcases superior detection performance in these typical UAV vehicle detection scenarios.

The relationship between

{A P}_{50 : 95}

and model parameters across various models is depicted in Figure 13. The x-axis represents the parameters of the model, while the y-axis illustrates the detection performance metric

{A P}_{50 : 95}

on the VisDrone-Vehicle dataset. In the context of UAV-based detection tasks covered in this paper, models situated closer to the top-left corner are relatively superior. Two-stage models, such as Faster RCNN and Cascade RCNN, exhibit better detection precision compared to most single-stage models. However, their parameter count is an order of magnitude larger than many single-stage models, thus limiting their practical deployment on UAV platforms. Our approach not only boasts smaller parameters and superior detection precision compared to classic single-stage methods such as FCOS, GFL, Fovea, and Fsaf but also exhibits relatively higher detection precision when compared to the latest single-stage detection techniques with comparable parameters. Taking both detection performance and model parameter count into consideration; our model is optimal for UAV vehicle detection tasks compared to all benchmark models.

5. Conclusions

To address issues in UAV vehicle detection, such as background interference, drastic target scale changes, and high mis-detection rates for small-sample targets, we introduced an object detection model based on the Background Suppression Pyramid Network and the Multi-scale Task Adaptive Decoupled Head. This method integrates long-tail feature resampling, a background suppression pyramid network, and a multi-scale task adaptive decoupled head. Long-tail feature resampling addresses the class imbalance in the dataset, and the background suppression pyramid network minimizes the interference of background information on targets while enhancing the feature extraction capabilities of the model for vehicles, and the multi-scale task adaptive decoupled head reduces the sensitivity of the model to target scale. Experimental Results: The results demonstrate that our approach achieves the current best performance on small-parameter object detection networks, meeting both the performance and accuracy demands of UAV platforms. Compared to the baseline model PP-YOLOE-s, our model

{A P}_{50 : 95}

reached 32.8%, an improvement of 1.9%. The average precision for three different sizes of targets

{A P}_{s m a l l}

,

{A P}_{m e d i u m}

and

{A P}_{l a r g e}

increased by 1%, 3%, and 3.2%, respectively, with the model parameter being 9.48 M. In the future, our work will continue to focus on dense small object detection and real-world deployment and testing on UAV platforms.

Author Contributions

Conceptualization, M.P. and W.C.; data curation, W.X. and X.H.; formal analysis, H.Y. and J.S.; investigation, H.Y., W.C., J.S. and M.P.; resources, H.Y., J.S. and W.C.; writing—original draft preparation, W.X., X.H. and M.P.; writing—review and editing, W.X., X.H. and M.P. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the National Key Research and Development Project of China (grant No. 2021YFC2801400, 2022YFC2803600); the Key Research and Development Program of Zhejiang Province (grant No. 2022C03027, 2022C01144); the Public Welfare Technology Research Project of Zhejiang Province (grant No. LGF21E090004, LGF22E090006); and Zhejiang Provincial Key Lab of Equipment Electronics; National Natural Science Foundation of China (No. 62271179), Natural Science Foundation of Zhejiang Province (No. LZ22F010004). The authors acknowledge the Supercomputing Center of HangzhouDianzi University for providing computing resources.

Data Availability Statement

The data are available from the corresponding author on reasonable request.

Acknowledgments

The authors would like to thank each member of the team for their efforts.

Conflicts of Interest

The authors declare no conflict of interest.

References

Mueller, M.; Smith, N.; Ghanem, B. A Benchmark and Simulator for UAV Tracking. In Computer Vision—ECCV 2016: Proceedings of the 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Leibe, B., Matas, J., Sebe, N., Welling, M., Eds.; Lecture Notes in Computer Science; Springer International Publishing: Cham, Switzerland, 2016; Volume 9905, pp. 445–461. [Google Scholar]
Robicquet, A.; Sadeghian, A.; Alahi, A.; Savarese, S. Learning Social Etiquette: Human Trajectory Understanding in Crowded Scenes. In Computer Vision—ECCV 2016: Proceedings of the 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Leibe, B., Matas, J., Sebe, N., Welling, M., Eds.; Lecture Notes in Computer Science; Springer International Publishing: Cham, Switzerland, 2016; Volume 9912, pp. 549–565. [Google Scholar]
Zhu, P.; Wen, L.; Bian, X.; Ling, H.; Hu, Q. Vision Meets Drones: A Challenge. arXiv 2018, arXiv:1804.07437. [Google Scholar]
Zhu, J.; Sun, K.; Jia, S.; Li, Q.; Hou, X.; Lin, W.; Liu, B.; Qiu, G. Urban Traffic Density Estimation Based on Ultrahigh-Resolution UAV Video and Deep Neural Network. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2018, 11, 4968–4981. [Google Scholar] [CrossRef]
Betti Sorbelli, F.; Palazzetti, L.; Pinotti, C.M. YOLO-based detection of Halyomorpha halys in orchards using RGB cameras and drones. Comput. Electron. Agric. 2023, 213, 108–228. [Google Scholar] [CrossRef]
Mishra, V.; Avtar, R.; Prathiba, A.P.; Mishra, P.K.; Tiwari, A.; Sharma, S.K.; Singh, C.H.; Chandra Yadav, B.; Jain, K. Uncrewed Aerial Systems in Water Resource Management and Monitoring: A Review of Sensors, Applications, Software, and Issues. Adv. Civ. Eng. 2023, 2023, e3544724. [Google Scholar] [CrossRef]
Wang, X.; Yao, F.; Li, A.; Xu, Z.; Ding, L.; Yang, X.; Zhong, G.; Wang, S. DroneNet: Rescue Drone-View Object Detection. Drones 2023, 7, 441. [Google Scholar] [CrossRef]
Półka, M.; Ptak, S.; Kuziora, Ł. The Use of UAV’s for Search and Rescue Operations. Procedia Eng. 2017, 192, 748–752. [Google Scholar] [CrossRef]
Singh, C.H.; Mishra, V.; Jain, K.; Shukla, A.K. FRCNN-Based Reinforcement Learning for Real-Time Vehicle Detection, Tracking and Geolocation from UAS. Drones 2022, 6, 406. [Google Scholar] [CrossRef]
Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 580–587. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 37, 1904–1916. [Google Scholar] [CrossRef]
Girshick, R. Fast R-CNN. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. In Advances in Neural Information Processing Systems: Proceedings of the Annual Conference on Neural Information Processing Systems 2015, Montreal, QC, Canada, 7–12 December 2015; Curran Associates, Inc.: Red Hook, NY, USA, 2015; Volume 28. [Google Scholar]
Dai, J.; Li, Y.; He, K.; Sun, J. R-FCN: Object Detection via Region-based Fully Convolutional Networks. In Advances in Neural Information Processing Systems: Proceedings of the Annual Conference on Neural Information Processing Systems 2016, Barcelona, Spain, 5–10 December 2016; Curran Associates, Inc.: Red Hook, NY, USA, 2016; Volume 29. [Google Scholar]
He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask R-CNN. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 2961–2969. [Google Scholar]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. SSD: Single Shot MultiBox Detector. In Computer Vision—ECCV 2016: Proceedings of the 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Leibe, B., Matas, J., Sebe, N., Welling, M., Eds.; Springer International Publishing: Cham, Switzerland, 2016; pp. 21–37. [Google Scholar]
Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal Loss for Dense Object Detection. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 2999–3007. [Google Scholar]
Redmon, J.; Farhadi, A. Yolov3: An incremental improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
Alexey, B.; Wang, C.-Y.; Liao, H.M. Yolov4: Optimal speed and accuracy of object detection. arXiv 2020, arXiv:2004.10934. [Google Scholar]
Chen, C.; Zheng, Z.; Xu, T.; Guo, S.; Feng, S.; Yao, W.; Lan, Y. YOLO-Based UAV Technology: A Review of the Research and Its Applications. Drones 2023, 7, 190. [Google Scholar] [CrossRef]
Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft COCO: Common objects in context. In Computer Vision—ECCV 2014: Proceedings of the 13th European Conference, Zurich, Switzerland, 6–12 September 2014; Springer: Cham, Switzerland, 2014; pp. 740–755. [Google Scholar]
Deng, J.; Dong, W.; Socher, R.; Li, L.-J.; Li, K.; Li, F.-F. ImageNet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; IEEE: Piscataway, NJ, USA, 2009; pp. 248–255. [Google Scholar]
Dong, Z.; Xu, K.; Yang, Y.; Xu, W.; Lau, R.W. Location-aware single image reflection removal. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 5017–5026. [Google Scholar]
Zhang, X.; Izquierdo, E.; Chandramouli, K. Dense and Small Object Detection in UAV Vision Based on Cascade Network. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW), Seoul, Republic of Korea, 27–28 October 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 118–126. [Google Scholar]
Li, X.; Li, X. Robust Vehicle Detection in Aerial Images Based on Image Spatial Pyramid Detection Model. In Proceedings of the 2019 IEEE 4th International Conference on Advanced Robotics and Mechatronics (ICARM), Toyonaka, Japan, 3–5 July 2019; pp. 850–855. [Google Scholar]
Wang, L.; Liao, J.; Xu, C. Vehicle Detection Based on Drone Images with the Improved Faster R-CNN. In Proceedings of the 2019 11th International Conference on Machine Learning and Computing, Zhuhai, China, 22–24 February 2019; ACM: New York, NY, USA, 2019; pp. 466–471. [Google Scholar]
Brkić, I.; Miler, M.; Ševrović, M.; Medak, D. An Analytical Framework for Accurate Traffic Flow Parameter Calculation from UAV Aerial Videos. Remote Sens. 2020, 12, 3844. [Google Scholar] [CrossRef]
Li, X.; Li, X.; Pan, H. Multi-Scale Vehicle Detection in High-Resolution Aerial Images with Context Information. IEEE Access 2020, 8, 208643–208657. [Google Scholar] [CrossRef]
Zhou, J.; Vong, C.-M.; Liu, Q.; Wang, Z. Scale adaptive image cropping for UAV object detection. Neurocomputing 2019, 366, 305–313. [Google Scholar] [CrossRef]
Li, X.; Li, X.; Li, Z.; Xiong, X.; Khyam, M.O.; Sun, C. Robust Vehicle Detection in High-Resolution Aerial Images with Imbalanced Data. IEEE Trans. Artif. Intell. 2021, 2, 238–250. [Google Scholar] [CrossRef]
Pandey, V.; Anand, K.; Kalra, A.; Gupta, A.; Roy, P.P.; Kim, B.-G. Enhancing object detection in aerial images. Math. Biosci. Eng. 2022, 19, 7920–7932. [Google Scholar] [CrossRef] [PubMed]
Tang, T.; Zhou, S.; Deng, Z.; Lei, L.; Zou, H. Fast multidirectional vehicle detection on aerial images using region based convolutional neural networks. In Proceedings of the 2017 IEEE International Geoscience and Remote Sensing Symposium (IGARSS), Fort Worth, TX, USA, 23–28 July 2017; pp. 1844–1847. [Google Scholar]
Sommer, L.; Schumann, A.; Schuchert, T.; Beyerer, J. Multi Feature Deconvolutional Faster R-CNN for Precise Vehicle Detection in Aerial Imagery. In Proceedings of the 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), Lake Tahoe, NV, USA, 12–15 March 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 635–642. [Google Scholar]
Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
Deng, Z.; Sun, H.; Zhou, S.; Zhao, J.; Zou, H. Toward Fast and Accurate Vehicle Detection in Aerial Images Using Coupled Region-Based Convolutional Neural Networks. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2017, 10, 3652–3664. [Google Scholar] [CrossRef]
Xie, X.; Yang, W.; Cao, G.; Yang, J.; Zhao, Z.; Chen, S.; Liao, Q.; Shi, G. Real-Time Vehicle Detection from UAV Imagery. In Proceedings of the 2018 IEEE Fourth International Conference on Multimedia Big Data (BigMM), Xi’an, China, 13–16 September 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 1–5. [Google Scholar]
Tayara, H.; Gil Soo, K.; Chong, K.T. Vehicle Detection and Counting in High-Resolution Aerial Images Using Convolutional Regression Neural Network. IEEE Access 2018, 6, 2220–2230. [Google Scholar] [CrossRef]
Liang, X.; Zhang, J.; Zhuo, L.; Li, Y.; Tian, Q. Small Object Detection in Unmanned Aerial Vehicle Images Using Feature Fusion and Scaling-Based Single Shot Detector with Spatial Context Analysis. IEEE Trans. Circuits Syst. Video Technol. 2020, 30, 1758–1770. [Google Scholar] [CrossRef]
Xi, Y.; Jia, W.; Miao, Q.; Liu, X.; Fan, X.; Li, H. FiFoNet: Fine-Grained Target Focusing Network for Object Detection in UAV Images. Remote Sens. 2022, 14, 3919. [Google Scholar] [CrossRef]
Ma, Y.; Chai, L.; Jin, L.; Yu, Y.; Yan, J. AVS-YOLO: Object Detection in Aerial Visual Scene. Int. J. Patt. Recogn. Artif. Intell. 2022, 36, 2250004. [Google Scholar] [CrossRef]
Xu, S.; Wang, X.; Lv, W.; Chang, Q.; Cui, C.; Deng, K.; Wang, G.; Dang, Q.; Wei, S.; Du, Y.; et al. PP-YOLOE: An evolved version of YOLO. arXiv 2022, arXiv:2203.16250. [Google Scholar]
Huang, X.; Wang, X.; Lv, W.; Bai, X.; Long, X.; Deng, K.; Dang, Q.; Han, S.; Liu, Q.; Hu, X.; et al. PP-YOLOv2: A Practical Object Detector. arXiv 2021, arXiv:2104.10419. [Google Scholar]
Li, X.; Wang, W.; Hu, X.; Li, J.; Tang, J.; Yang, J. Generalized focal loss v2: Learning reliable localization quality estimation for dense object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 19–24 June 2021; pp. 11632–11641. [Google Scholar]
Li, C.; Yang, T.; Zhu, S.; Chen, C.; Guan, S. Density Map Guided Object Detection in Aerial Images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Seattle, WA, USA, 13–19 June 2020; pp. 190–191. [Google Scholar]
Li, X.; Sun, W.; Wu, T. Attentive Normalization. In Computer Vision—ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part XVII; Springer: Cham, Switzerland, 2020; pp. 70–87. [Google Scholar]
Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 8–10 June 2015. [Google Scholar]
Cai, Z.; Vasconcelos, N. Cascade r-cnn: Delving into high quality object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 6154–6162. [Google Scholar]
Zhu, C.; He, Y.; Savvides, M. Feature selective anchor-free module for single-shot object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 840–849. [Google Scholar]
Tian, Z.; Shen, C.; Chen, H.; He, T. FCOS: Fully Convolutional One-Stage Object Detection. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 9626–9635. [Google Scholar]
Kong, T.; Sun, F.; Liu, H.; Jiang, Y.; Li, L.; Shi, J. FoveaBox: Beyound Anchor-Based Object Detection. IEEE Trans. Image Process. 2020, 29, 7389–7398. [Google Scholar] [CrossRef]
Ge, Z.; Liu, S.; Wang, F.; Li, Z.; Sun, J. YOLOX: Exceeding YOLO Series in 2021. arXiv 2021, arXiv:2107.08430. [Google Scholar]
Li, C.; Li, L.; Geng, Y.; Jiang, H.; Cheng, M.; Zhang, B.; Ke, Z.; Xu, X.; Chu, X. YOLOv6 v3.0: A Full-Scale Reloading. arXiv 2023, arXiv:2301.05586. [Google Scholar]
Wang, C.-Y.; Bochkovskiy, A.; Liao, H.-Y.M. YOLOv7: Trainable Bag-of-Freebies Sets New State-of-the-Art for Real-Time Object Detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 7464–7475. [Google Scholar]
Wang, J.; Xu, C.; Yang, W.; Yu, L. A Normalized Gaussian Wasserstein Distance for Tiny Object Detection. arXiv 2022, arXiv:2110.13389. [Google Scholar]
Meethal, A.; Granger, E.; Pedersoli, M. Cascaded Zoom-In Detector for High Resolution Aerial Images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 2046–2055. [Google Scholar]
Lv, W.; Zhao, Y.; Xu, S.; Wei, J.; Wang, G.; Cui, C.; Du, Y.; Dang, Q.; Liu, Y. DETRs Beat YOLOs on Real-time Object Detection. arXiv 2023, arXiv:2304.08069. [Google Scholar]

Figure 1. Schematic diagram of the PP-YOLOE model.

Figure 2. Schematic representation of the CSPRepResBlock module in the feature fusion (Neck) network of PP-YOLOE.

Figure 3. Diagrammatic illustration of the ET-Head module structure within the detection head network of PP-YOLOE.

Figure 4. Schematic diagram of vehicle detection based on background suppression pyramid network and multi-scale task adaptive decoupled head.

Figure 5. Schematic representation of the data enhancement network structure based on long-tail feature resampling algorithm.

Figure 6. Schematic diagram of background suppression pyramid network.

Figure 7. Schematic diagram of background suppression block.

Figure 8. Schematic diagram of lightweight transformer block.

Figure 9. Schematic diagram of multi-scale task adaptive decoupled head.

Figure 10. Schematic diagram of dynamic convolutional attention block.

Figure 11. Illustration of detection results from ablation study on the Visdrone-Vehicle Dataset using the proposed method.

Figure 12. Illustration of detection results from different models.

Figure 13. Illustration of the relationship between

{A P}_{50 : 95}

and model parameter count for different models.

Figure 13. Illustration of the relationship between

{A P}_{50 : 95}

and model parameter count for different models.

Table 1. Distribution of object counts across categories in the VisDrone-Vehicle dataset.

Object Size	Total	Category	Number
Small	132,484	Car	84,913
		Van	13,643
		Truck	4761
		Bus	4984
		Motor	26,808
Medium	110,752	Car	74,976
		Van	14,212
		Truck	8169
		Bus	4984
		Motor	8411
Large	20,323	Car	13,051
		Van	2872
		Truck	2604
		Bus	1523
		Motor	273

Table 2. Ablation study results of the proposed method on the VisDrone-Vehicle dataset.

Method	LFRA	BSPN	MTAD	${A P}_{50 : 95}$	${A P}_{50}$	${A P}_{75}$	${A P}_{s m a l l}$	${A P}_{m e d i u m}$	${A P}_{l a r g e}$
Baseline				30.9	49.7	33.3	13.8	40.7	52.4
M1	$\sqrt$			31.6	50.2	34.0	14.3	41.3	53.3
M2		$\sqrt$		31.5	50.0	33.9	14.0	41.6	53.5
M3			$\sqrt$	31.7	50.4	34.5	14.1	41.8	53.6
M4	$\sqrt$		$\sqrt$	32.4	51.0	35.2	14.5	42.5	54.2
M5	$\sqrt$	$\sqrt$		32.0	50.6	34.8	14.2	41.9	53.9
M6		$\sqrt$	$\sqrt$	32.1	50.8	34.9	14.3	42.3	53.8
M7	$\sqrt$	$\sqrt$	$\sqrt$	32.8	51.7	36.0	14.6	43.4	54.7

Table 3. Detection results (%) of different methods on the VisDrone-Vehicle Dataset.

Method	${A P}_{50 : 95}$	${A P}_{50}$	${A P}_{75}$	${A P}_{s m a l l}$	${A P}_{m e d i u m}$	${A P}_{l a r g e}$	Param	GFlops	FPS
Faster RCNN [13]	26.2	41.4	29.5	10.3	36.5	45.9	41.1	202	11.6
Cascade RCNN [47]	27.9	42.7	31.6	11.2	38.5	50.4	68.9	230	16.3
FSAF [48]	19.4	35.4	19.1	6.8	26.3	39.5	36.0	201	20.7
GFL [43]	20.4	35.1	21.5	6.9	27.8	42.0	32.0	203	20.6
FCOS [49]	18.4	32.2	18.8	4.9	25.6	40.1	31.9	196	21.4
Fovea [50]	17.5	30.5	18.2	3.3	25.1	41.2	36.0	201	21.1
YOLOX [51]	23.7	40.1	25.2	9.3	32.2	33.9	8.94	13.4	46.2
YOLOv6-n [52]	24.9	40.7	26.7	9.3	33.7	47.5	4.30	5.49	33.4
YOLOv6-t [52]	28.4	45.6	31	11.1	38.1	49.2	9.67	12.3	34.8
YOLOv7-t [53]	25.4	43.3	26.7	9.9	33.7	46.9	6.02	6.89	65.5
YOLOv8-n	25.3	41.4	27.1	9.5	34.5	45.6	3.01	4.40	64.7
YOLOv8-s	30.2	48.2	32.7	12.7	40.4	51.8	11.1	14.3	63.2
Faster RCNN NWD [54]	30.3	32.8	49.8	15.9	39.5	46.5	41.14	202	11.2
CZ Det [55]	24.8	26.3	41.3	12.2	30.9	35.2	45.9	210	7.0
RT-DETR-s [56]	28.9	31.8	46.0	11.2	39.1	51.3	8.87	14.8	16.7
PPYOLOE-s [41]	30.9	49.7	33.3	13.6	40.4	51.5	7.46	7.95	45.1
Our method	32.8	51.7	36	14.6	43.4	54.7	10.3	13.1	20.1

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Pan, M.; Xia, W.; Yu, H.; Hu, X.; Cai, W.; Shi, J. Vehicle Detection in UAV Images via Background Suppression Pyramid Network and Multi-Scale Task Adaptive Decoupled Head. Remote Sens. 2023, 15, 5698. https://doi.org/10.3390/rs15245698

AMA Style

Pan M, Xia W, Yu H, Hu X, Cai W, Shi J. Vehicle Detection in UAV Images via Background Suppression Pyramid Network and Multi-Scale Task Adaptive Decoupled Head. Remote Sensing. 2023; 15(24):5698. https://doi.org/10.3390/rs15245698

Chicago/Turabian Style

Pan, Mian, Weijie Xia, Haibin Yu, Xinzhi Hu, Wenyu Cai, and Jianguang Shi. 2023. "Vehicle Detection in UAV Images via Background Suppression Pyramid Network and Multi-Scale Task Adaptive Decoupled Head" Remote Sensing 15, no. 24: 5698. https://doi.org/10.3390/rs15245698

APA Style

Pan, M., Xia, W., Yu, H., Hu, X., Cai, W., & Shi, J. (2023). Vehicle Detection in UAV Images via Background Suppression Pyramid Network and Multi-Scale Task Adaptive Decoupled Head. Remote Sensing, 15(24), 5698. https://doi.org/10.3390/rs15245698

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Vehicle Detection in UAV Images via Background Suppression Pyramid Network and Multi-Scale Task Adaptive Decoupled Head

Abstract

1. Introduction

2. Related Work

3. Proposed Approach

3.1. Overall Framework of the Model

3.2. Data Enhancement

3.3. Background Suppression Pyramid Network

3.3.1. Background Suppression Block

3.3.2. Lightweight Transformer Block

3.4. Detection Head Network

3.4.1. Multi-Scale Task Adaptive Decoupled Head

3.4.2. Dynamic Convolutional Attention Block

3.5. Loss Function

3.5.1. Density Map

3.5.2. Object Detection

4. Experiments

4.1. VisDrone-Vehicle Dataset

4.2. Experimental Setup

4.3. Ablation Study

4.4. Overall Model Performance Analysis

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI