Object Detection in UAV Images via Global Density Fused Convolutional Network

: Object detection in Unmanned Aerial Vehicle (UAV) images plays fundamental roles in a wide variety of applications. As UAVs are maneuverable with high speed, multiple viewpoints, and varying altitudes, objects in UAV images are distributed with great heterogeneity, varying in size, with high density, bringing great difﬁculty to object detection using existing algorithms. To address the above issues, we propose a novel global density fused convolutional network (GDF-Net) optimized for object detection in UAV images. We test the effectiveness and robustness of the proposed GDF-Nets on the VisDrone dataset and the UAVDT dataset. The designed GDF-Net consists of a Backbone Network, a Global Density Model (GDM), and an Object Detection Network. Speciﬁcally, GDM reﬁnes density features via the application of dilated convolutional networks, aiming to deliver larger reception ﬁelds and to generate global density fused features. Compared with base networks, the addition of GDM improves the model performance in both recall and precision. We also ﬁnd that the designed GDM facilitates the detection of objects in congested scenes with high distribution density. The presented GDF-Net framework can be instantiated to not only the base networks selected in this study but also other popular object detection models.


Introduction
Unmanned Aerial Vehicle (UAV) is a new and prominent remote sensing platform operated by radio remote control equipment or programming, which benefits a wide range of practical applications that include environmental monitoring [1][2][3][4][5], abnormal target tracking [6,7] and animal protection [8][9][10].The rapid development in UAV techniques and applications has fostered wide attention in the object detection domain.In this paper, we focus on object detection in UAV images [11,12] that aims to identify and localize objects of interest from UAV images, serving as a basic and significant algorithm in numerous UAV applications.
In order to detect objects in UAV images, early algorithms adopt background extraction and selected feature extraction approaches [6,[13][14][15].Despite the effectiveness of these methods, they highly depend on the descriptive method of features and the perspective of images, which not only consume plenty of manpower and computation but also reduce the capability of the model in transferring on different datasets [8,16].Over recent years, deep learning has become one of the most cutting-edge technologies in both computer vision and remote sensing communities [17][18][19][20][21]. Deep convolutional neural networks (DCNNs), an important network model in deep learning, brings significant progress and achieves state-of-the-art performance in image analysis related fields.In object detection tasks, due to unprecedented success in deep learning based algorithms in natural scene images (such as the images in MS COCO [22], PASCAL [23] and ImageNet [24]), many researches adopt deep learning based algorithms in natural scene images to detect objects in UAV images [7,9,16,25].However, the major difference between natural scene images and UAV images is that UAV images are often with varying scales, perspectives, and appearances, due to the fact that UAVs are maneuverable with high speed, multiple viewpoints, and altitudes [26][27][28].In addition, unlike generic natural scenes with large individual objects, UAV images often contain a large number of small objects, leading to great challenges for object detection in UAV images using existing approaches [28,29].
To address the aforementioned challenges, we propose a global density fused convolutional network (termed as GDF-Net) that is able to cascade global features to facilitate object distribution learning, to detect objects in UAV images.The proposed method introduces congested scene analysis for dense object distribution learning, which is inspired by the methods in crowd counting tasks [30,31].Compared with the existing networks, the proposed method improves the performance of congested scene object detection results, benefiting from object density features by dilated convolutional networks [30,32] and feature refinement.As shown in Figure 1, the architecture of the proposed GDF-Net consists of a Backbone Network, a Global Density Model (GDM), and an Object Detection Network.The innovative GDM fuses multiple features to a global density fused features using multi-level features from a Feature Pyramid Network (FPN) as inputs.Additionally, the GDM consists of a series of dilated convolutional networks where dilated kernels are applied to deliver larger reception fields of whole features from the backbone network.The generated global density fused features are integrated with the original features and promote feature alignment among objects distributed in congested scenes in UAV images, which further improve the performance of object detection in UAV images.We evaluate the proposed GDF-Net framework on two public UAV benchmark datasets: VisDrone [33] dataset and UAVDT [34].As the proposed GDF-Net can be instantiated to existing detection algorithms, we perform several experiments of the GDF-Net instantiated on Faster R-CNN [35], Cascade R-CNN [36], Free Anchor [37] and Grid R-CNN [38].To highlight the advantages of GDF-Net, we add GDM to the widely recognized algorithms above (the original algorithms are called base networks) to detect objects in UAV images.The experimental results demonstrate that the proposed component improves the performance of object detection in UAV images.
The contributions of this work are summarized as follows: • We propose a novel global density fused convolutional network (GDF-Net) for object detection in UAV images, which cascades a novel Global Density Model to base networks.Via the application of GDM, the proposed GDF-Net achieves a distribution learning that integrates global patterns from the input image with features extracted by existing object detection networks.

•
We introduce a novel Global Density Model into the base networks to improve the performance of object detection in UAV images.GDM applies dilated convolutional networks to deliver large reception fields, facilitating the learning of global patterns in targets.

•
The proposed GDF-Net can be instantiated to existing detection algorithms, and we demonstrate the effectiveness and robustness of GDF-Net on two popular UAV object detection datasets: VisDrone [33] dataset and UAVDT [34].
The remainder of the paper is organized as follows.Section 2 includes a brief review of the literature for object detection in UAV images.Section 3 presents the GDF-Net architecture and details the individual components within the framework.Section 4 describes the experimental settings, results, as well as the sensitivity analysis.Section 5 presents a discussion on the limitations and future directions, followed by conclusions in Section 6.

Related Work
Aiming to identify an object category, object detection has always been a hot topic in the computer vision domain.To detect objects in UAV images, early algorithms leverage the extraction of selected features and background information.Researchers in [6,13,14], detect objects in UAV images by generating a saliency map computed from the image background.Kalantar et al. [39] conduct object detection based on region adjacency graphs of visual appearance and geometric properties to facilitate background extraction from objects.Portmann et al. [15] detect pedestrians in UAV images using techniques in background subtraction and HoG feature extraction.Although these methods have been proved effective in terms of detection accuracy, they largely rely on the descriptive method of features and perspectives of images [8,16].In addition, it is difficult for these methods to extract overlapping objects that commonly appear in congested regions, which largely reduces the generalization capability of these models in transferring among different tasks.
The success of deep learning in identifying objects from natural images allows researches to adopt similar approaches for UAV images.For example, Wang et al. [25] experiment on numerous popular convolutional neural networks, such as SSD, Faster R-CNN, and RetinaNet, in natural images.Scholars in [53][54][55] develop various types of object detectors using enhanced deep convolutional neural networks based on SSD.Methodologies in [56,57] are designed based on improved YOLOv2 or YOLOv3 object detection algorithms, respectively.Different from early algorithms by selected features and background information, the deep learning based approaches automate object detection in UAV images and are transferable to different datasets.However, these object detection algorithms are designed on standard natural imagery and have not been optimized for detecting objects in UAV images.To improve the detection specifically in UAV images, scholars in [26,27] integrate object detection and depth prediction for images obtained from micro UAVs.However, the inputs of stereo images greatly limit its potential application, as stereo images are often difficult to acquire.

Methodology
In this study, we propose a global density fused convolutional network (GDF-Net) for object detection in UAV images, which cascades a novel Global Density Model (GDM) to a base network, aiming to promote the object distribution learning.The GDM uses pyramid features from FPN and fuses multiple features to global density fused features.The proposed GDF-Net promotes feature alignment among objects distributed in congested scenes, which further improves the performance of object detection in UAV images.In the remainder of this section, we formulate proposed GDF-Net and detail the structures of the Backbone Network, the GDM, and the Object Detection Network.

Approach Overview
Let us consider a typical object detection problem over an input space X and a detection ground truth space Y.The goal of an object detection algorithm is to learn the mapping M from X to Y, i.e., M : X → Y.In general, existing approaches based on deep learning algorithms firstly learn a deep feature F for the representative X and then obtain Y from F via object regression and a classification network.These algorithms can be abstracted using Equation (1), where M 1 denotes mapping M from X to Y. Given that UAVs are often with varying perspectives and flying altitudes, objects can be distributed in congested scenes, which increases the difficulty of object regression via only F from UAV images.To address this issue, our proposed GDF-Net applies a distribution learning that integrates the global features of an image with the features extracted from existing object detection networks.The GDF-Net introduces a Global Density Model (GDM), which learns a global feature G that describes object distribution in UAV images from input X and object density domain H. Therefore, the mapping problem in our method can be defined as M 2 in (2).
An overview of our proposed GDF-Net approach is presented in Figure 1, where the GDM generates global density fused features for distribution learning using pyramid features.The Object Detection Network is leveraged to perform bounding box regression and target categorization using global density fused features as input.

Backbone Network
s The goal of the Backbone Network is to generate high-dimensional features from input images by employing deep convolutional neural networks.In GDF-Net, Backbone Network is adopted from widely used feature extraction networks, such as ResNet-50 [58].For an input image I, the Backbone Network obtains a feature collection L b that contains deep features from the input image.As shown in Figure 1, we leverage a Feature Pyramid Network (FPN) [44] after the Backbone Network to detect multi-scale objects.If we define the Backbone Network and FPN respectively as F 1 b and F 2 b , the FPN features L f can be represented as: where f } are pyramid features from FPN (shown in the orange rectangular in Figure 1), enabling multi-scale object detection in UAV images.FPN takes a light top-down and bottom-up pathway with lateral connections to transform multi-level features to integrated pyramid features.Note that all features in L f have 256 channels with different scales of feature height and width.

Global Density Model (GDM)
We design the Global Density Model with an aim to learn the global distribution of the targets.Specifically, we employ dilated convolutional networks, a technique to enlarge receptive fields and extract deeper features without losing resolutions [30], to obtain global density features.The architecture of GDM is inspired by the methods in crowd counting tasks where objects distributed in congested scenes usually create challenges for the counting algorithms.To solve this problem, scholars in [30,31] design dilated convolutional networks on deep features extracted from input images and produce density maps of images for crowd distribution analysis.Similarly, we believe that object detection from UAV images can also benefit from the learned object distribution, especially in congested scenes.
Compared with the approaches in crowd counting tasks that regress the counting number of objects from each image [30,31], object detection networks [35,36,38] focus on both object locations and categories determined by the multi-scale features extracted from deep convolutional neural networks.As stated in Section 3.2, pyramid features from FPN serve as input to GDM.In order to adopt distribution learning on integrated information from all pyramid features (L f in Figure 1), our designed GDM is able to generate an integrated feature L c from each input feature L f (similar to the feature integrating operation in [31]).Taking the integrated feature L c as input, GDM then produces a refined density feature L d via dilated convolutional networks, for the purpose of enlarging receptive fields for distribution learning.In order to integrate density feature L d and pyramid features L f , the GDM architecture further scatters the refined density feature L d to multi-level features L g .
For detailed architecture, the GDM employs FPN features L f (shown in the orange rectangle in Figure 1) and further acquires cascaded multiple features L g (shown in the green rectangle in Figure 1) with global reception fields.The output features of GDM are the concatenation of input pyramid features and contain the information regarding the global density of targets.Assume F g is the function of GDM, L g can be obtained using the following formula: Figure 2 illustrates a detailed structure of the proposed GDM.The process of designed GDM can be divided into the following three steps.We first generalize multiple FPN features L f to a refined feature L c , as the object distribution information requires features with an integrated perspective of the input image.After that, we generate a global feature L d by transfering refined feature L c to a density domain using dilated convolutional networks.Finally, through a residual path, refined density feature L d is scattered to multi-level features L g which are adopted as input for the Object Detection Network in GDF-Net.Specifically, an integrated multi-level features L c can be presented as: where parameter l, as the size of the l layer feature L l f , is defined to specify the size of the refined feature L c , Avg(•) denotes average operation, and R 1 represents the resizing operation.Here, R 1 differs, given different layers within the feature L i f .If i < l, R 1 denotes a max pooling function P. Otherwise, R 1 denotes a resizing function U via the nearest interpolation.A s −→ B means the operation with feature A according to the size of feature B. R 1 can be defined as: Aiming to enlarge receptive fields and consequently to improve detection accuracy, dilated convolutional layers have been proved rather efficient by multiple tasks [30,32,59].Dilated convolutions, also called as atrous convolutions, introduce a new parameter, i.e., the dilation rate, aiming to set the number of interval pixel for convolution kernels [59].The dilation rate defines the size of the reception field in GDM, which is crucial to the capacity of models in learning global features.The designed GDM adopts dilated convolutional layers to extract global distribution information from the refined feature L c , further transferring it to L d .Since L d is refined from integrated features of the whole image, it represents the density domain with global information in our GDF-Net.The process in the generation of L d can be represented as: where denotes the operation of convolution and {D 1 , D 2 , ...D k } denotes the dilated convolutional operation.Here, we set parameter r as the dilation rate for these dilated convolutional functions.If r = 1, a dilated convolution is the same as a normal convolution.
In order to apply the global density feature L d in multiple scales, we employ a residual path to scatter L d to multi-level features L g using the following formula: where ⊕ indicates the element-wise sum operator, which introduces a residual path between L d and L f .

Object Detection Network
To learn the explicit mapping from extracted global density fused features G to ground truth space Y, we apply Object Detection Network, which regresses the position of each object by a regressor and classifies its category by classifier.The Object Detection Network takes L g as input and generates location p loc and category p cls of each object prediction n.Compared with the ground-truthing value {g loc , g cls }, the loss function GDF-Net during the training process is calculated as: where ∆ 1 (•) and ∆ 2 (•) define the rules in location and category offset calculation, respectively.λ denotes the weight between the loss of regressor and classifier.In our proposed GDF-Net, the architecture of the Object Detection Network can be instantiated into different existing algorithms, such as Faster R-CNN [35], YOLO [45], and Cascade R-CNN [36].Here, we apply it to numerous two-stage detection networks in Figure 1, which adopts Region Proposal Network (RPN) and ROI Align structures [35].∆ 1 (•), ∆ 2 (•) denote the error calculation methods that vary depending on the choice of the algorithms (e.g., ∆ 1 (•) denotes SmoothL1 loss and ∆ 2 (•) means Cross Entropy Loss in Faster R-CNN [35]).

Experimental Results and Analysis
We conduct a series of experiments to evaluate the performance of the proposed GDF-Net.In this section, we describe the benchmark datasets, evaluation metrics, and implementation details used in our training and testing experiments.We compare our method with several popular object detection approaches quantitatively and qualitatively to shed light on the advantage of the proposed GDF-Net framework.In addition, we further analyze the effects of the setting of dilation rate r on the performance of GDF-Net.

Datasets
Two challenging UAV benchmark datasets were evaluated by the proposed GDF-Net, i.e., VisDrone dataset [33] and UAVDT dataset [34].The selected two datasets well simulate various scenarios of UAV object detection, as they contain objects obtained from different sensors, with varying weather conditions, perspectives, flying altitudes, camera views, and occlusions.In our experiments, we focused on presenting the improvement in performance when the proposed GDF-Net framework is added to the selected base networks.Specifically, the original VisDrone dataset [33] consists of a total of 400 video clips.Our training set, validation set, and testing set respectively contain a total of 6471, 548, and 1610 images.The VisDrone dataset labels humans and vehicles in daily life in ten categories, i.e., pedestrian, person, bicycle, car, van, truck, tricycle, awning-tricycle, bus, and motor.Another benchmark dataset is the UAVDT dataset [34], which consists of UAV imagery with vehicles in three categories (car, truck, and bus) selected from videos (about 10-h long).In the UAVDT dataset, we derived 11,915 images for training and 16,580 images for testing.

Evaluation Metrics
To evaluate the performance of GDF-Net, we employed precision metrics and recall metrics, defined in [22].The precision metrics, including AP, AP 50 , AP 75 , AP S , AP M and AP L , were calculated as the ratio of the average correctly predicted positive observations to the total number of predicted positive observations.The recall metrics, including AR 1 , AR 10 , AR 100 , AR S , AR M and AR L , were calculated as the ratio of the correctly predicted positive observations to all observations.Here, the AP, AP S , AP M , AP L , and all recall metrics use ten intersections over union (IOU) values ([0.50 : 0.05 : 0.95]) as IOU thresholds to calculate average precision and recall results.The AP50 and AP75 evaluate results using IOU thresholds as 0.5 and 0.75, respectively.Moreover, {S, M, L} in these indexes represent the average precision at different scales.The recall index APnum indicates the maximum recall given num detection per image for num = {1, 10, 100}.A detailed definition of these metrics can be found in [22].These metrics are widely used in the existing object detection literature, and they evaluate the performance of proposed GDF-Net in a comprehensive manner.

Implementation Details
The proposed GDF-Net framework was implemented with the PyTorch framework and was run at a desktop equipped with an Intel(R) Core(TM) i7-9800X CPU @ 3.80GHz, two NVIDIA Geforce RTX 2080ti GPUs with 11G memory each.All experiments were conducted on Ubuntu 16.04 system with two parallel GPUs.The whole program is implemented based on the publicly available Open MMLab Detection [60] framework on the PyTorch platform.We initiate the Backbone Network parameters in the ResNet50 [58] model (pre-trained on ImageNet).Other parameters in GDM and Object Detection Network are randomly initialized.
We focused on evaluating the performance of the proposed GDF-Net framework when it is attached to existing popular base networks.All experiments were conducted following the default parameter settings in the base networks and without data augmentation.In light of the different scales of images from the VisDrone dataset, we resize them to 1200 × 675 pixels during the training process.Images from the UAVDT dataset were fed to GDF-Net with their original size, i.e., 1024 × 540.Moreover, we chose Stochastic Gradient Descent (SGD) as the optimizer [61] and set the momentum as 0.9, weight decay as 10 −4 , the initial learning rate as 0.02.The experiments using Grid R-CNN were trained with 25 epochs [38], and all other experiments were trained with 12 epochs.In our experiments, we empirically set the l = 2 and λ = 1 in Section 3.4.The dilation rate r in Section 3.3 was set to 2 based on experiments in Section 4.2.3.In addition, we set k as 6, and the channels of {D 1 , D 2 , ..., D k } as [512,512,512,256,128,64], following the experiments in [30].

Quantitative Evaluation
We evaluated the performance of the proposed GDF-Net framework with state-of-the-art object detection methods by adding the designed GDM to these existing networks (the base networks).These base networks include Faster R-CNN [35], Cascade R-CNN [36], Free Anchor [37], and Grid R-CNN [38].We term them Faster GDF, Cascade GDF, Free Anchor GDF, and Grid GDF, when GDM is respectively added to Faster R-CNN, Cascade R-CNN, Free Anchor, and Grid R-CNN.All parameters remain unchanged in each set of comparisons.The quantitative evaluations in the VisDrone dataset and UAVDT dataset can be found respectively in Tables 1 and 2.  Note: better performances are highlighted in bold.
In general, our method achieves the best performance on both datasets, in almost all precision and recall metrics (up to 1.2% gains in AP and 0.9% gains in AR 100 ).The experiments on multiple base algorithms demonstrate that our proposed GDF-Net network improves both precision and recall compared to the base algorithms alone, suggesting its robustness and compatibility.It is worth noting, however, that a few base networks underperform when designed GDM is added to them, especially under the evaluation using AP L .Compared with the AP S and AP M , AP L has seen the most improvement in various experiments.The results reveal that the GDF-Net well-performs in terms of learning global distribution patterns, retaining the balance among objects with scale diversity.However, improving the detection performance of objects in varying scales simultaneously still remains a challenge, and the detection accuracy of large objects is considerably higher than that of small objects (up to 4.7% gains in AP L and 0.3% gains in AP S in Table 1).Therefore, given the trade-off process of detecting large objects and small objects in UAV images via a certain model, the improvement of precision of small targets has a greater significance in overall detection evaluation AP.
To highlight the utility of the proposed method, we present the accuracy, complexity, and speed between baseline models and baseline models coupled with GDF-Net on the UAVDT dataset (Table 3).The Params and FLOPs respectively denote the number of parameters and speed of performing multiply-adds [62], while the speed measures the processing speed of scenes using frame per second (fps) as a unit [57].For a fair comparison, results are measured on the same GPU with the same settings described in Section 4.1.3.We observe that, with the attachment of GDF-Net, the number of parameters has slightly increased, resulting in slightly reduced FLOPS and speed.However, considering the improvement of the general accuracy and the improvement of detection, especially in congested scenes (illustrated in Section 4.2.2),we believe the advantages of attaching GDF-Net outweigh the disadvantages.We present the experimental results generated by the proposed GDF-Net in Figure 3 (VisDrone dataset) and Figure 4 (UAVDT dataset), where all objects are shown in green rectangles marked by their categories.These detection results are based on the Faster GDF-Net method trained on two benchmark datasets independently.From the results, we observe that the objects are successfully detected with great accuracy, despite the heterogeneity in backgrounds, perspectives, flying altitudes, scales, and appearances, demonstrating the great performance of the proposed GDF-Net in object detection from UAV images.However, we also find that GDF-Net fails to detect obscured objects or objects with a hazy background (see the last row in Figures 3 and 4), suggesting that further improvement is still needed.Furthermore, for the purpose of visualizing the comparison between our proposed networks and baselines, we randomly selected examples of detection results on Faster R-CNN and Faster GDF-Net on the VisDrone dataset (Figure 5).From the zoom-in views (red rectangles), we observe that the proposed GDF-Net is able to detect more objects in highly congested regions (see the comparison between Figure 5k,l).We also notice that GDF-Net produces less wrong detections in highly congested regions (see the comparison between Figure 5g,h), presumably due to the additional global density model (GDM).We further visualize L f and L g in base networks alone and base networks combined with designed GDM (Figure 6).As described in Section 3.3, L f = {L 1 f , L 2 f , ..., L 5 f } are pyramid features from FPN, serving as inputs to GDM, while L g = {L 1  g , L 2 g , ..., L 5 g } are generated by GDM with numerous operations that include gathering, refining, and scattering.Thus, the visualization of L f and L g explicitly reflects the functionality of GDM, which contributes to the explanation of the difference between base models and GDF-Nets.Images in row 1-5 and column 1, 3, 5 in Figure 6 are the visualizations of {L 1 f , L 2 f , ..., L 5 f } while images in row 1-5 and column 2, 4, 6 are visualizations of the corresponding {L 1 g , L 2 g , ..., L 5 g }.All images are presented in HSV color space using channel maximum squeeze.The comparison between L f and L g illustrates that L g is generally more accurate compared with the corresponding L f , evidenced by the fact that L g features show more consistency with input image than L f (see regions marked by the black and white rectangles in Figure 6).From the detected objects (last row in Figure 6), the proposed GDF-Net achieves better performance, as a certain amount of objects are neglected by Faster R-CNN but successfully detected by Faster GDF (e.g., the white vehicle in the lower right corner).

Sensitivity Analysis
As the dilation rate, i.e., parameter r, defines the size of the reception field in GDM, which is crucial to the capacity of models in learning global features.To optimize r, we test different values of r (equal intervals from 1 to 3) on Faster GDF, Cascade GDF, and Grid GDF, which show high performance in Table 1.The experiments are conducted on the VisDrone dataset.As shown in Table 4, regardless of the setting of r, GDF-Nets generally outperform the corresponding base networks, evidenced by the higher values in both precision and recall.As r increases from 1 to 2, almost all evaluation indexes increase accordingly, suggesting that the performance of our network improves with a slightly larger receptive field that facilitates the extraction of global distribution features.However, from r = 2 to r = 3, model performance generally reduces, indicating that an excessive enlargement of r limits the improvement of the performance.We can conclude that r = 2 is the optimized value in this study, and it is important for the proposed GDF-Net to keep the balance between enlarged reception field with larger r and obtain a detailed structure with smaller r.Note: better performances are highlighted in bold.

Limitations and Future Directions
Although the proposed GDF-Net brings promising results for object detection in UAV images, some notable issues remain and call for further research.Firstly, four popular algorithms are selected as base algorithms in our experiment.However, we acknowledge that algorithms that include HTC [63] and YOLOv4 [48] have become more popular recently.Thus, the potential of those methods in the proposed framework deserves further investigation.
Secondly, to fairly evaluate the performance of designed GDM on existing base networks and mitigate the impacts resulting from the difference in parameter settings, we set parameters (e.g., ∆ 1 (•), ∆ 2 (•) and l) according to the empirical values from the base algorithms, and include no data augmentation in all experiments.However, researches have shown that the parameter, loss settings, and data augmentation have a great impact on the performance of deep learning models [64][65][66][67].Further research is needed to further optimize relevant parameters, experiment loss settings, and employ data augmentation methods in these UAV image datasets.
Thirdly, despite the fact that our proposed GDF-Net improves the performance compared to the base networks, it usually fails to detect occluded objects, especially in congested scenes and with a hazy background.Fortunately, numerous studies have been conducted to address this issue, most notable of which is by [68], who applied a novel aggregation loss function and a pooling method for occlusion detection, providing a great opportunity to identify partially obscured objects.In future studies, we plan to incorporate the aforementioned methods to our GDF-Net.
Lastly, dilated convolutional networks are applied to deliver larger reception fields and generate global density fused features.Although the great utility of dilated convolutional networks has been proved in this study, other emerging techniques, for example, attention mechanism [69] and Generative Adversarial Network (GAN) [70], have received growing attention.The potential of those methods in rendering large reception fields and how they can be incorporated in the proposed GDF-Net framework deserve further exploration.

Conclusions
Object detection in UAV imagery remains a challenging task, as UAVs are often maneuverable with high speed, multiple viewpoints, and varying altitudes, which leads to unique characteristics of UAV imagery that usually contain varying perspectives, scales, and occlusion.In addition, objects in UAV images are often distributed with heterogeneity, varying in size, with high density, causing great difficulty for object detection using existing algorithms that are not optimized for UAV images.In this paper, we propose a novel global density fused convolutional network (GDF-Net) specifically for object detection in UAV images.The proposed GDF-Nets consists of a Backbone Network, a Global Density Model (GDM), and an Object Detection Network.We test the effectiveness and robustness of the proposed GDF-Nets on the VisDrone dataset and UAVDT dataset.The novelty in GDM is that it refines density features via the application of dilated convolutional networks, aiming to deliver larger reception fields and to facilitate the generation of global density fused features.When comparing with the scenario where base networks are used independently, the addition of GDM improves the model performance in both recall and precision.We also find that the designed GDM facilitates the detection of objects in congested scenes with high distribution density.The presented GDF-Net framework can be instantiated to not only the base networks selected in this study but also other popular object detection models.

Figure 1 .
Figure 1.An overview of the proposed global density fused (GDF)-Net architecture.The GDF-Net consists of the Backbone Network, Global Density Model, and Object Detection Network, where the Backbone Network extracts pyramid features with typical networks, the Global Density Model integrates object distribution information into pyramid features, and the Object Detection Network locates and categorizes objects from UAV images.

Figure 2 .
Figure 2. A detailed structure of the Global Density Model.In GDM, the FPN features L f are fused to a refined feature L c via a resizing operation R 1 shown with blue color and an average function Avg in the purple rectangle.The global feature L d is obtained from dilated convolutional networks D and eventually scattered to multi-levels L g by a residual path.

Figure 3 .
Figure 3. Detection results of the proposed GDF-Net approach on the VisDrone dataset.Detections are labeled in green rectangles and marked with associated categories and confidence scores.

Figure 4 .
Figure 4. Detection results of the proposed GDF-Net approach on the UAVDT dataset.Detections are labeled with green rectangles and marked with associated categories and confidence scores.

Figure 5 .
Figure 5. Selected examples of detection results based on the Faster R-CNN and the proposed Faster GDF approach.Compared with results from Faster R-CNN, the proposed Faster GDF detects more objects in dense distribution with lower False Positives.

Figure 6 .
Figure 6.Visualization of L f and L g features in the proposed GDF-Net.The images in row 1-5 show the visualization results of five layer L f or L g based on Faster GDF, respectively.The last row presents object detection results of three images (every two pictures are results with the same input image) from VisDrone dataset, where (a,c,e) are tested by Faster R-CNN, and (b,d,f) are detected from Faster GDF, respectively.

Table 1 .
Comparisons of detection performance on the VisDrone dataset.

AP 75 AP AP S AP M AP L AR 1 AR 10 AR 100 AR S AR M AR L
Note: better performances are highlighted in bold.

Table 2 .
Comparisons of detection performance on the UAVDT dataset.

Table 3 .
Accuracy, complexity and speed comparison on the UAVDT Dataset.

Table 4 .
Sensitivity analysis on the VisDrone dataset.We analyze the effects of r setting in Section 3.3 with multiple algorithms, and all experiments are performed with GDF-Net.