A Fast and Accurate Few-Shot Detector for Objects with Fewer Pixels in Drone Image

: Unmanned aerial vehicles (UAVs) are important in modern war, and object detection performance inﬂuences the development of related intelligent drone application. At present, the target categories of UAV detection tasks are diversiﬁed. However, the lack of training samples of novel categories will have a bad impact on the task. At the same time, many state-of-the-arts are not suitable for drone images due to the particularity of perspective and large number of small targets. In this paper, we design a fast few-shot detector for drone targets. It adopts the idea of anchor-free in fully convolutional one-stage object detection (FCOS), which leads to a more reasonable deﬁnition of positive and negative samples and faster speed, and introduces Siamese framework with more discriminative target model and attention mechanism to integrate similarity measures, which enables our model to match the objects of the same categories and distinguish the different class objects and background. We propose a matching score map to utilize the similarity information of attention feature map. Finally, through soft-NMS, the predicted detection bounding boxes for support category objects are generated. We construct a DAN dataset as a collection of DOTA and NWPU VHR-10. Compared with many state-of-the-arts on the DAN dataset, our model is proved to outperform them for few-shot detection tasks of drone images.


Introduction
Object detection has played a more and more important role in drone-based applications. Traditional object detection methods, such as the histogram of oriented gradients descriptor (HOG) [1] and the deformable part model (DPM) [2], perform well when detecting a certain specific object. However, they have poor performance when detecting multiple classes of objects. Since neural network framework was proposed, the development of object detection has progressed very fast. Some state-of-the-art object detectors (such as Faster R-CNN [3], YOLO [4], SSD [5], etc.) have been proposed with high accuracy on conventional detection datasets (PASCAL VOC 2007, PASCAL VOC 2012, COCO, etc.). Nevertheless, object detection for drone images suffers a lot due to the particularity of drone images. The unmanned aerial vehicle (UAV) images are different from images in generic datasets in following points.
Diversity of scale: Drone images are taken from different altitudes, and the sizes of targets are different, even for similar targets.
Different direction of perspective: Aerial images are generally viewed from high altitude, but most of the conventional data sets are from the ground level perspective, so the pattern of the same object is different.
Densely arranged small targets problem: Many targets in aerial images are small targets (dozens or even a few pixels), which results in a small amount of target information.

Related Works
Since the main contribution of this paper is an anchor-free one-stage model for fewshot drone image detection, in this section, we briefly introduce two aspects related to our work: General object detection and few-shot detection.
General object detection. Object detection is a key technology of computer vision. In early years, object detection was usually formulated as a sliding window classification problem using handcrafted features like HOG and DPM. With the development of deep learning network (DNN), many classical backbone networks have been proposed for image classification, such as AlexNet [20], VGG16 [21], GoogLeNet [22], Darnet19 [4], etc., which has made CNN-based methods more and more popular. Most object detection models can be divided into two categories: Two-stage model and one-stage model. Compared with onestage model, two-stage model has one more step to generate proposals by region proposal network (RPN). RPN filters out many negative locations to solve the class imbalance problem and provides refined anchors for the next classification and regression, which makes two-stage models usually have higher detection accuracy, but slower inference speed than one-stage models. Gishick et al. [23] used selective search to generate region proposals in R-CNN. Afterwards, following the structure of R-CNN, Fast R-CNN [24] introduced a RoI pooling layer to extract the region proposals generated by selective search from a shared feature map. Considering the enormous time cost of selective search, in Faster R-CNN [3], RPN replaced selective search to generate refined anchors for detection. Reference [25] proposed an approach based on multi-scale balanced sampling (MB-RPN) to solve the problem of difficult matching of small objects and detecting multi-scale objects, and it had high accuracy on DOTA dataset. Considering the high efficiency, the onestage approach tends to be the first choice for the development of intelligence detection applications. Redmon et al. [4] proposed an extremely fast one-stage model YOLO, which used a single feed-forward neural network to directly predict object classes and locations at the same time. Afterwards, YOLOv2 [26] improved YOLO in several aspects, i.e., batch normalization, high resolution classifier, dimension clusters, etc. Another classical onestage model SSD [5] made a breakthrough in multi-scale object detection by introducing a multi-scale feature map and detecting objects of different sizes on the feature layer of the corresponding scale. DSSD [17] introduced additional context into SSD by combining a deconvolutional high-level feature map with a high-resolution, low-level feature map to improve the accuracy. DSOD [27] used DenseNet as the backbone to enable the training objective to supervise the optimization of parameters of earlier layers, hence realizing a model that can be trained from scratch. RefineDet [28] introduced the anchor refinement module (ARM) on the basis of one-stage model to filter out negative anchors to reduce search space for the classifier and coarsely adjust the locations of anchors for the subsequent regressor. Reference [29] proposed a seven-layer convolutional lightweight real-time detector SSD7-FFAM for embedded devices, which applied a novel feature fusion and attention mechanism to alleviate the impact of decreasing the number of convolutional layers, and it performed well on NWPU VHR-10. Nevertheless, since the accuracy of these one-stage model trails that of two stage methods, improving the detection accuracy of one-stage model is still an enormous challenge in objection detection.
Few-shot detection. Few-shot detection refers to learning the target object just from a few training samples. References [30][31][32] attempted to achieve few-shot learning by obtaining a general prior, which is shared across different categories. References [33][34][35] proposed the use of distance measure for few-shot learning. An increasingly popular solution for few-shot learning is meta-learning, which refers to design a strategy to guide the supervised learning in each task, so that the model has the ability of learning to learn. In this field, a Siamese network is proposed in [36], which is composed of two weight shared networks, which are used to extract the features of support image and query image, respectively, and the model can judge whether there is an object of support category in the query image by comparing two feature maps. Vinyals et al. [33] proposed Matching Network to learn the task of finding the most similar class for the target among a small set of labeled images. Prototypical Network [34] and Relation Network [35] use distance measure to realize classification. Ravi and Larochelle [37] proposed an LSTM meta-learner, which was dedicated to learning a general agent to guide parameter optimization. Similar to [37] in optimization for fast adaptation, Model-Agnostic Meta-Learning (MAML) [38] performed well in detecting novel categories by optimizing a task-agnostic network. In recent years, some works for few-shot object detection [7][8][9][10] have been proposed. However, they learn category-specific feature embedding and require fine-tuning to be able to detect novel category. It is difficult to directly apply the related progress to detection tasks of novel categories until [11] proposed a general-purpose few-shot object detector, through the welldesigned Attention-RPN, Multi-Relation Detector, and contrastive training strategy, the network can squeeze out the matching relationship between targets by training on a highdiversity dataset FSOD, and can carry out reliable detection of novel categories without fine-tuning. It inspires us to train the model to learn a general matching relationship to distinguish objects of the same category from those of different categories, instead of learning the details of each category separately.

Network Architecture and Detection Pipeline
The overall architecture of our network is showed in Figure 1. Specifically, our model consists of multiple branches, where one branch is for query set and others are for support set. In order to simplify the presentation diagram, we only draw one support branch that contains a novel category. We build a weight shared backbone to extract feature maps for both support set and query set. We find that the objects in the drone image are generally small and do not need a large receptive field to detect objects. Therefore, in order to extract better features for small objects, three feature maps are extracted from the shallow layers of the backbone. However, the operation of our model is different from that of most Siamese-based models, which do the same processing to the feature maps and then directly Figure 1. Network overall architecture. The feature maps of the support image are combined by deconvolution and element-wise product while that of support image are reserved to form FPN. The combined feature map of support is used to extract the initial target model and iteratively optimize the target model. The final target model is utilized to compute the similarity map with the feature maps from the query set as the attention feature map H × W × C, which helps our model match the categories in support from the regression results. For simplicity, only one support branch is shown. C represents the number of novel categories to be detected, that is, the number of support branches. After performing the depth-wise cross correlation between the final target model of each support branch and the feature map obtained from the query branch, an attention feature map of H × W × 1 is obtained. Additionally, after conducting channel-wise concatenation on these attention feature maps, a final attention feature map of H × W × C is obtained. The attention feature map is then utilized to get the matching score map H × W × C, which indicates the probability of each pixel in the regression bounding boxes belonging to the category in support by calculating the average probability of the pixels belonging to each category within the regression boxes. The first result of classification H × W × C is obtained by repeating a map of H × W × 1, which indicates the probability of the pixel belonging to a foreground on the channel C times. The final result of classification which indicates the probability of each pixel belonging to each category is obtained by conducting elementwise product between the first result of classification and the matching score map. Afterwards, this final result of classification map is utilized in turn to calculate the score of each regression bounding box on each category for post-processing.

Backbone Network
We design a Swish-DenseNet as the backbone to extract feature maps. The den block consists of a 1 × 1 convolution layer and 3 × 3 convolution layer cycles. Among the a 1 × 1 convolution layer is also called a bottleneck, which is used to reduce dimensio and combine features of different channels. DenseNet inserts a transition layer betwe each two adjacent dense blocks, which includes a 1 × 1 convolution layer to compre Figure 1. Network overall architecture. The feature maps of the support image are combined by deconvolution and element-wise product while that of support image are reserved to form FPN. The combined feature map of support is used to extract the initial target model and iteratively optimize the target model. The final target model is utilized to compute the similarity map with the feature maps from the query set as the attention feature map H × W × C, which helps our model match the categories in support from the regression results. For simplicity, only one support branch is shown. C represents the number of novel categories to be detected, that is, the number of support branches. After performing the depth-wise cross correlation between the final target model of each support branch and the feature map obtained from the query branch, an attention feature map of H × W × 1 is obtained. Additionally, after conducting channel-wise concatenation on these attention feature maps, a final attention feature map of H × W × C is obtained. The attention feature map is then utilized to get the matching score map H × W × C, which indicates the probability of each pixel in the regression bounding boxes belonging to the category in support by calculating the average probability of the pixels belonging to each category within the regression boxes. The first result of classification H × W × C is obtained by repeating a map of H × W × 1, which indicates the probability of the pixel belonging to a foreground on the channel C times. The final result of classification which indicates the probability of each pixel belonging to each category is obtained by conducting element-wise product between the first result of classification and the matching score map. Afterwards, this final result of classification map is utilized in turn to calculate the score of each regression bounding box on each category for post-processing.
For the support branch, our goal is to obtain an optimal target model to represent the support category regardless of its size. Reference [17] introduced a deconvolutional layer to adjust high-level feature map to the size of low-level feature map and achieved good accuracy by utilizing a deconvolutional layer and element-wise product method. Thus, we combine these feature maps through a step-by-step deconvolution and element-wise product operation. As a drone image often contains many target objects, we take the precise RoI pooling (PrRoI Pooling) [39] features for these target objects from the combined feature map and concatenate them in channels. Next, by averaging over channels, we obtain the initial target model for the support category.
A problem is that the initial target model contains only the information of support category and ignores the use of background information, which makes the model unable to identify when the background is similar to the support category. Therefore, in order to utilize background information, we introduce the feature map obtained by processing the combined feature map with 1 × 1 convolution for iterative optimization. A correlation value will be obtained by calculating the depth-wise cross correlation between the initial target model and the feature map for iterative optimization. Furthermore, we introduce an annotation map according to the spatial distance between the pixel position and the center of the annotation box. Specifically, we use a method similar to multi-dimensional Gaussian distribution to determine our annotation map. Thus, we iteratively optimize the target model by reducing the gap between the cross-correlation map and the labelled value map.
For the query branch, we use the three feature maps P1, P2, P3, which are produced by top-down connecting the extracted layers from the backbone to form a feature pyramid network (FPN) [40], so that our model can regress different objects from three different scales, respectively. Two vectors H × W × C (C = 1 for one class in support set) and H × W × 4 are obtained from these feature maps through two branches for classification and regression bounding boxes, respectively. C represents the number of novel categories to be detected, that is, the number of support branches. Since our model is based on learning a large number of different categories and then detecting new categories of objects, the features we extract can be understood as features extracted according to the general rules of objects. Therefore, the vector H × W × C is obtained by repeating a map of H × W × 1 which indicates the probability of the pixel belonging to a foreground on the channel C times and the vector H × W × 4 represents the predicted bounding boxes (offsets of left, top, right, and bottom) of each pixel on the feature map. Then, we utilize the optimized target model obtained from support set and attention mechanism to help us determine whether the regression bounding boxes obtained belongs to the category in the support image. To be specific, we produce an attention feature map by computing the similarity between the final target model of support and the feature map of the query by depth-wise cross correlation. The attention feature map then is utilized to get the matching score map, which indicates the probability of each pixel in the regression bounding boxes belonging to the category in support by combining the information of regression bounding boxes, and then the probability that each pixel in the feature map belongs to the category in support is calculated by combining the probability that each pixel in the feature map belongs to a foreground in query image which finally returns to guide the filtering of regression bounding boxes in the post-process stage.

Backbone Network
We design a Swish-DenseNet as the backbone to extract feature maps. The dense block consists of a 1 × 1 convolution layer and 3 × 3 convolution layer cycles. Among them, a 1 × 1 convolution layer is also called a bottleneck, which is used to reduce dimensions and combine features of different channels. DenseNet inserts a transition layer between each two adjacent dense blocks, which includes a 1 × 1 convolution layer to compress dimensions, and a size 2 × 2, stride 2 average pooling layer to change the resolution of the feature map, so that a feature map with a different scale can be obtained from each dense block. DenseNet is a compact and effective backbone network, which is preferable for our detector to learn from scratch.
In our experiments on DAN dataset, we utilize a modified DenseNet (growth rate k = 24) as the feature extractor. It consists of the initial convolution layers and four dense blocks with three transition layers between each two adjacent dense blocks. As proven by the heatmaps in [41], the first few convolution layers of DNN contain more information of small objects, while the deep layers contain strong semantic features, but less information of small objects. We argue that the first feature layer should be located in the front of the network as much as possible. Therefore, in order to capture more information of small objects, we use only two convolution layers without a max pooling layer as the initial convolution layers and set the number of layers of dense blocks at the front of the network to be small. The specific configuration is shown in Table 1. In addition, most neural networks use ReLUs as the activation functions. However, there is a hidden problem of the dying ReLUs [42]. Since its gradient in the negative range is 0, unreasonable parameter initialization and a large update of parameters in back propagation will make activation values of some neurons negative, meaning these neurons may never be activated, and the corresponding parameters cannot be updated. In this case, ReLUs collapse to a constant function and "die", effectively removing their contribution from the model, and that is what we call the dying ReLUs issue. In our work, we use a Swish unit [43] proposed by Google Brain as the activation function. The Swish activation function is defined as: where x denotes the input of the activation, and Sigmoid(x) equals 1/(1 + e −x ). Reference [43] proved that Swish outperforms ReLU on deeper models by experiments. Swish is unbounded above and bounded below like ReLU, whereas it is smooth, nonmonotonic, and unsaturated, which alleviates the dying neuron problem and gradient vanishing. Meanwhile, the simplicity of Swish and its similarity to ReLU make it easy to replace ReLUs with Swish units [43].

Feature Fusion and FPN
For the support branch, in order to obtain a feature map that can express the image in multiple scales, we utilize deconvolution and element-wise product operation to fuse high-resolution, low-semantic features and low-resolution, high-semantic features. To be specific, we use deconvolutional layers to adjust a high-level feature map to the size of a low-level feature map and then combine them by element-wise product. This feature map is used to extract RoI features from the image and to optimize the target model iteratively. This enables our final target model to have better representation of the support category at multiple scales. For the query branch, we reserve the three feature maps P1, P2, P3, which are produced by top-down connecting the extracted layers from the backbone to form a feature pyramid network (FPN). Specifically, P1 is the shallowest feature, P2 is obtained by combining deconvolutional P1 and second extracted feature layer with element-wise product operation, and similar to P2, P3 is the combined feature map of P2 and the last extracted feature layer. This FPN enables our model to regress different objects from three different scales, respectively.

Initialization and Iterative Optimization of Target Model
After obtaining the combined feature map of the support image, we use PrRoI pooling to extract the feature maps of labeled boxes for objects of the same support category. These feature maps are then combined by channel-wise concatenation and we take the average value over channels as the initial target model. To be specific, we denote the feature map of the object i as A i ∈ t S×S×C and the number of objects labeled in the support image as N, so the initial target model B can be formulated as where h, w, c represent the abscissa, ordinate, and channel of the pixel position, respectively.
As the initial target model only contains the information of support category, ignoring the utilization of background information, it is not discriminative when the background is similar to the support category. In order to introduce background information to generate a more discriminative target model, we use the feature map of whole image for iteratively optimizing the target model. For example, if we denote the feature map of the whole image for iterative optimization as W, then a cross-correlation map M can be formulated as where the target model B serves as a kernel to slide on the whole feature map W in a depth-wise cross correlation manner [44], and we take the average value over channels as the cross-correlation map, which indicates the similarity of the pixel region to the support category. Furthermore, we introduce an annotation map according to the spatial distance between the pixel position and the center of the annotation box. Specifically, we use a method similar to multi-dimensional Gaussian distribution to determine our annotation map, where we make the center of the label box take the highest value 1, the value at the border corresponds to the value at the standard deviation for each labeled bounding box in the support image, the pixel region of background takes 0, and the value of the pixel closer to the center is closer to 1. In detail, for a labeled bounding box (x, y, a, b), where (x, y) is the center of the bounding box, a and b are the length and width of bounding, respectively, the value of a pixel region (x, y) within this bounding box can be calculated as where v is used to compensate the value at the center to 1. Thus, we iteratively optimize the target model by reducing the gap between the cross-correlation map M and the annotation map G.

Attention Feature Map
Attention feature map indicates the similarity information between each pixel region in the feature map of query image and the category of support. Here, we obtain the Electronics 2021, 10, 783 9 of 16 attention feature map by calculating the depth-wise cross correlation between the final target model and the feature map of the query image. Similar to formula (3), if we denote the final target model as X ∈ t S×S×C and feature map of query as Y ∈ t H×W×C , for one category in support set, the attention feature map Z can be represented as where each channel represents a category to be detected, that is, an input support branch.

Matching Score Map
In our work, we propose a map to transform the classification of foreground pixels and background pixels from query branch into the classification of target and non-target pixels. We first extract the regression bounding box t * = (l * , t * , r * , b * ) for each pixel region on the feature map of query, and then take the mean value of the similarity values of pixel regions within each bounding box on the attention feature map as the matching score of the pixel region in the query feature map for the support category. Here, l * , t * , r * , and b * are the distance from the location of each pixel of feature map to the four sides of the regression bounding box [14]. To be specific, for a pixel region (x, y) on the feature map with the stride = s of FPN, we denote a regression bounding box as t * = (l * , t * , r * , b * ), then the coordinates of the left-top and right-bottom corners of the corresponding region of the regression bounding box on the attention feature map are (lt x , lt y ) and (rb x , rb y ), which can be formulated as: Thus, we define our matching score map as follows: where N b equals (rb x − lt x ) × (rb y − lt y ), and the different channels on matching score map indicate the matching scores of different support categories.

Dataset and Image Processing
In order to learn from scratch and ensure diversity of training categories, we need a lot of data for training to make the parameters of our model work properly, but the actual data is not as much as required. In NWPU VHR-10, there are only 650 images with 10 classes of labeled objects. The images of NWPU VHR-10 are an order of magnitude less than that of generic detection datasets. To make up for this, we construct a DAN dataset as a collection of DOTA [45] and NWPU VHR-10, which consists of 15 representative categories, i.e., soccer ball field, helicopter, swimming pool, roundabout, large vehicle, small vehicle, bridge, harbor, ground track field, basketball court, tennis court, baseball diamond, storage tank, ship, and plane.
We find that a large number of small objects in the DAN dataset are not labeled, resulting in a small number of available samples, which will make detectors miss the detection of small targets. Secondly, the lack of diversity of their context background makes it difficult to detect small targets in other backgrounds. We use data augmentation that focuses on small objects. We copy and paste the small objects to the position that does not overlap with the existing objects in the images to increase the diversity of the location of the small target, and ensure that the objects appear in an appropriate context. Before pasting the target to the new location, we make a random transformation. We scale the target size to 80~120% and rotate the angle by ±15 degrees. We only consider the objects that are not occluded, because the discontinuous samples with occluded areas will be distorted.

Loss Function
We leverage the two-way contrastive training strategy of [11] to enable our model to identify objects of the same category from objects of different categories. Specifically, for each query image q c , we randomly choose one support image s c with objects of the same category and one support image s n with objects of other different categories to construct a training triplet (q c , s c , s n ). In the query image of the triplet, only the objects of category c are labeled as positive, while other objects and background are labeled as negative. For this triplet, our model should not only match the same category objects between (q c , s c ), but also distinguish objects with different classes between (q c , s n ). Therefore, we design the training loss function as follows: L(q c , s c , s n ) = L match (q c , s c ) + αL match (q c , s n ) + λL reg (q c , s c ) (8) where L matching (q c , s c ) is focal loss for offsetting the impact of class imbalance and makes the model pay more attention to hard examples by adjusting the weights, L matching (q c , s n ) is the binary cross-entropy loss, and L reg (q c , s c ) is the IOU loss as in [46]. In addition, we add the weighting factors α and λ, where the former is used to down-weight the matching loss of (q c , s n ), and the latter is used to adjust the weight of L reg . In our work, α is set to 0.5, and λ is set to 1 by cross validation.

Post-Processing
Since most objects overlap a lot with other adjacent objects in images with densely arranged objects, many correct detection results are filtered out by conventional NMS due to large IoUs. When an IoU above the threshold appears, conventional NMS sets the confidence score of the bounding box with lower confidence to zero to remove the redundant bounding boxes. Therefore, we use Soft-NMS rather than conventional NMS to process the detection results. Soft-NMS retains the correct results by reducing the lower confidence score instead of zeroing it when there is an IOU above the threshold. Specifically, for the bounding box b i , if the IoU between b i and another bounding box b j , which has higher confidence score than b i , is greater than the defined threshold T, the confidence score c i will be recalculated according to the following equation: where T is set to 0.5 as in most other works.

Implementation Details and Evaluation Metrics
We implement our model based on a Tensorflow framework. Our detector is trained from scratch with a computer running Ubuntu 18.04 LTS. Stochastic gradient descent (SGD) is performed on Nvidia GeForce GTX 1060 with 8 GB GPU memory. The experiments utilize CUDA v10.0, cuDNN v7.5.0, and Tensorflow-gpu-1.13 to accelerate computation. Considering that too many training iterations may damage performance by making the model over-fit, we take 80 epochs to train this model. We use the momentum method to optimize SGD. We train the model with an initial learning rate of 0.0002, momentum of 0.9, and weight decay of 0.0005. In addition, we use sub-batch method to solve the memory overflow problem caused by large batch size. Our model takes batch size 32, image size 640 × 640 pixels as input.
We divide our experiments into three modules to validate the effectiveness of our model and explore the factors that may affect the detection performance. We choose vehicle, storage tank, and plane as the novel class to conduct experiments. Since our purpose is to create a model which can learn from scratch on an unconventional dataset, not using a pre-trained model is the premise of all our experiments. The first module includes performing our proposed model and other state-of-the-art few-shot detectors on a DAN dataset and then comparing their results. Their object detection performances are measured by two evaluation metrics on DAN base classes and DAN novel classes, respectively, which are mean Average Precision (mAP) and speed (FPS). The second module includes investigating the validity of different components in our model. Several controlled experiments on DAN datasets are conducted for the ablation study. In the last module, we test our model on the images with a single influence factor, image resolution and object density, to find out how these factors affect our model, respectively. To be specific, we first down-sample the images to a series of lower resolutions (0.8×, 0.6× and 0.4×) and then train our model with each resolution and measure their predictions on the mAP at IoU 0.5 metric. On the other hand, we collect images with densely arranged small objects in UCAS-AOD to form a test set in order to evaluate the performance of our model on detection of densely arranged small objects.

Performance Comparison with Other Few-Shot Detectors on DAN Dataset
The result of the performance comparison between our model and other state-of-theart few-shot detectors is shown in Table 2. All models are trained and tested on a DAN dataset with the same pre-process and post-process setting. Since our model is designed for unconventional dataset with a large number of small targets and applies one-stage model as template in query branch, through experiments, it appears that our proposed model has better performance not only on detection accuracy, but also on inference speed compared to other few-shot detectors. There are some examples of detection results predicted by our proposed model (see Figure 2).

Ablation
The result of the ablation study is shown in Table 3. Specifically, we use a consistent setting in evaluation for fair comparison. All models are trained and tested on a DAN dataset. Our complete proposed model achieves mAP 67.4 and 39.9 when tested on DAN base classes and DAN novel classes, respectively. Then, we remove some components of

Ablation
The result of the ablation study is shown in Table 3. Specifically, we use a consistent setting in evaluation for fair comparison. All models are trained and tested on a DAN dataset. Our complete proposed model achieves mAP 67.4 and 39.9 when tested on DAN base classes and DAN novel classes, respectively. Then, we remove some components of the model to observe the detection performance in order to analyze the effect of the components, and " √ " indicates which components are included. Obviously, our designed backbone for extracting feature for small objects, Swish activation function, and feature map for iterative optimization improve detection performance. Our designed backbone for extracting features for small objects improves the mAP of our model on DAN base classes and DAN novel classes by 5.7% and 9.1%, respectively, because our designed backbone makes use of skip connection to simplify the learning objective and enable the model to apply a deeper network structure, thus improving the effectiveness of feature extraction. It is preferable for tasks where the model needs to be trained from scratch due to fewer parameters. Furthermore, since feature maps are extracted from shallow layers, the information of small objects are relatively complete, which helps to express small objects in DAN dataset. Swish activation function improves the mAP of our model on DAN base classes and DAN novel classes by 0.2% and 0.1%, respectively, which proves that it is better than ReLU in our work. It alleviates the dying neuron problem and gradient vanishing. The feature map for iterative optimization also brings improvement on detection performance. It increases the mAP of our model on DAN base classes and DAN novel classes by 3.3% and 4.1%, respectively, because it introduces information of background when generating a final target model, which makes the model more discriminative in distinguishing foreground and background.

Impact of Image Resolution and Object Density
In this module, we test the impact of image resolution and object density on object detection, and compare their impact on our model and well-trained models.
Image resolution impact: In this experiment, we down-sample the images to a series of lower resolutions (see Figure 3). Afterwards, we train our model and Meta R-CNN with each resolution and measure their predictions on the mAP at IoU 0.5 metric. Results are illustrated in Table 4. Table 4, the detection accuracy of the small objects (vehicle, storage tank, and plane) in the picture decreases sharply with the decrease of resolution. At the same time, when the resolution of the image is reduced to 0.4×, the detection accuracy of Meta R-CNN for vehicle, storage tank, and plane are reduced by 14.1%, 7.7%, and 19.9%, while that of our model is reduced by 10%, 5.5%, and 17.4%. We can find that our model is more robust to the change of image resolution in the detection of small objects than Meta R-CNN.   Object density impact: In order to eliminate the influence of missing detection of the detector itself, we compare the model trained by adding vehicle class into base classes with a practical method, YOLOv3, which directly learns vehicle class. We select some images with densely arranged vehicles from an UCAS-AOD dataset, and use YOLOv3 and our model trained on a DAN dataset to detect the targets. Figure 4 shows some detection results of YOLOv3 and our model, respectively.
Obviously, a lower rate of missed detection indicates that our model has better performance in the detection of densely arranged objects. This is especially true when detecting objects with different sizes and oblique orientations. As in this case, the predicted bounding boxes of two adjacent objects have a high IoU, which leads to the traditional NMS filtering out the prediction with lower confidence score, resulting in missed detection. Soft-NMS, utilized in our model, preserves the correct prediction by reducing the lower confidence score rather than setting it to zero when there is an IOU above the threshold.  Obviously, image resolution is very important to the accuracy of target detection. From Figure 3, we can see that an image at low resolution looks more blurred and lacks a lot of detailed features, which makes small objects more difficult to identify. As shown in Table 4, the detection accuracy of the small objects (vehicle, storage tank, and plane) in the picture decreases sharply with the decrease of resolution. At the same time, when the resolution of the image is reduced to 0.4×, the detection accuracy of Meta R-CNN for vehicle, storage tank, and plane are reduced by 14.1%, 7.7%, and 19.9%, while that of our model is reduced by 10%, 5.5%, and 17.4%. We can find that our model is more robust to the change of image resolution in the detection of small objects than Meta R-CNN.
Object density impact: In order to eliminate the influence of missing detection of the detector itself, we compare the model trained by adding vehicle class into base classes with a practical method, YOLOv3, which directly learns vehicle class. We select some images with densely arranged vehicles from an UCAS-AOD dataset, and use YOLOv3 and our model trained on a DAN dataset to detect the targets. Figure 4 shows some detection results of YOLOv3 and our model, respectively.

Conclusions
The research on few-shot detection is of great significance, because it can not only reduce the cost of manual annotation, but also help to realize the diversification of detection targets. Although many detectors have been proposed, there are few real-time fewshot detection methods for UAV image targets. In this paper, we propose a novel fewshot object detector especially for special datasets with fewer labeled images and small objects. We design a special Swish-DenseNet as our backbone for feature extraction, which enable our model to be trained from scratch and have more effective feature maps. We introduce the feature map for iterative optimization to make use of background information to generate a more discriminative target model. Unlike most of the state-of-thearts of few-shot detection, we utilize a one-stage model FCOS as template in the query branch rather than a two-stage model, thus making our model have higher inference speed. In addition, we leverage a matching score map to transform the classification of foreground and background from the query branch into the classification of target and non-target pixels, which integrates the information from the support branch and query branch. In addition, we use Soft-NMS to alleviate the missed detection problem when dealing with densely arranged targets. The experimental result on the DAN dataset shows that our proposed model has better performance than other state-of-the-art few-shot models and maintains the high efficiency of one-stage model, which enable it to be applied on applications with real-time requirements. In the future, we will try to apply it to the target detection of UAV aerial photography through the cloud server. Obviously, a lower rate of missed detection indicates that our model has better performance in the detection of densely arranged objects. This is especially true when detecting objects with different sizes and oblique orientations. As in this case, the predicted bounding boxes of two adjacent objects have a high IoU, which leads to the traditional NMS filtering out the prediction with lower confidence score, resulting in missed detection. Soft-NMS, utilized in our model, preserves the correct prediction by reducing the lower confidence score rather than setting it to zero when there is an IOU above the threshold.

Conclusions
The research on few-shot detection is of great significance, because it can not only reduce the cost of manual annotation, but also help to realize the diversification of detection targets. Although many detectors have been proposed, there are few real-time few-shot detection methods for UAV image targets. In this paper, we propose a novel few-shot object detector especially for special datasets with fewer labeled images and small objects. We design a special Swish-DenseNet as our backbone for feature extraction, which enable our model to be trained from scratch and have more effective feature maps. We introduce the feature map for iterative optimization to make use of background information to generate a more discriminative target model. Unlike most of the state-of-the-arts of few-shot detection, we utilize a one-stage model FCOS as template in the query branch rather than a two-stage model, thus making our model have higher inference speed. In addition, we leverage a matching score map to transform the classification of foreground and background from the query branch into the classification of target and non-target pixels, which integrates the information from the support branch and query branch. In addition, we use Soft-NMS to alleviate the missed detection problem when dealing with densely arranged targets. The experimental result on the DAN dataset shows that our proposed model has better performance than other state-of-the-art few-shot models and maintains the high efficiency of one-stage model, which enable it to be applied on applications with real-time requirements. In the future, we will try to apply it to the target detection of UAV aerial photography through the cloud server.