1. Introduction
As one fundamental component, as well as a bottleneck of the Unmanned Aerial Vehicle (UAV) system, object detection technology has been widely deployed in real-world remote sensing applications ranging from nature protection, geological disasters monitoring to surveillance [
1,
2,
3]. With the rapid development of convolutional neuron networks, generic object detectors (GOD) [
4,
5,
6] have made tremendous progress in natural scene images such as the COCO [
7] and PASCAL VOC [
8] datasets. However, detecting objects at diverse scales in UAV-captured images with high efficiency (e.g., Visdrone [
9], MOHR [
10], and UAVDT [
11]) is still a challenging task that desires satisfactory performance.
As illustrated in
Figure 1, we summarize six major and specific challenges in drone object detection (DOD): (1) small scale, (2) dense cluster, (3) overlap and occlusion, (4) scale diversity, (5) category imbalance, (6) indistinguishable category. Specifically, the insufficient and weak appearance representation of
tiny objects significantly degrades the detector’s performance, which is the primary challenge in both tiny object detection (TOD) and DOD task. Furthermore, tiny objects in drone images always cluster together, forming
dense clusters where objects are unevenly distributed and become
overlapped/
occluded by each other. However, as UAVs typically fly at low or medium altitudes and capture images from an angle of depression, there exists remarkable
scale diversity between close and distant objects. Additionally, certain categories naturally possess significant scale diversity, e.g., the pedestrian class versus the truck/bus class. Therefore, a DOD detector should handle the objects at all scales at the same time. Concerning classification, for one thing, as drone images are always captured in specific scenarios like urban areas, some classes appear more frequently and occupy the majority of annotations, leading to significantly imbalanced class distributions, i.e., the
long-tail problem. For another, certain classes are
hard to distinguish, such as motor vs. bicycle, especially when dealing with tiny objects.
Facing the challenge of detecting tiny objects, the most straightforward approach is to increase the input resolution. However, in general images, both the relative and absolute scales of tiny objects are small, limiting the effects of enlarging. Fortunately, with the development of payloads on drones, recent drone-captured images usually have much higher resolutions. For instance, nearly half of the images in VisDrone [
9] are in 1080P resolution, and the highest resolution in the more recent drone dataset MOHR [
10] even surpasses 8K. Thus, although the relative scales of tiny objects in recent drone images are still small, the number of occupied pixels may be sufficient. To the best of our knowledge, DOD methods dedicated to enlarging tiny objects can be mainly divided into the following four types:
Global scaling is straightforward and effective but leads to a quadratic increase in computational expense, especially during the training phase. SR methods not only have huge input resolution but also introduce additional computation-expensive modules. Tiling can be viewed as an indirect and local scaling manner, where the additional operations in fore-processing and post-processing mainly include image partitioning, coordinate converting, and prediction fusing. Although tiling can effectively enlarge tiny objects, the object truncation caused by partitioning will result in inaccurate, partial, and redundant predictions (an object may be split into several parts, and each of the parts will be predicted as an individual object). Furthermore, large objects, whose sizes are close to or even over the patch size, will suffer an anchor-mismatch and heavy truncation in local detection, even though they are easy for global detection. Consequently, recently more efforts are spent on CR-based methods because they can actively acquire more effective patches.
However, compared to tiling methods, the two-stage architecture and extra operations of CR methods are more inefficient and complex. More importantly, the overall performance of the detection framework is limited by the CR extractor, since the inaccurate CR estimation will also pose truncation and missed detection. On the contrary, tiling methods exhibit neat and end-to-end architecture, making them more friendly to deployment and practical applications. Besides, we believe that, with appropriate patch settings, a prediction fusion strategy, and a matched training pipeline, it is possible for tiling methods to overcome the scale diversity and truncation problem. Consequently, in this work, we present an improved tiling detection framework with both high efficiency and outstanding performance.
First, we review and formulate the tiling inference pipeline. A mixed data strategy is adopted to deal with scale diversity. Specifically, apart from the local detection on patches, we also conduct the global detection on the corresponding original image to maintain the performance of large objects. Patches and the complete image are assembled into a mini-batch tensor for parallel inference. Regarding patch settings, for one thing, patches are set to have overlaps with each other to address the truncation of tiny objects. For another, the side lengths of patches have a fixed relative ratio with the image size instead of absolute lengths, so that the local and global detection can employ the same model rather than two independent models. As the majority of additional operations are performed in parallel, the tiling pipeline still maintains high efficiency.
Correspondingly, to keep the consistency of the model during the inference and training phases, we also take the mixed training strategy that the training data consists of both the patches and original images. However, if merely utilizing pre-cropped patches with the same patch settings as inference, some patches may contain very few objects or even only the background, hindering the learning of the model. Thus, we produce the training patches by random online anchor-cropping to ensure that a patch includes at least one valid annotation and meanwhile enrich the scenarios.
Due to the inherent discrepancy in relative scale distributions between patches and original images, the anchors for local and global detection inevitably have a misalignment. To keep the scale invariance, SNIP [
23] proposes a multi-scale training framework, where each level of the image pyramid corresponds to a specified range of object scales. Besides, the simple fusion manner in inference will introduce numerous redundant predictions. Similarly, we design a scale filtering mechanism for both training assignment and prediction fusion to properly assign the objects at diverse scales to local and global detection tasks.
In addition, although anchor-cropping is adopted to produce valid training patches, the patches still contain fewer examples than the original image. Furthermore, truncation checking and scale filtering will further reduce annotations. Thus, we devise two augmentations customized for tiling detection, aiming at increasing the number of valid objects and generating more challenging drone scenarios, including:
Mosaic Plus: in addition to the combination manner, we introduce more diverse stitching manners in Mosaic augmentation to fully disrupt the semantic features of input images and accelerate the training process.
Crystallization Copy-paste: in addition to the normal random copy-paste, we propose a crystallization copy-paste to simulate the realistic dense clusters with overlapping and raise the appearance probability of rare categories to relieve category imbalance.
In summary, our contributions are listed as follows:
We propose an improved tiling detection framework with a mixed data strategy and scale filtering mechanism to avoid truncation and handle objects in all scales, and generate effective training patches by online anchor-cropping.
We devise two augmentations customized for tiling detection to produce more challenging drone scenarios, simulate dense clusters and alleviate the long-tail problem.
We conduct comprehensive experiments on both public DOD benchmarks and real-world drone images to validate the outstanding performance and efficiency of the proposed tiling framework. On VisDrone and UAVDT, it surpasses the best cluster-region-based method ZR-DET [
24] by 1.3 and 7.8 in terms of average precision and meantime achieves over 4 times faster inference speed on GPU. Furthermore, when being deployed on our edge computation equipment, the proposed tiling framework still performs well on practical drone scenarios with a real-time speed of 27 fps.
3. Tiling Detection Framework
Generally speaking, tiling inference refers to the process of partitioning an input image into sub-patches by uniform sliding window and then respectively performing detection on the patches. However, direct partitioning will face the following main issues:
How to properly set the scale and number of patches.
When an object lies on the boundaries of two or more adjacent chips, it gets truncated into several parts, leading to partial, inaccurate, and redundant predictions.
In addition to the increased risk of truncation, medium and large objects that are easily detected globally may not match anchors well, leading to a drop in performance, especially when their sizes are close to or even larger than the patch size.
Extra operations are introduced into the fore-processing and post-processing, aggravating the computational burden.
To address the above issues, we develop an improved tiling detection framework that utilizes a mixed data strategy in both inference and training to avoid truncation and handle objects at all scales. Besides, we apply random online anchor-cropping to generate valid training patches. Furthermore, we propose a scale filtering mechanism to assign objects at diverse scales to the global and local tasks, which can obtain optimal fused predictions and keep the scale invariance.
3.1. Efficient Tiling Inference Pipeline
As shown in
Figure 2a, to avoid the truncation of tiny objects, when setting the tiling patch size and number, we make adjacent sub-patches have overlaps with each other. Concurrently, to keep the performance of larger objects, we introduce global detection on the original input image. Furthermore, as the spatial sizes of the images in datasets are unfixed and we hope to simultaneously perform local and global detection employing just a single model instead of two independent models, the aspect ratio of the patches should be the same as the original image. Therefore, we choose to make patches have a fixed relative scale ratio with the original image (equal for both width and height), instead of setting fixed and absolute side lengths.
Specifically, assuming the spatial size of the image is
and the patch scale ratio is
, the side lengths of its patches are
and
. Under the same model input size, the relative scale of the objects on the patches will increase by
, essentially equivalent to enlarging the input image by
. Let
M denote the sampling number along an axis, the total patch number is
and the input batch size equals
. Consequently, the overlap between two consecutive chips is
That is to say, for any object whose maximum relative scale is below
O, wherever it lies on the image, there must exist at least one sub-patch where this object can be fully contained without any truncation. Accordingly, the interval between chips is
which hints again that
must take values from
and the sampling number
M should be greater than 1.
For the sake of efficiency, after resizing and padding, all the individual chips and the original images will be assembled into a mini-batch tensor for parallel inference. After scaling and adding the position bias, the predicted boxes on patches will be converted into the absolute coordinates, so that it is able to directly concatenate all the predictions and conduct the NMS operation together. As the patches possess a definite relative position relationship with the original image, the above operations can also be performed in parallel. Incidentally, in real-world UAV applications, when input images are sampled from a video stream and have a fixed size, the time of image partitioning can be ignored since the slicing indices remain constant. Actually, the majority of time consumption is attributed to the parallel inference of multiple patches.
In summary, the tiling framework adopts a mixed inference strategy. While the local detection is performed on patches that have overlaps with each other to avoid the truncation of tiny objects, the global detection on original images maintains the performance of larger objects. Moreover, the tiling inference is also highly efficient as most of the additional operations are performed in parallel.
3.2. Mixed Data Training with Anchor-Cropping
Previous typical tiling methods partition the training images in advance, adopting the same tiling settings as the inference phase, i.e., the training patches are fixed during training. Inevitably, part of patches may contain very few valid objects even merely background areas, dragging the training progress. As shown in
Figure 2b, to keep the model’s consistency between the inference and training phase, we also take the mixed training strategy in the training pipeline, namely, the training images include both patches and original images. Moreover, we propose to randomly cropping the valid training patches online as an augmentation measure.
Specifically, first, since a valid training image must contain at least one complete object, a ‘cropping anchor’ will be chosen randomly from all the objects in the original image. Then, around the anchor object, randomly select the position of the patch, while making sure that the anchor must be completely included in this patch without any truncation. The basic relative width and height of the cropped patches are the same as the inference setting, and scale jittering is applied for augmentation. Eventually, the truncation situation of the rest of the annotations will be checked. If the area occupied by an object on the patch is less than 60% of its overall area, it is regarded as an invalid object and removed. Otherwise, the valid boxes will be converted into the relative coordinates on the patch and their side lengths will be scaled by the patch scaling ratio
. For instance, in
Figure 2b, around the anchor object (white car in black box), two training patches with scale jittering (yellow:
, green:
) are randomly generated. In patch
, since the truck (top) and car (right bottom) objects are heavily truncated, they are removed from the annotation list. During an epoch, 75% of the original images will be randomly selected and cropped, while 25% remain complete for global training.
3.3. Object Assignment by Scale Filtering
Referring to [
23], it is crucial for proposals to match the range of input resolution. After tiling, as the relative scale of objects to patches magnifies by
, the minimum relative scale of an object to the patch will accordingly become
times larger than its minimum relative scale to the original image. Besides, the maximum relative object scale to the patch may exceed 1 for large objects larger than the patch. Thus, the discrepancy in the relative scale distributions between the patch and the original image will cause the misalignment between the best-matched anchors for local and global detection. Obviously, on the patch, the performance of tiny objects is improved owing to higher input resolution, while on the original image, medium and large objects already own well performance without the risk of truncation and anchor-mismatch. To address the anchor misalignment, we take a “divide and rule” strategy in both training and prediction fusion.
First, we set two relative scale thresholds
and
whose value lies in
to classify objects as tiny, medium, or large. The value of
depends on the maximum relative object scale in the whole dataset.
is set below the patch overlap
O since objects larger than the patch are bound to suffer truncation. As shown in
Figure 2c, there are three vehicle objects at different scales near the overlap region between patch
(yellow) and
(green): the truck object
(in the orange box) with a relative height of
to the original image, the van object
(blue) with a relative height of
, and the tricycle object
(red) with a relative height of
. For
, as its maximum relative scale to the patch
exceeds the large-scale threshold
, it is viewed as a large object which cannot match proper anchors. For
, as
is far smaller than
O, it is regarded as a tiny object and there must exist at least one patch that can completely include it. For
, as its relative scale
does not satisfy the above two scale conditions, it is viewed as a medium object.
For patch inference and training, during inference, the confidence score of over-scaled predictions will be set to 0; during training, the over-scaled will be removed from the annotation list. Similarly, for global detection, if the maximum relative scale of an object/prediction to the original image lies below the tiny scale threshold , it will be regarded as a tiny object and only assigned to patches in both training and inference. For medium objects whose maximum relative scale ranges between and , they will participate in both global and local tasks.
4. Augmentations Customized for Tiling
As shown in
Figure 2b, although anchor-cropping is adopted to generate valid training patches, the patches naturally contain fewer valid objects than the original images. Furthermore, the truncation checking and scale filtering operations will further reduce the example number, which delays the model learning process. To address this, we devise two augmentations customized for the tiling detection and drone scenarios, which can generate more challenging training scenarios, simulate the practical dense clusters with overlaps, and effectively increase the valid objects, especially for rare categories.
4.1. Mosaic Plus Augmentation
Traditional Mosaic operation [
37] denotes combining four images into a new one in a
manner, whose essence is to disrupt the semantic features of input images, which can introduce richer spatial context, avert overfitting by exposing the detector to a wider range of scenes and object configurations, and enhance the generalization ability across diverse real-world scenarios. However, as recent aerial images always have huge resolutions, direct stitching will further reduce the relative object scale. Fortunately, in patch training, by controlling the side lengths of sub-patches, it is able to stitch images more freely and flexibly while maintaining the scale ratio. Specifically, as illustrated in
Figure 3, we add several stitching manners in the Mosaic augmentation to disrupt the semantic features more fully, increase the number of valid objects, and strengthen the background complexity.
First, we define the fundamental combination units in the proposed Mosaic augmentation: stitching two patches along the vertical or horizontal direction. For instance, two patches of or can generate a standard patch of . Then, we can further obtain the combination of three patches. For instance, first, two patches of fuse vertically, and this unit is then stitched horizontally with a patch of . Likewise, the normal Mosaic can be decomposed into a combination of four patches of after three fundamental stitches.
In addition to the above grid combination manners, we add two irregular manners:
Patch Embedding: First, a normal patch of and a smaller patch of are cropped, where is a scale factor sampled from a uniform distribution . The smaller patch is then randomly placed on . Finally, similar to the anchor-cropping operation, the truncation situations of the objects on are checked to remove invalid annotations.
Diagonal Stitching: First, two standard patches are cropped and padded to square matrices of the same size. Then, along the principal diagonal or counter diagonal, an upper triangular matrix and a lower triangular matrix are generated as patch masks. By applying these masks and combining the masked patches, we obtain a combination of the two triangular patches.
In summary, our proposed Mosaic Plus augmentation offers a more diverse set of image combination manners in addition to the default manner, which can disrupt the semantic information more fully, introduce richer spatial context, and generate more challenging scenarios. Furthermore, as each patch generated by random anchor-cropping contains as last one valid object, the minimum number of valid objects in the combination image is equal to the number of sub-patches. In other words, our proposed Mosaic augmentation can effectively multiply the number of valid objects, thereby accelerating the training process. Additionally, as reading high-resolution images are the main bottleneck in the training pipeline, the additional irregular combination manners that need fewer images can also increase training efficiency.
4.2. Crystallization Copy-Paste Augmentation
Copy-paste is an object-aware augmentation that can produce novel and challenging scenarios. Ref. [
53] finds that simply choosing objects from other images and pasting them at arbitrary locations can significantly improve performance. However, to the best of our knowledge, modern drone images are typically captured in very few specific scenarios, especially in urban areas. For instance, in VisDrone [
9] and UAVDT [
11], the majority of objects (such as vehicles and people) lie in the road area rather than buildings or plants. Thus, the context between the object and background is also crucial for a detector. To simulate the realistic dense clusters with overlapping in drone images, in addition to the normal random copy-paste, we propose a crystallization copy-paste augmentation, where the pasted objects have the same category and similar background as the original objects.
Specifically, first, an object is randomly selected from the patch/original image as the crystallization nucleus, which can also be regarded as the kernel or seed. Then, congeneric objects are randomly selected from other images or the original image and placed around this nucleus, and the pasted objects are allowed to have slight overlaps with each other. Besides, a few objects of other categories are also pasted to fully simulate real-world clusters. So far, a crystallizing process is finished. By repeating this crystallizing process several times, a synthetic dense clusters are formed.
As shown in
Figure 4, we define two crystal growth manners: (1) several original objects are chosen as kernels simultaneously and multiple independent clusters grow. (2) one original object is first picked as the kernel; after the initial crystallization, a new kernel is chosen from the last clusters; the crystallizing process will be repeated until obtaining a large cluster.
Besides, when selecting pasted objects, we raise the appearance probability of rare categories to balance the category number ratio. This can be deemed as a re-sampling operation which partially relieves the long-tail problem in the DOD task.