A Realistic Instance-Level Data Augmentation Method for Small-Object Detection Based on Scene Understanding

Li, Chuwei; Zhang, Zhilong; Zhong, Ping; He, Jun

doi:10.3390/rs18040647

Open AccessArticle

A Realistic Instance-Level Data Augmentation Method for Small-Object Detection Based on Scene Understanding

National Key Laboratory of Automatic Target Recognition, College of Electronic Science and Technology, National University of Defense Technology, Changsha 410073, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2026, 18(4), 647; https://doi.org/10.3390/rs18040647

Submission received: 8 January 2026 / Revised: 11 February 2026 / Accepted: 16 February 2026 / Published: 20 February 2026

(This article belongs to the Special Issue Multi-Object Detection and Feature Extraction of Remote Sensing Images)

Download

Browse Figures

Versions Notes

Highlights

What are the main findings?

This study establishes a unified, scene-understanding-driven framework that systematically addresses the four key dimensions of visual realism in instance-level data augmentation, including background, scale, illumination, and viewpoint.
Compared to existing methods, our method demonstrates superior visual realism and achieves optimal detection performance gains across multiple object detection models (e.g., YOLOv5 and RT-DETR).

What are the implications of the main findings?

Our experiment demonstrates a strong positive correlation between the degree of visual realism achieved and the final gain in detection performance, providing empirical evidence for a previously underexplored relationship.
Our research provides a practical and resource-efficient solution that reduces dependence on large-scale mask-level annotations, making it well suited to challenging domains such as UAV applications and remote sensing.

Abstract

Instance-level data augmentation methods, exemplified by “copy-paste”, serve as a conventional strategy for improving the performance of small object detectors. The core idea involves leveraging background redundancy by compositing object instances with suitable backgrounds—drawn either from the same image or from different images—to increase both the quantity and diversity of training samples. However, existing methods often struggle with mismatches in background, scale, illumination, and viewpoint between instances and backgrounds. More critically, their predominant reliance on background information, without a joint understanding of instance-background characteristics, results in augmented images lacking visual realism. Empirical studies have demonstrated that such unrealistic images not only fail to improve detection performance but can even be detrimental. To tackle this problem, we propose a scene-understanding-driven approach that systematically addresses these mismatches via joint instance-background understanding. This is achieved through a unified framework that integrates image inpainting, image tagging, open-set object detection, the Segment Anything Model (SAM), and pose estimation to jointly model instance attributes, background semantics, and their interrelationships, thereby abandoning the random operation paradigm of existing methods and synthesizing highly realistic augmented images while preserving data diversity. On the VisDrone dataset, our method improves the mAP@0.5:0.95 and mAP@0.5 of the baseline detector by 1.6% and 2.2%, respectively. Both quantitative gains and qualitative visualizations confirm that the systematic resolution of these mismatches directly translates into significantly higher visual realism and detection performance improvements.

Keywords:

small-object detection; data augmentation; instance-level; visual realism; scene understanding

1. Introduction

As a fundamental computer vision task that enables intelligent perception, object detection is widely applied across various imaging modalities, including visible light, infrared [1], synthetic aperture radar (SAR) [2], and hyperspectral images [3,4]. This technology supports several critical real-world applications: in autonomous driving, real-time obstacle detection is essential to avoiding traffic accidents [5]; in unmanned aerial vehicle (UAV) scenarios, it provides vital data for traffic management and smart city initiatives [6]; and in maritime monitoring, the long-range detection and identification of ship targets are fundamental to maritime security and situational awareness [7,8].

In these application scenarios, detecting small objects remains a significant challenge, especially those occupying a small area in the image, regardless of their actual size in the real world. Specifically, in the MS COCO [9] dataset, small objects are defined as those measuring less than 32 × 32 pixels. In contrast, the SODA [10] dataset categorizes small objects into three levels: extremely small (less than 12 × 12 pixels), relatively small (less than 20 × 20 pixels), and generally small (less than 32 × 32 pixels).

Over the past decade, despite the remarkable success of general object detection methods on benchmark datasets, the performance bottleneck for small-object detection remains prominent. Taking the PASCAL VOC 2007 [11] dataset as an example, the mAP of detectors has increased from 29.1% (DPM [12]) to 66.0% (R-CNN [13]) and then to 83.6% (R-FCN [14]), showing signs of plateauing. However, on the MS COCO [9] dataset, while RT-DETR [15] achieves an mAP of 72.1% for large objects, its performance on small objects drops to only 36.0%, revealing a substantial gap and highlighting the inadequacy of current methods in detecting small objects.

The performance gap is fundamentally rooted in the limited visual information inherent to small objects, compounded by factors such as suboptimal network architecture designs and the imbalanced distribution of large and small objects in datasets [16]. To address these challenges, the research community has developed two complementary lines of work. The first is feature-level optimization, which aims to enhance the representation of small objects in complex scenes using techniques such as multi-scale feature extraction [17] and fusion [18,19,20,21,22], explicit contextual information integration [23,24,25,26], and feature alignment [27]. The second is data-level optimization, which seeks to improve the distribution and diversity of training data using methods like balanced sampling [28] and data augmentation [29].

Among these approaches, instance-level data augmentation methods, represented by “copy-paste” [16,30,31,32,33], have drawn significant attention due to their effectiveness in improving the detection of small objects without requiring modifications to the existing network architecture. This approach operates as a form of image composition. The core idea is to composite instances with backgrounds (from the same or different images) to enhance the quantity and diversity of training samples. Typically, instance-level data augmentation involves three steps: (a) Instance Acquisition: acquiring object instances from source images. (b) Instance Processing: applying geometric or photometric transformations to the acquired instances. (c) Instance Composition: compositing the processed instances onto background images.

During the instance processing stage, existing methods typically utilize operations such as random scaling [16,34,35], random rotation [16,35], or perspective transformation [36] to enhance the diversity of object scales and viewpoints. In the subsequent instance composition stage, operations like random pasting [16,34,36,37,38] are frequently employed to increase the diversity of placement locations. However, due to inconsistencies between instance and background—such as background mismatch, scale mismatch, illumination mismatch, and viewpoint mismatch—these random operations often result in augmented images that lack visual realism, as illustrated in Figure 1a. Prior studies have demonstrated that including such images in the training set may not only fail to improve detection performance but could also degrade it [35,39,40].

In response to the lack of visual realism in augmented images, several solutions have been proposed. For example, InstaBoost [35] computes an appearance coherence heatmap to identify candidate regions that share similar background patterns with the original instance location. Context-DA [39] explicitly trains a classification model that classifies the local context of the instance. Meanwhile, AdaResampling [30] leverages a semantic segmentation model to generate a prior road map, which helps to prevent illogical placements, such as placing vehicles in the sky.

While the aforementioned methods have enhanced the visual realism of augmented images by leveraging background information within the dataset, their insufficient use of instance information still leaves room for improvement. For instance, InstaBoost [35] cannot be applied to cross-image scenarios. Additionally, Context-DA [39] does not account for scale matching, which undermines the visual realism of the augmented images. Lastly, AdaResampling [30] fails to properly match the illumination and viewpoint between the instance and background.

In contrast to current methods that depend primarily on background information, we argue that effective instance-level data augmentation is built upon a joint understanding of instances and backgrounds. To address this, we present a comprehensive framework that confronts key challenges, including mismatches in background, scale, illumination, and viewpoint.

To thoroughly validate the effectiveness of our proposed framework, we have chosen the UAV aerial photography scenario as our primary validation platform. This scenario not only captures the most common background variations found in natural scenes but is also characterized by significant changes in scale, viewpoint, and illumination [41]. Such compound variations impose strict requirements on the visual realism of data augmentation. Any unreasonable augmented results due to mismatching will be easily identifiable, making this scenario an ideal testbed for our method. In other words, successfully addressing the multi-dimensional matching challenges in this demanding scenario provides empirical evidence for the generalization of our approach to other scenarios, such as autonomous driving, where viewpoint changes are relatively minor.

Our major contributions can be summarized as follows:

We propose a unified scene-understanding-driven instance-level data augmentation framework dedicated to small-object detection, which abandons the random operation paradigm of traditional “copy-paste” methods and systematically addresses the four key mismatches in background, scale, illumination, and viewpoint. This is achieved through a unified pipeline that integrates image inpainting, tagging, open-set object detection, SAM, and pose estimation for joint instance-background modeling.
We conduct extensive comparative experiments on the VisDrone [42] dataset across multiple mainstream object detection models, and the quantitative results demonstrate that our method stably improves the baseline detector’s mAP@0.5:0.95 and mAP@0.5 by 1.6% and 2.2%, respectively. More importantly, our experiments uncover a strong positive correlation between the visual realism of augmented images and the final detection performance gain, providing valuable empirical evidence for this previously underexplored intrinsic relationship in the field of data augmentation.
We offer a practical solution that reduces dependency on large-scale manual annotation. By leveraging pre-trained models for scene understanding, our method generates high-quality training data with high visual realism. This resource-efficient approach is particularly suited for data-scarce domains such as UAV perception and remote sensing, where obtaining dense annotations is challenging.

2. Related Work

2.1. Small-Object Detection

Small-object detection methods can be categorized into at least four distinct types based on their methodology.

(1) Multi-Scale Feature Extraction and Fusion. This area of research aims to enhance feature representations of small objects by combining fine-grained spatial information from shallow-layer features with the more robust semantics of deep-layer features. The foundational work, Feature Pyramid Networks (FPNs) [18], introduced top-down lateral connections to facilitate multi-scale feature fusion. This approach ensures that features at all levels contain strong semantic information. Following this, PANet [19] improved the feature hierarchy by implementing bidirectional path aggregation. Additionally, Deng et al. [43] built on FPN by proposing a feature texture transfer module, which replaces traditional upsampling methods to generate more detailed feature maps.

(2) Explicit Incorporation of Contextual Information. The incorporation of contextual information enhances both the confidence and interpretability of object detection [44], which is particularly crucial for small objects [45]. In an early study, Chen et al. [23] utilized both original region proposals and their scaled-up versions during the training and testing phases. Tailored to the unique characteristics of traffic signs, Cheng et al. [46] expanded the original region proposal both horizontally and vertically, then combined features from all three regions to improve the detection of small objects. Furthermore, Cui et al. [26] developed a context-aware module that employed pyramid dilated convolutions to integrate multi-layer contextual information, providing high-resolution features with richer semantic content.

(3) Super-Resolution. Super-resolution emerges as an intuitive strategy for small-object detection by reconstructing enhanced texture details. Perceptual GAN [47] presented a method that uses a generative adversarial network to transform the poor feature representations of small objects into more discriminative, super-resolved representations. Building on this, SOD-MTGAN [48] introduced a multi-task GAN that integrates classification and regression losses into the generator loss. This allows it to generate detailed super-resolved images from low-resolution patches. In the domain of remote sensing, Bashir et al. [49] incorporated residual feature aggregation into the generator of a cyclic GAN, which further improves feature representation and produces higher-quality super-resolved images.

(4) Data Augmentation. Mainstream object detection datasets are dominated by medium- to large-sized objects, with small objects making up only a minority. This imbalance in the training data distribution negatively impacts the detection performance for small objects [10,50]. To address this issue, one key approach is to increase the number of small objects, thereby improving the data distribution and enhancing detection performance [16,34].

2.2. Data Augmentation for Object Detection

Data augmentation methods can be categorized into three main types based on their operational paradigms: image transformation-based, image generation-based, and image composition-based.

Image-transformation-based methods typically involve applying geometric (e.g., horizontal flipping [51], perspective transformation [52]) or photometric (e.g., distortion [17]) transformations to a single image. Since the transformation process maintains a pixel-level deterministic mapping, there exists a direct correspondence between the augmented images and the original images. It is important to note that when performing geometric transformations, the corresponding annotations must be adjusted accordingly [53].

Image-generation-based methods, exemplified by generative adversarial networks [54] and diffusion models [55,56], aim to learn the underlying distribution of training data to generate entirely new images from random noise. Although there is no direct pixel-level correspondence between the generated images and the original training images, they can capture the overall statistical characteristics of the dataset [57].

Image-composition-based methods synthesize new training images by compositing image-image pairs or image-instance pairs. In contrast to image transformation-based methods, which typically work with a single image, image composition often involves multiple images or instances. Based on the granularity of operations, image composition-based methods can be further categorized into image-level methods (e.g., Mosaic [58] and its variants [59]) and instance-level methods (e.g., “copy-paste” [16,30,31,32,33]).

2.3. Instance-Level Data Augmentation

Instance-level data augmentation is essentially an image-composition technique that focuses on compositing instances with backgrounds from the same or different images. This methodology can be analyzed along three main dimensions, as outlined in Table 1:

Instance Acquisition Strategy: Instances can be acquired through box-level annotations or mask-level annotations provided by the dataset, or by using an external segmentation model.
Background Processing Strategy: The background image can either be used in its original form or be inpainted first to remove existing instances.
Instance-Background Composition Strategy: The composition can occur either within the same image (intra-image) or across different images (cross-image).

Table 1. Comparison of different instance-level data augmentation methods.

Year	Method	Instance Acquisition	Background Processing	Instance-Background Composition	Back.	Scal.	Illu.	View.
2017	SP-BL-SS [60]	mask	Original	Cross-Image	✓	✓	-	-
2017	Cut–Paste–Learn [37]	seg	Original	Cross-Image	-	-	-	-
2018	Context-DA [39]	mask	Original	Cross-Image	✓	-	-	-
2019	Kisantal et al. [16]	mask	Original	Cross-Image	-	-	-	-
2019	InstaBoost [35]	mask	Inpainted	Intra-Image	✓	-	-	-
2019	Hong et al. [61]	box	Original	Cross-Image	-	-	-	-
2019	AdaResampling [30]	box	Original	Intra-Image	✓	✓	-	-
2020	Liu et al. [62]	seg	Original	Cross-Image	✓	-	✓	-
2020	Yang et al. [31]	mask	Original	Cross-Image	✓	-	-	-
2021	Ghiasi et al. [34]	mask	Original	Cross-Image	-	-	-	-
2022	Nie et al. [36]	box	Original	Cross-Image	-	-	-	-
2022	Li et al. [5]	mask	Original	Cross-Image	✓	✓	✓	-
2023	DS-GAN [32]	seg	Inpainted	Cross-Image	✓	✓	-	-
2023	X-Paste [38]	seg	Original	Cross-Image	-	-	-	-
2026	Our Method	seg	Original/ Inpainted	Intra-Image/ Cross-Image	✓	✓	✓	✓

The terms “box”, “mask”, and “seg” refer to the use of box-level annotations, mask-level annotations, or an external segmentation model, respectively. The column headers “Back.”, “Scal.”, “Illu.”, and “View.” are abbreviations for Background, Scale, Illumination, and Viewpoint Matching, respectively. The symbols “✓” and “-” indicate whether a data augmentation method has the capability to achieve a certain level of visual realism.

2.4. Visual Realism in Data Augmentation

As shown in Figure 2, operations such as random scaling and pasting in existing instance-level data augmentation methods often fail to consider visual realism, resulting in unrealistic images that may adversely affect detection performance [35,39,40]. To address this issue, studies [30,32] have begun to focus on visual realism constraints. Building on their work, this section systematically examines visual realism in data augmentation from four perspectives, reviewing the strengths and limitations of existing approaches.

2.4.1. Background Matching

Background matching requires placing instances in semantically consistent contexts; for example, vehicles should not be placed in the sky.

To tackle background matching in video datasets, Bosquet et al. [32] proposed two strategies. The first strategy involves using image inpainting techniques to remove existing objects from the current frame, after which generated instances are pasted into the inpainted locations. The second strategy selects non-overlapping locations in previous or subsequent frames to paste the generated instances. However, the first strategy limits the diversity of potential placement locations, while the second strategy is not directly applicable to image datasets.

In the context of image datasets, InstaBoost [35] introduced the concept of an appearance coherence heatmap to pinpoint candidate regions that exhibit background patterns similar to the original instance location. However, their method is not applicable to cross-image scenarios. In contrast, Context-DA [39] treated the background image patch (with the instance masked out) as a training sample for local context and the instance category as the corresponding label, thereby enabling the explicit training of a classification model for cross-image background matching. This approach, however, suffers from two limitations: the training process is relatively complex, and such pre-trained models lack generalizability to other datasets. In another approach, AdaResampling [30] employed a semantic segmentation model to obtain prior road maps for guiding instance placement, thus avoiding the complex model training process.

Similar to AdaResampling [30], our method uses semantic segmentation to determine suitable placement regions for instances. The core distinction is that we automatically generate all necessary segmentation prompts using an image-tagging technique rather than being confined to a fixed set of semantic categories (e.g., “road”).

2.4.2. Scale Matching

Scale matching means that instances should be depicted at a realistic scale based on their physical sizes and their distance from the camera. In essence, objects farther from the camera should appear smaller, while those closer should appear larger, in accordance with the principles of perspective geometry.

For scenarios where depth information is accessible, such as indoor scenes, Georgakis et al. [60] utilized depth maps as references to selectively scale instances before pasting them into the background image. However, in many natural environments (e.g., UAV-captured scenes), obtaining accurate depth information can be difficult, which limits the applicability of this approach.

In the absence of reliable depth information, scale matching can be approached by leveraging co-occurring instances within the image as references. Bosquet et al. [32], for example, proposed to select placement regions by maximizing the IoU between generated instances and existing instances. This method, nevertheless, suffers from several limitations: First, it requires removing existing instances from the image, which increases operational complexity and may potentially damage the background. Second, the method is only applicable to instances of the same category. Third, its reliance on IoU maximization restricts the diversity of placement regions, preventing reasonable scale matching at other locations. In contrast, AdaResampling [30] determines scaling factors by referencing pedestrian instances elsewhere in the image, allowing for scale matching at arbitrary locations.

Another line of work incorporates explicit physical measurements and geometric constraints. Specifically, the content-aware scaling method introduced by Li et al. [5] calculates the apparent size of an instance based directly on its 3D physical size and its distance relative to the camera.

Our method shares similarities with AdaResampling [30] and Li et al. [5], as we also leverage instance information as a reference and incorporate the 3D physical size of an instance to determine its apparent size.

2.4.3. Illumination Matching

Illumination matching requires that instances and background images maintain visual consistency in illumination attributes such as brightness, saturation, and contrast. A typical example is that instances acquired from low-illumination scenes (e.g., nighttime) should not be directly composited into high-illumination scenes (e.g., daytime).

Liu et al. [62] integrated masked reconstruction loss and total variation loss into CycleGAN, which allowed for style transfer of embedded instances. This approach facilitated implicit illumination matching while maintaining the attributes of pedestrian instances. Meanwhile, Li et al. [5] proposed a local-adaptive color transformation based on HSV color space statistics. Their method effectively minimizes visual inconsistencies caused by changes in illumination.

In contrast to the methods of Liu et al. [62] and Li et al. [5], our approach does not directly alter the pixel values of the instances. Instead, we establish illumination matching models at both global and local levels, ensuring that illumination between instances and the background image remains consistent.

2.4.4. Viewpoint Matching

Viewpoint matching requires that an instance’s depicted pose align with the geometric relationship between the camera’s line of sight and its spatial position. This constraint is particularly critical for UAV-scene datasets, given the diversity of UAV viewpoints. A prime example is the VisDrone dataset [42], which encompasses viewpoints ranging from top-down (resembling remote sensing imagery) to horizontal (similar to ground-based photography). Clearly, pasting an instance acquired from a top-down viewpoint into a horizontal background (or vice versa) would result in severe visual inconsistencies.

However, as shown in Table 1, existing instance-level data augmentation methods generally do not explicitly address this constraint. This oversight can be attributed to two main reasons. Firstly, in scenarios such as autonomous driving (e.g., KITTI [63]), the data is typically captured from a relatively fixed viewpoint. Consequently, even without matching viewpoints, the augmented images generally do not display significant visual inconsistencies. Secondly, in natural scene datasets (e.g., MS COCO [9]), despite the inherent diversity of viewpoints, “viewpoint” is commonly treated as an implicit, secondary background attribute rather than an explicit, modelable geometric variable. These factors contribute to the fact that viewpoint matching remains a relatively underexplored issue.

In this study, we propose a hierarchical framework for viewpoint matching in UAV scene datasets. The framework starts with an estimation of the local pose of vehicle instances using a template-matching approach. These estimates subsequently serve as cues to infer the global viewpoint of the background image and the local poses of other instance categories. Within this framework, we then achieve coarse viewpoint matching between the instances and the backgrounds.

3. Materials and Methods

To address the lack of visual realism in instance-level data augmentation, this section presents a novel method tailored for small-object detection. Our training-free method synthesizes highly realistic images by performing a comprehensive, joint understanding of instances and their background context within the dataset. As illustrated in Figure 3, the proposed method consists of five main steps: (a) acquiring instance and background images, (b) analyzing the background image, (c) enriching instance information, (d) modeling the co-occurrence probability of the instance and background, and (e) compositing the instance and the background image.

3.1. Acquisition of Instance and Background Image

Due to the high cost of obtaining mask-level annotations (e.g., annotating 1000 instance masks on the MS COCO dataset takes approximately 22 h [34]), most object detection datasets provide only box-level annotations. Consequently, some prior studies have relied on box-level annotations to acquire object instances [5,30,61], a practice that inevitably introduces background interference and degrades the visual realism of augmented images, as shown in Figure 2c. To address this, we employ the SAM model [64] to directly generate high-quality instance masks from existing box-level annotations, eliminating the requirement for fine-grained mask-level annotations. Furthermore, to counter the degradation of small-object segmentation under JPEG compression, we integrate a JPEG restoration module [65] for preprocessing. This step enhances the accuracy of the instance masks, as illustrated in Figure 4d.

Using the same set of box-level annotations, we apply Inpaint Anything [66] to remove all existing instances from the background images, as shown in Figure 3a. This step serves two purposes: first, it eliminates potential distractions caused by the existing instances, thereby improving understanding of the background context. Second, the resulting inpainted background image can be used for compositing object instances, thereby increasing the diversity of data augmentation.

3.2. Analysis of Background Image

In the context of instance-level data augmentation, particularly for “cross-image” scenarios, background analysis is critical to ensuring visual realism. This is governed by two key constraints: (a) spatial layout must obey physical laws (e.g., a vehicle should not appear in the sky); and (b) illumination must be consistent (e.g., a nighttime instance is incompatible with a daytime scene). Accordingly, we develop a framework that analyzes the background from two perspectives: scene semantics and global illumination.

3.2.1. Scene Semantic Segmentation

As previously mentioned, AdaResampling [30] utilized a semantic segmentation model guided by a predefined semantic prompt (“road” in their case) to identify suitable placement regions. However, as shown in Figure 5, this approach has two major limitations. First, if the background scene does not contain the predefined semantic category, the method fails completely, producing false positives. Second, the fixed set of prompts cannot adapt to diverse, open-world scenarios, excluding many potential background categories (such as “football field”), which severely restricts the diversity of placement regions.

To address these limitations, this section proposes a semantic segmentation pipeline that eliminates the reliance on predefined semantic prompts for background analysis. As shown in Figure 6, the pipeline begins by preparing multi-scale inputs through tiling and downscaling the inpainted background image. Next, it generates and refines semantic tags using an image tagging model (RAM++ [67]) and a large language model (DeepSeek [68]). Finally, it detects background regions and segments them by first prompting an open-set object detection model (Grounding DINO [69]) with the refined tags to obtain bounding boxes, which are then processed by a segmentation model (SAM [64]) to produce precise masks.

3.2.2. Global Illumination Estimation

Accurate estimation of global illumination is essential for ensuring visual consistency between composited instances and the background. A straightforward approach is to analyze low-level brightness statistics, such as the mean luminance (L channel in Lab color space). However, as shown in Figure 7, this method suffers from a fundamental limitation: images with statistically similar brightness can correspond to vastly different times of day (e.g., night vs. noon), leading to inaccurate illumination estimates.

We therefore propose a method for estimating global illumination through visual question answering. Specifically, we engage the pre-trained vision-language model (BLIP [70]) with illumination-aware visual questions, such as “What time of day was this photo taken?” The responses generated by the model act as a reliable semantic indicator of global illumination, ensuring consistency between instances and their backgrounds.

3.3. Enrichment of Instance Information

Standard object detection datasets conventionally provide three types of annotations for instances: bounding box coordinates, category labels, and occlusion status. While these annotations are adequate for training object detection models, they are insufficient for instance-level data augmentation, which demands a high degree of visual realism. The main limitation of current annotations is their lack of detail regarding the geometric and appearance attributes of individual instances. Firstly, the absence of local illumination makes it difficult to align instances with the global illumination of the background image. Secondly, the absence of pose parameters affects viewpoint consistency between instances and the background. Thirdly, the absence of spatial resolution leads to incorrect relative scaling between instances and the background, thereby compromising visual realism. To overcome these gaps, this study enriches the instance information using the following approaches.

3.3.1. Local Illumination Estimation

To ensure visual realism, the illumination of composited instances must align with their background context. Having addressed global background illumination in Section 3.2, this subsection focuses on estimating the local illumination for each instance.

As depicted in Figure 8, our method begins by constructing a set of illumination-aware text prompts (e.g., “a nighttime photo without street lighting”). These prompts are then fed into the pre-trained vision-language model (CLIP [71]), which, without any fine-tuning, estimates the local illumination for the instance and its surrounding context.

3.3.2. Instance Pose Estimation

In instance-level data augmentation, the plausibility of an instance’s pose is essential for achieving visual realism. However, current pose estimation methods [72,73,74], which are primarily trained on datasets such as PASCAL3D+ [75] and ApolloCar3D [76], are not well-suited to UAV scenarios. For instance, ApolloCar3D only encompasses pitch angles of up to 12.5°, whereas UAV imagery frequently involves near-top-down viewpoints with pitch angles approaching 90°. This significant disparity in data distribution makes it difficult to directly apply existing methods to UAV datasets.

Inspired by Kundu et al. [77], we propose a method for coarse pose estimation of rigid objects (e.g., vehicles) in UAV imagery, as shown in Figure 9. The procedure involves four main steps. First, a multi-view template library is constructed by rendering 3D models with azimuth and pitch sampled at 15° intervals. Next, each template is aligned by centering it on the instance mask and then scaled to maximize their IoU. Then, contour points are sampled from both the instance and the template masks at fixed angular intervals clockwise. Finally, the distance between the two sets of contour points is computed, and the azimuth and pitch corresponding to the minimum distance are selected as the final pose estimate.

In addition, by utilizing the estimated pitches of vehicle instances, we establish a linear regression model to predict the pitch angle

β^{'}

as a function of the vertical image coordinate

y^{'}

. This approach allows us to estimate the pitch for object instances of any category, regardless of their location within the image.

(k^{*}, b^{*}) = \underset{(k, b)}{\arg \min} \sum_{i = 1}^{n} {(β_{i} - k \cdot y_{i} - b)}^{2}

(1)

β^{'} = k^{*} \cdot y^{'} + b^{*}

(2)

where

y_{i}

and

β_{i}

denote the vertical coordinate and estimated pitch of the i-th vehicle instance in the image, respectively.

3.3.3. Spatial Resolution Estimation

Given a constant pixel size

σ

and focal length f, the angular resolution

φ

of an imaging system is independent of the object distance L, while the spatial resolution

ε

is proportional to L.

φ = 2 \arctan \frac{σ}{2 f}

(3)

ε = 2 L \cdot \tan \frac{φ}{2}

(4)

Let H denote the flight altitude of the UAV and let

β

denote the pitch at the center of the camera’s line of sight. The spatial resolution

ε

at this sight center can be expressed as follows:

ε = 2 \frac{H}{\sin β} \cdot \tan \frac{φ}{2}

(5)

For a single image, both H and

φ

are constants. Therefore, by defining

η = 2 H \cdot \tan \frac{φ}{2}

, the above equation can be rewritten as follows:

ε = η \cdot \frac{1}{\sin β}

(6)

Based on the above derivation, we propose a method for estimating the spatial resolution of instances in UAV imagery. The procedure is as follows: For instances (e.g., vehicles and pedestrians) with known 3D physical dimensions, the corresponding pitch

β_{1}, \dots, β_{n}

is first computed based on their vertical coordinates using Equation (4), and the spatial resolution

ε_{1}, \dots, ε_{n}

is subsequently inferred from their apparent size in the image. Next, a linear regression model is fitted with

\frac{1}{\sin β}

as the independent variable and

ε

as the dependent variable to estimate the parameter

η

. Once

η

is obtained, the spatial resolution of an instance, regardless of its category or spatial location, can be calculated directly from its vertical coordinate.

3.4. Co-Occurrence Probability Modeling of Instance and Background

Based on the semantic segmentation results from Section 3.2, we can use prior knowledge to eliminate certain unreasonable background categories. For example, when considering vehicle instances, backgrounds such as the sky and water bodies are physically implausible. However, this alone cannot guarantee the visual realism of augmented images. Beyond the physical constraint, a second key factor is the semantic distribution: in reality, different instance categories appear with different likelihoods across various backgrounds. Excessively placing pedestrian instances on highways or vehicle instances on soccer fields, for instance, would cause the augmented dataset to diverge from real-world statistical distributions.

To quantify the co-occurrence relationship between object instances and background regions, this section performs a statistical analysis based on instance category labels, instance segmentation masks, and background semantic segmentation masks; this analysis computes the conditional probability

P (F_{i}| B_{m})

of an instance category

F_{i}

given a background category

B_{m}

.

P (F_{i}| B_{m}) = \frac{\sum_{j} \sum_{n} I [\frac{|F_{i j} \cap B_{m n}|}{|F_{i j}|} \geq τ]}{\sum_{k} \sum_{j} \sum_{n} I [\frac{|F_{k j} \cap B_{m n}|}{|F_{k j}|} \geq τ]}

(7)

where

F_{i}

denotes the i-th instance category, and

F_{i j}

represents the j-th mask region of that category;

B_{m}

denotes the m-th background category, and

B_{m n}

represents the n-th mask region of that category. The threshold

τ

indicates the minimum overlap ratio between an instance and a background region (empirically set to 0.8 in this work).

3.5. Composition of Instance and Background

Based on the research presented in Section 3.1, Section 3.2 and Section 3.3, we have acquired multi-dimensional information for each instance (including its segmentation mask, local illumination, pose, and spatial resolution) as well as the semantic segmentation mask and global illumination of the background image. Leveraging the instance-background co-occurrence probability model established in Section 3.4, this section performs image composition; the complete procedure is presented in Algorithm 1.

3.5.1. Extraction of Candidate Placement Regions

For a given image, we first compute the average size

S_{m e a n}

and maximum size

S_{m a x}

of existing instances based on their bounding boxes. Subsequently, a three-level sliding window strategy (with window sizes of

\frac{S_{m e a n}}{2}

,

S_{m e a n}

, and

S_{m a x}

) is applied to scan the image and generate candidate placement regions that do not overlap with any existing instance. Finally, leveraging the image’s semantic segmentation mask, these candidate placement regions are filtered according to their semantic categories: regions corresponding to plausible background categories (e.g., road, football field) are retained, whereas those associated with implausible ones (e.g., sky, water bodies) are discarded. Here, “plausible” and “implausible” are defined by prior knowledge and the statistical analysis from Section 3.4.

3.5.2. Matching of Instances to Placement Regions

With the set of placement regions established, we select a subset for augmentation. For each chosen region, a suitable instance is matched using a two-step process. First, we identify the most compatible instance category based on the co-occurrence probability. Then, a specific instance from that category is selected subject to the following constraints: (a) global and local illumination must be consistent; (b) the pitch difference must be less than a specified threshold (empirically set to 20°); and (c) the instance’s spatial resolution must be lower than that of the placement region.

3.5.3. Geometric Transformation of Instances

Following matching, each instance is geometrically transformed to fit its corresponding placement region. This involves (a) applying a perspective transformation to align the instance’s pitch with that of the placement region and (b) scaling the instance proportionally based on the ratio of its spatial resolution to that of the placement region.

Algorithm 1 Instance-Background Composition for Cross-Image Augmentation

Input:

1:: Background: image $I_{B}$ , semantic mask $S_{B}$ , instance list $O_{B} = {o_{B}^{1}, \dots, o_{B}^{k}}$
2:: Foreground: instance set $F = {f_{1}, \dots, f_{n}}$ , each $f_{i}$ has: image $I_{i}$ , mask $M_{i}$ , category $C_{i}$ , illumination $L_{i}$ , pitch $P_{i}$ , resolution $R_{i}$
3:: Model & Parameters: co-occurrence model $C$ , pitch threshold $θ = 20^{\circ}$ , max instances $N_{\max} = 12$

Output: Synthetic image

I_{C}

4:: $I_{C} \leftarrow I_{B}$ {Initialize with background}
5:: Step 1: Generate sliding windows
6:: $S_{mean} \leftarrow \frac{1}{k} \sum_{i = 1}^{k} area (o_{B}^{i})$
7:: $S_{\max} \leftarrow \max_{i = 1}^{k} area (o_{B}^{i})$
8:: $W \leftarrow SlidingWindows (I_{B}, [S_{mean} / 2, S_{mean}, S_{\max}])$
9:: Step 2: Filter valid placement regions
10:: $R \leftarrow \emptyset$
11:: for $w \in W$ do
12:: if $w \cap (⋃_{i = 1}^{k} o_{B}^{i}) = \emptyset$ and $Class (S_{B}, w) \in PlausibleCategories$ then
13:: $R \leftarrow R \cup {w}$
14:: end if
15:: end for
16:: Step 3: Randomize and limit regions
17:: $R \leftarrow RandomPermutation (R)$
18:: $R \leftarrow R [1 : \min (| R |, N_{\max})]$ {Keep at most $N_{max}$ regions}
19:: Step 4: Instance placement
20:: for $r \in R$ do
21:: Find compatible instance category
22:: $c^{*} \leftarrow \arg \max_{c} C (c ∣ Class (S_{B}, r))$
23:: Search for matching instance
24:: for $f_{i} \in F$ do
25:: if $C_{i} = c^{*}$ and $L_{i} = Illumination (r)$ and $| P_{i} - Pitch (r) | < θ$ and $R_{i} < Resolution (r)$ then
26:: Apply geometric transformations
27:: $I_{F}^{'} \leftarrow ScaleTransformation (PerspectiveTransformation (I_{i}))$
28:: $M^{'} \leftarrow ScaleTransformation (PerspectiveTransformation (M_{i}))$
29:: Composite
30:: $I_{C} \leftarrow I_{C} ⊙ (1 - M^{'}) + I_{F}^{'} ⊙ M^{'}$
31:: break {Place at most one instance per region}
32:: end if
33:: end for
34:: end for
35:: return $I_{C}$

3.5.4. Generation of the Composite Image

The final step synthesizes the augmented image by compositing the geometrically transformed instance with the background image. Given the background image

I_{B}

, the transformed instance image

I_{F}

, and its binary mask M, the composite image

I_{C}

is computed as follows:

I_{C} = I_{B} \cdot (1 - M) + I_{F} \cdot M

(8)

4. Results

This section conducts a quantitative evaluation on a small-object detection dataset, comparing the proposed method with existing approaches to validate its effectiveness. To further dissect the sources of performance gain and the robustness of our method, we perform an in-depth analysis focusing on three critical factors: (a) the impact of visual realism, (b) the effect of training set size, and (c) the influence of class imbalance.

4.1. Experimental Setup

4.1.1. Dataset and Evaluation Metrics

Our experiments are mainly conducted on VisDrone [42], the prevalent benchmark dataset for small-object detection. The VisDrone dataset contains 10 typical urban instance categories (e.g., pedestrians, cars) and is split into four subsets: train (6471 images), val (548 images), test-dev (1610 images), and test-challenge (1580 images). Note that the test-challenge subset is reserved solely for online evaluation, and no annotations are provided. Therefore, we use the train subset for training, perform model selection based on the val subset, and report detection performance on the test-dev subset.

To further validate the generalization capability of our method in other scenarios, we also carried out a preliminary validation on the iSAID dataset [78], a widely adopted benchmark for object detection and instance segmentation in remote sensing images. The scene characteristics and object scales of iSAID differ significantly from those of VisDrone, making it well-suited for evaluating cross-scenario generalization. Following the practice of Bosquet et al. [32], we focused our experiments on five categories of instances (ships, large vehicles, small vehicles, helicopters, and planes).

In line with standard practices in the object detection field, we use mean Average Precision (mAP) as our evaluation metric. Specifically, we compute mAP@0.5:0.95 (abbreviated as mAP in subsequent tables for conciseness) and mAP@0.5 using the official VisDrone toolkit. In accordance with the MS COCO [9] standard, we also calculate the mAP for objects of various sizes: mAP_S@0.5 for areas smaller than 32² pixels, mAP_M@0.5 for areas between 32² and 96² pixels, and mAP_L@0.5 for areas larger than 96² pixels.

4.1.2. Baseline Detectors and Training Configuration

For a comprehensive evaluation of the data augmentation methods, we selected four representative object detectors: YOLOv5 [79] (serving as the baseline model and for comparative studies), TPH-YOLOv5 [80] (proven effective in the VisDrone-DET 2021 challenge), GFL V1-CEASC [81] (optimized for fast small-object detection), and RT-DETR [15] (a real-time Transformer-based detector).

All object detectors are trained on the augmented datasets with their architectures fixed. This ensures a controlled comparison of how different data augmentation methods affect detection performance. Furthermore, following InstaBoost [35], we doubled the default training epochs on the augmented datasets to address the increased complexity.

4.1.3. Baseline Data Augmentation Methods

According to the classification of existing methods presented in Table 1, three representative instance-level data augmentation methods were chosen for the experiments: Cut–Paste–Learn [37], AdaResampling [30], and InstaBoost [35]. In this selection, Cut–Paste–Learn [37] falls under cross-image augmentation, with its placement regions randomly chosen. The latter two methods are intra-image augmentation approaches whose placement regions are determined via road semantic segmentation and an appearance coherence heatmap, respectively.

4.1.4. Implementation Details

Our data augmentation implementation strictly adheres to the framework outlined in Section 3, taking into account two specific considerations. Firstly, in line with the approach adopted by Kisantal et al. [16], we only use instances that are fully visible, since instances that are occluded or truncated can reduce the accuracy of pose and spatial resolution estimation and compromise visual realism. Secondly, because person instances often overlap with bicycle or tricycle instances [42], processing them independently can result in segmentation errors. To address this, we first merge these overlapping instances prior to instance segmentation and then separate them into individual entities after image composition is complete.

To facilitate a fair comparison with existing methods, we implement four variants by differentiating the background processing and instance-background composition strategies: OG-CC (corresponding to Cut–Paste–Learn [37]), OG-IC (corresponding to AdaResampling [30]), IP-IC (corresponding to InstaBoost [35]), and IP-CC. Here, OG and IP denote the original and inpainted background image, while CC and IC denote cross-image and intra-image composition.

It should be noted that, since image inpainting removes existing instances from the background, the augmented dataset derived from the inpainted background images must be trained jointly with the original dataset to ensure comprehensive learning. Consequently, similar to InstaBoost [35], the effective number of training samples for the IP-IC and IP-CC variants is doubled, as detailed in Table 2.

4.2. Results and Analysis

4.2.1. Quantitative Comparison with Existing Methods

Table 2 presents the detection performance obtained on the VisDrone dataset using different data augmentation methods. Based on these results, we summarize and analyze the key findings as follows:

The random placement strategy of Cut–Paste–Learn [37] leads to performance degradation in most cases. This phenomenon aligns with findings in existing studies [35,39,40] that unrealistic augmented images not only fail to improve the performance of the detector but can even be detrimental.
Although AdaResampling [30] uses road semantics for background matching and references pedestrian instances to determine scale, it fails to improve detection performance. This shortcoming likely stems from its reliance on box-level annotations, which introduce background interference and can lead to overfitting.
While InstaBoost [35] does not address the scale matching issue, its placement strategy based on an appearance coherence heatmap still yields certain performance gains. This indirectly underscores the importance of background matching.
Compared to other methods, the proposed method achieves the best overall performance improvement across multiple detectors.
The experimental results confirm that, given proper background matching, cross-image augmentation yields significantly greater gains in detection performance than intra-image augmentation. This aligns with the design rationale, as cross-image augmentation can introduce richer background diversity.
In our method, variants using inpainted images (similar to InstaBoost [35]) achieve more substantial gains in detection performance than those using original images. This improvement is attributed to the novel contextual information introduced via inpainting, which facilitates model training.

4.2.2. Qualitative Analysis of Augmented Results

Qualitative Comparison with Existing Methods

Figure 10 and Figure 11 present a comparative illustration of the augmented images from two distinct UAV viewpoints, revealing the following observations:

Under the top-down view, where variations in scale and viewpoint are minimal, intra-image augmentation methods (e.g., AdaResampling [30] and InstaBoost [35]) can produce visually realistic images. In contrast, cross-image augmentation methods (e.g., Cut–Paste–Learn [37]) exhibit noticeable mismatches in both scale and viewpoint.
In the front view, intra-image augmentation methods also struggle to maintain visual realism, while cross-image augmentation methods induce significant background mismatch.
By comparison, the proposed method consistently produces realistic augmented images.

Figure 10. Qualitative comparison of augmented images from a top-down view. The augmented instances are highlighted in yellow solid boxes. (a) Cut–Paste–Learn [37]. (b) InstaBoost [35]. (c) AdaResampling [30]. (d) Our method.

Figure 11. Qualitative comparison of augmented images from a front view. The augmented instances are highlighted in yellow solid boxes. (a) Cut–Paste–Learn [37]. (b) InstaBoost [35]. (c) AdaResampling [30]. (d) Our method.

Failure Case Analysis

Although the proposed method can generate realistic augmented images in most cases, it still exhibits limitations or even complete failures in certain extreme scenarios. As shown in Figure 12, we summarize four primary failure modes and their underlying causes:

(a): Over-exposure: In overexposed background scenarios, even though the illumination matching module ensures near-identical illumination intensity between the augmented instance and the background, the augmented instance may still exhibit noticeable visual inconsistency with the surrounding background due to compressed texture details.
(b): Low illumination: Extremely low light results in a poor signal-to-noise ratio, which degrades the performance of semantic segmentation and scene understanding models, hindering the localization of plausible placement regions.
(c): Abnormal Viewpoint: When the viewpoint of the background image or instance deviates from the normal range, our viewpoint and scale matching framework fails due to insufficient physical reference cues, ultimately reducing geometric plausibility.
(d): Insufficient reference instances: Scenes with too few or no valid reference instances prevent our matching strategy from acquiring sufficient prior information, making scale and viewpoint estimation unreliable.

Figure 12. Failure Cases of Our Method. The augmented instances are highlighted in yellow solid boxes. (a) Over-exposed background. (b) Low-illumination background. (c) Abnormal viewpoint. (d) Insufficient visual reference for matching.

4.2.3. Ablation and Analysis

This section presents a series of controlled experiments designed to improve our understanding of the effectiveness of our data augmentation method in different situations. Specifically, we analyze its dependence on visual realism, training set size, and class distribution.

Impact of Visual Realism

This experiment investigates the impact of different levels of visual realism on detection performance gains for both intra-image and cross-image augmentation. Using the IP-IC and IP-CC variants of our method as the primary testbeds, we also include the Cut–Paste–Learn [37] and InstaBoost [35] methods for comparison. Note that for the IP-IC variant, illumination matching was not applied.

The results from Table 3 and Table 4 lead to the following conclusions:

(1): Background-matching plays a crucial role in cross-image augmentation, improving mAP@0.5 by 1.5%. Its effect is more modest in intra-image augmentation (with mAP@0.5 increasing by 0.4%), where background consistency is inherently higher.
(2): Scale matching consistently improves the detection performance for small objects. In contrast, random scaling strategies (e.g., InstaBoost [35]) can degrade it.
(3): The results show that detection performance is strongly correlated with the visual realism of data augmentation. A progressive introduction of realism components (background-matching, illumination-matching, scale-matching, and viewpoint-matching) leads to a corresponding increase in mAP.

Table 3. Impact of visual realism on detection performance in intra-image augmentation.

Method	Back.	Illu.	Scal.	View.	mAP	mAP@0.5	mAP_S@0.5	mAP_M@0.5	mAP_L@0.5
Baseline	-	-	-	-	34.6	56.7	45.9	70.9	83.8
InstaBoost [35]	✓	-	-	-	35.0 (+0.4)	57.0 (+0.3)	45.4 (−0.5)	71.8 (+0.9)	85.8 (+2.0)
(OURS) IP-IC	✓	-	-	-	34.8 (+0.2)	57.1 (+0.4)	45.9 (+0.0)	71.5 (+0.6)	85.0 (+1.2)
(OURS) IP-IC	✓	-	✓	-	35.0 (+0.4)	57.4 (+0.7)	46.2 (+0.3)	72.0 (+1.1)	84.7 (+0.9)
(OURS) IP-IC	✓	-	✓	✓	35.2 (+0.6)	57.6 (+0.9)	46.3 (+0.4)	72.2 (+1.3)	84.6 (+0.8)

The column headers “Back.”, “Illu.”, “Scal.”, and “View.” are abbreviations for Background, Illumination, Scale, and Viewpoint Matching, respectively. The symbols “✓” and “-” indicate whether a data augmentation method has the capability to achieve a certain level of visual realism.

Table 4. Impact of visual realism on detection performance in cross-image augmentation.

Method	Back.	Illu.	Scal.	View.	mAP	mAP@0.5	mAP_S@0.5	mAP_M@0.5	mAP_L@0.5
Baseline	-	-	-	-	34.6	56.7	45.9	70.9	83.8
Cut–Paste–Learn [37]	-	-	-	-	34.3 (−0.3)	56.2 (−0.5)	45.0 (−0.9)	70.6 (−0.3)	84.4 (+0.6)
(OURS) IP-CC	✓	-	-	-	35.6 (+1.0)	58.2 (+1.5)	46.4 (+0.5)	72.3 (+1.4)	86.2 (+2.4)
(OURS) IP-CC	✓	✓	-	-	35.8 (+1.2)	58.3 (+1.6)	46.6 (+0.7)	72.8 (+1.9)	86.3 (+2.5)
(OURS) IP-CC	✓	✓	✓	-	35.7 (+1.1)	58.4 (+1.7)	47.2 (+1.3)	72.5 (+1.6)	86.1 (+2.3)
(OURS) IP-CC	✓	✓	✓	✓	36.2 (+1.6)	58.9 (+2.2)	47.4 (+1.5)	73.1 (+2.2)	87.5 (+3.7)

The column headers “Back.”, “Illu.”, “Scal.”, and “View.” are abbreviations for Background, Illumination, Scale, and Viewpoint Matching, respectively. The symbols “✓” and “-” indicate whether a data augmentation method has the capability to achieve a certain level of visual realism.

Robustness Across Varying Training Set Sizes

This experiment assesses the robustness of our method to training set size, using results from the IP-CC variant. Figure 13 shows the mAP curves for different training set sizes, which reveal several key findings:

(1): Our method consistently improves detection performance regardless of training set size.
(2): With very small training sets, models trained on the augmented dataset perform well on the val subset but achieve limited improvement on the test-dev subset, which aligns with the theoretical expectation that insufficient data leads to overfitting.
(3): As the training set size increases, the cross-image composition strategy in our method introduces greater background diversity, leading to more significant performance gains.

Figure 13. Detection performance under different training set sizes.

Generalization under Class Imbalance

The severe class imbalance in the VisDrone dataset [42] also indirectly constrains the effectiveness of data augmentation. Results in Table 5 indicate that:

(1): For classes with scarce samples (e.g., “truck”, “bus”), data augmentation significantly boosts detection performance, whether the placement strategy is random (Cut–Paste–Learn [37]) or background-aware (AdaResampling [30], InstaBoost [35]).
(2): For classes with abundant samples (e.g., “pedestrian”, “car”), existing augmentation methods exhibit varying degrees of performance degradation.
(3): In contrast, our method achieves stable and consistent performance gains across all classes.

Table 5. Per-class detection performance comparison on the VisDrone dataset.

Method	Ped.	Person	Bicycle	Car	Van	Truck	Tricycle	Awn.	Bus	Motor
Baseline	23.4	13.6	12.8	56.4	37.5	40.5	21.9	19.3	49.9	23.6
Cut–Paste–Learn [37]	22.5 (−0.9)	13.2 (−0.4)	12.5 (−0.3)	55.8 (−0.6)	36.3 (−1.2)	41.9 (+1.4)	21.4 (−0.5)	20.0 (+0.7)	51.0 (+1.1)	23.1 (−0.5)
AdaResampling [30]	22.9 (−0.5)	13.0 (−0.6)	12.4 (−0.4)	56.3 (−0.1)	37.8 (+0.3)	41.3 (+0.8)	20.3 (−1.6)	19.8 (+0.5)	50.7 (+0.8)	23.1 (−0.5)
InstaBoost [35]	23.1 (−0.3)	13.2 (−0.4)	13.3 (+0.5)	56.2 (−0.2)	37.7 (+0.2)	43.0 (+2.5)	22.6 (+0.7)	20.1 (+0.8)	52.3 (+2.4)	22.8 (−0.8)
(OURS) IP-CC	23.9 (+0.5)	14.1 (+0.5)	14.1 (+1.3)	56.8 (+0.4)	39.1 (+1.6)	44.8 (+4.3)	24.2 (+2.3)	20.4 (+1.1)	53.8 (+3.9)	24.9 (+1.3)

The column headers “Ped.” and “Awn.” are abbreviations for pedestrian and awning tricycle, respectively.

4.2.4. Preliminary Cross-Scenario Validation

As shown in Table 6, our data augmentation method significantly enhances the detection performance of YOLOv5 on the iSAID dataset. Specifically:

(1): Performance gains are observed across all five object categories, indicating its effectiveness for remote sensing scenarios;
(2): Notably, the substantial improvement for the sample-scarce category (“helicopter”) aligns with the pattern observed in the VisDrone dataset.

Table 6. Results of YOLOv5 on the iSAID dataset.

Method	mAP	Ship	LV	SV	HC	Plane
Baseline	53.3	66.2	53.4	31.6	34.1	81.0
(OURS) IP-CC	54.2 (+0.9)	66.9 (+0.7)	53.7 (+0.3)	32.7 (+1.1)	36.0 (+1.9)	81.5 (+0.5)

Abbreviations: LV = Large Vehicle, SV = Small Vehicle, HC = Helicopter.

4.2.5. Computational Cost and Efficiency Analysis

We assess the computational cost of the proposed data augmentation pipeline using a workstation equipped with an Intel i7-8700 CPU, 64 GB of RAM, and an NVIDIA Titan Xp GPU with 12 GB of memory. The timings are averaged over 100 images sampled from the VisDrone training set and encompass the entire process, from loading raw images and annotations to saving the final augmented data.

The following metrics are measured to assess computational cost: (1) the per-instance preprocessing time (e.g., for instance segmentation and pose estimation); (2) the per-image background preprocessing time (e.g., for inpainting and semantic segmentation); (3) the per-image composition time; and (4) the peak GPU memory usage. Additionally, since data augmentation is performed offline, we quantify its impact on the overall training pipeline by reporting the relative increase in total training time, normalized against the baseline (denoted as 1×).

The detailed results are summarized in Table 7. The analysis reveals several key observations: (1) Our full pipeline (IP-CC) incurs the highest per-image processing time, primarily due to the sequential execution of multiple deep models (e.g., SAM for segmentation, an inpainting network) to ensure multi-dimensional matching. (2) Simpler methods like Cut–Paste–Learn [37] and AdaResampling [30] are significantly faster as they bypass these complex models, but at the cost of visual realism, as evidenced in our earlier qualitative and quantitative comparisons. (3) The IP-CC variant requires a 4× longer training duration, matching InstaBoost [35], because both methods generate an expanded dataset (original + augmented samples) that necessitates extended training for convergence.

We argue that the increased computational cost of our method is justified and practical within its target application context. The pipeline is designed as a one-time, offline data preparation step. In domains such as UAV perception and remote sensing—where acquiring densely annotated real data is often prohibitively expensive or labor-intensive—investing several hours to generate a large, high-quality, and physically plausible training set can be far more efficient than manual annotation. While the current implementation is resource-intensive, its value is amortized over the entire lifecycle of the detection models it enhances.

5. Discussion

Although the role of visual realism in data augmentation remains theoretically underexplored, our quantitative experiments show that, when used to train object detection models, some existing methods do not enhance detection performance; in fact, they can even reduce it. This is closely related to the low level of visual realism and the shift in distribution of the augmented dataset. To address this issue, this paper conducts research from four key aspects: background, scale, illumination, and viewpoint matching. Built on this investigation, we propose a novel scene-understanding-based instance-level data augmentation method. By integrating state-of-the-art techniques from the field of computer vision, our method achieves a joint understanding of instances and backgrounds in the dataset, thereby significantly enhancing the visual realism of the augmented images while preserving data diversity. This core design, based on multi-dimensional matching, also establishes a robust foundation for the generalization of the method across diverse scenarios.

To validate the effectiveness of our approach, we selected the UAV aerial photography scenario as our primary validation platform. This scenario presents significant variations in background, scale, viewpoint, and illumination. Our extensive experiments on the VisDrone dataset demonstrate that our method delivers superior visual realism and achieves optimal overall performance improvements across multiple mainstream object detection models, such as YOLOv5 and RT-DETR. Additionally, preliminary validation on the iSAID dataset further confirms the effectiveness of our method in various small-object detection scenarios, strongly supporting the generalizability of the proposed framework.

In addition, the experiments reveal several advantages of the proposed method. First, by modeling the instance-background co-occurrence probability, our method effectively constrains the distribution of the augmented dataset, thereby preventing deviation from the underlying real-world distribution. Consequently, our method consistently improves detection performance for both instance categories with scarce samples and those with abundant samples in the dataset. Second, the performance gain brought by the cross-image augmentation strategy is significantly higher than that from intra-image augmentation. This finding aligns with our theoretical expectation: by integrating instances and backgrounds from different source images, cross-image augmentation introduces richer scene context and enhanced background diversity, thereby providing object detection models with more challenging and representative training samples.

It is important to note that this study has certain limitations and trade-offs. First, to achieve a higher level of visual realism, our framework integrates multiple advanced models. This integration results in increased preprocessing computational costs and memory requirements, which can pose challenges for its use in real-time or resource-constrained environments. Additionally, failures in some extreme cases highlight the limitations of upstream models in comprehending complex scenes.

Future work will focus on several key areas. First, to enhance practicality, we will investigate the use of lightweight modules and more efficient pipeline designs to reduce computational overhead. Second, we aim to develop quantitative metrics for visual realism specifically tailored for object detection tasks, moving beyond subjective assessments. Third, we plan to extend the application of our method to other scenarios, such as autonomous driving, thereby verifying its generalization ability. Furthermore, a systematic sensitivity analysis of key parameters (e.g., co-occurrence probability thresholds and geometric matching criteria) across different datasets will be conducted to optimize robustness and guide practical deployment. Finally, we will explore the use of generative models, such as diffusion models, for instance-background composition to further enhance visual realism.

6. Conclusions

This paper addresses the performance degradation caused by insufficient visual realism in existing instance-level data augmentation for object detection. We propose a scene-understanding-driven framework that jointly models instances and their surrounding context to mitigate mismatches in background, scale, illumination, and viewpoint, thereby generating more realistic composite images while maintaining data diversity. Experiments on VisDrone and iSAID show that our method improves detection performance across multiple detectors and reveals a positive correlation between visual realism and detection accuracy. This model-agnostic paradigm offers a viable option for data-scarce detection tasks and points to several directions for further exploration—such as improving efficiency, developing quantitative realism metrics, and extending to broader scenarios.

Author Contributions

Conceptualization, C.L. and Z.Z.; methodology, C.L.; software, C.L.; validation, C.L. and Z.Z.; formal analysis, C.L. and Z.Z.; investigation, C.L. and Z.Z.; resources, Z.Z.; data curation, C.L. and Z.Z.; writing—original draft preparation, C.L.; writing—review and editing, C.L., Z.Z., P.Z. and J.H.; visualization, C.L.; supervision, Z.Z. and P.Z.; project administration, Z.Z., P.Z. and J.H.; funding acquisition, Z.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Basic Research Foundation of the National Key Laboratory of Automatic Target Recognition (Grant Number: WDZC2035250202).

Data Availability Statement

The data presented in this study are available in GitHub at https://github.com/VisDrone/VisDrone-Dataset (accessed on 8 January 2026).

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

mAP	Mean Average Precision
UAV	Unmmaned Aerial Vehicle

References

Zhang, Y.; Zhang, Y.; Fu, R.; Shi, Z.; Zhang, J.; Liu, D.; Du, J. Learning Nonlocal Quadrature Contrast for Detection and Recognition of Infrared Rotary-Wing UAV Targets in Complex Background. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5629919. [Google Scholar] [CrossRef]
Zhang, T.; Zhang, X. A polarization fusion network with geometric feature embedding for SAR ship classification. Pattern Recognit. 2022, 123, 108365. [Google Scholar] [CrossRef]
Gao, F.; Liu, S.; Gong, C.; Zhou, X.; Wang, J.; Dong, J.; Du, Q. Prototype-Based Information Compensation Network for Multisource Remote Sensing Data Classification. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5513615. [Google Scholar] [CrossRef]
Lin, J.; Gao, F.; Shi, X.; Dong, J.; Du, Q. SS-MAE: Spatial–Spectral Masked Autoencoder for Multisource Remote Sensing Image Classification. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5531614. [Google Scholar] [CrossRef]
Li, N.; Song, F.; Zhang, Y.; Liang, P.; Cheng, E. Traffic Context Aware Data Augmentation for Rare Object Detection in Autonomous Driving. In Proceedings of the 2022 International Conference on Robotics and Automation (ICRA), Philadelphia, PA, USA, 23–27 May 2022; pp. 4548–4554. [Google Scholar] [CrossRef]
Wu, X.; Li, W.; Hong, D.; Tao, R.; Du, Q. Deep Learning for Unmanned Aerial Vehicle-Based Object Detection and Tracking: A survey. IEEE Geosci. Remote Sens. Mag. 2022, 10, 91–124. [Google Scholar] [CrossRef]
Zhang, T.; Zhang, X.; Shi, J.; Wei, S. HyperLi-Net: A hyper-light deep learning network for high-accurate and high-speed ship detection from synthetic aperture radar imagery. ISPRS J. Photogramm. Remote Sens. 2020, 167, 123–153. [Google Scholar] [CrossRef]
Zhang, T.; Zhang, X.; Gao, G. Divergence to Concentration and Population to Individual: A Progressive Approaching Ship Detection Paradigm for Synthetic Aperture Radar Remote Sensing Imagery. IEEE Trans. Aerosp. Electron. Syst. 2026, 62, 1325–1338. [Google Scholar] [CrossRef]
Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft coco: Common objects in context. In Proceedings of the Computer Vision—ECCV 2014: 13th European Conference, Zurich, Switzerland, 6–12 September 2014; Springer International Publishing: Cham, Switzerland, 2014; pp. 740–755. [Google Scholar] [CrossRef]
Cheng, G.; Yuan, X.; Yao, X.; Yan, K.; Zeng, Q.; Xie, X.; Han, J. Towards Large-Scale small-object detection: Survey and Benchmarks. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 13467–13488. [Google Scholar] [CrossRef]
Everingham, M.; Van Gool, L.; Williams, C.K.I.; Winn, J.; Zisserman, A. The Pascal Visual Object Classes (VOC) Challenge. Int. J. Comput. Vis. 2010, 88, 303–338. [Google Scholar] [CrossRef]
Felzenszwalb, P.F.; Girshick, R.B.; McAllester, D.; Ramanan, D. Object Detection with Discriminatively Trained Part-Based Models. IEEE Trans. Pattern Anal. Mach. Intell. 2010, 32, 1627–1645. [Google Scholar] [CrossRef]
Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 580–587. [Google Scholar] [CrossRef]
Dai, J.; Li, Y.; He, K.; Sun, J. R-FCN: Object detection via region-based fully convolutional networks. In Proceedings of the Advances in Neural Information Processing Systems 29, Barcelona, Spain, 5–10 December 2016; pp. 379–387. [Google Scholar]
Zhao, Y.; Lv, W.; Xu, S.; Wei, J.; Wang, G.; Dang, Q.; Liu, Y.; Chen, J. DETRs Beat YOLOs on Real-time Object Detection. In Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2024; pp. 16965–16974. [Google Scholar] [CrossRef]
Kisantal, M.; Wojna, Z.; Murawski, J.; Naruniec, J.; Cho, K. Augmentation for small-object detection. arXiv 2019, arXiv:1902.07296. [Google Scholar]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. SSD: Single Shot MultiBox Detector. In Proceedings of the Computer Vision—ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Leibe, B., Matas, J., Sebe, N., Welling, M., Eds.; Springer International Publishing: Cham, Switzerland, 2016; pp. 21–37. [Google Scholar] [CrossRef]
Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature Pyramid Networks for Object Detection. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 936–944. [Google Scholar] [CrossRef]
Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J. Path Aggregation Network for Instance Segmentation. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 8759–8768. [Google Scholar] [CrossRef]
Liu, Z.; Gao, G.; Sun, L.; Fang, Z. HRDNet: High-Resolution Detection Network for Small Objects. In Proceedings of the 2021 IEEE International Conference on Multimedia and Expo (ICME), Shenzhen, China, 5–9 July 2021; pp. 1–6. [Google Scholar] [CrossRef]
Tan, M.; Pang, R.; Le, Q.V. EfficientDet: Scalable and Efficient Object Detection. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 10778–10787. [Google Scholar] [CrossRef]
Gao, F.; Jin, X.; Zhou, X.; Dong, J.; Du, Q. MSFMamba: Multiscale Feature Fusion State Space Model for Multisource Remote Sensing Image Classification. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5504116. [Google Scholar] [CrossRef]
Chen, C.; Liu, M.Y.; Tuzel, O.; Xiao, J. R-CNN for small-object detection. In Proceedings of the Computer Vision—ACCV 2016: 13th Asian Conference on Computer Vision, Taipei, Taiwan, 20–24 November 2016; Springer International Publishing: Cham, Switzerland, 2017; pp. 214–230. [Google Scholar] [CrossRef]
Bell, S.; Zitnick, C.L.; Bala, K.; Girshick, R. Inside-Outside Net: Detecting Objects in Context with Skip Pooling and Recurrent Neural Networks. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 2874–2883. [Google Scholar] [CrossRef]
Zhang, G.; Lu, S.; Zhang, W. CAD-Net: A Context-Aware Detection Network for Objects in Remote Sensing Imagery. IEEE Trans. Geosci. Remote Sens. 2019, 57, 10015–10024. [Google Scholar] [CrossRef]
Cui, L.; Lv, P.; Jiang, X.; Gao, Z.; Zhou, B.; Zhang, L.; Shao, L.; Xu, M. Context-Aware Block Net for small-object detection. IEEE Trans. Cybern. 2022, 52, 2300–2313. [Google Scholar] [CrossRef]
Zhang, Y.; Zhang, Y.; Shi, Z.; Fu, R.; Liu, D.; Zhang, Y.; Du, J. Enhanced Cross-Domain Dim and Small Infrared Target Detection via Content-Decoupled Feature Alignment. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5618416. [Google Scholar] [CrossRef]
Zhang, T.; Zhang, X.; Liu, C.; Shi, J.; Wei, S.; Ahmad, I.; Zhan, X.; Zhou, Y.; Pan, D.; Li, J.; et al. Balance learning for ship detection from synthetic aperture radar remote sensing imagery. ISPRS J. Photogramm. Remote Sens. 2021, 182, 190–207. [Google Scholar] [CrossRef]
Hao, X.; Liu, L.; Yang, R.; Yin, L.; Zhang, L.; Li, X. A Review of Data Augmentation Methods of Remote Sensing Image Target Recognition. Remote Sens. 2023, 15, 827. [Google Scholar] [CrossRef]
Chen, C.; Zhang, Y.; Lv, Q.; Wei, S.; Wang, X.; Sun, X.; Dong, J. RRNet: A Hybrid Detector for Object Detection in Drone-Captured Images. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW), Seoul, Republic of Korea, 27–28 October 2019; pp. 100–108. [Google Scholar] [CrossRef]
Yang, Z.; Yu, H.; Feng, M.; Sun, W.; Lin, X.; Sun, M.; Mao, Z.H.; Mian, A. Small Object Augmentation of Urban Scenes for Real-Time Semantic Segmentation. IEEE Trans. Image Process. 2020, 29, 5175–5190. [Google Scholar] [CrossRef]
Bosquet, B.; Cores, D.; Seidenari, L.; Brea, V.M.; Mucientes, M.; Bimbo, A.D. A full data augmentation pipeline for small-object detection based on generative adversarial networks. Pattern Recognit. 2023, 133, 108998. [Google Scholar] [CrossRef]
Hu, Z.; Wu, W.; Yang, Z.; Zhao, Y.; Xu, L.; Kong, L.; Chen, Y.; Chen, L.; Liu, G. A Cost-Sensitive Small Vessel Detection Method for Maritime Remote Sensing Imagery. Remote Sens. 2025, 17, 2471. [Google Scholar] [CrossRef]
Ghiasi, G.; Cui, Y.; Srinivas, A.; Qian, R.; Lin, T.Y.; Cubuk, E.D.; Le, Q.V.; Zoph, B. Simple Copy-Paste is a Strong Data Augmentation Method for Instance Segmentation. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 2917–2927. [Google Scholar] [CrossRef]
Fang, H.S.; Sun, J.; Wang, R.; Gou, M.; Li, Y.L.; Lu, C. InstaBoost: Boosting Instance Segmentation via Probability Map Guided Copy-Pasting. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 682–691. [Google Scholar] [CrossRef]
Nie, Z.; Cao, J.; Weng, N.; Yu, X.; Wang, M. Object-Based Perspective Transformation Data Augmentation for Object Detection. In Proceedings of the 2022 International Conference on Frontiers of Artificial Intelligence and Machine Learning (FAIML), Hangzhou, China, 19–21 June 2022; pp. 186–190. [Google Scholar] [CrossRef]
Dwibedi, D.; Misra, I.; Hebert, M. Cut, Paste and Learn: Surprisingly Easy Synthesis for Instance Detection. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 1310–1319. [Google Scholar] [CrossRef]
Zhao, H.; Sheng, D.; Bao, J.; Chen, D.; Chen, D.; Wen, F.; Yuan, L.; Liu, C.; Zhou, W.; Chu, Q.; et al. X-Paste: Revisiting Scalable Copy-Paste for Instance Segmentation using CLIP and StableDiffusion. In Proceedings of the 40th International Conference on Machine Learning, Honolulu, HI, USA, 23–29 July 2023; Krause, A., Brunskill, E., Cho, K., Engelhardt, B., Sabato, S., Scarlett, J., Eds.; Proceedings of Machine Learning Research (PMLR): Brookline, MA, USA, 2023; Volume 202, pp. 42098–42109. [Google Scholar]
Dvornik, N.; Mairal, J.; Schmid, C. Modeling Visual Context Is Key to Augmenting Object Detection Datasets. In Proceedings of the Computer Vision—ECCV 2018: 15th European Conference, Munich, Germany, 8–14 September 2018; Springer International Publishing: Cham, Switzerland, 2018; pp. 375–391. [Google Scholar] [CrossRef]
Zhang, L.; Wen, T.; Min, J.; Wang, J.; Han, D.; Shi, J. Learning Object Placement by Inpainting for Compositional Data Augmentation. In Proceedings of the Computer Vision—ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Vedaldi, A., Bischof, H., Brox, T., Frahm, J.M., Eds.; Springer International Publishing: Cham, Switzerland, 2020; Volume 12358, pp. 566–581. [Google Scholar] [CrossRef]
Zhu, P.; Wen, L.; Du, D.; Bian, X.; Fan, H.; Hu, Q.; Ling, H. Detection and Tracking Meet Drones Challenge. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 44, 7380–7399. [Google Scholar] [CrossRef]
Du, D.; Zhu, P.; Wen, L.; Bian, X.; Lin, H.; Hu, Q.; Peng, T.; Zheng, J.; Wang, X.; Zhang, Y.; et al. VisDrone-DET2019: The Vision Meets Drone Object Detection in Image Challenge Results. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW), Seoul, Republic of Korea, 27–28 October 2019; pp. 213–226. [Google Scholar] [CrossRef]
Deng, C.; Wang, M.; Liu, L.; Liu, Y.; Jiang, Y. Extended Feature Pyramid Network for small-object detection. IEEE Trans. Multimed. 2022, 24, 1968–1979. [Google Scholar] [CrossRef]
Divvala, S.K.; Hoiem, D.; Hays, J.H.; Efros, A.A.; Hebert, M. An empirical study of context in object detection. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 1271–1278. [Google Scholar] [CrossRef]
Hoiem, D.; Chodpathumwan, Y.; Dai, Q. Diagnosing Error in Object Detectors. In Proceedings of the Computer Vision—ECCV 2012: 12th European Conference on Computer Vision, Florence, Italy, 7–13 October 2012; Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C., Eds.; Springer: Berlin/Heidelberg, Germany, 2012; pp. 340–353. [Google Scholar] [CrossRef]
Cheng, P.; Liu, W.; Zhang, Y.; Ma, H. LOCO: Local Context Based Faster R-CNN for Small Traffic Sign Detection. In Proceedings of the MultiMedia Modeling: 24th International Conference, MMM 2018, Bangkok, Thailand, 5–7 February 2018; Schoeffmann, K., Chalidabhongse, T.H., Ngo, C.W., Aramvith, S., O’Connor, N.E., Ho, Y.S., Gabbouj, M., Elgammal, A., Eds.; Springer International Publishing: Cham, Switzerland, 2018; pp. 329–341. [Google Scholar] [CrossRef]
Li, J.; Liang, X.; Wei, Y.; Xu, T.; Feng, J.; Yan, S. Perceptual Generative Adversarial Networks for Small Object Detection. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 1951–1959. [Google Scholar] [CrossRef]
Zhang, Y.; Bai, Y.; Ding, M.; Ghanem, B. Multi-task Generative Adversarial Network for Detecting Small Objects in the Wild. Int. J. Comput. Vis. 2020, 128, 1810–1828. [Google Scholar] [CrossRef]
Bashir, S.M.A.; Wang, Y. small-object detection in Remote Sensing Images with Residual Feature Aggregation-Based Super-Resolution and Object Detector Network. Remote Sens. 2021, 13, 1854. [Google Scholar] [CrossRef]
Xiuling, Z.; Huijuan, W.; Yu, S.; Gang, C.; Suhua, Z.; Quanbo, Y. Starting from the structure: A review of small-object detection based on deep learning. Image Vis. Comput. 2024, 146, 105054. [Google Scholar] [CrossRef]
Girshick, R. Fast R-CNN. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar] [CrossRef]
Wang, K.; Fang, B.; Qian, J.; Yang, S.; Zhou, X.; Zhou, J. Perspective Transformation Data Augmentation for Object Detection. IEEE Access 2020, 8, 4935–4943. [Google Scholar] [CrossRef]
Zoph, B.; Cubuk, E.D.; Ghiasi, G.; Lin, T.Y.; Shlens, J.; Le, Q.V. Learning Data Augmentation Strategies for Object Detection. In Proceedings of the Computer Vision—ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Vedaldi, A., Bischof, H., Brox, T., Frahm, J.M., Eds.; Springer International Publishing: Cham, Switzerland, 2020; Volume 12372, pp. 566–583. [Google Scholar] [CrossRef]
Kim, J.H.; Hwang, Y. GAN-Based Synthetic Data Augmentation for Infrared Small Target Detection. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5002512. [Google Scholar] [CrossRef]
Fang, H.; Han, B.; Zhang, S.; Zhou, S.; Hu, C.; Ye, W.M. Data Augmentation for Object Detection via Controllable Diffusion Models. In Proceedings of the 2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 3–8 January 2024; pp. 1246–1255. [Google Scholar] [CrossRef]
Li, Y.; Dong, X.; Chen, C.; Zhuang, W.; Lyu, L. A Simple Background Augmentation Method for Object Detection with Diffusion Model. In Proceedings of the Computer Vision—ECCV 2024: 18th European Conference, Milan, Italy, 29 September–4 October 2024; Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G., Eds.; Springer Nature: Cham, Switzerland, 2025; pp. 462–479. [Google Scholar] [CrossRef]
Alimisis, P.; Mademlis, I.; Radoglou-Grammatikis, P.; Sarigiannidis, P.; Papadopoulos, G.T. Advances in diffusion models for image data augmentation: A review of methods, models, evaluation metrics and future research directions. Artif. Intell. Rev. 2025, 58, 112. [Google Scholar] [CrossRef]
Bochkovskiy, A.; Wang, C.Y.; Liao, H.Y.M. YOLOv4: Optimal Speed and Accuracy of Object Detection. arXiv 2020, arXiv:2004.10934. [Google Scholar] [CrossRef]
Duan, C.; Wei, Z.; Zhang, C.; Qu, S.; Wang, H. Coarse-grained Density Map Guided Object Detection in Aerial Images. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), Montreal, QC, Canada, 11–17 October 2021; pp. 2789–2798. [Google Scholar] [CrossRef]
Georgakis, G.; Mousavian, A.; Berg, A.C.; Košecká, J. Synthesizing training data for object detection in indoor scenes. In Proceedings of the Robotics: Science and Systems, Massachusetts Institute of Technology, Cambridge, MA, USA, 12–16 July 2017; Volume 13. [Google Scholar] [CrossRef]
Hong, S.; Kang, S.; Cho, D. Patch-Level Augmentation for Object Detection in Aerial Images. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW), Seoul, Republic of Korea, 27–28 October 2019; pp. 127–134. [Google Scholar] [CrossRef]
Liu, S.; Guo, H.; Hu, J.G.; Zhao, X.; Zhao, C.; Wang, T.; Zhu, Y.; Wang, J.; Tang, M. A novel data augmentation scheme for pedestrian detection with attribute preserving GAN. Neurocomputing 2020, 401, 123–132. [Google Scholar] [CrossRef]
Geiger, A.; Lenz, P.; Urtasun, R. Are we ready for autonomous driving? The KITTI vision benchmark suite. In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, 16–21 June 2012; pp. 3354–3361. [Google Scholar] [CrossRef]
Kirillov, A.; Mintun, E.; Ravi, N.; Mao, H.; Rolland, C.; Gustafson, L.; Xiao, T.; Whitehead, S.; Berg, A.C.; Lo, W.Y.; et al. Segment Anything. In Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 1–6 October 2023; pp. 3992–4003. [Google Scholar] [CrossRef]
Jiang, J.; Zhang, K.; Timofte, R. Towards Flexible Blind JPEG Artifacts Removal. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 4977–4986. [Google Scholar] [CrossRef]
Yu, T.; Feng, R.; Feng, R.; Liu, J.; Jin, X.; Zeng, W.; Chen, Z. Inpaint Anything: Segment Anything Meets Image Inpainting. arXiv 2023, arXiv:2304.06790. [Google Scholar] [CrossRef]
Huang, X.; Huang, Y.J.; Zhang, Y.; Tian, W.; Feng, R.; Zhang, Y.; Xie, Y.; Li, Y.; Zhang, L. Open-Set Image Tagging with Multi-Grained Text Supervision. In Proceedings of the 33rd ACM International Conference on Multimedia, MM ’25, Dublin, Ireland, 27–31 October 2025; pp. 4117–4126. [Google Scholar] [CrossRef]
DeepSeek-AI; Guo, D.; Yang, D.; Zhang, H.; Song, J.; Zhang, R.; Xu, R.; Zhu, Q.; Ma, S.; Wang, P.; et al. DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. arXiv 2025, arXiv:2501.12948. [Google Scholar]
Liu, S.; Zeng, Z.; Ren, T.; Li, F.; Zhang, H.; Yang, J.; Jiang, Q.; Li, C.; Yang, J.; Su, H.; et al. Grounding DINO: Marrying DINO with Grounded Pre-training for Open-Set Object Detection. In Proceedings of the Computer Vision—ECCV 2024: 18th European Conference, Milan, Italy, 29 September–4 October 2024; Springer Nature: Cham, Switzerland, 2025; pp. 38–55. [Google Scholar] [CrossRef]
Li, J.; Li, D.; Xiong, C.; Hoi, S. BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation. In Proceedings of the 39th International Conference on Machine Learning, Baltimore, MD, USA, 17–23 July 2022; Chaudhuri, K., Jegelka, S., Song, L., Szepesvari, C., Niu, G., Sabato, S., Eds.; Proceedings of Machine Learning Research (PMLR): Brookline, MA, USA, 2022; Volume 162, pp. 12888–12900. [Google Scholar]
Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning Transferable Visual Models from Natural Language Supervision. In Proceedings of the 38th International Conference on Machine Learning, Virtual, 18–24 July 2021; Meila, M., Zhang, T., Eds.; Proceedings of Machine Learning Research (PMLR): Brookline, MA, USA, 2021; Volume 139, pp. 8748–8763. [Google Scholar]
Ke, L.; Li, S.; Sun, Y.; Tai, Y.W.; Tang, C.K. GSNet: Joint Vehicle Pose and Shape Reconstruction with Geometrical and Scene-Aware Supervision. In Proceedings of the Computer Vision—ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Vedaldi, A., Bischof, H., Brox, T., Frahm, J.M., Eds.; Springer International Publishing: Cham, Switzerland, 2020; pp. 515–532. [Google Scholar] [CrossRef]
Kouros, G.; Shrivastava, S.; Picron, C.; Nagesh, S.; Chakravarty, P.; Tuytelaars, T. Category-Level Pose Retrieval with Contrastive Features Learnt with Occlusion Augmentation. In Proceedings of the 33rd British Machine Vision Conference 2022, BMVC 2022, London, UK, 21–24 November 2022; BMVA Press: Malvern, UK, 2022. [Google Scholar]
Klee, D.M.; Biza, O.; Platt, R.; Walters, R. Image to Sphere: Learning Equivariant Features for Efficient Pose Prediction. arXiv 2023, arXiv:2302.13926. [Google Scholar] [CrossRef]
Xiang, Y.; Mottaghi, R.; Savarese, S. Beyond PASCAL: A benchmark for 3D object detection in the wild. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision, Steamboat Springs, CO, USA, 24–26 March 2014; pp. 75–82. [Google Scholar] [CrossRef]
Song, X.; Wang, P.; Zhou, D.; Zhu, R.; Guan, C.; Dai, Y.; Su, H.; Li, H.; Yang, R. ApolloCar3D: A Large 3D Car Instance Understanding Benchmark for Autonomous Driving. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 5447–5457. [Google Scholar] [CrossRef]
Kundu, A.; Li, Y.; Rehg, J.M. 3D-RCNN: Instance-Level 3D Object Reconstruction via Render-and-Compare. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 3559–3568. [Google Scholar] [CrossRef]
Waqas Zamir, S.; Arora, A.; Gupta, A.; Khan, S.; Sun, G.; Shahbaz Khan, F.; Zhu, F.; Shao, L.; Xia, G.S.; Bai, X. isaid: A large-scale dataset for instance segmentation in aerial images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Long Beach, CA, USA, 16–17 June 2019; pp. 28–37. [Google Scholar]
Jocher, G. Ultralytics/Yolov5: v7.0—YOLOv5 SOTA Realtime Instance Segmentation. License: AGPL-3.0. Available online: https://github.com/ultralytics/yolov5 (accessed on 20 August 2025).
Zhu, X.; Lyu, S.; Wang, X.; Zhao, Q. TPH-YOLOv5: Improved YOLOv5 Based on Transformer Prediction Head for Object Detection on Drone-captured Scenarios. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), Montreal, QC, Canada, 11–17 October 2021; pp. 2778–2788. [Google Scholar] [CrossRef]
Du, B.; Huang, Y.; Chen, J.; Huang, D. Adaptive Sparse Convolutional Networks with Global Context Enhancement for Faster Object Detection on Drone Images. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 13435–13444. [Google Scholar] [CrossRef]

Figure 1. Qualitative comparison between the proposed method and existing methods. The original instance is highlighted in blue dashed boxes, while the augmented instances are highlighted in yellow solid boxes. (a) Cut–Paste–Learn [37]: The location, scale, and rotation parameters are all randomly assigned. (b) InstaBoost [35]: Locations are predicted using an appearance coherence heatmap, whereas scale parameters are randomly generated. (c) AdaResampling [30]: Locations are determined via road semantic segmentation, and scale parameters are adjusted with reference to pedestrian sizes. (d) Our method: Locations are guided by background semantic segmentation, with scale and perspective transformation parameters parameterized by vehicle size and viewpoint.

Figure 2. Examples of augmented images lacking visual realism. (a) Background mismatch and scale mismatch [30]. (b) Illumination mismatch and viewpoint mismatch [32]. (c) Scale mismatch and viewpoint mismatch [61]. All bounding boxes indicate the augmented instances; their thicknesses, colors, and the attached numbers follow the styles of the original source papers and have not been altered.

Figure 3. Pipeline of the proposed instance-level data augmentation method. Red boxes in (e) highlight the candidate locations for instance placement, while black arrows in each subfigure indicate the flow direction of the pipeline. (a) Acquire object instances and “clean” background images from original images using segmentation and inpainting models, guided by box-level annotations. (b) Perform semantic segmentation on the background image and estimate its global illumination. (c) Estimate the local illumination, pose, and spatial resolution of the instance. (d) Perform a statistical analysis of the co-occurrence probability between the instance and the background. (e) Match each instance to its background image. Then, scale and apply a perspective transformation to the successfully matched instance before placing it into the candidate locations highlighted by the red boxes.

Figure 4. Comparative results of instance segmentation with and without JPEG restoration preprocessing. (a) Original image. (b) Restored image. (c) Segmentation using the original image. (d) Segmentation using the restored image. Yellow contours highlight the instance segmentation boundaries.

Figure 5. Comparison of semantic segmentation results for background images using different prompts. (a) Using predefined semantic prompt (e.g., “road”). (b) Using semantic prompts generated by image tagging.

Figure 6. Semantic segmentation pipeline for background analysis.

Figure 7. Global illumination estimation based on visual question answering.

Figure 8. Local illumination estimation with a pre-trained vision-language model.

Figure 9. Pose Estimation through Contour Point Matching.

Table 2. Comparison of detection performance on the VisDrone dataset.

Method	Num. Images	YOLOv5		TPH-YOLOv5		GFL V1-CEASC		RT-DETR
Method	Num. Images	mAP	mAP@0.5	mAP	mAP@0.5	mAP	mAP@0.5	mAP	mAP@0.5
Baseline	1×	34.6	56.7	35.1	58.2	26.4	45.4	35.2	58.9
Cut–Paste–Learn [37]	1×	34.3 (−0.3)	56.2 (−0.5)	34.9 (−0.2)	58.0 (−0.2)	26.3 (−0.1)	45.5 (+0.1)	35.0 (−0.2)	58.7 (−0.2)
AdaResampling [30]	1×	34.6 (+0.0)	56.8 (+0.1)	34.5 (−0.6)	57.2 (−1.0)	26.4 (+0.0)	45.4 (+0.0)	35.2 (+0.0)	58.8 (−0.1)
(OURS) OG-IC	1×	34.7 (+0.1)	56.9 (+0.2)	35.2 (+0.1)	58.1 (−0.1)	26.6 (+0.2)	45.5 (+0.1)	35.5 (+0.3)	59.4 (+0.5)
(OURS) OG-CC	1×	34.9 (+0.3)	57.3 (+0.6)	35.4 (+0.3)	58.3 (+0.1)	26.7 (+0.3)	46.0 (+0.6)	35.7 (+0.5)	59.6 (+0.7)
InstaBoost [35]	2×	35.0 (+0.4)	57.0 (+0.3)	35.7 (+0.6)	59.0 (+0.8)	26.8 (+0.4)	45.9 (+0.5)	35.7 (+0.5)	60.1 (+1.2)
(OURS) IP-IC	2×	35.2 (+0.6)	57.6 (+0.9)	36.0 (+0.9)	59.2 (+1.0)	26.9 (+0.5)	46.4 (+1.0)	36.2 (+1.0)	60.3 (+1.4)
(OURS) IP-CC	2×	36.2 (+1.4)	58.9 (+2.1)	36.3 (+1.1)	59.9 (+1.6)	27.1 (+0.6)	46.5 (+1.0)	36.2 (+1.0)	60.5 (+1.5)

Table 7. Computational cost comparison of different augmentation methods on the VisDrone dataset.

Method	Instance Prep. (s/Instance)	Background Prep. (s/Image)	Composition (s/Image)	GPU Mem. (GB)	Training Duration
Baseline	-	-	-	-	1×
Cut–Paste–Learn [37]	0.09	-	0.59	-	1×
AdaResampling [30]	-	1.12	0.04	-	1×
InstaBoost [35]	0.09	-	9.52	-	4×
(OURS) OG-IC	0.13	7.05	1.46	7.2	1×
(OURS) IP-CC	0.13	7.05	8.64	7.2	4×

The column headers “Prep.” and “Mem.” are abbreviations for preprocessing and memory, respectively.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Li, C.; Zhang, Z.; Zhong, P.; He, J. A Realistic Instance-Level Data Augmentation Method for Small-Object Detection Based on Scene Understanding. Remote Sens. 2026, 18, 647. https://doi.org/10.3390/rs18040647

AMA Style

Li C, Zhang Z, Zhong P, He J. A Realistic Instance-Level Data Augmentation Method for Small-Object Detection Based on Scene Understanding. Remote Sensing. 2026; 18(4):647. https://doi.org/10.3390/rs18040647

Chicago/Turabian Style

Li, Chuwei, Zhilong Zhang, Ping Zhong, and Jun He. 2026. "A Realistic Instance-Level Data Augmentation Method for Small-Object Detection Based on Scene Understanding" Remote Sensing 18, no. 4: 647. https://doi.org/10.3390/rs18040647

APA Style

Li, C., Zhang, Z., Zhong, P., & He, J. (2026). A Realistic Instance-Level Data Augmentation Method for Small-Object Detection Based on Scene Understanding. Remote Sensing, 18(4), 647. https://doi.org/10.3390/rs18040647

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Realistic Instance-Level Data Augmentation Method for Small-Object Detection Based on Scene Understanding

Highlights

Abstract

1. Introduction

2. Related Work

2.1. Small-Object Detection

2.2. Data Augmentation for Object Detection

2.3. Instance-Level Data Augmentation

2.4. Visual Realism in Data Augmentation

2.4.1. Background Matching

2.4.2. Scale Matching

2.4.3. Illumination Matching

2.4.4. Viewpoint Matching

3. Materials and Methods

3.1. Acquisition of Instance and Background Image

3.2. Analysis of Background Image

3.2.1. Scene Semantic Segmentation

3.2.2. Global Illumination Estimation

3.3. Enrichment of Instance Information

3.3.1. Local Illumination Estimation

3.3.2. Instance Pose Estimation

3.3.3. Spatial Resolution Estimation

3.4. Co-Occurrence Probability Modeling of Instance and Background

3.5. Composition of Instance and Background

3.5.1. Extraction of Candidate Placement Regions

3.5.2. Matching of Instances to Placement Regions

3.5.3. Geometric Transformation of Instances

3.5.4. Generation of the Composite Image

4. Results

4.1. Experimental Setup

4.1.1. Dataset and Evaluation Metrics

4.1.2. Baseline Detectors and Training Configuration

4.1.3. Baseline Data Augmentation Methods

4.1.4. Implementation Details

4.2. Results and Analysis

4.2.1. Quantitative Comparison with Existing Methods

4.2.2. Qualitative Analysis of Augmented Results

4.2.3. Ablation and Analysis

4.2.4. Preliminary Cross-Scenario Validation

4.2.5. Computational Cost and Efficiency Analysis

5. Discussion

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI