A Realistic Instance-Level Data Augmentation Method for Small-Object Detection Based on Scene Understanding
Highlights
- This study establishes a unified, scene-understanding-driven framework that systematically addresses the four key dimensions of visual realism in instance-level data augmentation, including background, scale, illumination, and viewpoint.
- Compared to existing methods, our method demonstrates superior visual realism and achieves optimal detection performance gains across multiple object detection models (e.g., YOLOv5 and RT-DETR).
- Our experiment demonstrates a strong positive correlation between the degree of visual realism achieved and the final gain in detection performance, providing empirical evidence for a previously underexplored relationship.
- Our research provides a practical and resource-efficient solution that reduces dependence on large-scale mask-level annotations, making it well suited to challenging domains such as UAV applications and remote sensing.
Abstract
1. Introduction
- We propose a unified scene-understanding-driven instance-level data augmentation framework dedicated to small-object detection, which abandons the random operation paradigm of traditional “copy-paste” methods and systematically addresses the four key mismatches in background, scale, illumination, and viewpoint. This is achieved through a unified pipeline that integrates image inpainting, tagging, open-set object detection, SAM, and pose estimation for joint instance-background modeling.
- We conduct extensive comparative experiments on the VisDrone [42] dataset across multiple mainstream object detection models, and the quantitative results demonstrate that our method stably improves the baseline detector’s mAP@0.5:0.95 and mAP@0.5 by 1.6% and 2.2%, respectively. More importantly, our experiments uncover a strong positive correlation between the visual realism of augmented images and the final detection performance gain, providing valuable empirical evidence for this previously underexplored intrinsic relationship in the field of data augmentation.
- We offer a practical solution that reduces dependency on large-scale manual annotation. By leveraging pre-trained models for scene understanding, our method generates high-quality training data with high visual realism. This resource-efficient approach is particularly suited for data-scarce domains such as UAV perception and remote sensing, where obtaining dense annotations is challenging.
2. Related Work
2.1. Small-Object Detection
2.2. Data Augmentation for Object Detection
2.3. Instance-Level Data Augmentation
- Instance Acquisition Strategy: Instances can be acquired through box-level annotations or mask-level annotations provided by the dataset, or by using an external segmentation model.
- Background Processing Strategy: The background image can either be used in its original form or be inpainted first to remove existing instances.
- Instance-Background Composition Strategy: The composition can occur either within the same image (intra-image) or across different images (cross-image).
| Year | Method | Instance Acquisition | Background Processing | Instance-Background Composition | Back. | Scal. | Illu. | View. |
|---|---|---|---|---|---|---|---|---|
| 2017 | SP-BL-SS [60] | mask | Original | Cross-Image | ✓ | ✓ | - | - |
| 2017 | Cut–Paste–Learn [37] | seg | Original | Cross-Image | - | - | - | - |
| 2018 | Context-DA [39] | mask | Original | Cross-Image | ✓ | - | - | - |
| 2019 | Kisantal et al. [16] | mask | Original | Cross-Image | - | - | - | - |
| 2019 | InstaBoost [35] | mask | Inpainted | Intra-Image | ✓ | - | - | - |
| 2019 | Hong et al. [61] | box | Original | Cross-Image | - | - | - | - |
| 2019 | AdaResampling [30] | box | Original | Intra-Image | ✓ | ✓ | - | - |
| 2020 | Liu et al. [62] | seg | Original | Cross-Image | ✓ | - | ✓ | - |
| 2020 | Yang et al. [31] | mask | Original | Cross-Image | ✓ | - | - | - |
| 2021 | Ghiasi et al. [34] | mask | Original | Cross-Image | - | - | - | - |
| 2022 | Nie et al. [36] | box | Original | Cross-Image | - | - | - | - |
| 2022 | Li et al. [5] | mask | Original | Cross-Image | ✓ | ✓ | ✓ | - |
| 2023 | DS-GAN [32] | seg | Inpainted | Cross-Image | ✓ | ✓ | - | - |
| 2023 | X-Paste [38] | seg | Original | Cross-Image | - | - | - | - |
| 2026 | Our Method | seg | Original/ Inpainted | Intra-Image/ Cross-Image | ✓ | ✓ | ✓ | ✓ |
2.4. Visual Realism in Data Augmentation
2.4.1. Background Matching
2.4.2. Scale Matching
2.4.3. Illumination Matching
2.4.4. Viewpoint Matching
3. Materials and Methods
3.1. Acquisition of Instance and Background Image
3.2. Analysis of Background Image
3.2.1. Scene Semantic Segmentation
3.2.2. Global Illumination Estimation
3.3. Enrichment of Instance Information
3.3.1. Local Illumination Estimation
3.3.2. Instance Pose Estimation
3.3.3. Spatial Resolution Estimation
3.4. Co-Occurrence Probability Modeling of Instance and Background
3.5. Composition of Instance and Background
3.5.1. Extraction of Candidate Placement Regions
3.5.2. Matching of Instances to Placement Regions
3.5.3. Geometric Transformation of Instances
| Algorithm 1 Instance-Background Composition for Cross-Image Augmentation |
Input:
|
3.5.4. Generation of the Composite Image
4. Results
4.1. Experimental Setup
4.1.1. Dataset and Evaluation Metrics
4.1.2. Baseline Detectors and Training Configuration
4.1.3. Baseline Data Augmentation Methods
4.1.4. Implementation Details
4.2. Results and Analysis
4.2.1. Quantitative Comparison with Existing Methods
- Although AdaResampling [30] uses road semantics for background matching and references pedestrian instances to determine scale, it fails to improve detection performance. This shortcoming likely stems from its reliance on box-level annotations, which introduce background interference and can lead to overfitting.
- While InstaBoost [35] does not address the scale matching issue, its placement strategy based on an appearance coherence heatmap still yields certain performance gains. This indirectly underscores the importance of background matching.
- Compared to other methods, the proposed method achieves the best overall performance improvement across multiple detectors.
- The experimental results confirm that, given proper background matching, cross-image augmentation yields significantly greater gains in detection performance than intra-image augmentation. This aligns with the design rationale, as cross-image augmentation can introduce richer background diversity.
- In our method, variants using inpainted images (similar to InstaBoost [35]) achieve more substantial gains in detection performance than those using original images. This improvement is attributed to the novel contextual information introduced via inpainting, which facilitates model training.
4.2.2. Qualitative Analysis of Augmented Results
- Qualitative Comparison with Existing Methods
- Under the top-down view, where variations in scale and viewpoint are minimal, intra-image augmentation methods (e.g., AdaResampling [30] and InstaBoost [35]) can produce visually realistic images. In contrast, cross-image augmentation methods (e.g., Cut–Paste–Learn [37]) exhibit noticeable mismatches in both scale and viewpoint.
- In the front view, intra-image augmentation methods also struggle to maintain visual realism, while cross-image augmentation methods induce significant background mismatch.
- By comparison, the proposed method consistently produces realistic augmented images.


- Failure Case Analysis
- (a)
- Over-exposure: In overexposed background scenarios, even though the illumination matching module ensures near-identical illumination intensity between the augmented instance and the background, the augmented instance may still exhibit noticeable visual inconsistency with the surrounding background due to compressed texture details.
- (b)
- Low illumination: Extremely low light results in a poor signal-to-noise ratio, which degrades the performance of semantic segmentation and scene understanding models, hindering the localization of plausible placement regions.
- (c)
- Abnormal Viewpoint: When the viewpoint of the background image or instance deviates from the normal range, our viewpoint and scale matching framework fails due to insufficient physical reference cues, ultimately reducing geometric plausibility.
- (d)
- Insufficient reference instances: Scenes with too few or no valid reference instances prevent our matching strategy from acquiring sufficient prior information, making scale and viewpoint estimation unreliable.

4.2.3. Ablation and Analysis
- Impact of Visual Realism
- (1)
- Background-matching plays a crucial role in cross-image augmentation, improving mAP@0.5 by 1.5%. Its effect is more modest in intra-image augmentation (with mAP@0.5 increasing by 0.4%), where background consistency is inherently higher.
- (2)
- Scale matching consistently improves the detection performance for small objects. In contrast, random scaling strategies (e.g., InstaBoost [35]) can degrade it.
- (3)
- The results show that detection performance is strongly correlated with the visual realism of data augmentation. A progressive introduction of realism components (background-matching, illumination-matching, scale-matching, and viewpoint-matching) leads to a corresponding increase in mAP.
| Method | Back. | Illu. | Scal. | View. | mAP | mAP@0.5 | mAPS@0.5 | mAPM@0.5 | mAPL@0.5 |
|---|---|---|---|---|---|---|---|---|---|
| Baseline | - | - | - | - | 34.6 | 56.7 | 45.9 | 70.9 | 83.8 |
| InstaBoost [35] | ✓ | - | - | - | 35.0 (+0.4) | 57.0 (+0.3) | 45.4 (−0.5) | 71.8 (+0.9) | 85.8 (+2.0) |
| (OURS) IP-IC | ✓ | - | - | - | 34.8 (+0.2) | 57.1 (+0.4) | 45.9 (+0.0) | 71.5 (+0.6) | 85.0 (+1.2) |
| (OURS) IP-IC | ✓ | - | ✓ | - | 35.0 (+0.4) | 57.4 (+0.7) | 46.2 (+0.3) | 72.0 (+1.1) | 84.7 (+0.9) |
| (OURS) IP-IC | ✓ | - | ✓ | ✓ | 35.2 (+0.6) | 57.6 (+0.9) | 46.3 (+0.4) | 72.2 (+1.3) | 84.6 (+0.8) |
| Method | Back. | Illu. | Scal. | View. | mAP | mAP@0.5 | mAPS@0.5 | mAPM@0.5 | mAPL@0.5 |
|---|---|---|---|---|---|---|---|---|---|
| Baseline | - | - | - | - | 34.6 | 56.7 | 45.9 | 70.9 | 83.8 |
| Cut–Paste–Learn [37] | - | - | - | - | 34.3 (−0.3) | 56.2 (−0.5) | 45.0 (−0.9) | 70.6 (−0.3) | 84.4 (+0.6) |
| (OURS) IP-CC | ✓ | - | - | - | 35.6 (+1.0) | 58.2 (+1.5) | 46.4 (+0.5) | 72.3 (+1.4) | 86.2 (+2.4) |
| (OURS) IP-CC | ✓ | ✓ | - | - | 35.8 (+1.2) | 58.3 (+1.6) | 46.6 (+0.7) | 72.8 (+1.9) | 86.3 (+2.5) |
| (OURS) IP-CC | ✓ | ✓ | ✓ | - | 35.7 (+1.1) | 58.4 (+1.7) | 47.2 (+1.3) | 72.5 (+1.6) | 86.1 (+2.3) |
| (OURS) IP-CC | ✓ | ✓ | ✓ | ✓ | 36.2 (+1.6) | 58.9 (+2.2) | 47.4 (+1.5) | 73.1 (+2.2) | 87.5 (+3.7) |
- Robustness Across Varying Training Set Sizes
- (1)
- Our method consistently improves detection performance regardless of training set size.
- (2)
- With very small training sets, models trained on the augmented dataset perform well on the val subset but achieve limited improvement on the test-dev subset, which aligns with the theoretical expectation that insufficient data leads to overfitting.
- (3)
- As the training set size increases, the cross-image composition strategy in our method introduces greater background diversity, leading to more significant performance gains.

- Generalization under Class Imbalance
- (1)
- (2)
- For classes with abundant samples (e.g., “pedestrian”, “car”), existing augmentation methods exhibit varying degrees of performance degradation.
- (3)
- In contrast, our method achieves stable and consistent performance gains across all classes.
| Method | Ped. | Person | Bicycle | Car | Van | Truck | Tricycle | Awn. | Bus | Motor |
|---|---|---|---|---|---|---|---|---|---|---|
| Baseline | 23.4 | 13.6 | 12.8 | 56.4 | 37.5 | 40.5 | 21.9 | 19.3 | 49.9 | 23.6 |
| Cut–Paste–Learn [37] | 22.5 (−0.9) | 13.2 (−0.4) | 12.5 (−0.3) | 55.8 (−0.6) | 36.3 (−1.2) | 41.9 (+1.4) | 21.4 (−0.5) | 20.0 (+0.7) | 51.0 (+1.1) | 23.1 (−0.5) |
| AdaResampling [30] | 22.9 (−0.5) | 13.0 (−0.6) | 12.4 (−0.4) | 56.3 (−0.1) | 37.8 (+0.3) | 41.3 (+0.8) | 20.3 (−1.6) | 19.8 (+0.5) | 50.7 (+0.8) | 23.1 (−0.5) |
| InstaBoost [35] | 23.1 (−0.3) | 13.2 (−0.4) | 13.3 (+0.5) | 56.2 (−0.2) | 37.7 (+0.2) | 43.0 (+2.5) | 22.6 (+0.7) | 20.1 (+0.8) | 52.3 (+2.4) | 22.8 (−0.8) |
| (OURS) IP-CC | 23.9 (+0.5) | 14.1 (+0.5) | 14.1 (+1.3) | 56.8 (+0.4) | 39.1 (+1.6) | 44.8 (+4.3) | 24.2 (+2.3) | 20.4 (+1.1) | 53.8 (+3.9) | 24.9 (+1.3) |
4.2.4. Preliminary Cross-Scenario Validation
- (1)
- Performance gains are observed across all five object categories, indicating its effectiveness for remote sensing scenarios;
- (2)
- Notably, the substantial improvement for the sample-scarce category (“helicopter”) aligns with the pattern observed in the VisDrone dataset.
| Method | mAP | Ship | LV | SV | HC | Plane |
|---|---|---|---|---|---|---|
| Baseline | 53.3 | 66.2 | 53.4 | 31.6 | 34.1 | 81.0 |
| (OURS) IP-CC | 54.2 (+0.9) | 66.9 (+0.7) | 53.7 (+0.3) | 32.7 (+1.1) | 36.0 (+1.9) | 81.5 (+0.5) |
4.2.5. Computational Cost and Efficiency Analysis
5. Discussion
6. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
Abbreviations
| mAP | Mean Average Precision |
| UAV | Unmmaned Aerial Vehicle |
References
- Zhang, Y.; Zhang, Y.; Fu, R.; Shi, Z.; Zhang, J.; Liu, D.; Du, J. Learning Nonlocal Quadrature Contrast for Detection and Recognition of Infrared Rotary-Wing UAV Targets in Complex Background. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5629919. [Google Scholar] [CrossRef]
- Zhang, T.; Zhang, X. A polarization fusion network with geometric feature embedding for SAR ship classification. Pattern Recognit. 2022, 123, 108365. [Google Scholar] [CrossRef]
- Gao, F.; Liu, S.; Gong, C.; Zhou, X.; Wang, J.; Dong, J.; Du, Q. Prototype-Based Information Compensation Network for Multisource Remote Sensing Data Classification. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5513615. [Google Scholar] [CrossRef]
- Lin, J.; Gao, F.; Shi, X.; Dong, J.; Du, Q. SS-MAE: Spatial–Spectral Masked Autoencoder for Multisource Remote Sensing Image Classification. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5531614. [Google Scholar] [CrossRef]
- Li, N.; Song, F.; Zhang, Y.; Liang, P.; Cheng, E. Traffic Context Aware Data Augmentation for Rare Object Detection in Autonomous Driving. In Proceedings of the 2022 International Conference on Robotics and Automation (ICRA), Philadelphia, PA, USA, 23–27 May 2022; pp. 4548–4554. [Google Scholar] [CrossRef]
- Wu, X.; Li, W.; Hong, D.; Tao, R.; Du, Q. Deep Learning for Unmanned Aerial Vehicle-Based Object Detection and Tracking: A survey. IEEE Geosci. Remote Sens. Mag. 2022, 10, 91–124. [Google Scholar] [CrossRef]
- Zhang, T.; Zhang, X.; Shi, J.; Wei, S. HyperLi-Net: A hyper-light deep learning network for high-accurate and high-speed ship detection from synthetic aperture radar imagery. ISPRS J. Photogramm. Remote Sens. 2020, 167, 123–153. [Google Scholar] [CrossRef]
- Zhang, T.; Zhang, X.; Gao, G. Divergence to Concentration and Population to Individual: A Progressive Approaching Ship Detection Paradigm for Synthetic Aperture Radar Remote Sensing Imagery. IEEE Trans. Aerosp. Electron. Syst. 2026, 62, 1325–1338. [Google Scholar] [CrossRef]
- Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft coco: Common objects in context. In Proceedings of the Computer Vision—ECCV 2014: 13th European Conference, Zurich, Switzerland, 6–12 September 2014; Springer International Publishing: Cham, Switzerland, 2014; pp. 740–755. [Google Scholar] [CrossRef]
- Cheng, G.; Yuan, X.; Yao, X.; Yan, K.; Zeng, Q.; Xie, X.; Han, J. Towards Large-Scale small-object detection: Survey and Benchmarks. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 13467–13488. [Google Scholar] [CrossRef]
- Everingham, M.; Van Gool, L.; Williams, C.K.I.; Winn, J.; Zisserman, A. The Pascal Visual Object Classes (VOC) Challenge. Int. J. Comput. Vis. 2010, 88, 303–338. [Google Scholar] [CrossRef]
- Felzenszwalb, P.F.; Girshick, R.B.; McAllester, D.; Ramanan, D. Object Detection with Discriminatively Trained Part-Based Models. IEEE Trans. Pattern Anal. Mach. Intell. 2010, 32, 1627–1645. [Google Scholar] [CrossRef]
- Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 580–587. [Google Scholar] [CrossRef]
- Dai, J.; Li, Y.; He, K.; Sun, J. R-FCN: Object detection via region-based fully convolutional networks. In Proceedings of the Advances in Neural Information Processing Systems 29, Barcelona, Spain, 5–10 December 2016; pp. 379–387. [Google Scholar]
- Zhao, Y.; Lv, W.; Xu, S.; Wei, J.; Wang, G.; Dang, Q.; Liu, Y.; Chen, J. DETRs Beat YOLOs on Real-time Object Detection. In Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2024; pp. 16965–16974. [Google Scholar] [CrossRef]
- Kisantal, M.; Wojna, Z.; Murawski, J.; Naruniec, J.; Cho, K. Augmentation for small-object detection. arXiv 2019, arXiv:1902.07296. [Google Scholar]
- Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. SSD: Single Shot MultiBox Detector. In Proceedings of the Computer Vision—ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Leibe, B., Matas, J., Sebe, N., Welling, M., Eds.; Springer International Publishing: Cham, Switzerland, 2016; pp. 21–37. [Google Scholar] [CrossRef]
- Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature Pyramid Networks for Object Detection. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 936–944. [Google Scholar] [CrossRef]
- Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J. Path Aggregation Network for Instance Segmentation. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 8759–8768. [Google Scholar] [CrossRef]
- Liu, Z.; Gao, G.; Sun, L.; Fang, Z. HRDNet: High-Resolution Detection Network for Small Objects. In Proceedings of the 2021 IEEE International Conference on Multimedia and Expo (ICME), Shenzhen, China, 5–9 July 2021; pp. 1–6. [Google Scholar] [CrossRef]
- Tan, M.; Pang, R.; Le, Q.V. EfficientDet: Scalable and Efficient Object Detection. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 10778–10787. [Google Scholar] [CrossRef]
- Gao, F.; Jin, X.; Zhou, X.; Dong, J.; Du, Q. MSFMamba: Multiscale Feature Fusion State Space Model for Multisource Remote Sensing Image Classification. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5504116. [Google Scholar] [CrossRef]
- Chen, C.; Liu, M.Y.; Tuzel, O.; Xiao, J. R-CNN for small-object detection. In Proceedings of the Computer Vision—ACCV 2016: 13th Asian Conference on Computer Vision, Taipei, Taiwan, 20–24 November 2016; Springer International Publishing: Cham, Switzerland, 2017; pp. 214–230. [Google Scholar] [CrossRef]
- Bell, S.; Zitnick, C.L.; Bala, K.; Girshick, R. Inside-Outside Net: Detecting Objects in Context with Skip Pooling and Recurrent Neural Networks. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 2874–2883. [Google Scholar] [CrossRef]
- Zhang, G.; Lu, S.; Zhang, W. CAD-Net: A Context-Aware Detection Network for Objects in Remote Sensing Imagery. IEEE Trans. Geosci. Remote Sens. 2019, 57, 10015–10024. [Google Scholar] [CrossRef]
- Cui, L.; Lv, P.; Jiang, X.; Gao, Z.; Zhou, B.; Zhang, L.; Shao, L.; Xu, M. Context-Aware Block Net for small-object detection. IEEE Trans. Cybern. 2022, 52, 2300–2313. [Google Scholar] [CrossRef]
- Zhang, Y.; Zhang, Y.; Shi, Z.; Fu, R.; Liu, D.; Zhang, Y.; Du, J. Enhanced Cross-Domain Dim and Small Infrared Target Detection via Content-Decoupled Feature Alignment. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5618416. [Google Scholar] [CrossRef]
- Zhang, T.; Zhang, X.; Liu, C.; Shi, J.; Wei, S.; Ahmad, I.; Zhan, X.; Zhou, Y.; Pan, D.; Li, J.; et al. Balance learning for ship detection from synthetic aperture radar remote sensing imagery. ISPRS J. Photogramm. Remote Sens. 2021, 182, 190–207. [Google Scholar] [CrossRef]
- Hao, X.; Liu, L.; Yang, R.; Yin, L.; Zhang, L.; Li, X. A Review of Data Augmentation Methods of Remote Sensing Image Target Recognition. Remote Sens. 2023, 15, 827. [Google Scholar] [CrossRef]
- Chen, C.; Zhang, Y.; Lv, Q.; Wei, S.; Wang, X.; Sun, X.; Dong, J. RRNet: A Hybrid Detector for Object Detection in Drone-Captured Images. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW), Seoul, Republic of Korea, 27–28 October 2019; pp. 100–108. [Google Scholar] [CrossRef]
- Yang, Z.; Yu, H.; Feng, M.; Sun, W.; Lin, X.; Sun, M.; Mao, Z.H.; Mian, A. Small Object Augmentation of Urban Scenes for Real-Time Semantic Segmentation. IEEE Trans. Image Process. 2020, 29, 5175–5190. [Google Scholar] [CrossRef]
- Bosquet, B.; Cores, D.; Seidenari, L.; Brea, V.M.; Mucientes, M.; Bimbo, A.D. A full data augmentation pipeline for small-object detection based on generative adversarial networks. Pattern Recognit. 2023, 133, 108998. [Google Scholar] [CrossRef]
- Hu, Z.; Wu, W.; Yang, Z.; Zhao, Y.; Xu, L.; Kong, L.; Chen, Y.; Chen, L.; Liu, G. A Cost-Sensitive Small Vessel Detection Method for Maritime Remote Sensing Imagery. Remote Sens. 2025, 17, 2471. [Google Scholar] [CrossRef]
- Ghiasi, G.; Cui, Y.; Srinivas, A.; Qian, R.; Lin, T.Y.; Cubuk, E.D.; Le, Q.V.; Zoph, B. Simple Copy-Paste is a Strong Data Augmentation Method for Instance Segmentation. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 2917–2927. [Google Scholar] [CrossRef]
- Fang, H.S.; Sun, J.; Wang, R.; Gou, M.; Li, Y.L.; Lu, C. InstaBoost: Boosting Instance Segmentation via Probability Map Guided Copy-Pasting. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 682–691. [Google Scholar] [CrossRef]
- Nie, Z.; Cao, J.; Weng, N.; Yu, X.; Wang, M. Object-Based Perspective Transformation Data Augmentation for Object Detection. In Proceedings of the 2022 International Conference on Frontiers of Artificial Intelligence and Machine Learning (FAIML), Hangzhou, China, 19–21 June 2022; pp. 186–190. [Google Scholar] [CrossRef]
- Dwibedi, D.; Misra, I.; Hebert, M. Cut, Paste and Learn: Surprisingly Easy Synthesis for Instance Detection. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 1310–1319. [Google Scholar] [CrossRef]
- Zhao, H.; Sheng, D.; Bao, J.; Chen, D.; Chen, D.; Wen, F.; Yuan, L.; Liu, C.; Zhou, W.; Chu, Q.; et al. X-Paste: Revisiting Scalable Copy-Paste for Instance Segmentation using CLIP and StableDiffusion. In Proceedings of the 40th International Conference on Machine Learning, Honolulu, HI, USA, 23–29 July 2023; Krause, A., Brunskill, E., Cho, K., Engelhardt, B., Sabato, S., Scarlett, J., Eds.; Proceedings of Machine Learning Research (PMLR): Brookline, MA, USA, 2023; Volume 202, pp. 42098–42109. [Google Scholar]
- Dvornik, N.; Mairal, J.; Schmid, C. Modeling Visual Context Is Key to Augmenting Object Detection Datasets. In Proceedings of the Computer Vision—ECCV 2018: 15th European Conference, Munich, Germany, 8–14 September 2018; Springer International Publishing: Cham, Switzerland, 2018; pp. 375–391. [Google Scholar] [CrossRef]
- Zhang, L.; Wen, T.; Min, J.; Wang, J.; Han, D.; Shi, J. Learning Object Placement by Inpainting for Compositional Data Augmentation. In Proceedings of the Computer Vision—ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Vedaldi, A., Bischof, H., Brox, T., Frahm, J.M., Eds.; Springer International Publishing: Cham, Switzerland, 2020; Volume 12358, pp. 566–581. [Google Scholar] [CrossRef]
- Zhu, P.; Wen, L.; Du, D.; Bian, X.; Fan, H.; Hu, Q.; Ling, H. Detection and Tracking Meet Drones Challenge. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 44, 7380–7399. [Google Scholar] [CrossRef]
- Du, D.; Zhu, P.; Wen, L.; Bian, X.; Lin, H.; Hu, Q.; Peng, T.; Zheng, J.; Wang, X.; Zhang, Y.; et al. VisDrone-DET2019: The Vision Meets Drone Object Detection in Image Challenge Results. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW), Seoul, Republic of Korea, 27–28 October 2019; pp. 213–226. [Google Scholar] [CrossRef]
- Deng, C.; Wang, M.; Liu, L.; Liu, Y.; Jiang, Y. Extended Feature Pyramid Network for small-object detection. IEEE Trans. Multimed. 2022, 24, 1968–1979. [Google Scholar] [CrossRef]
- Divvala, S.K.; Hoiem, D.; Hays, J.H.; Efros, A.A.; Hebert, M. An empirical study of context in object detection. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 1271–1278. [Google Scholar] [CrossRef]
- Hoiem, D.; Chodpathumwan, Y.; Dai, Q. Diagnosing Error in Object Detectors. In Proceedings of the Computer Vision—ECCV 2012: 12th European Conference on Computer Vision, Florence, Italy, 7–13 October 2012; Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C., Eds.; Springer: Berlin/Heidelberg, Germany, 2012; pp. 340–353. [Google Scholar] [CrossRef]
- Cheng, P.; Liu, W.; Zhang, Y.; Ma, H. LOCO: Local Context Based Faster R-CNN for Small Traffic Sign Detection. In Proceedings of the MultiMedia Modeling: 24th International Conference, MMM 2018, Bangkok, Thailand, 5–7 February 2018; Schoeffmann, K., Chalidabhongse, T.H., Ngo, C.W., Aramvith, S., O’Connor, N.E., Ho, Y.S., Gabbouj, M., Elgammal, A., Eds.; Springer International Publishing: Cham, Switzerland, 2018; pp. 329–341. [Google Scholar] [CrossRef]
- Li, J.; Liang, X.; Wei, Y.; Xu, T.; Feng, J.; Yan, S. Perceptual Generative Adversarial Networks for Small Object Detection. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 1951–1959. [Google Scholar] [CrossRef]
- Zhang, Y.; Bai, Y.; Ding, M.; Ghanem, B. Multi-task Generative Adversarial Network for Detecting Small Objects in the Wild. Int. J. Comput. Vis. 2020, 128, 1810–1828. [Google Scholar] [CrossRef]
- Bashir, S.M.A.; Wang, Y. small-object detection in Remote Sensing Images with Residual Feature Aggregation-Based Super-Resolution and Object Detector Network. Remote Sens. 2021, 13, 1854. [Google Scholar] [CrossRef]
- Xiuling, Z.; Huijuan, W.; Yu, S.; Gang, C.; Suhua, Z.; Quanbo, Y. Starting from the structure: A review of small-object detection based on deep learning. Image Vis. Comput. 2024, 146, 105054. [Google Scholar] [CrossRef]
- Girshick, R. Fast R-CNN. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar] [CrossRef]
- Wang, K.; Fang, B.; Qian, J.; Yang, S.; Zhou, X.; Zhou, J. Perspective Transformation Data Augmentation for Object Detection. IEEE Access 2020, 8, 4935–4943. [Google Scholar] [CrossRef]
- Zoph, B.; Cubuk, E.D.; Ghiasi, G.; Lin, T.Y.; Shlens, J.; Le, Q.V. Learning Data Augmentation Strategies for Object Detection. In Proceedings of the Computer Vision—ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Vedaldi, A., Bischof, H., Brox, T., Frahm, J.M., Eds.; Springer International Publishing: Cham, Switzerland, 2020; Volume 12372, pp. 566–583. [Google Scholar] [CrossRef]
- Kim, J.H.; Hwang, Y. GAN-Based Synthetic Data Augmentation for Infrared Small Target Detection. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5002512. [Google Scholar] [CrossRef]
- Fang, H.; Han, B.; Zhang, S.; Zhou, S.; Hu, C.; Ye, W.M. Data Augmentation for Object Detection via Controllable Diffusion Models. In Proceedings of the 2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 3–8 January 2024; pp. 1246–1255. [Google Scholar] [CrossRef]
- Li, Y.; Dong, X.; Chen, C.; Zhuang, W.; Lyu, L. A Simple Background Augmentation Method for Object Detection with Diffusion Model. In Proceedings of the Computer Vision—ECCV 2024: 18th European Conference, Milan, Italy, 29 September–4 October 2024; Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G., Eds.; Springer Nature: Cham, Switzerland, 2025; pp. 462–479. [Google Scholar] [CrossRef]
- Alimisis, P.; Mademlis, I.; Radoglou-Grammatikis, P.; Sarigiannidis, P.; Papadopoulos, G.T. Advances in diffusion models for image data augmentation: A review of methods, models, evaluation metrics and future research directions. Artif. Intell. Rev. 2025, 58, 112. [Google Scholar] [CrossRef]
- Bochkovskiy, A.; Wang, C.Y.; Liao, H.Y.M. YOLOv4: Optimal Speed and Accuracy of Object Detection. arXiv 2020, arXiv:2004.10934. [Google Scholar] [CrossRef]
- Duan, C.; Wei, Z.; Zhang, C.; Qu, S.; Wang, H. Coarse-grained Density Map Guided Object Detection in Aerial Images. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), Montreal, QC, Canada, 11–17 October 2021; pp. 2789–2798. [Google Scholar] [CrossRef]
- Georgakis, G.; Mousavian, A.; Berg, A.C.; Košecká, J. Synthesizing training data for object detection in indoor scenes. In Proceedings of the Robotics: Science and Systems, Massachusetts Institute of Technology, Cambridge, MA, USA, 12–16 July 2017; Volume 13. [Google Scholar] [CrossRef]
- Hong, S.; Kang, S.; Cho, D. Patch-Level Augmentation for Object Detection in Aerial Images. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW), Seoul, Republic of Korea, 27–28 October 2019; pp. 127–134. [Google Scholar] [CrossRef]
- Liu, S.; Guo, H.; Hu, J.G.; Zhao, X.; Zhao, C.; Wang, T.; Zhu, Y.; Wang, J.; Tang, M. A novel data augmentation scheme for pedestrian detection with attribute preserving GAN. Neurocomputing 2020, 401, 123–132. [Google Scholar] [CrossRef]
- Geiger, A.; Lenz, P.; Urtasun, R. Are we ready for autonomous driving? The KITTI vision benchmark suite. In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, 16–21 June 2012; pp. 3354–3361. [Google Scholar] [CrossRef]
- Kirillov, A.; Mintun, E.; Ravi, N.; Mao, H.; Rolland, C.; Gustafson, L.; Xiao, T.; Whitehead, S.; Berg, A.C.; Lo, W.Y.; et al. Segment Anything. In Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 1–6 October 2023; pp. 3992–4003. [Google Scholar] [CrossRef]
- Jiang, J.; Zhang, K.; Timofte, R. Towards Flexible Blind JPEG Artifacts Removal. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 4977–4986. [Google Scholar] [CrossRef]
- Yu, T.; Feng, R.; Feng, R.; Liu, J.; Jin, X.; Zeng, W.; Chen, Z. Inpaint Anything: Segment Anything Meets Image Inpainting. arXiv 2023, arXiv:2304.06790. [Google Scholar] [CrossRef]
- Huang, X.; Huang, Y.J.; Zhang, Y.; Tian, W.; Feng, R.; Zhang, Y.; Xie, Y.; Li, Y.; Zhang, L. Open-Set Image Tagging with Multi-Grained Text Supervision. In Proceedings of the 33rd ACM International Conference on Multimedia, MM ’25, Dublin, Ireland, 27–31 October 2025; pp. 4117–4126. [Google Scholar] [CrossRef]
- DeepSeek-AI; Guo, D.; Yang, D.; Zhang, H.; Song, J.; Zhang, R.; Xu, R.; Zhu, Q.; Ma, S.; Wang, P.; et al. DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. arXiv 2025, arXiv:2501.12948. [Google Scholar]
- Liu, S.; Zeng, Z.; Ren, T.; Li, F.; Zhang, H.; Yang, J.; Jiang, Q.; Li, C.; Yang, J.; Su, H.; et al. Grounding DINO: Marrying DINO with Grounded Pre-training for Open-Set Object Detection. In Proceedings of the Computer Vision—ECCV 2024: 18th European Conference, Milan, Italy, 29 September–4 October 2024; Springer Nature: Cham, Switzerland, 2025; pp. 38–55. [Google Scholar] [CrossRef]
- Li, J.; Li, D.; Xiong, C.; Hoi, S. BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation. In Proceedings of the 39th International Conference on Machine Learning, Baltimore, MD, USA, 17–23 July 2022; Chaudhuri, K., Jegelka, S., Song, L., Szepesvari, C., Niu, G., Sabato, S., Eds.; Proceedings of Machine Learning Research (PMLR): Brookline, MA, USA, 2022; Volume 162, pp. 12888–12900. [Google Scholar]
- Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning Transferable Visual Models from Natural Language Supervision. In Proceedings of the 38th International Conference on Machine Learning, Virtual, 18–24 July 2021; Meila, M., Zhang, T., Eds.; Proceedings of Machine Learning Research (PMLR): Brookline, MA, USA, 2021; Volume 139, pp. 8748–8763. [Google Scholar]
- Ke, L.; Li, S.; Sun, Y.; Tai, Y.W.; Tang, C.K. GSNet: Joint Vehicle Pose and Shape Reconstruction with Geometrical and Scene-Aware Supervision. In Proceedings of the Computer Vision—ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Vedaldi, A., Bischof, H., Brox, T., Frahm, J.M., Eds.; Springer International Publishing: Cham, Switzerland, 2020; pp. 515–532. [Google Scholar] [CrossRef]
- Kouros, G.; Shrivastava, S.; Picron, C.; Nagesh, S.; Chakravarty, P.; Tuytelaars, T. Category-Level Pose Retrieval with Contrastive Features Learnt with Occlusion Augmentation. In Proceedings of the 33rd British Machine Vision Conference 2022, BMVC 2022, London, UK, 21–24 November 2022; BMVA Press: Malvern, UK, 2022. [Google Scholar]
- Klee, D.M.; Biza, O.; Platt, R.; Walters, R. Image to Sphere: Learning Equivariant Features for Efficient Pose Prediction. arXiv 2023, arXiv:2302.13926. [Google Scholar] [CrossRef]
- Xiang, Y.; Mottaghi, R.; Savarese, S. Beyond PASCAL: A benchmark for 3D object detection in the wild. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision, Steamboat Springs, CO, USA, 24–26 March 2014; pp. 75–82. [Google Scholar] [CrossRef]
- Song, X.; Wang, P.; Zhou, D.; Zhu, R.; Guan, C.; Dai, Y.; Su, H.; Li, H.; Yang, R. ApolloCar3D: A Large 3D Car Instance Understanding Benchmark for Autonomous Driving. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 5447–5457. [Google Scholar] [CrossRef]
- Kundu, A.; Li, Y.; Rehg, J.M. 3D-RCNN: Instance-Level 3D Object Reconstruction via Render-and-Compare. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 3559–3568. [Google Scholar] [CrossRef]
- Waqas Zamir, S.; Arora, A.; Gupta, A.; Khan, S.; Sun, G.; Shahbaz Khan, F.; Zhu, F.; Shao, L.; Xia, G.S.; Bai, X. isaid: A large-scale dataset for instance segmentation in aerial images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Long Beach, CA, USA, 16–17 June 2019; pp. 28–37. [Google Scholar]
- Jocher, G. Ultralytics/Yolov5: v7.0—YOLOv5 SOTA Realtime Instance Segmentation. License: AGPL-3.0. Available online: https://github.com/ultralytics/yolov5 (accessed on 20 August 2025).
- Zhu, X.; Lyu, S.; Wang, X.; Zhao, Q. TPH-YOLOv5: Improved YOLOv5 Based on Transformer Prediction Head for Object Detection on Drone-captured Scenarios. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), Montreal, QC, Canada, 11–17 October 2021; pp. 2778–2788. [Google Scholar] [CrossRef]
- Du, B.; Huang, Y.; Chen, J.; Huang, D. Adaptive Sparse Convolutional Networks with Global Context Enhancement for Faster Object Detection on Drone Images. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 13435–13444. [Google Scholar] [CrossRef]









| Method | Num. Images | YOLOv5 | TPH-YOLOv5 | GFL V1-CEASC | RT-DETR | ||||
|---|---|---|---|---|---|---|---|---|---|
| mAP | mAP@0.5 | mAP | mAP@0.5 | mAP | mAP@0.5 | mAP | mAP@0.5 | ||
| Baseline | 1× | 34.6 | 56.7 | 35.1 | 58.2 | 26.4 | 45.4 | 35.2 | 58.9 |
| Cut–Paste–Learn [37] | 1× | 34.3 (−0.3) | 56.2 (−0.5) | 34.9 (−0.2) | 58.0 (−0.2) | 26.3 (−0.1) | 45.5 (+0.1) | 35.0 (−0.2) | 58.7 (−0.2) |
| AdaResampling [30] | 1× | 34.6 (+0.0) | 56.8 (+0.1) | 34.5 (−0.6) | 57.2 (−1.0) | 26.4 (+0.0) | 45.4 (+0.0) | 35.2 (+0.0) | 58.8 (−0.1) |
| (OURS) OG-IC | 1× | 34.7 (+0.1) | 56.9 (+0.2) | 35.2 (+0.1) | 58.1 (−0.1) | 26.6 (+0.2) | 45.5 (+0.1) | 35.5 (+0.3) | 59.4 (+0.5) |
| (OURS) OG-CC | 1× | 34.9 (+0.3) | 57.3 (+0.6) | 35.4 (+0.3) | 58.3 (+0.1) | 26.7 (+0.3) | 46.0 (+0.6) | 35.7 (+0.5) | 59.6 (+0.7) |
| InstaBoost [35] | 2× | 35.0 (+0.4) | 57.0 (+0.3) | 35.7 (+0.6) | 59.0 (+0.8) | 26.8 (+0.4) | 45.9 (+0.5) | 35.7 (+0.5) | 60.1 (+1.2) |
| (OURS) IP-IC | 2× | 35.2 (+0.6) | 57.6 (+0.9) | 36.0 (+0.9) | 59.2 (+1.0) | 26.9 (+0.5) | 46.4 (+1.0) | 36.2 (+1.0) | 60.3 (+1.4) |
| (OURS) IP-CC | 2× | 36.2 (+1.4) | 58.9 (+2.1) | 36.3 (+1.1) | 59.9 (+1.6) | 27.1 (+0.6) | 46.5 (+1.0) | 36.2 (+1.0) | 60.5 (+1.5) |
| Method | Instance Prep. (s/Instance) | Background Prep. (s/Image) | Composition (s/Image) | GPU Mem. (GB) | Training Duration |
|---|---|---|---|---|---|
| Baseline | - | - | - | - | 1× |
| Cut–Paste–Learn [37] | 0.09 | - | 0.59 | - | 1× |
| AdaResampling [30] | - | 1.12 | 0.04 | - | 1× |
| InstaBoost [35] | 0.09 | - | 9.52 | - | 4× |
| (OURS) OG-IC | 0.13 | 7.05 | 1.46 | 7.2 | 1× |
| (OURS) IP-CC | 0.13 | 7.05 | 8.64 | 7.2 | 4× |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Li, C.; Zhang, Z.; Zhong, P.; He, J. A Realistic Instance-Level Data Augmentation Method for Small-Object Detection Based on Scene Understanding. Remote Sens. 2026, 18, 647. https://doi.org/10.3390/rs18040647
Li C, Zhang Z, Zhong P, He J. A Realistic Instance-Level Data Augmentation Method for Small-Object Detection Based on Scene Understanding. Remote Sensing. 2026; 18(4):647. https://doi.org/10.3390/rs18040647
Chicago/Turabian StyleLi, Chuwei, Zhilong Zhang, Ping Zhong, and Jun He. 2026. "A Realistic Instance-Level Data Augmentation Method for Small-Object Detection Based on Scene Understanding" Remote Sensing 18, no. 4: 647. https://doi.org/10.3390/rs18040647
APA StyleLi, C., Zhang, Z., Zhong, P., & He, J. (2026). A Realistic Instance-Level Data Augmentation Method for Small-Object Detection Based on Scene Understanding. Remote Sensing, 18(4), 647. https://doi.org/10.3390/rs18040647

