1. Introduction
UAVs have become an important platform for remote sensing, supporting applications such as urban monitoring, intelligent transportation, infrastructure inspection, and environmental surveillance [
1,
2]. Benefiting from high spatial resolution, flexible acquisition modes, and rapid response capabilities, UAV imagery effectively captures small-scale targets, making it well-suited for vehicle detection tasks in complex urban environments.
Deep learning-based detectors rely heavily on large-scale and high-quality annotated datasets [
3,
4]. However, collecting and labeling real-world UAV data is prohibitively costly and time-consuming. Furthermore, constrained by flight regulations, weather conditions, and operational safety, real data collection often fails to capture rare corner cases, such as truncated objects caused by variations in altitude and viewing angles, or hazardous scenarios [
5,
6]. These limitations hinder the acquisition of sufficiently diverse and balanced datasets for training robust vehicle detectors, especially when precise annotations are required.
Synthetic data generation provides a promising solution to address these challenges. This technology enables full control over illumination, sensor parameters, and scene composition, while providing complete and automatically generated annotations. Moreover, synthetic datasets constructed through this approach can reproduce rare, hazardous, or operationally constrained scenarios that are difficult to capture in real-world UAV missions. Consequently, synthetic data serves as an effective supplement to real data, compensating for its limitations in coverage and scene diversity.
The practical value of synthetic data depends on its ability to be effectively deployed in real-world environments. However, many existing synthetic UAV datasets, constructed using 3D engines such as Carla [
7] and AirSim [
8], typically rely on simplified 3D assets or generic urban templates [
9]. These environments do not accurately reflect real-world geospatial structures, terrain geometry, or UAV-specific imaging characteristics. As a result, the Sim2Real domain gap remains substantial, causing models trained on such data to degrade sharply in real-world deployment [
10,
11,
12]. Notably, most existing approaches still require mixing synthetic data with real data or fine-tuning on a real dataset to reach a satisfactory performance baseline [
13,
14].
To address these challenges, this paper proposes a synthetic data generation framework that combines scene fidelity with diversified rendering (
Figure 1). First, oblique photogrammetry is employed to reconstruct real-world 3D models from UAV imagery, or existing 3D tiles (e.g., .b3dm format) are directly integrated, ensuring the authenticity of geometric structures and texture details without extensive manual modeling. Second, diversified rendering strategies are adopted to simulate variations in illumination, weather conditions, and UAV viewpoints, enhancing model adaptability to complex environments. Finally, an automated annotation algorithm based on semantic masks is introduced to generate pixel-level accurate labels, significantly reducing annotation costs. Based on this framework, we construct a synthetic dataset named UAV-SynthScene (
Figure 2). It further supports the controllable synthesis of long-tail and underrepresented object instances to mitigate sampling imbalance, providing an efficient and scalable supplement to labor-intensive real data acquisition.
Comprehensive experiments conducted on real-world UAV datasets using six mainstream detectors demonstrate that models trained solely on UAV-SynthScene achieve performance comparable to those trained on real data (
Figure 3). In particular, the proposed synthetic data significantly improves model robustness under long-tail data distributions and exhibits a certain degree of cross-domain generalization capability across multiple real-world UAV datasets.
The main contributions are as follows:
- 1.
A data generation framework: We propose a high-fidelity synthetic data generation framework designed to minimize the Sim2Real gap at the source. This framework enhances the practical utility of synthetic data, establishing it as a powerful supplement to real-world UAV imagery.
- 2.
A High-Quality Synthetic Dataset: Based on the proposed framework, we construct UAV-SynthScene, a high-quality and realistic synthetic dataset specifically designed for UAV-based vehicle detection.
- 3.
Comprehensive Experimental Validation: Extensive experiments are conducted using six mainstream detectors on multiple real-world UAV datasets to validate the effectiveness of the generated synthetic data.
3. Materials and Methods
The proposed framework provides a complete workflow for generating high-fidelity, diverse synthetic data for UAV-based object detection, designed to minimize the domain gap with real-world aerial imagery. As illustrated in
Figure 4, the overall pipeline consists of three key stages: scene reconstruction, rendering diversification and automatic annotation. Each stage is designed to address a specific limitation of conventional synthetic data, including insufficient realism, restricted variation, high annotation cost, and dataset imbalance, while maintaining full controllability throughout the rendering process.
Our methodology is implemented within an advanced simulation pipeline built upon Colosseum [
53] (the successor to Microsoft AirSim) and Unreal Engine 5.1. First, realistic static environments are reconstructed from real UAV imagery through oblique photogrammetry, which ensures geometrical and textural consistency with real-world urban scenes. Second, dynamic environmental conditions such as lighting, weather, and camera viewpoint are systematically varied using physically based rendering to improve generalization. Third, each rendered image is automatically annotated with precise bounding boxes derived from semantic instance masks generated within the rendering engine. Finally, targeted synthesis of long-tail samples—including truncated, occluded, or boundary vehicles—is performed to mitigate the long-tail distribution commonly observed in UAV datasets.
3.1. High-Fidelity Scene Construction
Background modeling. We employ automated photogrammetry to reconstruct textured 3D meshes from UAV imagery, generating high-fidelity digital twin environments. The reconstructed background preserves the geometry of terrain, roads, and buildings, while retaining intricate texture details and spatial layouts. Compared with generic 3D asset libraries, this approach substantially reduces the stylistic gap between synthetic and real scenes, which helps reduce the domain gap from the source.
Object modeling. A comprehensive vehicle asset library is developed, covering various categories such as sedans, SUVs, trucks, and buses. By leveraging Unreal Engine’s material system, realistic surface properties are assigned to each vehicle, including color, metallicity, roughness, and albedo. This ensures both accurate geometric fidelity and broad coverage of real-world appearance variations, as illustrated in
Figure 5.
Environment modeling. We integrate the UltraDynamicSky plugin of Unreal Engine to simulate physically based temporal and weather dynamics. This includes solar elevation, illumination color, shadow intensity, as well as complex weather conditions such as overcast skies, rainfall, and fog. Such modeling provides UAV target detection with spatiotemporal variations that closely resemble those observed in real environments.
3.2. Diversified Rendering
While static scene reconstruction achieves high fidelity, simulation alone cannot fully capture the complexity of real-world phenomena such as illumination, atmospheric scattering, and dynamic camera perspectives. To overcome this limitation, we implement targeted diversification strategies to ensure our synthetic data covers a wide spectrum of complex conditions encountered in real-world operations.
To achieve illumination diversity, we simulate the full diurnal cycle and a variety of weather scenarios (
Figure 6) by dynamically adjusting solar elevation, light intensity, and atmospheric conditions. This ensures that the dataset spans a wide range of lighting and meteorological conditions.
To ensure UAV viewpoint diversity, we produce multi-perspective observations by randomizing flight altitude, pitch, yaw, and roll. This enhances the robustness of detectors when deployed in real UAV missions. The camera in Unreal Engine is configured with a focal length of 10 mm, a sensor size of 6.4 mm × 5.12 mm, and a resolution of 1920 × 1536.
For dynamic traffic flow modeling, we employ spline-driven trajectories to simulate traffic flow, producing realistic spatial distributions and relative vehicle motions. In contrast to purely random placement, this method preserves naturalistic traffic patterns while ensuring sufficient diversity.
For long-tail sample generation, we systematically generate truncated samples by controlling camera poses and sampling strategies to position objects at image boundaries. This mechanism significantly mitigates the scarcity of long-tail samples commonly found in real datasets.
3.3. Automated Ground Truth Generation
To circumvent the high cost of manual labeling, we generate instance-level semantic segmentation maps in parallel with RGB images during rendering. Each vehicle is encoded with a unique RGB value, enabling automated extraction of object masks and bounding boxes. The entire procedure is detailed in Algorithm 1. Compared with manual annotation, this approach offers three distinct advantages:
- 1.
Consistency and completeness: Eliminates the subjective biases and annotation omissions inherent in manual labeling.
- 2.
Low-cost scalability: Enables the generation of large-scale, richly annotated datasets at a near-zero marginal cost, removing the primary barrier to data scaling.
- 3.
Annotation precision: In contrast to the inherent imprecision of manual labeling, our approach guarantees tightly fitted boxes derived from instance segmentations, thus providing superior annotation quality.
| Algorithm 1 Automated Annotation from Semantic Masks |
- 1:
Input: Root directory with image-mask pairs (, ). Pixel threshold . Color-to-class map . - 2:
Output: A dataset of images and corresponding YOLO-format label files. - 3:
for each semantic mask in do - 4:
RemoveNoise(Isem) {Pre-process the mask using a median filter.} - 5:
- 6:
GetDimensions(Isem) - 7:
for each (color C, class_name V) in do - 8:
CreateBinaryMask(Isem, C) - 9:
if CountPixels then - 10:
FindBoundingBox - 11:
ConvertToYOLO {Normalize coordinates.} - 12:
Append to . - 13:
end if - 14:
end for - 15:
if is not empty then - 16:
Find corresponding . - 17:
SaveImageToDataset() - 18:
SaveLabelsToDataset() - 19:
end if - 20:
end for
|
In summary, high-fidelity scene modeling ensures consistency with real-world environments, diversified rendering enhances the model’s robustness to complex scenarios, and fully automated ground-truth generation resolves the cost bottleneck of large-scale labeling.
4. Results
This section systematically validates the effectiveness and superiority of our proposed high-fidelity synthetic data generation framework through a series of comprehensive experiments. Our evaluation revolves around several core dimensions. First, we conduct a direct performance comparison of our generated synthetic dataset, UAV-SynthScene, against real data and two other mainstream public synthetic datasets to confirm its viability as an effective supplement to real data. Next, we delve into the framework’s robustness in handling typical long-tail distribution scenarios, such as “truncated objects.” Furthermore, we assess the cross-domain generalization capability of models trained exclusively on our synthetic data in unseen environments through zero-shot transfer experiments. Finally, through meticulous ablation studies, we precisely quantify the individual key contributions of the two core components of our framework: high-fidelity scene reconstruction and diversified rendering.
4.1. Datasets
Our experiments are conducted on a collection of real and synthetic UAV datasets:
UAV-RealScene: A real dataset captured by DJI Mavic 3T UAVs (Shenzhen, China) across an area of approximately 1000 m × 2000 m, covering both urban and suburban environments. It contains 8736 images at a resolution of 640 × 480, with 8000 images used for training and 736 for testing. All vehicle instances are manually annotated. This dataset serves as the real-world benchmark for performance comparison.
UAV-SynthScene: Constructed using the proposed synthetic data generation framework. Ten flight paths were designed, with altitudes randomly sampled between 50–300 m, heading and pitch angles varied within −10° to 10° and −45° to −90°, respectively, while roll angles were fixed at 0°. In total, 4659 images of resolution 1920 × 1536 were generated. The rendering process incorporated diverse weather conditions (sunny, cloudy, foggy) and lighting conditions (sunrise, noon, sunset), with dynamic simulation of solar and cloud variations to enhance generalization.
SkyScenes [
13]: A large-scale public synthetic dataset for aerial scene understanding. This dataset is generated using a high-fidelity simulator and provides rich annotations for various computer vision tasks. In our experiments, SkyScenes serves as an advanced synthetic data benchmark for performance comparison with the proposed data generation method.
UEMM-Air [
54]: A large-scale multimodal synthetic dataset constructed using Unreal Engine, covering aerial scenes, environmental variations, and multi-task annotations. It is used in ablation studies to compare the performance of our high-fidelity reconstruction approach against generic 3D asset–based generation.
RGBT-Tiny [
55]: A public UAV-based visible-thermal benchmark dataset containing multiple image sequences. In our cross-domain generalization tests, to maintain evaluation consistency, we selected one representative sequence for testing.
VisDrone [
56]: A large-scale, challenging benchmark dataset captured by various drone platforms across 14 different cities in China. It encompasses a wide spectrum of complex scenarios, including dense traffic, cluttered backgrounds, and significant variations in object scale and viewpoint. Due to its diversity and difficulty, VisDrone serves as an ideal unseen real-world benchmark in our experiments to rigorously evaluate the cross-domain generalization capability of the models.
4.2. Implementation Settings
The majority of our experiments were implemented using the MMDetection toolbox [
57] on a single NVIDIA GeForce RTX 4090 GPU. Models available in this library (DINO, Deformable-DETR [
58], DDQ [
59]) were trained using their default configurations with a ResNet-50 backbone and a Feature Pyramid Network (FPN) neck. For detectors not integrated into MMDetection, we utilized their official open-source implementations: DEIM [
60] was trained using its author-provided code, YOLOv11 and RT-DETR [
61] were trained via Ultralytics. All vehicle types are treated as a single category for both training and evaluation.
For performance evaluation, we adhere to the official COCO protocol. It is crucial to note the discrepancy between the ground truth annotations in our synthetic data and those in real datasets. The bounding boxes in our synthetic datasets are programmatically derived from pixel-perfect instance masks, yielding exceptionally precise boundaries. Conversely, manual annotations may exhibit unavoidable looseness due to annotation subjectivity (
Figure 7). This inherent difference can lead to the penalization of accurate predictions at high Intersection over Union (IoU) thresholds (e.g.,
,
). To mitigate this bias and focus more on detection recall than on hyper-precise localization, we have selected
as our primary evaluation metric. Accordingly,
will serve as the default metric for all experiments presented hereafter, unless explicitly stated otherwise.
4.3. Performance Evaluation
4.3.1. Effectiveness of Synthetic Data as a Scalable Supplement to Real Data
To comprehensively evaluate the effectiveness of our UAV-SynthScene dataset, we conducted rigorous performance benchmark tests against the real-world UAV-RealScene dataset and two other public synthetic datasets, SkyScenes and UEMM-Air. Significantly, to ensure practical meaning, all experiments utilized real data as the evaluation benchmark. Specifically, six representative detectors, including YOLOv11, Deformable-DETR, DINO, DDQ, DEIM, and RT-DETR, are selected for experimental validation. These models cover both Convolutional Neural Network (CNN)-based and Transformer-based architectures, encompassing high-accuracy and real-time detection paradigms. This comprehensive evaluation is designed to assess the general applicability of the proposed synthetic dataset across different architectures, thereby ensuring the robustness and generality of the research conclusions.
The experimental results are presented in
Table 1. The analysis reveals several key points. First, the performance of models trained exclusively on our UAV-SynthScene is remarkably close to that of models trained on the large-scale real-world UAV-RealScene data. Using the
metric as an example, the model trained on our synthetic data achieved 87.9% with the DINO detector, which is identical to the performance of the model trained on real data. Second, compared to the other two synthetic datasets, our UAV-SynthScene demonstrates a clear advantage. For instance, when using the DEIM detector, our model achieved an
of 88.7%, significantly outperforming models trained on SkyScenes (86.5%) and UEMM-Air (72.0%). This performance advantage is consistent across all tested detector architectures, strongly proving that our strategy of combining high-fidelity reconstruction with diversified rendering is superior to existing synthetic data generation methods in bridging the Sim2Real gap.
In addition to the quantitative metrics, we further conducted a qualitative analysis through visualization of the detection results (
Figure 8) to gain deeper insights into the impact of different training datasets on model robustness, green bounding boxes denote correctly detected vehicles, while red boxes prominently highlight missed detections (false negatives). A clear visual inspection reveals that models trained on SkyScenes and UEMM-Air exhibit severe missed detections, as evidenced by the abundance of red boxes. In sharp contrast, models trained on our UAV-SynthScene dataset (“Pred (ours)” row) show a remarkable reduction in missed detections. As illustrated, the number of red boxes is minimal, indicating superior detection completeness that even surpasses models trained solely on real data.
This demonstrates that the proposed synthetic data can provide detection performance comparable to manually collected real data, effectively bridging the Sim2Real gap. Thus, synthetic data can serve as a high-quality supplement to costly, manually annotated real imagery.
4.3.2. Robustness in Long-Tail Scenarios
One of the core strengths of synthetic data lies in its ability to actively control data distribution, thereby addressing inherent sampling biases in real datasets. To validate this, we focused on truncated objects, which represent a typical long-tail scenario.
As shown in
Table 2 and visually evidenced in
Figure 8, the proposed synthetic dataset achieves markedly superior detection performance on truncated vehicles compared to both real-world and existing synthetic datasets. Models trained on UAV-SynthScene exhibit notably higher recall for partially visible targets, effectively mitigating the severe missed detections observed in SkyScenes and UEMM-Air.
Across all detectors, models trained on UAV-SynthScene consistently and substantially outperform those trained on all other datasets (including real data) in detecting truncated objects. For example, with the DINO detector, our model achieved an of 73.0% on the truncated-object test set, which not only far surpasses the model trained on real data (56.3%) but also significantly exceeds those trained on SkyScenes (56.9%) and UEMM-Air (50.8%). Furthermore, the Recall metric supports this advantage, showing that our framework maintains robust detection completeness for boundary targets. These results suggest that the active generation strategy effectively helps reduce missed detections for such hard-to-capture instances.
This strong quantitative and qualitative consistency demonstrates that the active long-tail sample generation strategy embedded in our framework is highly effective. It enables precise augmentation of sparse yet crucial sample categories in real data, thereby greatly enhancing the model’s robustness in complex and non-ideal long-tail detection scenarios.
4.3.3. Cross-Domain Generalization
To assess the cross-domain generalization of our synthetic data, we conducted a series of zero-shot transfer experiments. Models trained on UAV-SynthScene and three baseline datasets (one real, two synthetic) were evaluated directly on two unseen real-world benchmarks, RGBT-Tiny and VisDrone, without any fine-tuning.
The results, presented in
Table 3, indicate a clear advantage in generalization for models trained on our UAV-SynthScene. Across most detector architectures, our synthetic-data-trained models consistently outperform those trained on the other synthetic baselines (SkyScenes and UEMM-Air). Notably, in several instances, their performance is also highly competitive with or even exceeds that of models trained on our real data (UAV-RealScene).
For example, when evaluated on the RGBT-Tiny benchmark, the DINO model trained on UAV-SynthScene achieves 96.5% . This result is substantially higher than the performance of models trained on SkyScenes (76.5%) and UEMM-Air (65.0%), and is closely comparable to the in-domain real-data baseline (97.1%). Similar performance trends are observed on the VisDrone dataset.
These findings suggest that our high-fidelity synthetic data generation strategy yields synthetic data with strong transferability. The combination of a photorealistic scene foundation and diversified rendering likely enables the models to learn more fundamental and robust visual representations, contributing to improved generalization performance in novel real-world environments.
4.4. Ablation Study
To quantify the independent contributions of the two core components in our framework—high-fidelity scene reconstruction (Fidelity) and diversified rendering (Diversity)—we conducted a series of ablation studies across all five detectors.
Contribution of diversified rendering. To quantify the contribution of our diversified rendering strategies, we created a baseline dataset, UAV-SynthScene-Static, by rendering all images under a fixed set of conditions: clear noon lighting from a static viewpoint (150 m altitude, −80° pitch, 0° roll/yaw). We then trained all five detector architectures on both this static dataset and our full, diversified dataset. The performance of these models was evaluated on two versions of the real-world test set: the complete set (“Overall”) and a challenging subset containing only truncated objects (“Trunc.”).
The detailed results, presented in
Table 4, reveal a stark performance degradation when diversified rendering is removed. On average, the
across all detectors drops by nearly 10 points on the “Overall” set and a staggering 20 points on the “Trunc.” subset. To highlight a specific example, the DINO detector’s performance on truncated objects plummets from 73.0% with our full method to a mere 44.2% when trained on the static data. This substantial gap underscores that exposing the model to a wide spectrum of environmental conditions is not merely beneficial but essential for learning robust features that can generalize to the unpredictable nature of real-world scenarios.
Contribution of High-Fidelity Reconstruction. To evaluate the importance of high-fidelity scene reconstruction, we compared our approach against Generic-Synth3D, a dataset generated using generic 3D assets but also incorporating diversified rendering. This allows us to isolate the impact of the scene’s foundational realism. Models were trained on these respective datasets and evaluated on our real-world test set, with performance measured on both the complete set (“Overall”) and its truncated subset (“Trunc.”).
The results, presented in
Table 5, reveal a dramatic performance gap and unequivocally demonstrate the critical role of scene fidelity. Models trained on the Generic-Synth3D dataset consistently and substantially underperform those trained on our photogrammetry-based UAV-SynthScene. As highlighted by the “Average” row, switching from generic assets to our high-fidelity scenes yields a staggering average improvement of +22.5
points. The performance leap is particularly pronounced for detectors like RT-DETR, which saw a gain of +27.5 points. This massive disparity validates our core hypothesis: high-fidelity reconstruction is the cornerstone of effective Sim2Real transfer. By mirroring the unique geometry and textures of the target environment, our method enables models to learn priors that are directly and effectively transferable, paving the way for performance that rivals real-world training data.
5. Discussion
Experimental results demonstrate that high-fidelity synthetic data can effectively bridge the Sim2Real gap for UAV-based vehicle detection. This section discusses the broader implications of these findings, positions our work in the context of existing literature, and outlines its limitations and future research directions.
5.1. Interpretation of Key Findings
The near-identical performance between models trained on UAV-SynthScene and UAV-RealScene (as shown in
Table 1) is a significant finding. It suggests that for UAV-based object detection, the “reality gap” can be substantially closed if the synthetic environment accurately captures the geospatial structure and textural characteristics of the target domain. Our results from the ablation study (
Table 5) strongly support this, revealing that high-fidelity reconstruction contributed a substantial +23.3
gain over generic 3D assets. This confirms our central hypothesis: for top-down remote sensing views, structural and textural fidelity is a more critical factor than for ground-level vision, where generic urban layouts can often suffice.
Furthermore, the superior performance in long-tail scenarios (
Table 2) highlights a fundamental advantage of simulation. Real data collection is inherently a process of passive, uncontrolled sampling, leading to inevitable data imbalances. Our framework transforms this into a process of active, goal-oriented sampling. By programmatically generating truncated and occluded objects, we can directly address known failure modes of detectors, a capability that traditional data augmentation or post-hoc resampling methods cannot fully replicate.
5.2. Comparison with Existing Works and Methodological Implications
Our high-fidelity approach presents a conceptual counterpoint to the philosophy of pure domain randomization, which often sacrifices realism for diversity. While DR is effective for learning domain-agnostic features, our results indicate that for geo-specific tasks, starting from a high-fidelity baseline provides a much stronger foundation. The strong cross-domain generalization performance (
Table 3) suggests that our method, by combining a realistic foundation with controlled diversity, allows models to learn features that are not only robust but also more transferable to novel geographic environments.
Compared to other high-fidelity simulation datasets like SkyScenes [
13], our framework’s key differentiator is the use of photogrammetry-based digital twins instead of artist-created 3D worlds. While the latter can be visually stunning, they may lack the subtle, imperfect, and unique signatures of a real location. The consistent performance advantage of UAV-SynthScene suggests that these authentic geo-specific details are crucial for training models that can operate reliably in the real world. This work therefore advocates for a tighter integration of survey-grade 3D mapping techniques (such as photogrammetry and LiDAR) into the pipeline of synthetic data generation for geospatial artificial intelligence.
5.3. Implications for Remote Sensing Applications
The findings of this study have significant practical implications for the remote sensing community. The proposed framework offers a scalable and cost-effective pathway to overcome the data annotation bottleneck, which is a major impediment to the deployment of artificial intelligence in areas such as:
Urban Planning and Traffic Management: Large-scale, diverse datasets of vehicle behavior can be generated for different cities without extensive manual annotation, enabling the development of more robust traffic monitoring systems.
Disaster Response and Damage Assessment: By creating digital twins of pre-disaster areas, it becomes possible to simulate post-disaster scenarios (e.g., placing debris, damaged vehicles), thereby generating crucial training data for automated damage detection systems before a real event occurs.
Sensor Simulation and Mission Planning: The framework can be used as a high-fidelity sensor simulator. This allows for testing and validating new perception algorithms or planning UAV survey missions under a wide range of simulated environmental conditions (e.g., different times of day, weather) to predict performance and optimize flight parameters.
5.4. Limitations and Future Directions
Although the proposed framework has proven effective, certain limitations remain. First, the current reliance on static digital twins does not account for environmental dynamics, such as seasonal vegetation shifts and urban development. Future work should explore updating 3D environments using multi-temporal satellite imagery or periodic UAV surveys to dynamically refresh scene topology and textures, thereby maintaining fidelity for long-term monitoring. Second, while the framework effectively addresses geometric long-tail scenarios (e.g., truncated objects), a fidelity gap persists under complex illumination. Although photogrammetry ensures geometric consistency, the interaction between simulated lighting and reconstructed textures inevitably approximates real-world optics. This discrepancy may lead to missed detections under extreme lighting or complex material reflectance conditions. Finally, while this study focuses on the independent transferability of synthetic data, future research should explore joint training strategies that combine synthetic and real-world datasets. Such mixed-training approaches represent a promising avenue for further boosting detector performance in practical applications.
While the proposed framework offers clear advantages in data scalability, it is important to acknowledge practical trade-offs relative to real-world data collection. Real UAV data acquisition typically incurs high operational costs, including flight permissions, specialized equipment, and personnel, and annotation effort increases linearly with dataset size. In contrast, the proposed framework relies on a limited amount of real UAV imagery for photogrammetric reconstruction rather than large-scale data collection. Although this introduces upfront modeling and computational costs, the imagery used for reconstruction does not require manual bounding box annotation. Once this foundation is established, the marginal cost of generating and annotating large volumes of synthetic data is relatively low. As a result, the proposed framework is well suited to difficult, rare, or unsafe scenarios and serves as a supplement to real-world UAV data rather than a replacement.