1. Introduction
Rapid urban development presents increasing challenges, particularly regarding mobility and traffic congestion across diverse demographic groups. Traffic analysis and monitoring have become highly useful tools for addressing the challenges of urbanization, such as traffic congestion, urban distribution, and vehicle accident prevention. These types of tools allow for the extraction of information on vehicle behavior and contribute to decision making, enabling sustainable urban development [
1]. This type of system involves the operation of various components that allow for the extraction of different information, such as object detection, vehicle tracking, and vehicle speed measurement, among others [
2].
Currently, Closed-Circuit Television (CCTV) systems are among the primary tools used for traffic monitoring, relying on fixed cameras that operate continuously to provide permanent surveillance of specific areas. As illustrated in
Figure 1, these systems are typically deployed at strategic locations to enable high-resolution observation of road traffic conditions. Despite their effectiveness, fixed CCTV installations suffer from inherent limitations, including high installation and maintenance costs, a restricted field of view, and a strong dependence on human operators for system operation and supervision [
3].
In recent years, the use of Unmanned Aerial Vehicle (UAV) has increased significantly to tackle complex tasks in various fields, such as agriculture, traffic analysis, search and rescue [
4,
5,
6]. By providing a flexible aerial perspective, UAVs enable data acquisition at multiple altitudes and in complex traffic scenarios, overcoming key limitations of conventional monitoring systems. Their mobility, rapid redeployment, and cost-effectiveness further position UAVs as a practical solution for traffic analysis, particularly in environments where traditional infrastructure is limited. Rather than replacing fixed monitoring systems, UAVs should be regarded as a complementary technology, as they address the constraints of static cameras while still facing inherent operational limitations, including limited flight autonomy, short battery life, and restricted onboard processing capabilities [
7].
Unmanned Aerial Vehicles (UAVs) are pilotless devices capable of remote or autonomous operation, enabling data acquisition in hard-to-reach areas. These platforms are typically equipped with cameras, sensors, and communication systems, allowing multiple UAVs to be integrated into coordinated systems for the development of more complex and scalable applications [
8]. Aerial images taken by UAVs present considerable challenges for analysis. The altitude at which images are captured makes objects appear smaller due to the greater distance between the camera and the objects, which makes it difficult to extract detailed features. Furthermore, the wide field of view includes a large amount of complex background information, which interferes with the accurate detection of relevant features [
9].
Beyond visual challenges, training object detection models on UAV-based traffic datasets, which are typically derived from continuous video streams, introduces a critical methodological issue: high spatiotemporal correlation between consecutive frames. When such datasets are randomly split into training, validation, and test sets, this correlation can lead to data leakage, where highly similar (or nearly identical) traffic scenes appear across different splits. Consequently, the resulting performance metrics may be artificially inflated, providing an overly optimistic assessment that does not accurately reflect the model’s true generalization capability in real-world traffic scenarios [
10].
To address these challenges and support research on UAV-based vehicle detection, this work introduces UTUAV (Urban Traffic Unmanned Aerial Vehicle), a novel dataset collected in Medellín, Colombia. The dataset is designed to reflect realistic traffic compositions in the Global South, where motorcycles constitute the dominant mode of transportation (approximately 62% of the vehicle fleet in Colombia), posing a significant challenge for small-object detection. In contrast to established benchmarks such as VisDrone and UAVDT, UTUAV focuses on urban environments and traffic dynamics specific to Latin American cities.
Motivated by the aforementioned considerations, the main contributions of this work are summarized as follows.
We introduce UTUAV, a UAV-based dataset for vehicle detection in urban traffic scenes, collected at multiple locations in Medellín, Colombia. The dataset consists of two subsets (UTUAV-B and UTUAV-C), each with more than 6000 sequential frames, captured under different urban conditions.
We experimentally demonstrate the impact of spatiotemporal data leakage in UAV video–based model evaluation, emphasizing the need for principled dataset partitioning to prevent inflated performance metrics.
We provide a benchmark evaluation of state-of-the-art object detectors on the proposed dataset, highlighting the challenges inherent to UAV-based vehicle detection.
3. UTUAV Urban Traffic Dataset
The motivation for constructing UTUAV is grounded in the most recent statistics on Colombia’s vehicle fleet. According to the RUNT Press Bulletin 001 of 2025, by the end of 2024, the country had registered approximately 20 million active vehicles, of which about 12.4 million (around 62%) correspond to motorcycles, making them the dominant mode of transport nationwide [
35]. This prevalence reflects the central role of motorcycles in Colombia’s mobility landscape while coexisting with a significant proportion of cars, buses, and freight vehicles that together shape complex urban traffic dynamics. In metropolitan areas such as Medellín and the Aburrá Valley, motorcycles contribute notably to pollutant emissions, traffic congestion, and intricate flow interactions, underscoring the need for detailed modeling of all vehicle categories. Consequently, the UTUAV dataset supports the broader goal of developing accurate and emission-aware traffic analysis tools, consistent with the objectives of the Vehicular Emission Reduction through Optimal Traffic Modeling and Management in Metropolitan Areas–Medellín Case project (Reducción de Emisiones Vehiculares Mediante el Modelado y Gestión Óptima de Tráfico en Áreas Metropolitanas–Caso Medellín, Área Metropolitana del Valle de Aburrá) [
36].
In this context, the UTUAV dataset was conceived to reflect real traffic conditions in urban Colombia, emphasizing the most frequent and representative vehicle types: motorcycles, cars, and buses. These three categories were selected because they cover the largest proportion of road users and display distinct mobility behaviors and visual characteristics from an aerial perspective. Unlike existing UAV-based traffic datasets such as VisDrone [
37] or UAVDT [
38], which include a broader set of object classes and are recorded under heterogeneous environmental conditions, UTUAV provides a context-specific and geographically focused dataset designed around Latin American urban morphology and traffic density. It captures scenes from busy intersections and roundabouts typical of Medellín, where the predominance of motorcycles, vehicle occlusion, and lighting variations pose unique challenges for computer vision algorithms. This specialization allows researchers to evaluate detection models in conditions that closely resemble real urban traffic in the Global South, where mobility structures differ significantly from those of datasets recorded in Asian or European cities.
The annotations were manually performed on high-resolution frames (3840 × 2160) extracted from aerial videos recorded by UAVs flying at altitudes of 100 m and 120 m, using a semi-static position for better geometric stability and a consistent scale. The annotation process for the dataset was conducted using the VIPER-GT platform [
39], which was selected for its ability to handle video sequences and facilitate the generation of high-precision labels. Three independent annotators were responsible for identifying and delineating cars, buses, and motorcycles in the drone-captured videos. To ensure consistency and quality, all annotations produced by the annotators were subsequently reviewed and validated by an expert supervisor, who resolved discrepancies and guaranteed the coherence of the final dataset.
Each object was carefully labeled to preserve contour precision under partial occlusion, shadows, and varying lighting conditions. Recordings were made exclusively during daylight hours and in authorized airspace zones, following local aviation regulations for UAV operation in Colombia [
40]. These operational restrictions limited the number and times of recording sites as well as viewpoints, but guaranteed both compliance and reproducibility. Combined with its emphasis on realistic emission-related traffic composition and the methodological framework of the MOYCOT project, the UTUAV dataset represents a unique contribution for advancing UAV-based object detection and urban traffic analysis in Latin American environments. The dataset is publicly available from
http://videodatasets.org/UTUAV (accessed 20 December 2025).
3.1. UTUAV-B
The set of images referred to as UTUAV-B consists of 6500 images with a resolution of
pixels that were taken from an altitude of approximately 100 m. All of them belong to the same location (the intersection of two busy roads). The main challenge of this set is the small size of the objects that appear in it, especially motorcycles. As shown in
Figure 2, the dataset was captured from a high altitude using a fully top-down viewing angle, commonly referred to as a bird’s-eye view. Under this configuration, only the upper surfaces of the vehicles are visible, which limits the amount of geometric information available for recognition. Moreover, the combination of high capture altitude and a top-down perspective results in scenes with abundant complex and cluttered backgrounds (such as buildings, vegetation, and heterogeneous terrain) that can partially obscure the target objects and introduce additional noise into the detection process.
Table 1 summarizes the main characteristics of the UTUAV-B dataset and reveals clear differences among the three vehicle categories. Light vehicles dominate the dataset, both in terms of the total number of annotations and distinct annotated objects, followed by motorcycles and heavy vehicles. As expected from their physical dimensions, heavy vehicles exhibit the largest mean object size, whereas motorcycles present substantially smaller footprints. Occlusion in motorcycles and light vehicles shows a comparable number of fully occluded instances, while heavy vehicles, despite being scarce, exhibit the longest average occlusion duration and the largest occlusion displacement. These disparities highlight class-specific challenges for detection, particularly regarding the accurate tracking of small vehicles and the management of long occlusion periods associated with heavy vehicles.
3.2. UTUAV-C
The set of images referred to as UTUAV-C shares similar characteristics with UTUAV-B. It consists of 6600 images with a resolution of
pixels, captured from an approximate altitude of 120 m. All images were acquired from the same location, corresponding to a major roundabout, which differs from the scenario used in UTUAV-B.
Figure 3 presents a sample from the UTUAV-C dataset, where it becomes evident that the higher capture altitude introduces several additional challenges. First, it reduces the apparent size of objects, decreasing the amount of visual information available and making detection more difficult. Second, as altitude increases, the field of view expands, leading to a larger number of objects within the scene and a higher prevalence of complex and cluttered backgrounds. Finally, it is important to note that, unlike UTUAV-B, the UTUAV-C dataset includes areas with high traffic density, which increases object overlap and makes it more challenging for models to distinguish individual instances.
Table 2 summarizes the main characteristics of this subset, the UTUAV-C subset exhibits notable differences in the number of annotated vehicles. Light Vehicles (cars) represent the majority of annotations, with nearly 1.5 million bounding boxes and close to 1000 distinct instances, followed by Motorcycles and Heavy Vehicles (buses), highlighting a clear object-detection imbalance across classes. The size of the objects varies substantially: Heavy Vehicles have the largest mean pixel area, as expected given their physical dimensions, whereas Motorcycles display significantly smaller bounding boxes. Motorcycles and Light Vehicles show a high number of fully occluded instances, while Heavy Vehicles, despite having fewer total annotations, exhibit the longest average occlusion duration and the largest displacement during occluded segments.
4. Methodology
This study analyzes the performance of several deep learning models for aerial traffic image understanding. To this end, we compare different computer vision architectures, including YOLOv8, YOLOv11, YOLOv12, and the transformer-based RT-DETR. In addition, we examine key challenges inherent to this type of data, such as object overlap, small object sizes, and limited scene variability. We also assess the models’ ability to generalize across related datasets and explore techniques aimed at improving performance, such as image partitioning. Finally, we conduct experiments to evaluate the impact of transitioning from HBB to OBB annotations. The methodology employed to train the models and evaluate their performance is detailed below.
4.1. Data Splitting Strategies
UTUAV comprises frames extracted from UAV video sequences captured from a semi-static viewpoint, which induces strong spatiotemporal correlation and visual redundancy. When datasets are randomly split, this characteristic can cause data leakage, leading to inflated performance metrics that do not reflect true model generalization in real-world scenarios.
To address this challenge, we evaluate different dataset partitioning strategies. Three partitioning approaches are considered, in which the final 15% of the video frames are reserved for evaluation (test), while the remaining 85% are divided into training and validation sets according to the following configurations:
Random: This strategy follows the conventional machine learning approach, where frames are randomly split into 70% for training and 15% for validation, as can be seen in
Figure 4a. It is used as a baseline to illustrate the inflationary effect of data leakage.
Temporal: This strategy enforces a time-ordered split. The first 70% of the sequence is assigned to training, followed by 15% for validation. This approach ensures that models are evaluated on temporally subsequent frames, which are therefore less immediately correlated than those in the random partition.
Temporal with gap: Building upon the temporal split, this strategy introduces an exclusion gap (
) between the training/validation and validation/evaluation subsets, as illustrated in
Figure 4b. The purpose of this gap is to further reduce temporal similarity at subset boundaries. By discarding a fixed number of adjacent frames, the risk of inadvertently including highly correlated samples across partitions is minimized.
4.2. Horizontal and Oriented Bounding Boxes
The HBB annotations in the UTUAV datasets include background information within the bounding boxes, which can negatively affect vehicle detection accuracy. Prior studies indicate that converting HBB annotations to OBB improves detection performance, as demonstrated in the DOTA dataset [
41]. Since the UTUAV datasets rely on HBB annotations, a conversion to OBB is proposed using a segmentation-based approach for each object.
To achieve conversion from Horizontal Bounding Boxes (HBB) to Oriented Bounding Boxes (OBB), we propose a two-stage segmentation-based pipeline.
In the first stage, an advanced segmentation model (SAM) [
22] is used to generate detailed binary masks from the original HBB annotations. The bounding boxes and their center points are provided as prompts, allowing the model to accurately infer the contours of each vehicle and extract high-fidelity object polygons.
In the second stage, the extracted contour polygons are processed and converted into the final OBB representation, defined by the object class label and the four oriented corner coordinates. This representation is directly compatible with specialized detection frameworks, such as YOLO-OBB, enabling the use of architectures specifically designed for rotated object detection.
By reducing background inclusion and providing a more compact object representation, this methodology allows detection models to focus on vehicle-specific features, establishing a robust basis for evaluating the impact of oriented annotations on the detection performance of small objects.
4.3. Experimental Settings
The experiments establish a performance baseline using state-of-the-art deep learning models on the dataset introduced in
Section 3. All experiments were conducted on a workstation equipped with an 8 GB NVIDIA GeForce RTX 3060 Ti GPU running Ubuntu 22.04.5 LTS. The training environment consisted of CUDA 12.5, Python 3.8.19, and PyTorch 2.4.1. The most recent versions of the YOLO architecture (8, 11, and 12) are trained due to their high computational efficiency. In addition, an RT-DETR model is also incorporated, which contains a transformer block within their architecture. All models used in the benchmark have been pre-trained with the well-known COCO dataset [
42] and fine-tuned with a UTUAV dataset. The purpose of using pre-trained models is to facilitate the model training process without the need to train the weights from scratch. Implementations for YOLO and RT-DETR were obtained from the Ultralytics repository, version 8.3.
For the benchmark, it is proposed to use the default configurations of the models, utilizing the default hyperparameters suggested by the original model authors. In the case of YOLO versions (8, 11, and 12), the hyperparameters are those used in
Table 3. It notes that for YOLO-based models, input images are resized to
pixels to preserve their aspect ratio. The RT-DETR architecture requires all input images to have square dimensions. To maintain the original aspect ratio, images are first resized to
pixels, and then padded to form a square input of
pixels.
4.4. Evaluation Metrics
To analyze the performance of the above models, as well as to make appropriate comparisons between them, the following performance evaluation metrics, which are common in object detection and classification tasks [
43] are employed.
Intersection over union (
) is a metric that quantifies the proximity of a bounding box (
A) to a corresponding ground truth bounding box (
B). It is calculated as the ratio of the intersection area to the union area between
A and
B.
Precision quantifies how well a model avoids false predictions. It is calculated as the proportion of correct predictions (true Positives—
) out of the total number of predictions made, i.e., the sum of true positives (
) and false positives (
).
Recall measures the model’s ability to detect objects present in the image and is calculated as the proportion of correct predictions, true positives (
) over the total ground truth, i.e., the sum of true positives (
) and false negatives (
).
Average precision (
) is a metric that allows precision and recall to be quantified independently of the threshold used to establish the quantities of
,
, and
. It is calculated as the area under the precision-recall curve for the
i-th class.
Mean average precision (
) represents the average of the average precision over the
n classes.
5. Results
This section presents the first experiments conducted on the UTUAV-B and UTUAV-C datasets. The goal is to analyze the impact of the temporal correlation present among video frames and how it influences the performance of a real-time object detection model, specifically YOLO v11n, which is known for its strong performance in the state of the art.
Two experiments are conducted: (i) evaluating different partitioning strategies for defining the training, validation, and test subsets in order to reduce data leakage in the evaluation set; and (ii) examining the effect of including samples with high temporal redundancy, that is, temporally adjacent frames that exhibit high similarity or are, in some cases, nearly identical.
5.1. Splitting Method
First, we evaluate the behavior of the previously defined random and temporary partitions, i.e., there is no gap (
).
Figure 5 shows that the
on the evaluation set of the UTUAV-C dataset decreases significantly when the partitioning method is changed. Specifically, switching from the random to the temporal partition results in a 7.3% drop in
.
Since adjacent frames exhibit high similarity, the random split is not suitable for data separation, as it introduces similar samples into both the training and validation sets. This leads to inflated performance metrics and fails to provide an accurate estimate of the model’s real-world generalization capability.
This phenomenon is not observed in dataset UTUAV-B, where the
metric remains relatively stable across different partitioning strategies, indicating that the data leakage risk is less significant in this configuration. This phenomenon can be explained by the temporal proximity of the frames assigned to the training, validation, and evaluation sets.
Table 4 presents the average Euclidean distance
between two sets of image embeddings. The distance between
n samples from the embedding set
T and
m samples from the embedding set
V is defined in Equation (
6),
, and
represent individual embedding vectors from each set for
and
, respectively.
In the case of the UTUAV-B dataset (
Table 4a), the Euclidean distances between the training and evaluation sets, as well as between the validation and evaluation sets, are relatively similar. This suggests that the evaluation set lies at a comparable distance from both the training and validation sets, indicating that it may contain more challenging samples for model inference, regardless of the data partitioning strategy used.
On the other hand, the UTUAV-C dataset (
Table 4b), the validation and evaluation sets show a lower distance to each other in the case of temporal splits. As a result, including samples from the validation set in the training process leads to a substantial overestimation of performance metrics, undermining the reliability of the model’s future performance evaluation.
For a more robust analysis, a second measure of the distance between image sets is considered. The Fréchet Inception Distance (FID), introduced in [
44], is used to quantify the difference between two distributions, assumed to be Gaussian. The FID between two distributions
T and
V is defined by Equation (
7). Here,
and
represent the means vectors, while
and
correspond to the covariance matrices of distributions
T and
V, respectively. The term
denotes the trace operator, which computes the sum of the diagonal elements of any matrix
.
Table 5 shows the distances calculated using FID. In this case, a pre-trained ResNet model is used to extract the image embeddings instead of the original Inception network.
Table 5b shows that for UTUAV-C, the validation set and evaluation set are very close in the case of temporal splits. This is in contrast to what was seen in the UTUAV-B set
Table 5a, where the evaluation set represents a more challenging scenario, and the performance metrics are more stable regardless of the separation method used.
The following sections present the results obtained using the temporal split method on both the UTUAV-B and UTUAV-C datasets. This partitioning strategy provides a more objective assessment of model performance by reducing the overlap of highly similar samples between the training and validation sets, thus avoiding artificially inflated metrics. These results constitute an initial benchmark of state-of-the-art fast object detection methods on this dataset and are intended to serve as a reference for future research aimed at improving performance on this benchmark and, we hope, in this field of work.
5.2. Effect of Gap Size
While temporal separation prevents adjacent frames from being split across different subsets, samples located near the boundaries between subsets may still exhibit high similarity. To mitigate this effect, a temporal gap is introduced, excluding a certain number of frames between the training, validation, and test blocks. This approach reduces the likelihood that temporally correlated samples belong to different subsets. In particular, it aims to minimize the similarity between the last samples of the training set and the first samples of the validation set, where temporal proximity often leads to visual redundancy.
Figure 6 illustrates the behavior of both datasets as the gap size increases. To develop this experiment, the values 0, 90, 300, 900, and 1200 were given as the gap size
, where
represents a temporary split without removing samples between subsets, and a value of
indicates that 1200 samples between each subset are removed. The results show that, for the UTUAV-C dataset, performance gradually decreases as the gap size increases. This degradation can be explained by the high temporal and visual similarity of samples located near the boundaries of the subsets when smaller gaps are used, which tends to artificially inflate performance metrics.
The UTUAV-B dataset does not exhibit a decline in the performance metrics reported. This behavior can be explained by the lower similarity between the test samples and those used for training and validation. Specifically, the video from which the UTUAV-B samples were extracted undergoes a noticeable scene change toward the end of the recording, introducing a substantial difference compared to the initial frames used for training and validation.
5.3. Frame Sampling
This section presents experiments aimed at assessing the impact of information redundancy within the dataset. To this end, a sampling method is implemented to select frames exhibiting greater vehicle displacement. Specifically, starting from the first frame, the next frame is selected when the average displacement of all vehicles exceeds a predefined threshold.
Using a temporal partition with a gap of samples for both datasets, a list of threshold values was defined to analyze how the number of selected frames decreases as the threshold increases. For the UTUAV-B dataset, the threshold was varied within the range , while for UTUAV-C it was set within . These ranges were empirically determined by identifying the values that produce the greatest variation in the number of samples retained after frame sampling.
Figure 7 presents the performance metrics obtained by training the YOLOv11n model with different threshold values in the UTUAV-B, which progressively reduce the number of samples used for training (red line). The results show that all metrics exhibit fluctuations as the threshold varies. Nevertheless, a general decreasing trend in performance can be observed as the number of training samples decreases in all samples mAP50-90, mAP50, precision, recall and F1.
Figure 8 illustrates the performance of the YOLOv11n model as the number of training samples decreases in the UTUAV-C. In this case, the threshold range is narrower because the objects (vehicles) exhibit smaller displacements between frames. This is reflected in a considerable reduction in the number of training samples, even when applying a threshold of
. The performance metrics show a general decreasing trend across all evaluated measures.
Although performance decreases for both datasets as the number of training samples is reduced, it is observed that even with stricter sampling, which considerably limits the training data, the model still maintains relatively high performance. This finding is particularly valuable when training time needs to be reduced, as it achieves performance levels comparable to those obtained using the full dataset.
5.4. OBB vs. HBB
In this section, we present experiments designed to compare the performance of OBB annotations against HBB annotations. To this end, the YOLOv8-Nano model was trained on the UTUAV-B and UTUAV-C datasets. To isolate and quantify the effect of the annotation type alone, we employed YOLOv8-Nano models without any pretraining. Additionally, temporal subsets with a gap value of
samples were used. For each dataset, the labels were converted into oriented bounding boxes following the procedure described in
Section 4.2.
The results presented in
Table 6 demonstrate a consistent improvement when using oriented bounding boxes (OBB) compared to axis-aligned bounding boxes (HBB). In both dataset subsets UTUAV-B and UTUAV-C, the Average Precision increases across all three classes when using OBB. For UTUAV-B, the absolute gains are
for car,
for bus, and
for motorcycle, while in UTUAV-C, the gains reach
,
, and
, respectively. In relative terms, the most notable improvement corresponds to the motorcycle class,
in UTUAV-B and
in UTUAV-C, indicating that oriented annotations are particularly beneficial for smaller objects. The class-wise F1 metrics exhibit a similar trend: in UTUAV-B, F1 increases for car
and motorcycle
, whereas bus shows a slight decrease
. In UTUAV-C, all three classes achieve improvements in F1
for car,
for bus, and
for motorcycle.
5.5. Performance in Set B
This section analyzes the performance of the models trained on the UTUAV-B training set, with the aim of establishing a real-time baseline for this dataset. We evaluate YOLO models in versions 8, 11, and 12, in both nano and small sizes, as well as the RT-DETR model in its large configuration. In addition, we report results using oriented bounding boxes (OBB); for this annotation format, YOLO versions 8 and 11 in nano and small sizes are considered. YOLOv12 is excluded from the OBB experiments due to instability observed during training, while the Ultralytics implementation of RT-DETR does not support oriented annotations.
The results in
Table 7 reveal consistent differences across models in both performance (class-wise AP and F1) and computational efficiency (inference and training time). Among HBB-based detectors, YOLOv8 models provide a strong baseline. YOLOv8n achieves the highest AP for cars (
) and buses (
), while its performance on motorcycles remains moderate (
). Increasing model capacity proves particularly beneficial for the motorcycle class. YOLOv8s improves motorcycle detection (
,
indicating that additional parameters help capture the fine-grained features required for small and challenging objects.
The more recent models YOLOv11 and YOLOv12 show stable performance but no substantial improvements over earlier versions. YOLOv11s achieves the highest AP for the motorcycle class among the HBB-based models (), although it does not surpass YOLOv8 in the larger vehicle categories (car and bus). YOLOv12 exhibits more inconsistent behavior and does not reach the highest values in any class, suggesting that despite having more parameters and incorporating attention mechanisms, these enhancements do not translate into performance gains in this particular scenario.
The conversion of annotations from HBB to OBB yields clear and consistent improvements. All models experience significant performance gains, particularly in the car and bus classes. Specifically, OBB models improve HBB by and in the car and bus classes, respectively. Moreover, the results obtained with YOLOv11s-obb show that, although it does not achieve the highest score for the car class, it delivers a more balanced performance across all categories, with (car), (bus), and (motorcycle) outperforming all HBB-based models.
5.6. Performance in Set C
Similarly to the analysis conducted for UTUAV-B, this section evaluates the performance of the models trained on the UTUAV-C training set, considering the same models used for both HBB and OBB. In addition, a temporal partitioning strategy with a
of samples is applied. The results are presented in
Table 8.
Models trained with HBB annotations exhibit strong performance in the car and bus classes, where all YOLO models achieve AP values above . These results suggest that both classes exhibit lower variability and are easier to detect in the aerial scenario of the UTUAV-C dataset. The motorcycle class remains significantly more challenging. YOLO-based models reach AP values between and . The RT-DETR model achieves an AP of for this class, indicating a higher capacity to detect small objects. However, this performance comes at a significantly higher computational cost, with longer inference and training times compared to the YOLO models.
Converting to OBB yields performance gains, though they are less pronounced than those observed for the UTUAV-B dataset. The car class benefits the most, with an increase of 0.026 in both AP and F1 when comparing the best HBB and OBB results. For the bus and motorcycle classes, incorporating orientation provides limited improvements. No meaningful gains are observed for buses, while motorcycles show only a modest F1 increase of 0.022.
5.7. Cross-Dataset Evaluation
This section examines the robustness of the models previously trained on the UTUAV-B and UTUAV-C datasets by evaluating them in a domain that partially differs from the one used during training. Although both datasets consist of aerial traffic imagery, they exhibit notable differences in road layout, capture altitude, and traffic density. To quantify cross-domain generalization, a cross-evaluation is conducted: the model trained on UTUAV-B is evaluated on the test set of UTUAV-C, and conversely, the model trained on UTUAV-C is evaluated on the test set of UTUAV-B.
Table 9 presents the performance of the models trained on UTUAV-B and evaluated on the UTUAV-C test set. Overall, the results reveal a substantial drop in performance compared to the within-dataset evaluations, confirming the presence of a domain shift between both datasets.
For HBB models, the YOLOv12-s variant achieves the best overall performance, reaching the highest AP across all three classes (, , ). In terms of F1 score, the YOLOv8-s model obtains the highest value for the car class (0.656), although YOLOv12-s provides the strongest results for buses and motorcycles. These observations indicate that larger models tend to generalize slightly better under cross-domain conditions, particularly for classes with more consistent visual patterns (cars and buses). The motorcycle class remains the most challenging across all architectures, exhibiting a marked decrease in AP and F1.
Regarding OBB models, the performance consistently improves across all classes compared to their HBB counterparts. The best-performing model is YOLOv11-s-OBB, which achieves AP values of 0.651, 0.629, and 0.326 for cars, buses, and motorcycles, respectively, and also obtains the highest F1 scores. These results highlight that incorporating orientation information yields a clear advantage when transferring between domains with differences in road geometry or camera angle. Notably, the AP and F1 improvements are especially significant for the motorcycle class, which suggests that OBB annotations mitigate part of the localization ambiguity present in small, elongated, or highly oriented objects.
Table 10 reports the performance of the models trained on UTUAV-C and evaluated on the UTUAV-B test set. As in the previous experiment, the results confirm a marked degradation in performance under cross-domain conditions, although the severity of the drop varies notably across classes and architectures.
For HBB models, overall performance remains limited, particularly for the bus and motorcycle classes. Among the YOLO variants, YOLOv11-s achieves the highest AP for cars (0.596) and the highest F1 score for the same class (0.582), whereas YOLOv8-s obtains the strongest AP for buses (0.304) and the highest F1 score as well as AP for motorcycles. RT-DETR shows a comparatively strong AP for motorcycles (0.138), but lower performance in the remaining classes. These patterns indicate that the models trained on UTUAV-C struggle to generalize to UTUAV-B, likely due to differences in scene structure and traffic composition that disproportionately that are less frequent or of smaller size objects.
The OBB models exhibit consistent and substantial improvements over their HBB counterparts across all classes. The best results are obtained with YOLOv8-s-OBB, which achieves the highest AP for car (0.721), while YOLOv11-s-OBB attains the best performance for bus (0.469). For the motorcycle class, all models show considerably lower performance, highlighting the inherent difficulty of detecting small objects with high geometric variability in the aerial context of the UTUAV-C dataset.
6. Discussion
6.1. Dataset
Data collection for the proposed dataset utilized a drone positioned at a fixed coordinate, capturing aerial images at two controlled altitudes: 100 m (UTUAV-B) and 120 m (UTUAV-C). In contrast to established, large-scale UAV vision benchmarks such as VisDrone and UAVDT, which encompass hundreds of images across diverse viewing geometries, geographic locations, and environmental conditions, this dataset emphasizes controlled variation in object scale resulting from altitude changes and the inherent temporal dependence derived from continuous video recording from a semi-static viewpoint.
Although the dataset is captured from a semi-static aerial viewpoint, which may resemble the fixed perspective of CCTV systems, UAVs operate at higher altitudes and provide a substantially wider field of view, enabling the monitoring of larger traffic regions—such as entire intersections or multiple traffic flows—with a single sensor. This perspective fundamentally differs from fixed CCTV setups, which offer limited coverage and often require multiple cameras to observe complex scenes. The aerial viewpoint also reduces occlusions caused by large vehicles or infrastructure and provides a geometry better suited for analyzing traffic density, spatial distribution, and motion patterns. While the proposed dataset focuses on a semi-static configuration for controlled analysis, UAV platforms inherently retain operational flexibility, including rapid redeployment, altitude adjustment, and coverage of areas without fixed monitoring infrastructure.
A key strength of the proposed dataset lies in its high relevance to real-world UAV-based traffic monitoring scenarios. The semi-static acquisition setup induces strong temporal correlation and visual redundancy, closely reflecting practical deployments where UAVs observe traffic from fixed or near-fixed viewpoints. This makes the dataset particularly suitable for analyzing temporal dependence, frame redundancy, and data leakage risks in deep learning pipelines, supporting the development of more robust sampling and training strategies. Moreover, these characteristics address regional deployment constraints, as existing public UAV datasets often fail to capture the environmental and infrastructural conditions typical of contexts such as Colombia.
6.2. Benchmark
The initial benchmarking results indicate that, while state-of-the-art detectors achieve strong performance on large vehicle classes within the same domain, motorcycle detection remains the primary challenge across both subsets. The consistently low AP observed for motorcycles can be attributed to factors inherent to aerial data acquisition. Capture altitude renders motorcycles extremely small, limiting the extraction of discriminative features, while the bird’s-eye view introduces cluttered backgrounds that increase confusion and false negatives.
Cross-dataset evaluation between UTUAV-B and UTUAV-C reveals a substantial performance drop when models are exposed to variations in altitude, traffic density, and road geometry, highlighting the limited generalization capability of current detection models under modest domain shifts. Models trained on UTUAV-B consistently perform better, likely due to the lower capture altitude, which results in larger object scales and richer visual features, whereas the higher traffic density in UTUAV-C further complicates detection. Notably, the degradation is most pronounced for small objects (motorcycles), indicating that deep learning features for small-scale targets are highly sensitive to changes in scale and viewpoint, and underscoring the need for more robust feature representations under altitude variations.
Despite the inherent challenges of aerial imagery and domain shift, converting Horizontal Bounding Boxes (HBB) to Oriented Bounding Boxes (OBB) proved to be an effective mitigation strategy. OBB annotations consistently improved detection performance across all classes, with the most pronounced gains observed for motorcycles. By providing a more compact representation that better aligns with object orientation in bird’s-eye views, OBBs reduce background interference and localization ambiguity, enabling models to focus on discriminative features. This effect is particularly beneficial for small and rotated objects, partially alleviating performance degradation under domain shifts.
7. Conclusions and Future Work
This paper has introduced a novel dataset for research on traffic vehicle detection from UAV images, posing new research challenges such as small and clustered object detection and background complexity. Using the proposed dataset, it is shown that in a high spatiotemporal relationship scenario, inadequate data separation can lead to data leakage issues, inflating performance metrics. Models based on YOLO and RT-DETR perform well in vehicle detection from an aerial perspective. However, when faced with a more challenging scenario, their performance degrades significantly. The need for image diversity in training to achieve model generalization under various conditions is emphasized.
In terms of future research, there are several areas of interest that could deepen the understanding and improvement of object detection models. First, it would be valuable to explore how these models perform when synthetic data augmentation techniques are used during the training process, with the aim of optimizing their generalization capability. Secondly, it would be interesting to delve into various resizing methods using different techniques and assess their impact on the detection of small-sized objects. Furthermore, evaluating the performance of these models on unpublished datasets, characterized by a wide variety of perspectives and object sizes, could provide valuable information about their adaptability in real-world scenarios. An intriguing line of research well suited for the UTUAV sets would be the application of transformers in object tracking tasks across multiple images, which could open up new possibilities in the field of moving object tracking.