UTUAV: A Drone Dataset for Urban Traffic Analysis

Lepin, Felipe; Velastin, Sergio A.; León, Roberto; García-Herrero, Jesús; Rojas-Martínez, Gonzalo; Espinosa-Oviedo, Jorge Ernesto

doi:10.3390/drones10010015

Open AccessArticle

UTUAV: A Drone Dataset for Urban Traffic Analysis

by

Felipe Lepin

¹,

Sergio A. Velastin

^2,3,*

,

Roberto León

¹

,

Jesús García-Herrero

²

,

Gonzalo Rojas-Martínez

² and

Jorge Ernesto Espinosa-Oviedo

⁴

¹

Departamento de Informática, Universidad Técnica Federico Santa María, Santiago 2340000, Chile

²

Computer Science and Engineering Department, Universidad Carlos III of Madrid, 28911 Madrid, Spain

³

School of Electronic Engineering and Computer Science, Queen Mary University of London, London E1 4NS, UK

⁴

Faculty of Engineering, Politécnico Colombiano Jaime Isaza Cadavid, Medellín 050022, Colombia

^*

Author to whom correspondence should be addressed.

Drones 2026, 10(1), 15; https://doi.org/10.3390/drones10010015

Submission received: 24 November 2025 / Revised: 19 December 2025 / Accepted: 24 December 2025 / Published: 27 December 2025

(This article belongs to the Section Innovative Urban Mobility)

Download

Browse Figures

Versions Notes

Highlights

What are the main findings?

The paper presents a dataset with two sub-sets of urban traffic images captured by a drone at two different heights (100 and 200 m) and with two different road configurations (intersection and roundabout).
The paper studies different data partition techniques that lead to significantly different prediction performances, some of which might be deceivingly better while not reflecting the likely performance of a model in practice.
The paper presents benchmark results for popular detection methods (YOLOv11, YOLOv12, YOLOv11-OBB, RT-DETR) focusing on studying data selection methods for more realistic model performance assessment, and highlighting how these methods need much improvement before it can be said that they can generalize well, e.g., for even slightly different camera views.

What are the implications of the main finding?

Compared to other UAV datasets, this dataset contains a significant number of motorbikes (a primary means of private transport in developing countries). This will challenge researchers elsewhere on the detection of small objects (motorbikes in this case) and to study robustness of methods to the domain shift of different camera views.
When data comes from video sequences, the selection of data for building deep-learning object detection/localization models needs to be done carefully. Traditional random sampling generates data leakage (testing data “contaminated” with similar training data) that results in over-estimation of model performance.

Abstract

Vehicle detection from unmanned aerial vehicles (UAVs) has gained increasing attention due to the growing availability and accessibility of these platforms. UAV-captured videos have proven valuable in a variety of applications, including agriculture, security, and search and rescue operations. To support research in UAV-based vehicle detection, this paper introduces UTUAV: Urban Traffic Unmanned Aerial Vehicle, a dataset composed of traffic video images collected over the streets of Medellín, Colombia. The images are recorded from a semi-static position at two different altitudes (100 and 120 m) and include three manually annotated vehicle types: cars, motorcycles, and large vehicles. The analysis focuses on the main characteristics and challenges presented in the dataset. In particular, data leakage occurs when a single video is used to construct the training, validation, and evaluation sets. An inadequate data split can result in highly similar samples leaking into the evaluation set, leading to inflated performance metrics that do not reflect a model’s true generalization ability. Additionally, baseline results from recent state-of-the-art object detection models based on CNNs and Transformers (YOLOv8, YOLOv11, YOLOv12 and RT-DETR) are presented. The experiments highlight several challenges, including the difficulty of detecting small-scale objects, especially motorcycles, and limited generalization capabilities under altitude changes, a phenomenon commonly referred to as domain shift.

Keywords:

deep learning; object detection; computer vision; transformers; UAV; RT-DETR; YOLO

1. Introduction

Rapid urban development presents increasing challenges, particularly regarding mobility and traffic congestion across diverse demographic groups. Traffic analysis and monitoring have become highly useful tools for addressing the challenges of urbanization, such as traffic congestion, urban distribution, and vehicle accident prevention. These types of tools allow for the extraction of information on vehicle behavior and contribute to decision making, enabling sustainable urban development [1]. This type of system involves the operation of various components that allow for the extraction of different information, such as object detection, vehicle tracking, and vehicle speed measurement, among others [2].

Currently, Closed-Circuit Television (CCTV) systems are among the primary tools used for traffic monitoring, relying on fixed cameras that operate continuously to provide permanent surveillance of specific areas. As illustrated in Figure 1, these systems are typically deployed at strategic locations to enable high-resolution observation of road traffic conditions. Despite their effectiveness, fixed CCTV installations suffer from inherent limitations, including high installation and maintenance costs, a restricted field of view, and a strong dependence on human operators for system operation and supervision [3].

In recent years, the use of Unmanned Aerial Vehicle (UAV) has increased significantly to tackle complex tasks in various fields, such as agriculture, traffic analysis, search and rescue [4,5,6]. By providing a flexible aerial perspective, UAVs enable data acquisition at multiple altitudes and in complex traffic scenarios, overcoming key limitations of conventional monitoring systems. Their mobility, rapid redeployment, and cost-effectiveness further position UAVs as a practical solution for traffic analysis, particularly in environments where traditional infrastructure is limited. Rather than replacing fixed monitoring systems, UAVs should be regarded as a complementary technology, as they address the constraints of static cameras while still facing inherent operational limitations, including limited flight autonomy, short battery life, and restricted onboard processing capabilities [7].

Unmanned Aerial Vehicles (UAVs) are pilotless devices capable of remote or autonomous operation, enabling data acquisition in hard-to-reach areas. These platforms are typically equipped with cameras, sensors, and communication systems, allowing multiple UAVs to be integrated into coordinated systems for the development of more complex and scalable applications [8]. Aerial images taken by UAVs present considerable challenges for analysis. The altitude at which images are captured makes objects appear smaller due to the greater distance between the camera and the objects, which makes it difficult to extract detailed features. Furthermore, the wide field of view includes a large amount of complex background information, which interferes with the accurate detection of relevant features [9].

Beyond visual challenges, training object detection models on UAV-based traffic datasets, which are typically derived from continuous video streams, introduces a critical methodological issue: high spatiotemporal correlation between consecutive frames. When such datasets are randomly split into training, validation, and test sets, this correlation can lead to data leakage, where highly similar (or nearly identical) traffic scenes appear across different splits. Consequently, the resulting performance metrics may be artificially inflated, providing an overly optimistic assessment that does not accurately reflect the model’s true generalization capability in real-world traffic scenarios [10].

To address these challenges and support research on UAV-based vehicle detection, this work introduces UTUAV (Urban Traffic Unmanned Aerial Vehicle), a novel dataset collected in Medellín, Colombia. The dataset is designed to reflect realistic traffic compositions in the Global South, where motorcycles constitute the dominant mode of transportation (approximately 62% of the vehicle fleet in Colombia), posing a significant challenge for small-object detection. In contrast to established benchmarks such as VisDrone and UAVDT, UTUAV focuses on urban environments and traffic dynamics specific to Latin American cities.

Motivated by the aforementioned considerations, the main contributions of this work are summarized as follows.

We introduce UTUAV, a UAV-based dataset for vehicle detection in urban traffic scenes, collected at multiple locations in Medellín, Colombia. The dataset consists of two subsets (UTUAV-B and UTUAV-C), each with more than 6000 sequential frames, captured under different urban conditions.
We experimentally demonstrate the impact of spatiotemporal data leakage in UAV video–based model evaluation, emphasizing the need for principled dataset partitioning to prevent inflated performance metrics.
We provide a benchmark evaluation of state-of-the-art object detectors on the proposed dataset, highlighting the challenges inherent to UAV-based vehicle detection.

2. Related Work

The recent development of Deep Learning in the field of computer vision has driven the advancement of various object detection techniques. These models are generally classified into three main categories: two-stage detectors, one-stage detectors, and transformers-based detectors [11].

2.1. Object Detection with Neural Networks

Two-stage methods first generate a set of region proposals that are likely to contain objects, and then perform classification and localization on these proposals. They are known to achieve high accuracy, albeit at the cost of increased computational requirements. RCNN (Region-based Convolutional Neural Network) was one of the first architectures to combine CNN with proposed regions for object detection. Fast-RCNN improved speed and efficiency compared to RCNN by introducing a single fully connected network for all proposed regions. Later, Faster-RCNN integrated a region proposal network (RPN) module directly into the architecture, further accelerating the object detection process and setting a new standard for accuracy and speed [12,13,14].

One-stage object detectors perform dense predictions by directly regressing bounding boxes and class probabilities over a predefined image grid. Representative architectures such as Single Shot MultiBox Detector (SSD) [15] and You Only Look Once (YOLO) [16] achieve a favorable trade-off between accuracy and computational efficiency, making them well suited for UAV-based image analysis. YOLO, originally introduced by Redmon and Divvala [16], formulates object detection as a single-pass regression problem, enabling real-time performance. Since its introduction in 2016, successive versions have progressively improved speed, accuracy, and efficiency, culminating in recent architectures such as YOLOv12 (2025), which incorporates attention mechanisms to further enhance detection performance [17,18,19].

Following the introduction of the first Transformer model for natural language processing tasks in 2017 [20], these types of attention-based architectures have achieved significant performance in NLP (Natural Language Processing). Recently, there have been attempts to extend this type of technology to computer vision tasks. In 2020, Vision Transformer (ViT), the first Transformer designed for image classification, emerged [21]. Shortly thereafter, the Detection Transformer (DETR) and You Only Look One Sequence (YOLOS) incorporated attentional mechanisms for object detection and classification in images. Motivated by its scalability and powerful pretraining methods, the ViT architecture has also been used to train the Segment Anything Model (SAM), using one of the largest image segmentation datasets to date as an image encoder to process high-resolution inputs [22].

The Detection Transformer (DETR) is a computer vision architecture that approaches object detection using a token-based mapping process [DET] instead of proposal generation. It starts with a CNN-based backbone to extract image features, followed by multiple Transformer layers in the encoder to learn contextualized representations. It introduces “object queries” that represent potential objects and uses them in a Transformer decoder to generate bounding box predictions and classes for the objects. The model is trained using a unique assignment loss function. This innovative approach has shown outstanding performance in detecting objects in images, eliminating the need for traditional proposal generation and post-processing techniques [23]. The most recent modification RT-DETR [24] achieves improved computational efficiency by using a hybrid encoder that combines an attention-based module with a CNN-based module.

You Only Look One Sequence (YOLOS), inspired by DETR, is a computer vision architecture that uses tokens [DET] to represent objects in images, avoiding 2D biases and prior knowledge in label assignment. This differs from DETR in its architecture, using only a Transformer encoder instead of an encoder-decoder. In addition, YOLOS benefits from pre-trained representations of ViT models and does not employ cross-attention or deep supervision, as DETR does. These differences highlight unique approaches to object detection, demonstrating how both architectures contribute to progress in this area of computer vision [25].

2.2. Vehicle Detection in Aerial Imagery

In recent years, the technological development of unmanned aerial vehicles (UAVs) has driven the creation of several datasets dedicated to vehicle detection from an aerial perspective. One of the most representative benchmarks is VisDrone [26], designed to address a wide range of visual perception challenges, including object detection in images and videos, as well as single- and multi-object tracking. For image-based object detection, VisDrone provides 10,209 high-resolution static images (up to

2000 \times 1500

px) annotated across 10 object categories, including various vehicle types and vulnerable road users. Another widely used benchmark is UAVDT [27], developed specifically for vehicle detection and tracking. UAVDT contains 80,000 frames at

1080 \times 540

px, labeled for cars, buses, and trucks under diverse environmental conditions, capture altitudes, and viewing angles.

Beyond these established datasets, recent research has explored both new data collections and architectural improvements tailored to UAV scenarios. Ref. [28] investigates UAV-based vehicle detection using YOLOv8, emphasizing adaptations that improve robustness in intelligent transportation scenarios. Ref. [29] introduce PVswin-YOLOv8s, a model that integrates a Swin Transformer backbone and CBAM attention into YOLOv8s to enhance global feature extraction and mitigate occlusion effects. Their approach also incorporates Soft-NMS to better handle dense instances, achieving notable improvements on the VisDrone2019 dataset [26]. Complementing these model-centric efforts, ref. [30] propose a comprehensive UAV-derived dataset designed for vehicle detection in complex urban environments and evaluate multiple deep detectors under varying illumination, scale, and perspective conditions.

2.3. Spatiotemporal Correlation

Several studies have shown that using random splits in datasets with spatial or temporal dependencies introduces data leakage and leads to overly optimistic performance estimates. In the temporal domain, ref. [31] demonstrate that partitioning strategies that preserve temporal order—such as period-based holdout or blocked cross-validation—yield more reliable evaluations than random cross-validation. For spatial data, ref. [32] show that random sampling within the same region induces strong spatial autocorrelation between training and test sets, biasing accuracy metrics. Similar empirical evidence is reported by [33], who observe that spatial autocorrelation inflates the performance of convolutional neural networks in aerial imagery. Finally, ref. [34] propose fair train–test splitting strategies based on spatial blocking, buffers, and region-level separation to prevent information leakage. Collectively, these works emphasize that structured temporal or spatial partitioning methods are essential for robust evaluation in spatiotemporal datasets.

3. UTUAV Urban Traffic Dataset

The motivation for constructing UTUAV is grounded in the most recent statistics on Colombia’s vehicle fleet. According to the RUNT Press Bulletin 001 of 2025, by the end of 2024, the country had registered approximately 20 million active vehicles, of which about 12.4 million (around 62%) correspond to motorcycles, making them the dominant mode of transport nationwide [35]. This prevalence reflects the central role of motorcycles in Colombia’s mobility landscape while coexisting with a significant proportion of cars, buses, and freight vehicles that together shape complex urban traffic dynamics. In metropolitan areas such as Medellín and the Aburrá Valley, motorcycles contribute notably to pollutant emissions, traffic congestion, and intricate flow interactions, underscoring the need for detailed modeling of all vehicle categories. Consequently, the UTUAV dataset supports the broader goal of developing accurate and emission-aware traffic analysis tools, consistent with the objectives of the Vehicular Emission Reduction through Optimal Traffic Modeling and Management in Metropolitan Areas–Medellín Case project (Reducción de Emisiones Vehiculares Mediante el Modelado y Gestión Óptima de Tráfico en Áreas Metropolitanas–Caso Medellín, Área Metropolitana del Valle de Aburrá) [36].

In this context, the UTUAV dataset was conceived to reflect real traffic conditions in urban Colombia, emphasizing the most frequent and representative vehicle types: motorcycles, cars, and buses. These three categories were selected because they cover the largest proportion of road users and display distinct mobility behaviors and visual characteristics from an aerial perspective. Unlike existing UAV-based traffic datasets such as VisDrone [37] or UAVDT [38], which include a broader set of object classes and are recorded under heterogeneous environmental conditions, UTUAV provides a context-specific and geographically focused dataset designed around Latin American urban morphology and traffic density. It captures scenes from busy intersections and roundabouts typical of Medellín, where the predominance of motorcycles, vehicle occlusion, and lighting variations pose unique challenges for computer vision algorithms. This specialization allows researchers to evaluate detection models in conditions that closely resemble real urban traffic in the Global South, where mobility structures differ significantly from those of datasets recorded in Asian or European cities.

The annotations were manually performed on high-resolution frames (3840 × 2160) extracted from aerial videos recorded by UAVs flying at altitudes of 100 m and 120 m, using a semi-static position for better geometric stability and a consistent scale. The annotation process for the dataset was conducted using the VIPER-GT platform [39], which was selected for its ability to handle video sequences and facilitate the generation of high-precision labels. Three independent annotators were responsible for identifying and delineating cars, buses, and motorcycles in the drone-captured videos. To ensure consistency and quality, all annotations produced by the annotators were subsequently reviewed and validated by an expert supervisor, who resolved discrepancies and guaranteed the coherence of the final dataset.

Each object was carefully labeled to preserve contour precision under partial occlusion, shadows, and varying lighting conditions. Recordings were made exclusively during daylight hours and in authorized airspace zones, following local aviation regulations for UAV operation in Colombia [40]. These operational restrictions limited the number and times of recording sites as well as viewpoints, but guaranteed both compliance and reproducibility. Combined with its emphasis on realistic emission-related traffic composition and the methodological framework of the MOYCOT project, the UTUAV dataset represents a unique contribution for advancing UAV-based object detection and urban traffic analysis in Latin American environments. The dataset is publicly available from http://videodatasets.org/UTUAV (accessed 20 December 2025).

3.1. UTUAV-B

The set of images referred to as UTUAV-B consists of 6500 images with a resolution of

3840 \times 2160

pixels that were taken from an altitude of approximately 100 m. All of them belong to the same location (the intersection of two busy roads). The main challenge of this set is the small size of the objects that appear in it, especially motorcycles. As shown in Figure 2, the dataset was captured from a high altitude using a fully top-down viewing angle, commonly referred to as a bird’s-eye view. Under this configuration, only the upper surfaces of the vehicles are visible, which limits the amount of geometric information available for recognition. Moreover, the combination of high capture altitude and a top-down perspective results in scenes with abundant complex and cluttered backgrounds (such as buildings, vegetation, and heterogeneous terrain) that can partially obscure the target objects and introduce additional noise into the detection process.

Table 1 summarizes the main characteristics of the UTUAV-B dataset and reveals clear differences among the three vehicle categories. Light vehicles dominate the dataset, both in terms of the total number of annotations and distinct annotated objects, followed by motorcycles and heavy vehicles. As expected from their physical dimensions, heavy vehicles exhibit the largest mean object size, whereas motorcycles present substantially smaller footprints. Occlusion in motorcycles and light vehicles shows a comparable number of fully occluded instances, while heavy vehicles, despite being scarce, exhibit the longest average occlusion duration and the largest occlusion displacement. These disparities highlight class-specific challenges for detection, particularly regarding the accurate tracking of small vehicles and the management of long occlusion periods associated with heavy vehicles.

3.2. UTUAV-C

The set of images referred to as UTUAV-C shares similar characteristics with UTUAV-B. It consists of 6600 images with a resolution of

3840 \times 2160

pixels, captured from an approximate altitude of 120 m. All images were acquired from the same location, corresponding to a major roundabout, which differs from the scenario used in UTUAV-B. Figure 3 presents a sample from the UTUAV-C dataset, where it becomes evident that the higher capture altitude introduces several additional challenges. First, it reduces the apparent size of objects, decreasing the amount of visual information available and making detection more difficult. Second, as altitude increases, the field of view expands, leading to a larger number of objects within the scene and a higher prevalence of complex and cluttered backgrounds. Finally, it is important to note that, unlike UTUAV-B, the UTUAV-C dataset includes areas with high traffic density, which increases object overlap and makes it more challenging for models to distinguish individual instances.

Table 2 summarizes the main characteristics of this subset, the UTUAV-C subset exhibits notable differences in the number of annotated vehicles. Light Vehicles (cars) represent the majority of annotations, with nearly 1.5 million bounding boxes and close to 1000 distinct instances, followed by Motorcycles and Heavy Vehicles (buses), highlighting a clear object-detection imbalance across classes. The size of the objects varies substantially: Heavy Vehicles have the largest mean pixel area, as expected given their physical dimensions, whereas Motorcycles display significantly smaller bounding boxes. Motorcycles and Light Vehicles show a high number of fully occluded instances, while Heavy Vehicles, despite having fewer total annotations, exhibit the longest average occlusion duration and the largest displacement during occluded segments.

4. Methodology

This study analyzes the performance of several deep learning models for aerial traffic image understanding. To this end, we compare different computer vision architectures, including YOLOv8, YOLOv11, YOLOv12, and the transformer-based RT-DETR. In addition, we examine key challenges inherent to this type of data, such as object overlap, small object sizes, and limited scene variability. We also assess the models’ ability to generalize across related datasets and explore techniques aimed at improving performance, such as image partitioning. Finally, we conduct experiments to evaluate the impact of transitioning from HBB to OBB annotations. The methodology employed to train the models and evaluate their performance is detailed below.

4.1. Data Splitting Strategies

UTUAV comprises frames extracted from UAV video sequences captured from a semi-static viewpoint, which induces strong spatiotemporal correlation and visual redundancy. When datasets are randomly split, this characteristic can cause data leakage, leading to inflated performance metrics that do not reflect true model generalization in real-world scenarios.

To address this challenge, we evaluate different dataset partitioning strategies. Three partitioning approaches are considered, in which the final 15% of the video frames are reserved for evaluation (test), while the remaining 85% are divided into training and validation sets according to the following configurations:

Random: This strategy follows the conventional machine learning approach, where frames are randomly split into 70% for training and 15% for validation, as can be seen in Figure 4a. It is used as a baseline to illustrate the inflationary effect of data leakage.
Temporal: This strategy enforces a time-ordered split. The first 70% of the sequence is assigned to training, followed by 15% for validation. This approach ensures that models are evaluated on temporally subsequent frames, which are therefore less immediately correlated than those in the random partition.
Temporal with gap: Building upon the temporal split, this strategy introduces an exclusion gap ( $δ_{g}$ ) between the training/validation and validation/evaluation subsets, as illustrated in Figure 4b. The purpose of this gap is to further reduce temporal similarity at subset boundaries. By discarding a fixed number of adjacent frames, the risk of inadvertently including highly correlated samples across partitions is minimized.

4.2. Horizontal and Oriented Bounding Boxes

The HBB annotations in the UTUAV datasets include background information within the bounding boxes, which can negatively affect vehicle detection accuracy. Prior studies indicate that converting HBB annotations to OBB improves detection performance, as demonstrated in the DOTA dataset [41]. Since the UTUAV datasets rely on HBB annotations, a conversion to OBB is proposed using a segmentation-based approach for each object.

To achieve conversion from Horizontal Bounding Boxes (HBB) to Oriented Bounding Boxes (OBB), we propose a two-stage segmentation-based pipeline.

In the first stage, an advanced segmentation model (SAM) [22] is used to generate detailed binary masks from the original HBB annotations. The bounding boxes and their center points are provided as prompts, allowing the model to accurately infer the contours of each vehicle and extract high-fidelity object polygons.
In the second stage, the extracted contour polygons are processed and converted into the final OBB representation, defined by the object class label and the four oriented corner coordinates. This representation is directly compatible with specialized detection frameworks, such as YOLO-OBB, enabling the use of architectures specifically designed for rotated object detection.

By reducing background inclusion and providing a more compact object representation, this methodology allows detection models to focus on vehicle-specific features, establishing a robust basis for evaluating the impact of oriented annotations on the detection performance of small objects.

4.3. Experimental Settings

The experiments establish a performance baseline using state-of-the-art deep learning models on the dataset introduced in Section 3. All experiments were conducted on a workstation equipped with an 8 GB NVIDIA GeForce RTX 3060 Ti GPU running Ubuntu 22.04.5 LTS. The training environment consisted of CUDA 12.5, Python 3.8.19, and PyTorch 2.4.1. The most recent versions of the YOLO architecture (8, 11, and 12) are trained due to their high computational efficiency. In addition, an RT-DETR model is also incorporated, which contains a transformer block within their architecture. All models used in the benchmark have been pre-trained with the well-known COCO dataset [42] and fine-tuned with a UTUAV dataset. The purpose of using pre-trained models is to facilitate the model training process without the need to train the weights from scratch. Implementations for YOLO and RT-DETR were obtained from the Ultralytics repository, version 8.3.

For the benchmark, it is proposed to use the default configurations of the models, utilizing the default hyperparameters suggested by the original model authors. In the case of YOLO versions (8, 11, and 12), the hyperparameters are those used in Table 3. It notes that for YOLO-based models, input images are resized to

1024 \times 512

pixels to preserve their aspect ratio. The RT-DETR architecture requires all input images to have square dimensions. To maintain the original aspect ratio, images are first resized to

1024 \times 512

pixels, and then padded to form a square input of

1024 \times 1024

pixels.

4.4. Evaluation Metrics

To analyze the performance of the above models, as well as to make appropriate comparisons between them, the following performance evaluation metrics, which are common in object detection and classification tasks [43] are employed.

Intersection over union ( $I o U$ ) is a metric that quantifies the proximity of a bounding box (A) to a corresponding ground truth bounding box (B). It is calculated as the ratio of the intersection area to the union area between A and B.

$I o U = \frac{A \cap B}{A \cup B}$

(1)
Precision quantifies how well a model avoids false predictions. It is calculated as the proportion of correct predictions (true Positives— $T P$ ) out of the total number of predictions made, i.e., the sum of true positives ( $T P$ ) and false positives ( $F P$ ).

$P r e c i s i o n = \frac{T P}{T P + F P}$

(2)
Recall measures the model’s ability to detect objects present in the image and is calculated as the proportion of correct predictions, true positives ( $T P$ ) over the total ground truth, i.e., the sum of true positives ( $T P$ ) and false negatives ( $F N$ ).

$R e c a l l = \frac{T P}{T P + F N}$

(3)
Average precision ( $A P$ ) is a metric that allows precision and recall to be quantified independently of the threshold used to establish the quantities of $T P$ , $F P$ , and $F N$ . It is calculated as the area under the precision-recall curve for the i-th class.

$A P_{i} = \int_{0}^{1} p (r) d r$

(4)
Mean average precision ( $m A P$ ) represents the average of the average precision over the n classes.

$m A P = \frac{\sum_{i = 1}^{n} A P_{i}}{n}$

(5)

5. Results

This section presents the first experiments conducted on the UTUAV-B and UTUAV-C datasets. The goal is to analyze the impact of the temporal correlation present among video frames and how it influences the performance of a real-time object detection model, specifically YOLO v11n, which is known for its strong performance in the state of the art.

Two experiments are conducted: (i) evaluating different partitioning strategies for defining the training, validation, and test subsets in order to reduce data leakage in the evaluation set; and (ii) examining the effect of including samples with high temporal redundancy, that is, temporally adjacent frames that exhibit high similarity or are, in some cases, nearly identical.

5.1. Splitting Method

First, we evaluate the behavior of the previously defined random and temporary partitions, i.e., there is no gap (

δ_{g} = 0

). Figure 5 shows that the

m A P @ 50

on the evaluation set of the UTUAV-C dataset decreases significantly when the partitioning method is changed. Specifically, switching from the random to the temporal partition results in a 7.3% drop in

m A P @ 50

.

Since adjacent frames exhibit high similarity, the random split is not suitable for data separation, as it introduces similar samples into both the training and validation sets. This leads to inflated performance metrics and fails to provide an accurate estimate of the model’s real-world generalization capability.

This phenomenon is not observed in dataset UTUAV-B, where the

m A P @ 50

metric remains relatively stable across different partitioning strategies, indicating that the data leakage risk is less significant in this configuration. This phenomenon can be explained by the temporal proximity of the frames assigned to the training, validation, and evaluation sets. Table 4 presents the average Euclidean distance

(∥ \cdot ∥_{2})

between two sets of image embeddings. The distance between n samples from the embedding set T and m samples from the embedding set V is defined in Equation (6),

t_{i} \in T

, and

v_{j} \in V

represent individual embedding vectors from each set for

i \in {1, \dots, n}

and

j \in {1, \dots, m}

, respectively.

d_{E} (T, V) = \frac{1}{n \cdot m} \sum_{i = 1}^{n} \sum_{j = 1}^{m} {∥ϕ (t_{i}) - ϕ (v_{j})∥}_{2}

(6)

In the case of the UTUAV-B dataset (Table 4a), the Euclidean distances between the training and evaluation sets, as well as between the validation and evaluation sets, are relatively similar. This suggests that the evaluation set lies at a comparable distance from both the training and validation sets, indicating that it may contain more challenging samples for model inference, regardless of the data partitioning strategy used.

On the other hand, the UTUAV-C dataset (Table 4b), the validation and evaluation sets show a lower distance to each other in the case of temporal splits. As a result, including samples from the validation set in the training process leads to a substantial overestimation of performance metrics, undermining the reliability of the model’s future performance evaluation.

For a more robust analysis, a second measure of the distance between image sets is considered. The Fréchet Inception Distance (FID), introduced in [44], is used to quantify the difference between two distributions, assumed to be Gaussian. The FID between two distributions T and V is defined by Equation (7). Here,

μ_{T}

and

μ_{V}

represent the means vectors, while

Σ_{T}

and

Σ_{V}

correspond to the covariance matrices of distributions T and V, respectively. The term

Tr (Σ)

denotes the trace operator, which computes the sum of the diagonal elements of any matrix

Σ

.

d_{F I D} (T, V) = {∥μ_{T} - μ_{V}∥}_{2}^{2} + Tr (Σ_{T} + Σ_{V} - 2 \sqrt{Σ_{T} Σ_{V}})

(7)

Table 5 shows the distances calculated using FID. In this case, a pre-trained ResNet model is used to extract the image embeddings instead of the original Inception network. Table 5b shows that for UTUAV-C, the validation set and evaluation set are very close in the case of temporal splits. This is in contrast to what was seen in the UTUAV-B set Table 5a, where the evaluation set represents a more challenging scenario, and the performance metrics are more stable regardless of the separation method used.

The following sections present the results obtained using the temporal split method on both the UTUAV-B and UTUAV-C datasets. This partitioning strategy provides a more objective assessment of model performance by reducing the overlap of highly similar samples between the training and validation sets, thus avoiding artificially inflated metrics. These results constitute an initial benchmark of state-of-the-art fast object detection methods on this dataset and are intended to serve as a reference for future research aimed at improving performance on this benchmark and, we hope, in this field of work.

5.2. Effect of Gap Size

While temporal separation prevents adjacent frames from being split across different subsets, samples located near the boundaries between subsets may still exhibit high similarity. To mitigate this effect, a temporal gap

δ_{g}

is introduced, excluding a certain number of frames between the training, validation, and test blocks. This approach reduces the likelihood that temporally correlated samples belong to different subsets. In particular, it aims to minimize the similarity between the last samples of the training set and the first samples of the validation set, where temporal proximity often leads to visual redundancy.

Figure 6 illustrates the behavior of both datasets as the gap size increases. To develop this experiment, the values 0, 90, 300, 900, and 1200 were given as the gap size

δ_{g}

, where

δ_{g} = 0

represents a temporary split without removing samples between subsets, and a value of

δ_{g} = 1200

indicates that 1200 samples between each subset are removed. The results show that, for the UTUAV-C dataset, performance gradually decreases as the gap size increases. This degradation can be explained by the high temporal and visual similarity of samples located near the boundaries of the subsets when smaller gaps are used, which tends to artificially inflate performance metrics.

The UTUAV-B dataset does not exhibit a decline in the performance metrics reported. This behavior can be explained by the lower similarity between the test samples and those used for training and validation. Specifically, the video from which the UTUAV-B samples were extracted undergoes a noticeable scene change toward the end of the recording, introducing a substantial difference compared to the initial frames used for training and validation.

5.3. Frame Sampling

This section presents experiments aimed at assessing the impact of information redundancy within the dataset. To this end, a sampling method is implemented to select frames exhibiting greater vehicle displacement. Specifically, starting from the first frame, the next frame is selected when the average displacement of all vehicles exceeds a predefined threshold.

Using a temporal partition with a gap of

δ_{g} = 300

samples for both datasets, a list of threshold values was defined to analyze how the number of selected frames decreases as the threshold increases. For the UTUAV-B dataset, the threshold was varied within the range

[1.0, 4.0]

, while for UTUAV-C it was set within

[1.0, 2.0]

. These ranges were empirically determined by identifying the values that produce the greatest variation in the number of samples retained after frame sampling.

Figure 7 presents the performance metrics obtained by training the YOLOv11n model with different threshold values in the UTUAV-B, which progressively reduce the number of samples used for training (red line). The results show that all metrics exhibit fluctuations as the threshold varies. Nevertheless, a general decreasing trend in performance can be observed as the number of training samples decreases in all samples mAP50-90, mAP50, precision, recall and F1.

Figure 8 illustrates the performance of the YOLOv11n model as the number of training samples decreases in the UTUAV-C. In this case, the threshold range is narrower because the objects (vehicles) exhibit smaller displacements between frames. This is reflected in a considerable reduction in the number of training samples, even when applying a threshold of

1.0

. The performance metrics show a general decreasing trend across all evaluated measures.

Although performance decreases for both datasets as the number of training samples is reduced, it is observed that even with stricter sampling, which considerably limits the training data, the model still maintains relatively high performance. This finding is particularly valuable when training time needs to be reduced, as it achieves performance levels comparable to those obtained using the full dataset.

5.4. OBB vs. HBB

In this section, we present experiments designed to compare the performance of OBB annotations against HBB annotations. To this end, the YOLOv8-Nano model was trained on the UTUAV-B and UTUAV-C datasets. To isolate and quantify the effect of the annotation type alone, we employed YOLOv8-Nano models without any pretraining. Additionally, temporal subsets with a gap value of

δ_{g} = 300

samples were used. For each dataset, the labels were converted into oriented bounding boxes following the procedure described in Section 4.2.

The results presented in Table 6 demonstrate a consistent improvement when using oriented bounding boxes (OBB) compared to axis-aligned bounding boxes (HBB). In both dataset subsets UTUAV-B and UTUAV-C, the Average Precision increases across all three classes when using OBB. For UTUAV-B, the absolute gains are

↑ 0.036

for car,

↑ 0.014

for bus, and

↑ 0.133

for motorcycle, while in UTUAV-C, the gains reach

↑ 0.048

,

↑ 0.013

, and

↑ 0.092

, respectively. In relative terms, the most notable improvement corresponds to the motorcycle class,

\sim 40.2 %

in UTUAV-B and

\sim 29.5 %

in UTUAV-C, indicating that oriented annotations are particularly beneficial for smaller objects. The class-wise F1 metrics exhibit a similar trend: in UTUAV-B, F1 increases for car

↑ 0.051

and motorcycle

↑ 0.061

, whereas bus shows a slight decrease

↓ 0.008

. In UTUAV-C, all three classes achieve improvements in F1

↑ 0.028

for car,

↑ 0.027

for bus, and

↑ 0.048

for motorcycle.

5.5. Performance in Set B

This section analyzes the performance of the models trained on the UTUAV-B training set, with the aim of establishing a real-time baseline for this dataset. We evaluate YOLO models in versions 8, 11, and 12, in both nano and small sizes, as well as the RT-DETR model in its large configuration. In addition, we report results using oriented bounding boxes (OBB); for this annotation format, YOLO versions 8 and 11 in nano and small sizes are considered. YOLOv12 is excluded from the OBB experiments due to instability observed during training, while the Ultralytics implementation of RT-DETR does not support oriented annotations.

The results in Table 7 reveal consistent differences across models in both performance (class-wise AP and F1) and computational efficiency (inference and training time). Among HBB-based detectors, YOLOv8 models provide a strong baseline. YOLOv8n achieves the highest AP for cars (

A P = 0.937

) and buses (

A P = 0.784

), while its performance on motorcycles remains moderate (

A P = 0.412

). Increasing model capacity proves particularly beneficial for the motorcycle class. YOLOv8s improves motorcycle detection (

A P = 0.444

,

F 1 = 0.721

indicating that additional parameters help capture the fine-grained features required for small and challenging objects.

The more recent models YOLOv11 and YOLOv12 show stable performance but no substantial improvements over earlier versions. YOLOv11s achieves the highest AP for the motorcycle class among the HBB-based models (

A P = 0.576

), although it does not surpass YOLOv8 in the larger vehicle categories (car and bus). YOLOv12 exhibits more inconsistent behavior and does not reach the highest values in any class, suggesting that despite having more parameters and incorporating attention mechanisms, these enhancements do not translate into performance gains in this particular scenario.

The conversion of annotations from HBB to OBB yields clear and consistent improvements. All models experience significant performance gains, particularly in the car and bus classes. Specifically, OBB models improve HBB by

↑ 0.032

and

↑ 0.047

in the car and bus classes, respectively. Moreover, the results obtained with YOLOv11s-obb show that, although it does not achieve the highest score for the car class, it delivers a more balanced performance across all categories, with

A P = 0.961

(car),

A P = 0.831

(bus), and

A P = 0.576

(motorcycle) outperforming all HBB-based models.

5.6. Performance in Set C

Similarly to the analysis conducted for UTUAV-B, this section evaluates the performance of the models trained on the UTUAV-C training set, considering the same models used for both HBB and OBB. In addition, a temporal partitioning strategy with a

δ_{g} = 300

of samples is applied. The results are presented in Table 8.

Models trained with HBB annotations exhibit strong performance in the car and bus classes, where all YOLO models achieve AP values above

0.83

. These results suggest that both classes exhibit lower variability and are easier to detect in the aerial scenario of the UTUAV-C dataset. The motorcycle class remains significantly more challenging. YOLO-based models reach AP values between

0.28

and

0.47

. The RT-DETR model achieves an AP of

0.569

for this class, indicating a higher capacity to detect small objects. However, this performance comes at a significantly higher computational cost, with longer inference and training times compared to the YOLO models.

Converting to OBB yields performance gains, though they are less pronounced than those observed for the UTUAV-B dataset. The car class benefits the most, with an increase of 0.026 in both AP and F1 when comparing the best HBB and OBB results. For the bus and motorcycle classes, incorporating orientation provides limited improvements. No meaningful gains are observed for buses, while motorcycles show only a modest F1 increase of 0.022.

5.7. Cross-Dataset Evaluation

This section examines the robustness of the models previously trained on the UTUAV-B and UTUAV-C datasets by evaluating them in a domain that partially differs from the one used during training. Although both datasets consist of aerial traffic imagery, they exhibit notable differences in road layout, capture altitude, and traffic density. To quantify cross-domain generalization, a cross-evaluation is conducted: the model trained on UTUAV-B is evaluated on the test set of UTUAV-C, and conversely, the model trained on UTUAV-C is evaluated on the test set of UTUAV-B.

Table 9 presents the performance of the models trained on UTUAV-B and evaluated on the UTUAV-C test set. Overall, the results reveal a substantial drop in performance compared to the within-dataset evaluations, confirming the presence of a domain shift between both datasets.

For HBB models, the YOLOv12-s variant achieves the best overall performance, reaching the highest AP across all three classes (

A P = 0.608

,

A P = 0.648

,

A P = 0.209

). In terms of F1 score, the YOLOv8-s model obtains the highest value for the car class (0.656), although YOLOv12-s provides the strongest results for buses and motorcycles. These observations indicate that larger models tend to generalize slightly better under cross-domain conditions, particularly for classes with more consistent visual patterns (cars and buses). The motorcycle class remains the most challenging across all architectures, exhibiting a marked decrease in AP and F1.

Regarding OBB models, the performance consistently improves across all classes compared to their HBB counterparts. The best-performing model is YOLOv11-s-OBB, which achieves AP values of 0.651, 0.629, and 0.326 for cars, buses, and motorcycles, respectively, and also obtains the highest F1 scores. These results highlight that incorporating orientation information yields a clear advantage when transferring between domains with differences in road geometry or camera angle. Notably, the AP and F1 improvements are especially significant for the motorcycle class, which suggests that OBB annotations mitigate part of the localization ambiguity present in small, elongated, or highly oriented objects.

Table 10 reports the performance of the models trained on UTUAV-C and evaluated on the UTUAV-B test set. As in the previous experiment, the results confirm a marked degradation in performance under cross-domain conditions, although the severity of the drop varies notably across classes and architectures.

For HBB models, overall performance remains limited, particularly for the bus and motorcycle classes. Among the YOLO variants, YOLOv11-s achieves the highest AP for cars (0.596) and the highest F1 score for the same class (0.582), whereas YOLOv8-s obtains the strongest AP for buses (0.304) and the highest F1 score as well as AP for motorcycles. RT-DETR shows a comparatively strong AP for motorcycles (0.138), but lower performance in the remaining classes. These patterns indicate that the models trained on UTUAV-C struggle to generalize to UTUAV-B, likely due to differences in scene structure and traffic composition that disproportionately that are less frequent or of smaller size objects.

The OBB models exhibit consistent and substantial improvements over their HBB counterparts across all classes. The best results are obtained with YOLOv8-s-OBB, which achieves the highest AP for car (0.721), while YOLOv11-s-OBB attains the best performance for bus (0.469). For the motorcycle class, all models show considerably lower performance, highlighting the inherent difficulty of detecting small objects with high geometric variability in the aerial context of the UTUAV-C dataset.

6. Discussion

6.1. Dataset

Data collection for the proposed dataset utilized a drone positioned at a fixed coordinate, capturing aerial images at two controlled altitudes: 100 m (UTUAV-B) and 120 m (UTUAV-C). In contrast to established, large-scale UAV vision benchmarks such as VisDrone and UAVDT, which encompass hundreds of images across diverse viewing geometries, geographic locations, and environmental conditions, this dataset emphasizes controlled variation in object scale resulting from altitude changes and the inherent temporal dependence derived from continuous video recording from a semi-static viewpoint.

Although the dataset is captured from a semi-static aerial viewpoint, which may resemble the fixed perspective of CCTV systems, UAVs operate at higher altitudes and provide a substantially wider field of view, enabling the monitoring of larger traffic regions—such as entire intersections or multiple traffic flows—with a single sensor. This perspective fundamentally differs from fixed CCTV setups, which offer limited coverage and often require multiple cameras to observe complex scenes. The aerial viewpoint also reduces occlusions caused by large vehicles or infrastructure and provides a geometry better suited for analyzing traffic density, spatial distribution, and motion patterns. While the proposed dataset focuses on a semi-static configuration for controlled analysis, UAV platforms inherently retain operational flexibility, including rapid redeployment, altitude adjustment, and coverage of areas without fixed monitoring infrastructure.

A key strength of the proposed dataset lies in its high relevance to real-world UAV-based traffic monitoring scenarios. The semi-static acquisition setup induces strong temporal correlation and visual redundancy, closely reflecting practical deployments where UAVs observe traffic from fixed or near-fixed viewpoints. This makes the dataset particularly suitable for analyzing temporal dependence, frame redundancy, and data leakage risks in deep learning pipelines, supporting the development of more robust sampling and training strategies. Moreover, these characteristics address regional deployment constraints, as existing public UAV datasets often fail to capture the environmental and infrastructural conditions typical of contexts such as Colombia.

6.2. Benchmark

The initial benchmarking results indicate that, while state-of-the-art detectors achieve strong performance on large vehicle classes within the same domain, motorcycle detection remains the primary challenge across both subsets. The consistently low AP observed for motorcycles can be attributed to factors inherent to aerial data acquisition. Capture altitude renders motorcycles extremely small, limiting the extraction of discriminative features, while the bird’s-eye view introduces cluttered backgrounds that increase confusion and false negatives.

Cross-dataset evaluation between UTUAV-B and UTUAV-C reveals a substantial performance drop when models are exposed to variations in altitude, traffic density, and road geometry, highlighting the limited generalization capability of current detection models under modest domain shifts. Models trained on UTUAV-B consistently perform better, likely due to the lower capture altitude, which results in larger object scales and richer visual features, whereas the higher traffic density in UTUAV-C further complicates detection. Notably, the degradation is most pronounced for small objects (motorcycles), indicating that deep learning features for small-scale targets are highly sensitive to changes in scale and viewpoint, and underscoring the need for more robust feature representations under altitude variations.

Despite the inherent challenges of aerial imagery and domain shift, converting Horizontal Bounding Boxes (HBB) to Oriented Bounding Boxes (OBB) proved to be an effective mitigation strategy. OBB annotations consistently improved detection performance across all classes, with the most pronounced gains observed for motorcycles. By providing a more compact representation that better aligns with object orientation in bird’s-eye views, OBBs reduce background interference and localization ambiguity, enabling models to focus on discriminative features. This effect is particularly beneficial for small and rotated objects, partially alleviating performance degradation under domain shifts.

7. Conclusions and Future Work

This paper has introduced a novel dataset for research on traffic vehicle detection from UAV images, posing new research challenges such as small and clustered object detection and background complexity. Using the proposed dataset, it is shown that in a high spatiotemporal relationship scenario, inadequate data separation can lead to data leakage issues, inflating performance metrics. Models based on YOLO and RT-DETR perform well in vehicle detection from an aerial perspective. However, when faced with a more challenging scenario, their performance degrades significantly. The need for image diversity in training to achieve model generalization under various conditions is emphasized.

In terms of future research, there are several areas of interest that could deepen the understanding and improvement of object detection models. First, it would be valuable to explore how these models perform when synthetic data augmentation techniques are used during the training process, with the aim of optimizing their generalization capability. Secondly, it would be interesting to delve into various resizing methods using different techniques and assess their impact on the detection of small-sized objects. Furthermore, evaluating the performance of these models on unpublished datasets, characterized by a wide variety of perspectives and object sizes, could provide valuable information about their adaptability in real-world scenarios. An intriguing line of research well suited for the UTUAV sets would be the application of transformers in object tracking tasks across multiple images, which could open up new possibilities in the field of moving object tracking.

Author Contributions

Conceptualization, S.A.V., R.L., J.G.-H. and J.E.E.-O.; methodology, F.L., S.A.V., R.L. and J.G.-H.; software, F.L. and G.R.-M.; validation, S.A.V., R.L., J.G.-H. and J.E.E.-O.; formal analysis, F.L., S.A.V. and R.L.; investigation, F.L., S.A.V. and G.R.-M.; resources, J.E.E.-O.; data curation, F.L., S.A.V. and J.E.E.-O.; writing—original draft preparation, F.L. and G.R.-M.; writing—review and editing, S.A.V., R.L., J.G.-H. and J.E.E.-O.; visualization, F.L. and G.R.-M.; supervision, S.A.V., R.L. and J.G.-H.; project administration, S.A.V., R.L. and J.G.-H.; funding acquisition, R.L. and J.E.E.-O. All authors have read and agreed to the published version of the manuscript.

Funding

This study was partially funded by public research projects of the Spanish Ministry of Science and Innovation PID2023-151605OB-C22, and the internal research project PI_LII_23_13 from Universidad Técnica Federico Santa María, Chile. This work was also partially supported in Colombia by COLCIENCIAS project: Reducción de Emisiones Vehiculares Mediante el Modelado y Gestión Óptima de Tráfico en Areas Metropolitanas—Caso Medellín—Area Metropolitana del Valle de Aburra, code 111874558167, CT 049-2017. Universidad Nacional de Colombia. Proyecto HERMES 25374.

Data Availability Statement

The data created and used for this research is available from http://videodatasets.org/UTUAV, accessed 20 December 2025.

Acknowledgments

The authors are grateful to Luis Garrido-Calvo who worked in some of the software to generate results with Oriented Bounding Boxes (OBB). The authors also gratefully acknowledge the support of NVIDIA Corporation with the donation of GPUs used for this research.

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

Abbreviations

The following abbreviations are used in this manuscript:

UAV	Unmanned Aerial Vehicle
CNN	Convolutional Neural Network
R-CNN	Region-based Convolutional Neural Network
YOLO	You Only Look Once
DETR	Detection Transformer
AP	Average Precision
mAP	mean Average Precision
COCO	Common Objects in Context

References

Elassy, M.; Al-Hattab, M.; Takruri, M.; Badawi, S. Intelligent transportation systems for sustainable smart cities. Transp. Eng. 2024, 16, 100252. [Google Scholar] [CrossRef]
Byun, S.; Shin, I.K.; Moon, J.; Kang, J.; Choi, S.I. Road traffic monitoring from UAV images using deep learning networks. Remote Sens. 2021, 13, 4027. [Google Scholar] [CrossRef]
Bakirci, M. Vehicular mobility monitoring using remote sensing and deep learning on a UAV-based mobile computing platform. Measurement 2025, 244, 116579. [Google Scholar] [CrossRef]
Xu, R.; Li, C.; Bernardes, S. Development and testing of a UAV-based multi-sensor system for plant phenotyping and precision agriculture. Remote Sens. 2021, 13, 3517. [Google Scholar] [CrossRef]
Huang, H.; Savkin, A.V.; Huang, C. Decentralized autonomous navigation of a UAV network for road traffic monitoring. IEEE Trans. Aerosp. Electron. Syst. 2021, 57, 2558–2564. [Google Scholar] [CrossRef]
Dong, J.; Ota, K.; Dong, M. UAV-based real-time survivor detection system in post-disaster search and rescue operations. IEEE J. Miniaturization Air Space Syst. 2021, 2, 209–219. [Google Scholar] [CrossRef]
Jung, J.; Yoo, S.; La, W.G.; Lee, D.R.; Bae, M.; Kim, H. Avss: Airborne video surveillance system. Sensors 2018, 18, 1939. [Google Scholar] [CrossRef] [PubMed]
Mohsan, S.A.H.; Khan, M.A.; Noor, F.; Ullah, I.; Alsharif, M.H. Towards the unmanned aerial vehicles (UAVs): A comprehensive review. Drones 2022, 6, 147. [Google Scholar] [CrossRef]
Bouguettaya, A.; Zarzour, H.; Kechida, A.; Taberkit, A.M. Vehicle detection from UAV imagery with deep learning: A review. IEEE Trans. Neural Netw. Learn. Syst. 2021, 33, 6047–6067. [Google Scholar] [CrossRef]
Figueiredo, R.B.; Mendes, H.A. Analyzing information leakage on video object detection datasets by splitting images into clusters with high spatiotemporal correlation. IEEE Access 2024, 12, 47646–47655. [Google Scholar] [CrossRef]
Zaidi, S.S.A.; Ansari, M.S.; Aslam, A.; Kanwal, N.; Asghar, M.; Lee, B. A survey of modern deep learning based object detection models. Digit. Signal Process. 2022, 126, 103514. [Google Scholar] [CrossRef]
Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 580–587. [Google Scholar]
Girshick, R. Fast r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 39, 1137–1149. [Google Scholar] [CrossRef]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. Ssd: Single shot multibox detector. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Proceedings, Part I 14. Springer: Berlin/Heidelberg, Germany, 2016; pp. 21–37. [Google Scholar]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
Jocher, G.; Chaurasia, A.; Qiu, J. Ultralytics YOLOv8, 2023. Available online: https://docs.ultralytics.com/models/yolov8/ (accessed on 20 December 2025).
Jocher, G.; Qiu, J. Ultralytics YOLO11. 2024. Available online: https://docs.ultralytics.com/models/yolo11/ (accessed on 20 December 2025).
Tian, Y.; Ye, Q.; Doermann, D. YOLOv12: Attention-Centric Real-Time Object Detectors. arXiv 2025, arXiv:2502.12524. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30. Available online: https://proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf (accessed on 20 December 2025).
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Kirillov, A.; Mintun, E.; Ravi, N.; Mao, H.; Rolland, C.; Gustafson, L.; Xiao, T.; Whitehead, S.; Berg, A.C.; Lo, W.Y.; et al. Segment Anything. In Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 2–6 October 2023; pp. 3992–4003. [Google Scholar]
Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 213–229. [Google Scholar]
Zhao, Y.; Lv, W.; Xu, S.; Wei, J.; Wang, G.; Dang, Q.; Liu, Y.; Chen, J. Detrs beat yolos on real-time object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 16965–16974. [Google Scholar]
Fang, Y.; Liao, B.; Wang, X.; Fang, J.; Qi, J.; Wu, R.; Niu, J.; Liu, W. You only look at one sequence: Rethinking transformer in vision through object detection. Adv. Neural Inf. Process. Syst. 2021, 34, 26183–26197. [Google Scholar]
Zhu, P.; Wen, L.; Du, D.; Bian, X.; Fan, H.; Hu, Q.; Ling, H. Detection and tracking meet drones challenge. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 7380–7399. [Google Scholar] [CrossRef]
Du, D.; Qi, Y.; Yu, H.; Yang, Y.; Duan, K.; Li, G.; Zhang, W.; Huang, Q.; Tian, Q. The unmanned aerial vehicle benchmark: Object detection and tracking. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 370–386. [Google Scholar]
Bakirci, M. Enhancing vehicle detection in intelligent transportation systems via autonomous UAV platform and YOLOv8 integration. Appl. Soft Comput. 2024, 164, 112015. [Google Scholar] [CrossRef]
Tahir, N.U.A.; Long, Z.; Zhang, Z.; Asim, M.; ELAffendi, M. PVswin-YOLOv8s: UAV-based pedestrian and vehicle detection for traffic management in smart cities using improved YOLOv8. Drones 2024, 8, 84. [Google Scholar] [CrossRef]
Hou, T.; Leng, C.; Wang, J.; Pei, Z.; Peng, J.; Cheng, I.; Basu, A. MFEL-YOLO for small object detection in UAV aerial images. Expert Syst. Appl. 2025, 291, 128459. [Google Scholar] [CrossRef]
Cerqueira, V.; Torgo, L.; Mozetič, I. Evaluating time series forecasting models: An empirical study on performance estimation methods. Mach. Learn. 2020, 109, 1997–2028. [Google Scholar] [CrossRef]
Karasiak, N.; Dejoux, J.F.; Monteil, C.; Sheeren, D. Spatial dependence between training and test sets: Another pitfall of classification accuracy assessment in remote sensing. Mach. Learn. 2022, 111, 2715–2740. [Google Scholar] [CrossRef]
Kattenborn, T.; Schiefer, F.; Frey, J.; Feilhauer, H.; Mahecha, M.D.; Dormann, C.F. Spatially autocorrelated training and validation samples inflate performance assessment of convolutional neural networks. ISPRS Open J. Photogramm. Remote Sens. 2022, 5, 100018. [Google Scholar] [CrossRef]
Salazar, J.J.; Garland, L.; Ochoa, J.; Pyrcz, M.J. Fair train-test split in machine learning: Mitigating spatial autocorrelation for improved prediction accuracy. J. Pet. Sci. Eng. 2022, 209, 109885. [Google Scholar] [CrossRef]
Super Intendencia de Transporte, Colombia Boletín Análisis de Cifras Métricas Generales. Available online: https://transformaciondigital.supertransporte.gov.co/documentos/2025/boletin/Boletin_Abril_2025_Evasion.pdf (accessed on 20 December 2025).
Sarrazola, A.; Estrada, E.M.; Arroyo, N.; Espinosa, J. Reduction of vehicle emissions for urban traffic networks. In Proceedings of the MOVICI-MOYCOT 2018: Joint Conference for Urban Mobility in the Smart City, Medellin, Colombia, 18–20 April 2018; IET.; Stevenage, UK, 2018; Available online: https://digital-library.theiet.org/doi/book/10.1049/conferences-2018_0001 (accessed on 20 December 2025).
Hu, Q.; Han, J. VISDRONE: Vision Meets Drones. Available online: https://aiskyeye.com/home/ (accessed on 10 October 2025).
Dataset Ninja. UAVDT Dataset. Available online: https://datasetninja.com/uavdt (accessed on 10 October 2025).
Mihalcik, D.; Doermann, D. The design and implementation of ViPER. Univ. Md. 2003, 21, 3. [Google Scholar]
Aeronáutica Civil de Colombia. Aviación no Tripulada (UAS/Drones). Available online: https://www.aerocivil.gov.co/autoridad_aeronautica/publicaciones/3781/aviacion-no-tripulada-uasdrones/ (accessed on 10 October 2025).
Ding, J.; Xue, N.; Xia, G.S.; Bai, X.; Yang, W.; Yang, M.; Belongie, S.; Luo, J.; Datcu, M.; Pelillo, M.; et al. Object Detection in Aerial Images: A Large-Scale Benchmark and Challenges. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 7778–7796. [Google Scholar] [CrossRef] [PubMed]
Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft coco: Common objects in context. In Proceedings of the European Conference on Computer Vision, Zurich, Switzerland, 6–12 September 2014; Springer: Berlin/Heidelberg, Germany, 2014; pp. 740–755. [Google Scholar]
Padilla, R.; Netto, S.L.; Da Silva, E.A. A survey on performance metrics for object-detection algorithms. In Proceedings of the 2020 International Conference on Systems, Signals and Image Processing (IWSSIP), Niteroi, Brazil, 1–3 July 2020; pp. 237–242. [Google Scholar]
Heusel, M.; Ramsauer, H.; Unterthiner, T.; Nessler, B.; Hochreiter, S. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Adv. Neural Inf. Process. Syst. 2017, 30. Available online: https://proceedings.neurips.cc/paper/2017/file/8a1d694707eb0fefe65871369074926d-Paper.pdf (accessed on 20 December 2025).

Figure 1. Example image captured by a surveillance camera for traffic monitoring, taken from UTUAV dataset A (http://videodatasets.org/UTUAV, accessed 20 December 2025, from the authors).

Figure 2. Example image from the UTUAV-B dataset captured at 100 m altitude, illustrating the challenges of small-object detection and cluttered urban backgrounds in a bird’s-eye view of an urban intersection. Red indicates a large vehicle (bus or truck), green a normal vehicle (car) and blue a small vehicle (motorbike). The zoom-in region illustrates representative instances of the three target classes, including partially occluded vehicles.

Figure 3. Example image from UTUAV-C dataset captured at an altitude of 120 m, where increased altitude reduces apparent object size and expands the field of view. Red indicates a large vehicle (bus or truck), green a normal vehicle (car) and blue a small vehicle (motorbike). This scene exhibits higher traffic density and vehicle overlap.

Figure 4. Data partitioning strategies used to mitigate data leakage. (a) Random partition, where training and validation samples are temporally interleaved, potentially leading to artificially inflated performance. (b) Temporal partition with gap, which reduces temporal similarity between subsets by excluding frames at partition boundaries.

Figure 5. Results mean average precision (

m A P @ 50

) of YOLOv11n model in the evaluation set. The observed decrease when moving from random to temporal partitioning experimentally demonstrates the effect of data leakage, highlighting that random splits are unsuitable for video-derived datasets as they inflate performance metrics.

Figure 5. Results mean average precision (

m A P @ 50

) of YOLOv11n model in the evaluation set. The observed decrease when moving from random to temporal partitioning experimentally demonstrates the effect of data leakage, highlighting that random splits are unsuitable for video-derived datasets as they inflate performance metrics.

Figure 6. Performance metrics as a function of the temporal gap size between data subsets. The gradual performance decline observed on UTUAV-C as the gap increases validates the presence of temporal correlation, indicating that even gap-free temporal partitioning can inflate performance metrics. The stable performance on UTUAV-B is explained by a scene change toward the end of the recording.

Figure 7. Performance metrics of YOLOv11n on UTUAV-B under vehicle-displacement–based frame sampling. A general performance degradation is observed as the number of training samples (red line) decreases.

Figure 8. Performance metrics of YOLOv11n on UTUAV-C under vehicle-displacement–based frame sampling. The model retains relatively high performance even under strict sampling, highlighting its potential to reduce training time without significant loss in accuracy. The red line shows the number of samples used for training.

Table 1. Characteristics of UTUAV-B.

Vehicle	Motorcycle	Car	Bus
Number of annotations	70,064	331,508	18,864
Distinct Annotated objects	128	282	13
Mean object Size (pixels)	992	3318	5882
Number of occluded objects	80	84	4
Mean occlusion duration (frames)	$19.6$	18.6	34
Mean occlusion displacement (pixels)	108.4	130.7	197.5

Table 2. Characteristics of UTUAV-C.

Vehicle	Motorcycle	Car	Bus
Number of annotations	463,009	1,477,287	130,142
Distinct Annotated objects	456	997	86
Mean object Size (pixels)	467	1722	4275
Number of occluded objects	211	265	31
Mean occlusion duration (frames)	89.9	86.8	110.8
Mean occlusion displacement (pixels)	226.9	210.1	260.3

Table 3. Benchmark configurations. (a) Model complexity in terms of the number of parameters and computational cost. (b) Training hyperparameters used consistently across all evaluated models.

(a)
Model	Params (M)	FLOPs (G)
YOLOv8n	3.0	8.7
YOLOv8s	11.2	28.6
YOLOv11n	2.6	6.5
YOLOv11s	9.4	21.5
YOLOv12n	2.6	6.5
YOLOv12s	9.3	21.4
RT-DETR-l	32.9	103.4
(b)
Hyperparameter	Value
Model size	Nano (n), Small (s), Large (l)
Epochs	100
Batch size	12
Optimizer	AdamW
Learning rate	$1 \times 10^{- 3}$
Weight decay	$5 \times 10^{- 4}$

Table 4. Euclidean distances between the evaluation subset with respect to the training and validation subsets in (a) the distances for UTUAV-B and (b) the distances for UTUAV-C.

(a)
Partition	Training/Evaluation	Validation/Evaluation
Random	4.59	4.59
Temporal	4.66	4.29
(b)
Partition	Training/Evaluation	Validation/Evaluation
Random	3.20	3.13
Temporal	3.60	1.28

Table 5. Frechet ResNet distance between the evaluation subset with respect to the training and validation subsets in (a) the distances for UTUAV-B and (b) the distances for UTUAV-C.

(a)
Partition	Training/Evaluation	Validation/Evaluation
Random	22.48	22.64
Temporal	25.33	19.31
(b)
Partition	Training/Evaluation	Validation/Evaluation
Random	11.00	10.53
Temporal	13.38	0.88

Table 6. Evaluation metrics in yolov8 nano on HBB and OBB labels for both sets.

Labeled	${AP}_{car}$	${AP}_{bus}$	${AP}_{Motorcycle}$	${F 1}_{car}$	${F 1}_{bus}$	${F 1}_{Motorcycle}$
UTUAV-B w/ HBB	0.877	0.679	0.331	0.809	0.704	0.422
UTUAV-B w/ OBB	0.913	0.693	0.464	0.860	0.696	0.483
UTUAV-C w/ HBB	0.878	0.910	0.312	0.857	0.843	0.377
UTUAV-C w/ OBB	0.926	0.923	0.404	0.885	0.870	0.425

w/: with, figures in bold highlight the best performances.

Table 7. Average precision in evaluation set UTUAV-B.

Model	${AP}_{car}$	${AP}_{bus}$	${AP}_{Motor}$	${F1}_{car}$	${F1}_{bus}$	${F1}_{Motor}$	Inference Time (ms)	Training Time (h)
Yolo v8n	0.937	0.784	0.412	0.885	0.784	0.488	2.7	1.94
Yolo v8s	0.915	0.746	0.444	0.849	0.504	0.721	6.0	3.23
Yolo v11n	0.930	0.730	0.461	0.883	0.700	0.513	2.8	2.02
Yolo v11s	0.914	0.697	0.576	0.858	0.633	0.580	6.0	3.66
Yolo v12n	0.882	0.670	0.432	0.811	0.691	0.521	4.7	3.64
Yolo v12s	0.933	0.747	0.535	0.879	0.685	0.559	9.9	7.29
RT-DETR	0.828	0.655	0.614	0.800	0.695	0.630	46.0	12.6
Yolo v8n-obb	0.957	0.831	0.517	0.914	0.803	0.541	2.9	2.15
Yolo v8s-obb	0.969	0.801	0.481	0.925	0.783	0.499	6.2	3.37
Yolo v11n-obb	0.961	0.774	0.519	0.920	0.742	0.525	3.0	2.26
Yolo v11s-obb	0.961	0.831	0.576	0.921	0.810	0.574	6.2	3.91

Figures in bold indicate best performances.

Table 8. Average precision in evaluation set UTUAV-C.

Model	${AP}_{car}$	${AP}_{bus}$	${AP}_{Motor}$	${F1}_{car}$	${F1}_{bus}$	${F1}_{Motor}$	Inference Time (ms)	Training Time (h)
Yolo v8n	0.899	0.926	0.288	0.862	0.862	0.341	2.7	2.72
Yolo v8s	0.895	0.962	0.324	0.856	0.918	0.361	6.0	4.16
Yolo v11n	0.895	0.952	0.361	0.864	0.896	0.413	3.3	2.79
Yolo v11s	0.898	0.945	0.474	0.870	0.889	0.507	6.0	4.29
Yolo v12n	0.832	0.859	0.226	0.798	0.785	0.094	5.3	4.71
Yolo v12s	0.902	0.961	0.445	0.871	0.907	0.495	10.1	7.93
RT-DETR	0.894	0.894	0.569	0.868	0.853	0.605	37.0	18.2
Yolo v8n-obb	0.927	0.949	0.471	0.895	0.884	0.468	2.9	2.80
Yolo v8s-obb	0.923	0.955	0.557	0.893	0.917	0.529	6.3	4.04
Yolo v11n-obb	0.928	0.962	0.482	0.891	0.910	0.480	3.1	2.91
Yolo v11s-obb	0.923	0.953	0.566	0.897	0.917	0.528	6.2	4.49

Figures in bold indicate best performances.

Table 9. Model trained with UTUAV-B and evaluated with UTUAV-C evaluation set.

Model	${AP}_{car}$	${AP}_{bus}$	${AP}_{Motor}$	${F1}_{car}$	${F1}_{bus}$	${F1}_{Motor}$	Inference Time (ms)
Yolo v8n	0.474	0.308	0.039	0.554	0.332	0.079	2.7
Yolo v8s	0.591	0.542	0.134	0.656	0.563	0.178	6.0
Yolo v11n	0.501	0.548	0.077	0.580	0.522	0.124	2.8
Yolo v11s	0.584	0.509	0.094	0.603	0.490	0.130	6.0
Yolo v12n	0.517	0.477	0.101	0.589	0.483	0.159	5.0
Yolo v12s	0.608	0.648	0.209	0.576	0.580	0.232	10.8
RT-DETR	0.502	0.557	0.068	0.531	0.525	0.131	38.0
Yolo v8n-obb	0.564	0.535	0.199	0.644	0.538	0.246	3.0
Yolo v8s-obb	0.588	0.627	0.268	0.663	0.572	0.278	6.3
Yolo v11n-obb	0.579	0.529	0.182	0.642	0.519	0.252	3.1
Yolo v11s-obb	0.651	0.629	0.326	0.717	0.607	0.318	6.2

Figures in bold indicate best performances.

Table 10. Model trained with UTUAV-C and evaluated with UTUAV-B evaluation set.

Model	${AP}_{car}$	${AP}_{bus}$	${AP}_{Motor}$	${F1}_{car}$	${F1}_{bus}$	${F1}_{Motor}$	Inference Time (ms)
Yolo v8n	0.366	0.216	0.012	0.365	0.273	0.044	2.7
Yolo v8s	0.483	0.304	0.037	0.490	0.352	0.095	6.0
Yolo v11n	0.384	0.184	0.015	0.354	0.271	0.036	2.8
Yolo v11s	0.596	0.108	0.045	0.582	0.193	0.092	6.0
Yolo v12n	0.324	0.055	0.018	0.401	0.141	0.025	5.0
Yolo v12s	0.402	0.295	0.017	0.386	0.371	0.052	9.7
RT-DETR	0.486	0.129	0.138	0.574	0.103	0.208	37.9
Yolo v8n-obb	0.612	0.368	0.079	0.573	0.436	0.150	3.0
Yolo v8s-obb	0.721	0.441	0.055	0.671	0.451	0.127	6.2
Yolo v11n-obb	0.519	0.442	0.072	0.518	0.499	0.143	3.0
Yolo v11s-obb	0.676	0.469	0.052	0.607	0.447	0.113	6.2

Figures in bold indicate best performances.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Lepin, F.; Velastin, S.A.; León, R.; García-Herrero, J.; Rojas-Martínez, G.; Espinosa-Oviedo, J.E. UTUAV: A Drone Dataset for Urban Traffic Analysis. Drones 2026, 10, 15. https://doi.org/10.3390/drones10010015

AMA Style

Lepin F, Velastin SA, León R, García-Herrero J, Rojas-Martínez G, Espinosa-Oviedo JE. UTUAV: A Drone Dataset for Urban Traffic Analysis. Drones. 2026; 10(1):15. https://doi.org/10.3390/drones10010015

Chicago/Turabian Style

Lepin, Felipe, Sergio A. Velastin, Roberto León, Jesús García-Herrero, Gonzalo Rojas-Martínez, and Jorge Ernesto Espinosa-Oviedo. 2026. "UTUAV: A Drone Dataset for Urban Traffic Analysis" Drones 10, no. 1: 15. https://doi.org/10.3390/drones10010015

APA Style

Lepin, F., Velastin, S. A., León, R., García-Herrero, J., Rojas-Martínez, G., & Espinosa-Oviedo, J. E. (2026). UTUAV: A Drone Dataset for Urban Traffic Analysis. Drones, 10(1), 15. https://doi.org/10.3390/drones10010015

Article Menu

UTUAV: A Drone Dataset for Urban Traffic Analysis

Highlights

Abstract

1. Introduction

2. Related Work

2.1. Object Detection with Neural Networks

2.2. Vehicle Detection in Aerial Imagery

2.3. Spatiotemporal Correlation

3. UTUAV Urban Traffic Dataset

3.1. UTUAV-B

3.2. UTUAV-C

4. Methodology

4.1. Data Splitting Strategies

4.2. Horizontal and Oriented Bounding Boxes

4.3. Experimental Settings

4.4. Evaluation Metrics

5. Results

5.1. Splitting Method

5.2. Effect of Gap Size

5.3. Frame Sampling

5.4. OBB vs. HBB

5.5. Performance in Set B

5.6. Performance in Set C

5.7. Cross-Dataset Evaluation

6. Discussion

6.1. Dataset

6.2. Benchmark

7. Conclusions and Future Work

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI