4.2. Dataset
In this study, a vehicle detection dataset tailored for complex urban scenarios, named DJCAR, was constructed. The dataset was collected using a DJI drone (model: DJI Air 3, DJI, Shenzhen, China) to capture vehicle data within urban environments. A multi-dimensional control strategy was employed during data acquisition to enhance sample diversity. Specifically, the drone captured images at varying altitudes, viewing angles, and under different lighting conditions. A total of 944 images were collected, each with a resolution of 4032 × 2268 pixels. The dataset includes 26,075 vehicle instances, among which 25,052 (96.1%) are small objects. Data collection was conducted from April to September 2024. The dataset was split into training, validation, and test sets in a ratio of 8:1:1, containing 755, 94, and 95 images, respectively. The data acquisition settings are detailed in
Table 2.
Table 3 illustrates a sample segment from the DJCAR dataset.
To further evaluate the generalization capability of the proposed model, additional experiments are conducted on the public VisDrone dataset [
26]—a UAV vision benchmark released by Tianjin University and the Information Technology Laboratory for Data Mining. This dataset is widely adopted for tasks such as object detection and tracking in UAV-based scenarios. VisDrone encompasses a wide range of complex urban environments, including city roads, transportation hubs, campuses, and public squares, and features diverse conditions such as varying illumination and weather. The dataset consists of 10 object categories, including vehicles, pedestrians, and bicycles, with small targets (i.e., objects occupying fewer than 32 × 32 pixels) accounting for approximately 59% of the labeled instances. Each annotation includes detailed attributes such as bounding boxes, occlusion levels, and truncation status, effectively reflecting real-world challenges such as scale variation, dense target distribution, and background clutter in UAV imagery. Compared with traditional object detection benchmarks (e.g., COCO, PASCAL VOC), VisDrone places greater emphasis on the realism and complexity of low-altitude UAV imaging, making it especially suitable for testing algorithm robustness in dynamic and unconstrained environments. The VisDrone dataset contains 8629 images (6471 for training, 548 for validation, and 1610 for testing). The same detection and evaluation strategies were applied to both the DJCAR and VisDrone datasets. For a clearer context of the performance comparison shown in
Figure 6, the number of instances per class in VisDrone is as follows: pedestrian (109,187), people (38,560), bicycle (13,069), car (187,005), van (32,702), truck (16,284), tricycle (6387), awning-tricycle (4377), bus (9117), and motor (40,378).
As shown in
Table 4, the dataset structure diagrams of both the DJCAR and VisDrone datasets generated during the Ultralytics training process are presented.
Compared with the VisDrone and UAVDT datasets, the proposed DJCAR dataset presents several notable advantages: Higher image resolution: DJCAR contains 944 high-resolution UAV images with a resolution of 4032 × 2268 pixels, which is significantly higher than that of VisDrone (2000 × 1500) and UAVDT (1080 × 540). This allows for better preservation of object details, especially for small or distant vehicles. Diverse flight altitudes: The data were collected from four different altitudes (80 m, 90 m, 100 m, and 110 m), ensuring coverage across low-, mid-, and high-altitude perspectives. This helps to balance object scale (in pixel dimensions) and scene coverage, which is not explicitly controlled in VisDrone. Vehicle-focused annotation: DJCAR specifically targets vehicle detection, with 26,075 annotated vehicle instances. The dataset focuses on urban and suburban road scenes, making it highly relevant for intelligent transportation systems and traffic violation detection tasks. Scene diversity: DJCAR includes scenarios with varying traffic densities, road types, and environmental conditions. Compared to the relatively cluttered and object-diverse nature of VisDrone, DJCAR provides cleaner, vehicle-centric scenes that are more suitable for fine-grained behavior analysis.
The novelty of the DJCAR dataset lies in its high-resolution images, controlled flight altitudes, and vehicle-focused annotations, all of which make it a specialized dataset for vehicle detection. However, we acknowledge that the dataset primarily focuses on urban and suburban road scenes and does not fully represent rural or non-road environments. We recognize that future versions of the dataset could include additional environments to broaden its applicability.
4.3. Comparative Experiments
To assess the detection capability of the proposed enhanced model, several widely used evaluation metrics are utilized, such as Precision, Recall, F1-score, mAP@0.5, and mAP@0.5:0.95. The definitions of the relevant variables used in the calculation of these metrics are as follows: True Positives (TPs) refer to positive instances that are correctly identified by the model, whereas False Positives (FPs) correspond to incorrect predictions where negative samples are classified as positive. False Negatives (FNs) are actual positive cases that the model fails to detect, misclassifying them as negative. Additionally, the Intersection over Union (IoU) metric evaluates the overlap between predicted and ground-truth bounding boxes, calculated as the ratio of the intersection area to the total area covered by both boxes.
Precision measures the proportion of true positive predictions among all instances identified as positive by the model. It is calculated using the formula shown in Equation (
13):
Recall represents the proportion of true positive predictions relative to the total number of actual positive samples. It is calculated using the formula shown in Equation (
14):
The F1-Score balances Precision and Recall, with higher values indicating better performance. It is computed using the formula presented in Equation (
15):
Average Precision (AP) is the area under the precision–recall curve. It is calculated using the formula shown in Equation (
16):
Mean Average Precision (mAP) is the weighted average of the Average Precision (AP) values across all sample categories and is used to evaluate the detection performance of the model across all categories. It is calculated using the formula shown in Equation (
17):
In Equation (
17),
represents the Average Precision (AP) for the class indexed by
i, while
N refers to the total number of classes in the training dataset. mAP@0.5 corresponds to the mean Average Precision with the Intersection over Union (IoU) threshold set to 0.5. On the other hand, mAP@0.5:0.95 represents the mean Average Precision calculated over IoU thresholds ranging from 0.5 to 0.95, with a step size of 0.05. To demonstrate the superiority of the proposed algorithm over other mainstream vehicle detection methods, a comparative evaluation was conducted, several versions of the YOLO series, including YOLOv3-tiny [
27], YOLOv5n [
28], YOLOv7-tiny [
29], YOLOv9t [
30], YOLOv10n [
31], and YOLOv11n [
32], were selected for model comparison experiments. These models represent a range of YOLO architectures, from earlier versions to the latest iterations, and are widely used in target detection tasks.
Table 5 and
Table 6 present a summary of the experimental outcomes, showcasing the performance of each model on the DJCAR and VisDrone datasets. The primary evaluation indicators include Precision, Recall, mAP@0.5, mAP@0.5:0.95, and the F1-Score.
On the DJCAR dataset, the proposed RSW-YOLO model exhibits clear superiority in UAV vehicle detection tasks. Specifically, it achieves mAP@0.5 and mAP@0.5:0.95 scores of 92.6% and 59.6%, respectively—representing improvements of 5.4% and 6.2% over the baseline YOLOv8n model (87.2% and 53.4%) and outperforming other contemporary detection frameworks. Moreover, the enhanced model records a Precision of 91.2% and a Recall of 85.5%, both surpassing the original YOLOv8n’s performance (88.5% and 79.5%), indicating superior target coverage with fewer false positives. Compared to other mainstream algorithms, the proposed approach maintains a more optimal balance between detection accuracy and Recall. In addition, the F1-Score reaches 88.3%, a 4.5% increase over YOLOv8n, signifying a more refined trade-off between Precision and Recall while effectively suppressing both false alarms and missed detections.
To further verify its generalization capability, the model was evaluated on the VisDrone dataset, which involves multi-category object detection. On this benchmark, RSW-YOLO also achieved leading results, with mAP@0.5 and mAP@0.5:0.95 reaching 30.9% and 17.6%, respectively. These values exceed those of YOLOv8n (26.6% and 15.0%) by 4.3% and 2.6% and also outperform other leading models. The precision and recall achieved by the proposed model are 41.4% and 33.4%, significantly better than YOLOv8n’s 37.8% and 29.3%. This indicates that the improved model more effectively captures relevant targets while reducing incorrect predictions. Furthermore, the F1-Score of 36.9% represents a 4% improvement over YOLOv8n, underscoring a stronger equilibrium between detection precision and recall. A comparison across object categories is illustrated in
Figure 7.
An evaluation of the RSW-YOLO algorithm’s detection capability across ten object categories from the VisDrone dataset was conducted by computing the mean Average Precision (mAP) for each class. The corresponding results are presented in
Figure 7. As illustrated, RSW-YOLO demonstrates excellent performance across all 10 categories, with four classes (car, pedestrian, people, and motor) achieving significantly higher mAP@0.5 than the average level of the VisDrone dataset. Notably, the detection of vehicle targets is particularly outstanding, which aligns with the core objective of vehicle detection in urban UAV remote sensing imagery. These findings further affirm the model’s robustness and generalization ability in complex environments.
To evaluate the real-time performance of the proposed detection method, the number of parameters, GFLOPs, and inference speed (FPS) of both the proposed model and the baseline YOLOv8n were measured on an NVIDIA RTX 3090 GPU. The results are summarized in
Table 7.
As shown in
Table 7, on the DJCAR dataset composed of 4K resolution UAV images, the improved model achieved 60.13 FPS, 11.86 million parameters, and 38.3 GFLOPs, while YOLOv8n reached 110.47 FPS, 3.01 million parameters, and 8.1 GFLOPs. On the VisDrone dataset, which contains images with lower resolutions, the improved model achieved 174.51 FPS, also demonstrating efficient inference performance.
Although the improved model incurs higher computational costs compared to YOLOv8n, it maintains real-time performance even with high-resolution inputs and significantly enhances detection accuracy. These results indicate that the proposed model achieves a favorable trade-off between accuracy and efficiency, making it well-suited for real-time UAV-based object detection applications.
To visually evaluate the impact of the proposed optimizations, a per-sample analysis was carried out on the DJCAR test dataset, focusing on representative samples with high background complexity for visual comparison experiments. The selected samples exhibit challenging characteristics such as significant scale variations within the scene, dense target distributions, and high similarity between targets and background features. These samples effectively reflect the algorithm’s practical detection performance in complex environments. An example of the comparative results is presented in
Table 8.
As shown in
Table 8, the first column presents the detection results of the baseline YOLOv8 model, while the second and third columns display the outputs of RT-DETR and the proposed RSW-YOLO model, respectively. The selected remote sensing images reflect varying flight altitudes, illumination conditions, and object densities. A quantitative comparison of the detection results is provided in
Table 9.
As shown in
Table 9, both the YOLOv8n and RT-DETR algorithms exhibit noticeable missed and false detections in complex scenarios involving small vehicle targets, particularly under densely populated conditions. In contrast, the proposed RSW-YOLO model demonstrates significantly enhanced robustness in such challenging environments. It is more effective in detecting small-scale vehicles and substantially reduces both missed and false detection rates. These findings indicate that the proposed method maintains strong detection performance even under complex conditions, effectively mitigating the accuracy degradation typically caused by small object sizes in remote sensing imagery. The improvements clearly highlight the proposed model’s superiority in addressing small-object detection challenges.