2.1. Dataset
To carry out the proposed experiments, two distinct datasets were employed. The first, used during the weight pre-training stage, was selected with the aim of adapting the investigated approaches to the general context of vehicle detection, without specifically addressing the urban scenario of the city of Debrecen. For this purpose, the annotated image dataset UA-DETRAC [
17] was chosen.
The selection of this dataset is justified by its broad diversity and the high quality of the annotated objects, with a particular focus on urban environments, making it a well-established reference for the evaluation of models in the vehicle detection domain. Furthermore, as highlighted by Liang et al. [
18] in their analysis of the most relevant datasets in the field, the use of a comprehensive and representative dataset such as UA-DETRAC is of great significance for experiments of this nature, being widely recognized in the literature as a reliable benchmark. In this study, a random sample of 10,000 images was selected from this dataset to serve as the training base for the models in the initial experimental phase.
Based on previously trained and validated weights, to ensure that the employed architectures possessed the necessary foundation for analyzing information within the proposed context, a second set of images was utilized. This set corresponds to a novel dataset, introduced for the first time in this article, named DebStreet (dataset available at:
https://universe.roboflow.com/joao-vitor-de-andrade-porto/debstreet, accessed on 18 June 2025), which captures information from the location under controlled climatic and lighting conditions, inspired by the way the Caltech 1999 [
19] and 2001 [
20] datasets were constructed.
This set comprises 682 images, totaling 3256 annotated objects of interest, all labeled with the class “Vehicle”, as illustrated in the example shown in
Figure 1, without the application of any data augmentation techniques. The images were captured in the urban environment of Debrecen, Hungary, using a camera positioned at the height of a traffic light, enabling the visual recording of vehicles both stationary and in motion at the intersection of Bólyai and Thomas Mann Streets. This location is widely recognized for its high volume of private vehicle traffic and the vehicular diversity observed, due to the presence of two intersecting roads and the circulation of trams.
2.3. Comparative Metrics
In order to ensure the comparability of the presented results, and based on the considerations by [
25] regarding the importance of evaluation metrics, a set of nine comparative metrics was selected. The first six are directly associated with the classification and detection processes widely used in the literature, encompassing the calculation of general Mean Average Precision (mAP), as well as the specific Intersection over Union values of 50% (mAP50) and 75% (mAP75), in addition to the traditional classification metrics: Precision, Recall, and Fscore.
To obtain the values of the adopted metrics, at the end of each test procedure, the results were analyzed based on the correct and incorrect detections as well as classifications. The corresponding metrics were then calculated according to the equations presented below.
For the three most commonly used metrics, Precision, Recall, and Fscore, the reported and analyzed values were obtained from Equations (
1), (
2), and (
3), respectively.
In these equations, N represents the number of classes; denotes the number of true positives, or correctly predicted instances of class i; represents false positives, or instances incorrectly predicted as class i; and signifies false negatives, or instances of class i that were incorrectly predicted as another class.
Regarding the calculation of Mean Average Precision at different levels of Intersection over Union (IoU), Equations (
4), (
5), and (
6) were used to obtain the values of mAP, mAP50, and mAP75, respectively.
In these equations,
N denotes the total number of object classes, and
represents the Average Precision for class
i, which is computed at a specific Intersection over Union (IoU) threshold, denoted as
. The general Mean Average Precision (mAP) is obtained by averaging the AP values across
K different IoU thresholds, which typically range from 0.50 to 0.95 in increments of 0.05, following the COCO evaluation protocol. The metrics mAP@50 and mAP@75 represent the mean precision calculated using only predictions where the IoU is greater than or equal to 0.50 and 0.75, respectively. These metrics reflect the model’s performance under increasingly stringent localization criteria.
Finally, three additional metrics specific to this experiment were selected. Although traditionally used in regression studies, these metrics, as discussed by [
26] in their research on green apples, can contribute to assessing the robustness and the model’s ability to fit the problem, allowing for a more comprehensive performance analysis. Thus, the metrics Mean Absolute Error (MAE), Root Mean Squared Error (RMSE), and Pearson’s correlation coefficient (
r) were adopted.
For the calculation of these specific experimental metrics, Equations (
7)–(
9) were employed, where
n represents the total number of samples,
corresponds to the actual value of the
i-th sample,
denotes the predicted value for the
i-th sample,
indicates the mean of the actual values
y across all samples, and
represents the mean of the predicted values
within the same set.
2.4. Experimental Setup
To determine the most suitable approach for vehicle detection in an urban context, it became essential to establish quantitative criteria for comparing the methods analyzed. To this end, the study adopted two types of analysis for each experiment. The first focuses on evaluating each network’s ability to classify and identify objects of interest, using metrics such as Precision, Recall, Fscore, Mean Average Precision (mAP), mAP at 0.75 IoU (mAP75), and mAP at 0.5 IoU (mAP50). These metrics were selected to standardize comparisons with previous studies in the literature, thereby facilitating consistent and meaningful future analyses.
The second analysis focused more specifically on the problem of quantifying the number of vehicles passing through the roadway. For this purpose, a distinct set of metrics was adopted, aimed at vehicle counting. The metrics selected for this analysis were the Mean Absolute Error (MAE), Root Mean Squared Error (RMSE), and the Pearson correlation coefficient (r), which allowed for a more precise evaluation of the architectures within the specific context of the problem under study.
As previously presented, the main experiment of this study, which used both datasets, was carried out in three stages, with metrics being collected and evaluated at each stage. The first stage consisted of training the architectures with the goal of ensuring generalization in data prediction, focusing on the urban vehicle traffic scenario. For this stage, the UA-DETRAC dataset underwent a random sampling process without replacement, resulting in the formation of 10 distinct subsets, each containing exactly 10% of the total images. These subsets were used in a k-fold cross-validation procedure, conducted with 10 folds. In each iteration, one of the subsets was designated as the test set, while the remaining subsets were allocated for training and validation in an 80:20 ratio. The metrics previously discussed were collected and stored at the end of each fold.
The adopted cross-validation strategy ensured that each of the 10 subsets was used exactly once as a test set, providing a comprehensive assessment of the model’s performance. Furthermore, the strict separation between the training, validation, and test sets prevented any data contamination. In each iteration, the model’s weights were reset, ensuring that the process started from scratch. This approach reinforces the reliability of the obtained results, enabling a robust statistical analysis of the architectures’ performance. To ensure the standardization of the experiments, the hyperparameters learning rate, number of epochs, batch size, and patience were kept constant across all executions, assuming the values N, M, O, and P, respectively.
In order to determine these values, the image set was shuffled and five executions were performed, varying one hyperparameter at a time while keeping the others fixed. This procedure was repeated for each hyperparameter in order to identify the optimal combination for the conducted experiment. At the end of the executions, the loss curves during training and validation were analyzed; the best configuration was identified based on the region of the graph where the curve begins to stabilize, although it still shows a slight downward trend. Thus, the values of N, M, O, and P were defined in this study as 0.001, 200, 16, and 10 (5%), respectively, for all experimental runs performed.
To perform the statistical comparison of the stored quality metrics, the obtained values were subjected to three distinct analyses: analysis of the mean values and standard deviation, construction and analysis of boxplots, and execution of the analysis of variance (ANOVA) test with a post hoc Tukey test, adopting a 5% significance level. Through this robust statistical analysis, it was possible not only to identify the differences between the results presented by the architectures, but also to assess the relevance of these differences and the likelihood of their existence in an average sample outside the experimental control data, in the context of real-world application scenarios.
In the second stage, the weights from each fold of each architecture were reused for testing and metric collection; however, this time, the DebStreet dataset was used in place of the test set from each fold. The same metrics and statistical analyses were applied with the aim of evaluating the performance of the weights learned on a general dataset when applied to a new problem, using a cross-domain approach, without fine-tuning.
Finally, the third stage consisted of reapplying the 10-fold cross-validation process to all architectures, using the pre-trained weights obtained in Stage 1 of the experiment, now combined with the specific DebStreet dataset for classification, detection, and counting tasks within the proposed context. Thus, fine-tuning of the previously trained weights was performed to adapt each architecture to the new application domain. The collection and analysis of the same metrics from the previous stages were maintained, characterizing the cross-domain configuration with fine-tuning. This experiment aimed not only to identify the most suitable network for solving the problem but also to demonstrate the generalization and adaptability capacity of each architecture when adjusted to a new sample set.
All experiments were conducted over a five-day period using an NVIDIA A2000 graphics card with 12 GB of dedicated GDDR6 memory. The architectures were implemented using the PyTorch library version 2.5.0, while the Side-Aware Boundary Localization model was developed based on the MMDetection framework [
27] accessed on 15 March 2025 (the implementations used in this experiment are publicly available at the following repositories: compara_detectores_torch (
https://github.com/Inovisao/compara_detectores_torch), author’s implementation of the Faster R-CNN, DETR, and YOLOv8 networks using PyTorch; and detectores_json_k_dobras (
https://github.com/Inovisao/detectores_json_k_dobras), MMDetection wrapper for statistical data generation and experiment configuration).