4.3.1. mAP Analysis
This section presents a detailed analysis of the experimental results obtained from evaluating various YOLO models, specifically versions 8, 11, and 12, each trained with nano, small, medium, and large architectures, on our dataset. The performance of each model was measured using the metrics presented in
Section 4.2. The following analysis is based on the mAP50 scores, and, more specifically,
Table 3 presents the results for all models evaluated with an input size of 576 × 640.
Overall, the results indicate a clear positive correlation between model size and detection performance. In particular, the YOLOv8l model consistently demonstrated superior performance, achieving the highest overall mAP50 score of 60.7%. This trend suggests that for the given dataset, the increased parameter count and complexity of larger models, such as the medium and large variants, are crucial for improving overall detection accuracy. This is particularly evident when comparing the nano YOLO models, which exhibit the lowest overall scores, with their deeper counterparts, where performance gains of over 11% are observed for all the YOLO versions.
It is also noted that a class-by-class analysis reveals significant variations in detection performance. More specifically, classes such as guardrail and road consistently achieved exceptionally high mAP50 scores across all models, often exceeding 97%. This is likely due to the size, distinct shape, and consistent appearance and colors of these objects, which makes them relatively easy to identify and localize. Similarly, truck detection was highly accurate, with the YOLOv8l model reaching a remarkable mAP50 of 93.6%. These findings highlight the models’ proficiency in detecting large, well-defined objects with predictable visual characteristics.
Conversely, the performance for smaller, less common, or visually challenging classes varied considerably. Danger sign proved to be the most difficult class to detect, with all models yielding very low mAP50 scores, ranging from 4.49% for YOLOv11n to a peak of 31.0% for YOLOv8l. This difficulty could be attributed to the often-small size of these signs in the images, their varied visual appearance, and potential occlusions. Similarly, classes like bollard, delineator, and other sign showed lower scores in the nano models but saw substantial performance improvements with the small, medium, and large architectures, underscoring the benefit of greater model capacity for these challenging identifications.
To this end, a key factor contributing to the low mAP50 scores for certain classes, particularly those of signs, is the small size of these objects within the input resolution. At this resolution, small objects like signs and bollards occupy a very limited number of pixels, making it difficult for the models to extract the necessary features for accurate classification and localization. The limited pixel information results in a significant reduction in detection accuracy, which is reflected in the low scores for danger sign and the relatively modest scores for bollard and other sign across all model architectures.
To investigate the impact of input resolution on model performance, a second set of experiments was conducted using an input size of
. The results regarding the mAP50 metric are presented in
Table 4.
As can be seen in the aforementioned Table, the results from the higher input resolution detectors continue to support the trend that larger architectures achieve better performance. The YOLOv8l model again emerged as the top performer, achieving an overall mAP50 of 86.2%. The consistent outperformance of larger networks, particularly for the more challenging classes, highlights the benefit of increased parameter capacity when dealing with more detailed visual information. This is especially apparent in the medium and large architectures, which leverage the increased pixel density to better recognize and classify the various points of interest.
In parallel, it is emphasized that the most significant performance gains at the larger
resolution were observed in the classes that struggled with the lower resolution. Notably, the danger sign class, which previously had the lowest scores (see
Table 3), saw its mAP50 for YOLOv8l jump from 31.0% to a remarkable 92.7%. Similarly, bollard and delineator showed substantial improvements across all model sizes. Thus, the enhanced resolution provides the models with a greater number of pixels to analyze for these small objects, allowing for a more accurate feature extraction and, consequently, better detection. This validates the hypothesis that small object size was a primary limitation at the
input resolution.
The improvement in resolution had a considerable impact on the overall performance of all YOLO detectors. For instance, the YOLOv8l model showed a total mAP50 improvement of approximately 42%, increasing from 60.7% to 86.2%. The YOLOv11m model also exhibited a significant gain of about 45.3%, jumping from an overall score of 58.3% to 84.7%. It is noted that, while the benefits were less pronounced for already well-performing classes like guardrail and road, the higher resolution provided a significant boost to the overall utility and accuracy of the algorithms on DORIE, particularly for the challenging classes.
Building on the analysis of the mAP50 scores, we now turn our attention to the more rigorous mAP50:95 metric, which, as defined in
Section 4.2, provides a more comprehensive evaluation of a model’s performance by averaging the Average Precision across a range of stricter Intersection over Union (IoU) thresholds. This metric particularly penalizes imprecise bounding box localization. The results for the YOLO models with a 576 × 640 input size are presented in
Table 5.
The aforementioned results show a significant decrease in scores across all models and classes compared to the mAP50 metric. This is expected, as mAP50:95 averages performance over a much stricter range of IoU thresholds, from 0.5 to 0.95. Achieving a high mAP50:95 score requires not only accurate recognition and detection but also highly precise bounding box localization, which is a much more challenging task. The overall mAP50:95 scores for all models with a 576 × 640 input size range from 32.1% to 40.0%, indicating that the majority of models struggle to produce a highly accurate detection and a precise bounding box simultaneously.
Despite the overall performance drop, the trend of larger models outperforming smaller ones remains consistent. The YOLOv8m model leads the pack with an overall mAP50:95 of 40.0%, closely followed by the YOLOv8l model at 39.6%. The performance gains from increasing model size are evident across all YOLO versions, reinforcing the notion that greater model complexity is beneficial for more demanding tasks, such as precise localization. The performance gap between the nano and medium/large models is more pronounced under this metric, highlighting the struggle of lightweight models with the strict IoU requirements.
A class-by-class analysis further illustrates the challenges of the mAP50:95 metric. While classes like guardrail and road still maintain the highest scores, their performance remains nearly perfect, similarly to their results under the more lenient mAP50 metric. This indicates that for these large, distinct objects, the models are highly proficient at both accurate detection and precise bounding box localization. On the other hand, the truck and car classes experienced a significant drop, though they still remain among the better-performing categories, with the top scores for truck reaching just over 60%.
The most notable impact of the mAP50:95 metric is seen in the classes that were already difficult to detect at a lower resolution. Indicatively, the danger sign class, which had a peak mAP50 score of 31.0% (see
Table 3), now registers a peak mAP50:95 score of just 14.3% for the YOLOv8l model. This dramatic reduction underscores the extreme difficulty of precisely localizing these small, inconsistent objects. In parallel, classes like bollard, delineator, and other sign also show a similar pattern, with their best scores falling to around 20% and below, revealing that models that could successfully detect these objects at mAP50 struggle to achieve the pixel-perfect bounding boxes required by mAP50:95. This strongly suggests that for small objects, the models’ ability to localize precisely is a major performance bottleneck.
In summary, the mAP50:95 results reinforce the findings from the mAP50 evaluation but with a greater emphasis on the challenges of precise localization. The consistent hierarchy of performance from larger to smaller models is maintained, with the YOLOv8m model demonstrating the best overall performance, closely followed by YOLOv8l. However, the significantly lower scores across all classes highlight the difficulty of the task, particularly for small objects, where both detection and localization are hindered by the limited pixel information.
The analysis of the 576 × 640 results highlighted the challenge of precisely localizing small objects, which was especially apparent under the stricter mAP50:95 metric. To investigate whether a higher input resolution could mitigate this issue, again, a second evaluation was performed using models with an input size of 1120 × 1280.
Table 6 presents the mAP50:95 scores for this higher-resolution experiment.
The results from the higher resolution models continue to support the trend that larger models achieve better performance. The YOLOv8l and YOLOv11m networks emerged as the top performers, both achieving an overall mAP50:95 of 50.3%. This highlights that for this dataset, an increased input resolution significantly enhances the performance of more complex models. The consistent outperformance of medium and large variants over their smaller counterparts emphasizes the benefit of greater parameter capacity when precise localization is required.
A class-by-class analysis for the 1120 × 1280 resolution reveals that the most substantial performance gains were observed in the classes that struggled with the lower resolution. Notably, the danger sign class, which had the lowest scores previously, saw its peak mAP50:95 for YOLOv8l jump from 14.3% to a remarkable 55.2%, a gain of over 280%. Similarly, classes like bollard, delineator, and mandatory sign, which faced difficulties with precise localization, showed substantial improvements across all model sizes. This validates the hypothesis that small object size was a primary limitation at the 576 × 640 input resolution and that a higher resolution provides the models with the necessary pixel information for accurate feature extraction and localization.
The overall improvement in resolution had a considerable impact on the overall performance of all models. For instance, the YOLOv8l model showed a total mAP50:95 improvement of approximately 27%, increasing from 39.6% to 50.3%. The YOLOv11m model also exhibited a significant gain of about 30.6%, jumping from an overall score of 38.5% to 50.3%. While the benefits were less pronounced for already well-performing classes like guardrail and road, the higher resolution provided a significant boost to the overall utility and accuracy of the models on this dataset, particularly for the challenging classes.
In conclusion, the analysis of both mAP50 and mAP50:95 metrics confirms that model size and input resolution are critical factors in achieving high performance on object detection tasks. The consistent performance hierarchy from larger to smaller models is maintained across both metrics and resolutions. More importantly, the results show that while mAP50 is a useful indicator of general detection, the more stringent mAP50:95 metric highlights the importance of precise localization, a challenge that is significantly mitigated by increasing the input resolution. The remarkable improvement in scores for previously challenging objects like the various traffic signs underscores the necessity of providing sufficient visual information to the models for accurate and reliable performance.
4.3.2. Precision and Recall Analysis
The analysis of aggregated mAP scores of the previous section provided a high-level overview of models’ performance, but a deeper understanding of their strengths and weaknesses requires a class-by-class examination of Precision and Recall. These metrics reveal the trade-offs models make between correctly identifying positive instances (recall) and minimizing false alarms (precision). The following tables (i.e.,
Table 7 and
Table 8) present the F1-score, which is the harmonic mean of Precision and Recall, for both the lower 576 × 640 and higher 1120 × 1280 input resolutions, allowing for a thorough comparative analysis of each class.
A general overview of the results from the aforementioned tables indicates that models with an input resolution of 1120 × 1280 consistently achieve significantly higher Precision and Recall and thereby F1-scores across all classes compared to the 576 × 640 resolution. This is particularly noticeable in the average scores for all classes, where the higher-resolution models demonstrate a more balanced and effective performance. Additionally, the Precision and Recall scores, while often moving in tandem, sometimes show trade-offs for individual classes and models, a phenomenon that is explored in more detail in the class-specific analysis below.
The data for both resolutions highlight a clear link between model size and overall performance. At the lower 576 × 640 resolution, the large and medium models tend to show a better balance between Precision and Recall than their nano and small counterparts. For instance, the YOLOv11l model achieves the highest overall Precision (87.0%) at this resolution, while the YOLOv8m model has the highest overall Recall (57.1%). The nano models, in contrast, consistently have lower scores, indicating a general struggle to achieve both high precision and high recall simultaneously. This pattern is even more pronounced at the higher 1120 × 1280 resolution (see
Figure 6), where the top-performing models, such as YOLOv12s, YOLOv11m, and YOLOv8l, achieve a much better balance, with Precision and Recall scores both well above 75%.
Starting the comparative analysis of each class with the bollard class, at the 576 × 640 resolution, there is a clear trade-off between Precision and Recall. The nano and small models, such as YOLOv8n (95.6% Precision) and YOLOv8s (88.2% Precision), achieve very high precision scores, indicating that they are highly accurate when they do make a detection, with very few false positives. However, their Recall scores are very low (18.3% and 24.1%, respectively), meaning that they fail to detect a large majority of the actual bollards. In contrast, the YOLOv11m model, which has the highest Recall at this resolution (37.6%), does so with a lower but still respectable Precision of 77.7%. This pattern underscores the difficulty of accurately detecting this small, visually consistent object.
As can be seen in
Figure 7, increasing the input resolution to 1120 × 1280 dramatically improves performance for the bollard class across all models. Recall scores more than doubled for most architectures, with the highest Recall now reaching 74.2% (YOLOv11l). Precision scores also saw a general increase, and the volatility seen at the lower resolution was significantly reduced. This indicates that the models’ ability to both correctly identify bollards and find a higher proportion of them is directly proportional to the amount of pixel information available. The most substantial improvements were seen in the medium and large models, which capitalized on the higher resolution to achieve a better balance of Precision and Recall, with top scores in the high 80s for Precision and high 60s for Recall.
Subsequently, the delineator class at the 576 × 640 resolution shows a similar trade-off to bollards. Some models, such as YOLOv11l (95.2% Precision), are highly precise but less effective at finding all instances (26.0% Recall). The nano and small models consistently show low Recall scores, hovering around 20%, indicating that a significant number of delineators are missed. The best Recall score is only 30.7% (YOLOv8m), showing the general difficulty of this class for detection. This is likely due to their small size and vertical, often thin, structure, which can be easily missed or mistaken for noise.
Again, as one can observe in
Figure 8, when the resolution is increased to 1120 × 1280, the performance for delineator detection improves significantly. The highest Precision score climbs to a remarkable 93.9% (YOLOv12m), and the top Recall score nearly doubles to 56.9% (YOLOv8l). This indicates that the additional pixel information allows the models to not only locate delineators more effectively but also to classify them with a high degree of certainty, thereby reducing false positives. While the performance is not at the level of larger, more distinct objects, the gains clearly validate the importance of input resolution for accurately detecting small and visually challenging objects.
In parallel, at the 576 × 640 resolution, the prohibitory sign class presents a challenge for the models. Precision scores are generally in the range of 60–85%, with YOLOv11s achieving the highest precision at 84.9%. The Recall scores, however, are considerably lower, with the best score at 51.2% (YOLOv11m), indicating that a large portion of these signs are not being detected. This is a common pattern for small objects, where models opt for higher confidence to avoid false positives at the cost of missing true instances.
With the resolution increase to 1120 × 1280, both Precision and Recall for prohibitory signs show marked improvements (see
Figure 9). The best Precision score soars to 93.4% (YOLOv8m), and the peak Recall reaches 77.0% (YOLOv8l). This indicates that the higher resolution provides the models with the necessary detail to not only identify these signs more accurately but also to find a greater percentage of them. The improvement for this class is particularly significant, as the models are now much more capable of both classifying and locating these signs, reducing the trade-off that was prevalent at the lower resolution.
The danger sign class is the most difficult to detect at the 576 × 640 resolution, as evidenced by its extremely low mAP scores and analyzed in the previous paragraphs. This is directly reflected in its Precision and Recall scores. The models exhibit highly volatile Precision scores, with some dropping to as low as 6.3% (YOLOv12m), meaning that a very high percentage of its positive detections are incorrect. Recall is also abysmal, with YOLOv8m achieving a peak of just 30.4%, while other models miss almost every instance. This highlights the severe limitations of models when faced with very small, inconsistent, and often occluded objects.
The resolution increase to 1120 × 1280 provides the most dramatic improvement for the danger sign class (see
Figure 10). The Precision scores become much more stable and accurate, with several models reaching over 80%. Similarly, the Recall scores skyrocket across the board, with YOLOv8l achieving a remarkable 87.5%. The jump in performance is astounding, as a class that was nearly undetectable at the lower resolution is now identified with a high degree of confidence and effectiveness. This underscores that for extremely small and challenging objects, the input resolution is arguably the single most important factor for achieving robust performance.
It is noted, however, that at the 576 × 640 resolution, the mandatory sign class shows moderate performance. Recall scores are relatively consistent, with most models achieving scores in the mid-40s, and YOLOv11m reaching a high of 51.9%. Precision scores are more varied, with YOLOv8l achieving the highest score at 81.7%. The overall trend suggests that while models are somewhat successful at detecting these signs, there is still room for improvement in both finding all instances and avoiding false positives.
As presented in
Figure 11, with the higher 1120 × 1280 resolution, the performance for mandatory signs improves notably. Both Precision and Recall scores increase significantly, with the YOLOv12s model achieving a peak Precision of 95.5%, and YOLOv11m achieving a peak Recall of 88.9%. This indicates that the additional pixel information allows the models to resolve the features of these signs more clearly, leading to a much more accurate and robust detection system. The increased resolution effectively mitigates the performance limitations seen at the lower resolution, making the models highly effective for this class.
The other sign class at a lower resolution is a challenging category due to its varied appearance and often small size. The Precision scores show high volatility, with a high of 97.0% for YOLOv8s and a low of 25.8% for YOLOv12m, indicating that some models are highly precise but miss many signs, while others make many false detections. Recall scores are uniformly low, with the best score at only 33.1% (YOLOv8m). This illustrates the models’ difficulty in both identifying and locating this diverse class of objects.
The 1120 × 1280 resolution provides a substantial boost in performance for the other sign class too (see
Figure 12). Both Precision and Recall scores improve significantly, with the highest Precision score reaching 91.6% (YOLOv11m) and the highest Recall score reaching 58.4% (YOLOv8l). This indicates that the higher resolution provides the models with the detail needed to better handle the visual variations of this class, leading to more consistent and accurate detection. The improvement is a testament to the fact that providing more visual context and detail helps the models generalize better to a wider range of object appearances.
In contrast to the previous classes, the car class, being a relatively large and common object, shows solid performance at the 576 × 640 resolution. Precision scores are generally high, ranging from 60.0% to 85.1%, while Recall scores are in the 50–60% range. This suggests that the models are accurate when they make a detection, but a significant portion of cars are still being missed. The YOLOv8s and YOLOv11l models show the best trade-off at this resolution, with an F1-score of 71.6% and 71.5%, respectively.
As demonstrated in
Figure 13, at the higher resolution, the performance for the car class improves across the board. The Precision scores increase to the high 80s, with YOLOv12s reaching a peak of 89.9%, and the Recall scores also see a boost, with YOLOv8l achieving the best Recall at 83.4%. This confirms that even for large, common objects, higher resolution provides additional features that lead to more robust and reliable detection. The improved performance means that models can now both accurately identify a greater number of cars and do so with fewer false positives.
The truck class, similarly to cars, performs well at the lower resolution. Precision scores are consistently high, ranging from 73.3% to 91.3%, while Recall scores are also strong, ranging from 76.3% to 90.0%. This indicates that trucks are generally easy for the models to detect and identify, likely due to their large size and distinct shape. The YOLOv11l model achieves the highest Precision (91.3%), while the YOLOv8l model achieves the highest Recall (90.0%), showcasing a good balance between the two metrics.
With the resolution increase to 1120 × 1280, the models maintain their strong performance (see
Figure 14). Precision and Recall scores remain high, with some models even exceeding 90% in both metrics. Similarly to the lower resolution, the YOLOv11l model achieves the highest Precision (92.4%), and the YOLOv8l model achieves the highest Recall (95.5%). This confirms that for large and well-defined objects, the models’ performance is already at a high level at a lower resolution. However, the higher resolution provides a small but notable boost, making the identifications even more reliable and accurate.
Similarly, the guardrail class demonstrates exceptional performance at the 576 × 640 resolution, with both Precision and Recall scores consistently in the high 90s across all models. It is underlined that the lowest Precision score is 93.5%, and the lowest Recall score is 96.1%, both demonstrated by the YOLOv12s model. These near-perfect scores indicate that models are highly effective at detecting and localizing guardrails, which are large, continuous, and visually distinct objects. Since the F1-scores range from 94.8% (YOLOv12s) to 96.5% (YOLOv12m), there is virtually no trade-off between Precision and Recall for this class, as the models can both find nearly all instances and make very few false positive detections.
As demonstrated in
Figure 15, at the higher resolution, the performance for guardrails remains outstanding. The scores stay in the high 90s, with the best Precision reaching 97.9% (YOLOv11s) and the best Recall reaching 96.7% (YOLOv8s). F1-scores remain high, with all models exceeding 95% and YOLOv12m reaching a peak of 97.3%. The marginal gains confirm that for objects that are already easily identifiable, the benefit of increased resolution is minimal. The models are already operating at a near-optimal level, demonstrating that a higher resolution is not always necessary for classes with such distinct visual characteristics.
Lastly, similarly to guardrails, the road class exhibits near-perfect performance at the lower 576 × 640 resolution, with all models achieving both Precision and Recall scores of over 98%. This highlights that the models are extremely proficient at identifying the road surface, which is a large, consistent, and easily distinguishable feature in the images. The consistently high scores across all models, regardless of size, indicate that even the most lightweight architectures are sufficient for detecting this class.
The performance for the road class at the 1120 × 1280 resolution remains similarly high as presented in
Figure 16, with both Precision and Recall scores staying in the high 90s. However, while the overall performance remains exceptionally high, the gains from increasing resolution are negligible. For instance, the best F1-score for the road class at the lower resolution was 99.2% (YOLOv8l), which only increased by 0.1% to 99.3% (YOLOv12s) at the higher resolution. These results reinforce the finding from the lower resolution analysis: for easily identifiable, large, and consistent classes, the models’ performance is already at its peak, and increasing the resolution provides negligible additional benefit.