Benchmarking YOLOv8 to YOLOv11 Architectures for Real-Time Traffic Sign Recognition in Embedded 1:10 Scale Autonomous Vehicles

Rafael Reveles-Martínez; Hamurabi Gamboa-Rosales; Erika Sánchez-Femat; Javier Saldívar-Pérez; Teodoro Ibarra-Pérez; Luis Carlos Reveles-Gómez; Omar A. Guirette-Barbosa; Jorge I. Galván-Tejada; Carlos E. Galván-Tejada; Huizilopoztli Luna-García; José M. Celaya-Padilla

doi:10.3390/technologies13110531

,

and

¹

Unidad Profesional Interdisciplinaria de Ingeniería Campus Zacatecas, Instituto Politécnico Nacional, Zacatecas 98160, Mexico

²

Unidad Académica de Ingeniería Eléctrica, Universidad Autónoma de Zacatecas, Zacatecas 98160, Mexico

³

Department of Industrial Engineering, Universidad Politécnica de Zacatecas, Plan de Pardillo Sn, Parque Industrial, Fresnillo 99059, Mexico

^*

Authors to whom correspondence should be addressed.

Technologies2025, 13(11), 531;https://doi.org/10.3390/technologies13110531
(registering DOI)

This article belongs to the Special Issue Emerging Paradigms in AI, Autonomous Systems, and Intelligent Technologies

Version Notes

Order Reprints

Abstract

Traffic sign recognition is still one of the challenging aspects of intelligent vehicle systems, mainly when processor or memory resources are limited. In this work, real-time traffic sign detection was evaluated using five YOLO model variants—Nano, Small, Medium, Large, and XLarge—across versions 8 to 11. All models were trained and validated with a custom dataset collected in a simulated urban environment designed to replicate FIRA competition tracks. The models were then deployed and tested on a 1:10 scale autonomous vehicle equipped with a mini PC running the detector in real time. Performance was compared using mAP@50–95, F1-score, inference latency, and preprocessing and postprocessing times. The authors also analyzed training behavior, focusing on convergence speed and stopping criteria. The experiments showed that YOLOv10 B achieved the highest performance across varying conditions, while YOLOv8 M provided a better balance between speed and accuracy. These results can help practitioners select appropriate YOLO architectures for embedded traffic sign recognition systems that must operate in real time on resource-constrained autonomous vehicles.

Keywords:

traffic sign recognition; YOLOv8; YOLOv11; embedded systems; autonomous vehicles; edge AI; real-time inference; deep learning; intelligent technologies; autonomous systems

1. Introduction

Traffic sign recognition remains one of the most critical aspects of perception for autonomous driving. Accurate and timely sign recognition is vital for a safe drive, particularly if the vehicle travels in varying or confined environments, such as city roads [] or competition tracks based on 1:10-scale autonomous platforms [,], similar to those used in the FIRA Autonomous Cars league [].

Some non-verbal signs, like traffic signs, will help the control system of a car to make decisions in a safe and reasonable way. With the popularity of embedded robotic platforms, there is an increasing demand for detection models that achieve accuracy while being lightweight enough to run in real time on low-power devices. In the past years, vision-based perception has witnessed a shift toward using deep learning detectors. The YOLO (You Only Look Once) family of architectures is one of the fastest and accurate models. Recent iterations like YOLOv8, YOLOv9, YOLOv10, and the more recent one YOLOv11 illustrate how this architecture is rapidly changing, adding changes to its backbone/postprocessing/computational efficiency. However, despite such developments, there are few studies that compare such models in one and the same manner. Most methods only evaluate on one version and general datasets like COCO [] or Pascal VOC [], which are less suitable for small robotic cars or embedded systems. Therefore, their findings may not be representative of the potential in practice. Moreover, most previous works present only mAP (mean Average Precision) as the main performance measure []. Though mAP is a good global metric, it does not present the added overhead in training or inference delay (and per-class performance), all of which are relevant when working with real vehicles or time-critical applications. To fill this gap, in this work, we provide a complete benchmark of YOLOv8 to YOLOv11 architectures on Traffic Sign Recognition (TSR), utilizing an on-road dataset processed on low-cost embedded car-borne hardware. The performance of all models was assessed in terms of the following:

Detection mAP, which includes both mAP@50–95 and per-class accuracy.
Efficiency, evaluated in terms of inference performance.
Training cost, measured with respect to convergence and the total running time of training and inference.
Error reliability, using the F1-score and the false-positive rate.

The results presented in this work are of interest to both researchers and engineers who seek to perform reliable deployment of models on embedded systems like the TRIGKEY S5. This work also smoothes the understanding of what algorithmic performance translates to in a real-world deployable system, providing some insight into the process of selecting the correct YOLO model for robots that are tightly constrained by computational limits.

2. Related Work

Traffic sign recognition (TSR) has long been one of the basic tasks required for intelligent transportation systems. Its importance comes from the fact that it must run fast and reliably even when the available hardware or energy is limited []. Through the years, different approaches have been explored. At first, many methods depended on handcrafted features or classical machine-learning models that used color or geometric cues to identify signs. Later, the appearance of deep-learning detectors such as CNNs and the YOLO family changed the field entirely, allowing the whole process to be trained end-to-end. More recently, researchers have been concerned with detecting smaller objects and keeping accuracy high when images are noisy or come from adverse conditions. The following part briefly revisits each of these stages, pointing out their main contributions and how the present work moves the field forward with respect to the most recent state of the art.

2.1. Classical Traffic Sign Detection Methods

Common TSR methods mainly utilized hand-designed features and shallow classifiers. For instance, shape and edge information in local regions of image are frequently extracted by techniques such as Histogram of Oriented Gradients (HOGs) and then classified using Support Vector Machines (SVMs).

2.1.1. Histogram of Oriented Gradients

The HOG is an image object structure and shape feature descriptor, which processes small local regions of the input data and computes a histogram of gradient directions. It has been exploited to recognize the geometric symbols of traffic signs [] as a part of pre-processing and machine learning classifications. Wang et al. showed that traffic sign detection with HOG features is more accurate than the conventional method. While HOG is illumination-insensitive and discriminant for shapes [], its low-level representation has little flexibility, so it becomes non-trivial to determine universal thresholds valid to robustly separate them.

2.1.2. Support Vector Machines

An SVM is a kind of supervised learning algorithm that searches for the optimal hyperplanes to separate features from other classes. In TSR, SVMs combined with HOG features performed well in binary and multi-class problems []. However, they are highly class-imbalance sensitive [], given that the majority of real-world traffic sign datasets suffer from such imbalances, leading to biased results. This shortcoming limits their scalability and drives the research towards deeper models.

2.1.3. Feature-Based Object Detection

Other classic two-stage approaches employ region proposals and geometric/color heuristics []. These models are good for controlled situations but do not generalize well when objects are occluded or when lighting varies. The shortest path of these pipelines motivated us to end-to-end deep learning pipelines.

2.2. Deep Learning and YOLO Framework

YOLO family object detectors revolutionized computer vision from sliding-windowing to end-to-end detection through single-shot regression. Every new update has continued to tack on a better detection performance, speed, and architectural efficiency. Recent versions push this even further: YOLOv10 improves the end-to-end pipeline, and YOLOv11 targets small-object behavior and real-time use on edge hardware [,]. Short surveys and reviews also summarize how these versions moved the speed/accuracy balance across domains [].

2.2.1. YOLOv1 to YOLOv3

YOLOv1 proposed real-time detection using a simple architecture, but it could not work well in small or overlapping objects []. Some limitations were alleviated in YOLOv2 and YOLOv3 with the introduction of anchor boxes as well as multi-scale feature maps (i.e., Darknet-53), such that they performed better for detection on urban traffic signs while keeping fast inference speed (around 45 FPS) [].

2.2.2. YOLOv4 and YOLOv5: Transitional Architectures

After the publication of YOLOv3, two intermediate generations, YOLOv4 and YOLOv5, became crucial in shaping the modern evolution of the YOLO family. Both versions focused on improving detection accuracy while keeping real-time performance on modest hardware.

YOLOv4 introduced the CSPDarknet53 backbone, the Path Aggregation Network (PANet) for better feature fusion, and new training strategies such as Mosaic augmentation and Self-Adversarial Training (SAT). These design choices increased robustness under partial occlusion and improved the detection of small and distant objects [].

YOLOv5, developed shortly after, was rewritten entirely in PyTorch and introduced the focus layer to reduce spatial redundancy at the input stage, auto-learning of anchor boxes, and flexible image scaling. In addition, it emphasized lighter deployment and compatibility with TensorRT and ONNX, which made it easier to use in embedded environments [].

Together, these two versions bridged the gap between the early YOLO series and the later hardware-oriented models (v6–v11), serving as the foundation for many current lightweight and industrial-grade implementations.

2.2.3. YOLOv6 to YOLOv11

The most recent YOLO variants specifically focused on small object detection and edge-based image processing. STC-YOLO [] and YOLO-BS [] used efficient backbones with feature refinement blocks to achieve better accuracy with lower latency. YOLO-CCA [] integrated Transformer modules to integrate local and global attention, so that highly accurate detection can be performed under a complex background. LTS-YOLOv10 [] applied multi-layer feature fusion methods and detected neck structures to enhance the recognition of low-resolution or blurred traffic signs. There is a specialized detection strategy augmented and fused design for small signs in fog challenge of YOLO-TSF []. As a whole, these works indicate the direction towards lightweight and context-dependent detectors that could be adjusted for realistic scenarios. But several of them were only tested on public datasets such as TT100K [] or CCTSDB [], and not explored on embedded platforms. This makes them impractical for real-time automotive applications. Recent comparisons now include versions from v5 to v11 under common settings, but still with desktop GPUs as the main target []. YOLOv10 also reports lower latency versus YOLOv9 under the same accuracy, which is useful for edge deployment [].

2.2.4. Applications in Autonomous Navigation

Some works [,] investigate TSR pipelines based on YOLO for self-driving cars, which highlight the potential use in city environments. Zhu et al. [] presented a lightweight detector with high precision via channel-pruned YOLOv5s on small object datasets. However, most work is not tested on Raspberry Pi, Jetson Nano, or mini-PC, and therefore, little is known about how far these models generalize (or over-fit) to the real world. Recent reports with YOLOv11 on embedded boards (Jetson/RPi) show that real-time operation is feasible for ADAS-like tasks, which is close to our target scenario []. Our work contributes to fill this gap, benchmarking several YOLO architectures v8–v11 on model size across the spectrum and benchmarking them in a TRIGKEY S5 platform within our FIRA-inspired driving setting.

2.3. Limitations of CNNs in TSR and Dataset Limit

Although CNNs have achieved state-of-the-art performance on various visual recognition tasks, their applicability to TSR is limited by the lack of diverse datasets and edge device constraints. Cao et al. [] demonstrated that MicronNet-like small CNNs do not generalize well in the case of low light and motion blur. Venkatesh et al. [] investigated different versions of LeNet-5 in dynamic scenarios and showed performance decay on recognition accuracy for infrequent signs. GTSRB and CCTSDB are highly imbalanced datasets with disentangled occlusion of signs that are not even rotated, making them unfit to be used for robust TSR teaching [,]. Kozhamkulova et al. [] suggested that data augmentation and domain adaptation are crucial for small object detection. Moreover, Zaibi et al. [] observed that for small traffic datasets, classical CNN-based models were prone to overfitting. In addition, lightweight regularization strategies such as extensive data augmentation, dropout layers, and early stopping were employed to mitigate overfitting, which helps the models generalize better across unseen samples.

2.4. Our Contribution in Context

Our work is differentiated from the existing literature by performing an extensive benchmark of YOLOv8 to YOLOv11 architectures using five model scales, spanning Nano to XLarge. Moreover, we evaluate these models not only in an offline scenario but also in real time with an embedded system that has resource constraints, which contrasts with the previous works. In addition, our dataset contains various samples to simulate FIRA scenarios at different distances and different rotations. We also provide per-class breakdown, efficiency metrics including both FPS and resource cost (RAM/CPU), along with the training overhead, which provides a down-to-earth route map that leads to practical TSR systems on autonomous driving on a large scale.

3. Materials and Methods

This section presents the methodology followed to evaluate the performance of recent YOLO architectures for TSR under realistic conditions and constrained hardware. The workflow was structured into three main stages:

Custom dataset acquisition and annotation;
Model training and offline evaluation;
Real-time deployment and performance measurement on embedded hardware.

The experiments were conducted using a 1:10 scale autonomous vehicle on a controlled indoor track, designed to emulate urban driving scenarios. This setup aligns with the specifications of the FIRA RoboWorld Cup–Autonomous Challenge PRO [].

3.1. Dataset Description

We gathered a 9000-image dataset with an Intel RealSense D435i camera mounted at the front of the vehicle (shown in Section 3.4). The measurement was made using a modular test track with intersections, curves, dead ends, and multi-lane segments. The pictures were taken by OpenCV 4.11.0 (Open Source Computer Vision Library) API written in Python 3.10 [], which are in the native color representation BGR format order (Blue–Green–Red), 8 bits per channel used with OpenCV, which is opposite of the typical RGB ordering, at a resolution of 640 × 480 pixels and a frame rate of 30 FPS. The car was in an immobilized position while the investigation was performed, to maintain uniformity and control conditions between all samples. Six real European road signs were replicated: “Left Turn”, “Straight Ahead”, “Right Turn”, “Dead End”, “No Entry”, and “Stop”. These signs were printed at a 1:10 scale to match the original sign geometries and colors defined in international road signage standards. A graphical sketch of the classes’ signs can be seen in Figure 1. To model variations in perception, each sign was recorded from three different viewing angles (

- 45 °

,

0 °

, and

+ 45 °

) and two distances (20 cm and 40 cm). The signposts were posted on different points of the track design, shown in Figure 2, simulating important feature decisions in city navigation (Table 1).

Figure 1. Visual examples of the traffic sign classes used in the dataset: (a) Right Turn, (b) Straight Ahead, (c) Left Turn, (d) No Entry, (e) Dead End, (f) Stop.

Figure 2. Track layout used for data collection. Traffic signs were positioned at key intersections and corners. The environment simulates a reduced-scale urban driving scenario with realistic perception conditions.

Table 1. Technical specifications of the dataset acquisition.

The dataset was designed to ensure class balance and include common visual disturbances such as partial occlusion, reflections, motion blur, and illumination changes; challenges are often encountered in real-world deployments, but under controlled, repeatable conditions.

3.2. Data Collection and Annotation Protocol

Image capture was conducted using in-house-built Python 3.10 scripts that recorded frames in the montages and saved them into class-wise folders. The images were manually annotated using the CVAT tool [], and bounding boxes were drawn on detected traffic signs. The six traffic sign classes considered during annotation are given in Table 2. Once annotated, the dataset was exported in YOLOv8-compatible format [] and uploaded to the Roboflow platform [] for version control, augmentation (e.g., brightness, contrast, rotation), and structured export. A stratified split was used to generate the training (70%), validation (10%), and test (20%) sets, ensuring balanced representation across all classes.

Table 2. Traffic sign classes used in the dataset.

For readers who may want to repeat or check this study, the same data that we used for training and testing will be shared later. All the pictures, labels, and config files will be uploaded to a GitHub page that I prepared for this work. There, the data will be divided into training, validation, and test folders, together with a short note that explains how the images were taken and annotated. The link to that page is https://github.com/rrevelesm/Traffic-Signs-Dataset-YOLOv8-YOLOv11 (accessed on 20 October 2025).

3.2.1. Training Algorithm Description

All YOLO models (v8 to v11) were trained using the Ultralytics Python framework, with datasets accessed and versioned via the Roboflow API. The training process relied on transfer learning and incorporated early stopping, following the protocol summarized below:

YOLO Training Protocol. The following procedure was applied consistently to all YOLO models in this study:

Access the dataset through the Roboflow API.
Select a base model (e.g., yolov8m.pt) for transfer learning.
Configure training:
- Epochs: 200;
- Learning rate: 0.01;
- Optimizer: SGD;
- Image size: 640 × 640;
- Early stopping: 10 epochs.
Perform training with transfer learning.
Validate results and export the best checkpoint.

The selected training constants were chosen based on empirical evidence from prior YOLO implementations and benchmark studies []. A training length of 200 epochs was sufficient for convergence without overfitting in all model scales. The learning rate of 0.01 combined with the SGD optimizer provides stable updates on moderately sized datasets and has shown robustness across YOLO variants. An image resolution of 640 × 640 pixels was adopted as a standard compromise between feature richness and computational load, which is particularly relevant for embedded deployments. Finally, early stopping after 10 epochs of non-improvement prevents unnecessary computation during training while ensuring adequate model generalization.

In practice, all YOLO variants were configured for a maximum of 200 epochs, but early stopping typically triggered convergence between 130 and 140 epochs. This clarification ensures consistency between the configured schedule and the actual number of effective training iterations observed in each model.

3.2.2. Hyperparameter Justification

The training strategy used for all YOLO models was determined empirically, based on state-of-the-art detection pipelines and best practices reported in the literature. Most parameters are optimized to achieve a compromise between the number of iterations required by an algorithm to converge, its stability, and computational needs in an embedded environment.

The learning rate was 0.01, which is a default value for Stochastic Gradient Descent (SGD). This value is a good balance between fast convergence and stable gradients during training, especially for object detection applications [].

The optimizer was not an adaptive one like Adam, but rather SGD. SGD is quite popular on detection task due to its lower memory consumption and better generalization when the models run in the real world [].

All images were scaled to the fixed resolution of

640 \times 640

, which is widely used in YOLOv3 and its variants. This input resolution provides a good trade-off between detection accuracy for small objects (e.g., traffic signs) and low computational load [].

The model was trained up to 200 epochs, permitting convergence while monitoring for premature overfitting. The early stopping mechanism was taken with a patience of 10 epochs; if there is no improvements on the validation loss, the train would be stopped. This is a common technique to achieve “+generalization” and speed-up learning [].

Finally, we used transfer learning from pretrained YOLO checkpoints to enable the models to transfer knowledge from large-scale datasets like COCO, which greatly expedited convergence and improved the robustness over smaller domain-specific dataset [].

These were the values that ensured a stable and efficient setup during the training of all model versions studied in this work.

3.2.3. Preprocessing Protocol

Prior to training, all images were standardized through a common preprocessing pipeline:

Normalization: Rescaling pixel values to the [0, 1] range;
Augmentation: Random rotation, blur, brightness shift, and translation;
Split Strategy: 70% training, 15% validation, 15% test.

3.3. Evaluated Architectures and Performance Metrics

In total, 20 YOLO models were tested, including YOLOv8 to YOLOv11 in their five versions: Nano, Small, Medium, Large, and XLarge. In this way, we provided fair comparisons, since all architectures were trained with the same terms (same dataset, same pre-processing and measurements). Training setting: We conducted all training and validation experiments on the cloud high-performance computing environment, RunPod.io, listed in Table 3. Such a setting made the convergence significantly faster, computation time stable, and metric logs comparable across all model variations. Training Environment: All training and validation tasks were performed on a cloud-based high-performance computing platform, RunPod.io, whose specifications are summarized in Table 3. This configuration enabled accelerated convergence, stable runtime, and reliable metric logging across all model variants.

Table 3. Cloud-based training environment specifications.

Performance Metrics: For model performance evaluation, we also adopted the standard measures in object detection. The first one is the mean Average Precision (mAP), which indicates how accurately a model can estimate various kinds of overlap between the predicted box and the actual one. A pair of its versions was discussed in this study. The first, mAP@50, considers a prediction as positive if it is detected with at least 50% overlap (IoU). The second, mAP@50–95, is the stricter counterpart that averages 50% to 95% IoU in small steps. The former reads the detection ability in an easy manner, and the latter is more aimed at staying stable and accurate when boxes need to be fitted very tightly. Both were employed to provide a representative picture of the quality of the detection on low-capacity embedded systems.

1.: Offline Evaluation (Training and Validation Phase)

All YOLO architectures were trained for a maximum of 140 epochs using early stopping, which interrupted training if validation performance did not improve over 10 consecutive epochs. The evaluation was performed on the validation and test sets using the following standard metrics:

Precision: The ratio of correctly predicted positive instances (true positives) to all predicted positives (true positives + false positives). It measures the model’s ability to avoid false alarms.
Recall: The proportion of true positives detected out of all actual positives (true positives + false negatives). High recall indicates the model detects the most relevant objects.
F1-Score: The harmonic mean of precision and recall, offering a single value that balances both metrics. It is useful when classes are imbalanced.
mAP@50 (mean Average Precision at 0.50 IoU): Measures the Average Precision across all classes when predicted bounding boxes have at least 50% intersection over union (IoU) with ground truth boxes.
mAP@50–95: A stricter version of the previous metric, computed as the average of mAP at IoU thresholds ranging from 0.50 to 0.95 (in steps of 0.05). This offers a more comprehensive evaluation of localization accuracy.
False-Positive Rate (FPR): The proportion of negative instances incorrectly labeled as positive. High FPR indicates over-detection or poor precision.
False Negative Rate (FNR): The proportion of positive instances not detected by the model. High FNR indicates missed detections or low recall.
Inference Time per Image: The average time (in milliseconds) that the model takes to process a single image on a GPU during inference. Lower values are desirable for real-time applications.
Frames Per Second (FPS): The number of full images the model can process per second. FPS is inversely proportional to inference time and reflects real-time capability.

2.: Embedded Inference Evaluation (Real-Time Deployment)

To evaluate onboard feasibility, the best-performing models were deployed on the vehicle’s embedded inference system. Hardware specifications are listed in Table 4.

Table 4. Onboard embedded system specifications.

Real-time inference was conducted using an Intel RealSense D435i depth camera (Intel Corporation, Santa Clara, CA, USA). The following metrics were measured in a ROS-based pipeline (Table 5):

Table 5. Real-time embedded inference evaluation metrics.

3.4. 1:10 Scale Autonomous Vehicle Platform

The onboard system was installed on a 1:10 scale autonomous vehicle designed for research and competition. The architecture is shown in Figure 3 and summarized in Table 6 and Table 7.

Figure 3. Hardware architecture of the 1:10 autonomous vehicle.

Table 6. Core hardware components of the vehicle.

Table 7. Technical specifications of the TRIGKEY S5.

4. Results and Discussion

Results from the evaluation of different YOLO architectures, ranging from YOLOv8 to YOLOv11, are presented in this section. The results are presented in four primary groups, namely overall performance, class-wise precision, computational efficiency, and training cost. Instead of a simple snapshot of which models were the best, this multi-faceted analysis yields better intuition as to what the trade-offs are between accuracy and speed for deployment on embedded systems.

4.1. Overall Performance

We also computed mAP@50–95 in order to evaluate the global performance of each model. It considers both easy and hard detection, thus providing a strong testing value for the test ability of the model. This tendency is consistent with that observed in the pattern of Figure 4. Both YOLOv10 B and YOLOv9 S are the state-of-the-art benchmarks, and share the best mAP (0.9797 and 0.9796, respectively). It indicates that the recent YOLO variants, such as YOLOv3, have achieved efficient small object detection while maintaining speed. Surprisingly, the middle YOLOv8 M model also performs well—it is faster and maintains a good trade-off between accuracy and resource.

Figure 4. Global mean Average Precision mAP@50–95 of YOLOv8-11 models on COCO. Each bar shows the mean detection accuracy over all traffic-sign categories on the validation set. Among them, YOLOv10 B obtains the best overall performance, which has the best trade-off between accuracy and computational costs on both datasets with identical training settings.

4.2. Class-Wise Performance

While the global mAP@50–95 is good to start, per-class performance shows important nuances on how each model processes some particular traffic sign categories. In Figure 5, we show a bar chart of overall mAP@50–95 for each YOLO model variant.

Figure 5. Average mAP@50–95 across all traffic sign classes for each YOLO variant (YOLOv8–YOLOv11, all scales).

The bar plot (Figure 5) provides a statistical summary of the overall detection performance for all YOLO models. It allows a quick visual comparison of the relative accuracy of the models, and supplements fine-grained numerical tables in the Results.

Most models reach superior performance, with full detection mAP values more than 0.97. Nevertheless, there is greater variance for classes 0 and 1, which could be attributed to certain challenges that may require some additional attention.

To investigate this, Figure 6, Figure 7, Figure 8, Figure 9, Figure 10 and Figure 11 summarize the performance per class.

Figure 6. Per-model detection performance for Class 0 (Left Turn). Bars indicate mean Average Precision (mAP@50–95) for each model. Most architectures reach values above 0.94, with YOLOv10B standing out for its strong generalization in visually consistent symbols.

Figure 7. Per-model detection performance for Class 1 (Forward). Lower mAP@50–95 values show that this category remains the most challenging. Visual similarity with Classes 0 and 2 leads to frequent confusion among models, suggesting that additional data augmentation may help.

Figure 8. Per-model performance for Class 2 (Right Turn). The results show larger fluctuations across mid-sized models due to the rotational symmetry of this sign. Specialized augmentation strategies could further improve robustness for these cases.

Figure 9. Per-model performance for Class 3 (Dead End). Almost all YOLO versions achieve very high precision, confirming the robustness of detection for signs with distinctive geometry and strong color contrast.

Figure 10. Per-model performance for Class 4 (No Entry). All architectures exhibit high mAP@50–95, typically above 0.97. This strong performance arises from the clear color separation and stable shape of the sign.

Figure 11. Per-model detection accuracy for Class 5 (Stop). The majority of models reach saturation near 1.0 mAP@50–95, showing that this symbol—with its distinctive red octagon and white text—is easily recognized under various conditions.

We also observe in Figure 6 that it is relatively easy to identify Left Turn (Class 0). While most models surpass a value of 0.94, only a few reach values significantly higher. YOLOv10B again shows strong overall performance; its better generalization ability is supported by these results, consistent with the relatively homogeneous visual patterns of this class.

According to Figure 7, Class 1 (Forward) arises as one of the most challenging classes. The lower mAP values obtained from all the models indicate that this sign is more susceptible to misclassification, since it visually resembles Classes 0 and 2. This suggests a potential direction for further improving the datasets or neural architectures.

However, middle-tier models exhibit performance fluctuations, as depicted in Class 2 (Right Turn) in Figure 8. It shows one downside of rotational symmetry in the signs, which could be combated through specific types of data augmentation. According to Figure 7, Class 1 (Forward) arises as one of the most challenging classes. The lower mAP values obtained from all the models indicate that this sign is more susceptible to misclassification, since it visually resembles Classes 0 and 2. This suggests a potential direction for further improving the datasets or neural architectures.

In Figure 9, we can see how the models are robust on Class 3 (Dead End). It is notable that almost all architectures reach high accuracy, and this may be related to the unique visual properties of the sign (thick red border and specific layout), which results in a target that can be reliably detected even in non-optimal conditions.

Class 4 (No Entry), presented in Figure 11, is the second one with high precision. Rates of detection exceed 0.97 for almost all the architectures. With a high color contrast and clearly discernible geometry, it is a low-magnification scene, perfect for training stability.

At last, Figure 11 also endorses Class 5 (Stop) as one of the top-performing classes. The vast majority of models work around the saturation threshold (as silhouetted objects are relatively easy to detect), suggesting that the CNN-based detectors readily learn this prototypical shape-text combination.

Table 8 complements the previous data by providing a more direct quantitative comparison of the three best-performing models, showing the relation between detection precision and inference time under equal experimental conditions.

Table 8. Summary of main detection metrics (mAP@50–95, F1-score, and Inference FPS) for the best-performing YOLO models.

Insights on Per-Class Performance

Per-class analysis reveals a two-mode distribution in which the easy classes (3, 4, and 5) are separated from the difficult ones (0, 1, and 2). The former category is of different geometries and colors, while the latter are visually ambiguous or symmetric. This suggests that further training gains could focus on harder examples, targeted augmentation, and adaptive loss functions or attentional mechanisms.

4.3. Computational Efficiency

Additionally, accuracy alone is not sufficient; embedded deployment of DNN models requires low latency and fast inference to facilitate real-time responsiveness and stability. These aspects determine how easily a network can be deployed on power-constrained hardware. This section describes three components of the inference procedure, namely preprocessing, prediction, and postprocessing.

Figure 12 shows the preprocessing time required before input to the model. Preprocessing time is typically similar between the architectures and remains within practical limits for real-time applications. However, YOLOv8M is the fastest model at FP32 and INT8 levels of precision, reconfirming its suitability for low-latency systems, where front-end processing must be minimized (e.g., on edge devices with limited computational power).

Figure 12. Average preprocessing time across YOLOv8–YOLOv11 architectures (ms). This stage includes image normalization, resizing, and data loading operations before inference. Preprocessing time remains below 0.6 s for the compact models YOLOv11N and YOLOv8M, while it goes above 1.3 s for heavier versions such as YOLOv10B and YOLOv8X due to the overhead introduced by tensor reshaping.

Results on inference throughput and latency are depicted in Figure 13, showing notable variances across architectures. YOLOv8N and YOLOv10N come out as the fastest models, with sub-20 ms inference times that ideally fit real-time control loops. At the other extreme, larger models such as YOLOv10L and YOLOv8X take more than twice as long per image, which may make them unsuitable for latency-sensitive applications without GPU acceleration.

Figure 13. Inference time comparison of YOLOv8–YOLOv11 models (ms). YOLOv10N and YOLOv8N achieve sub-second inference, suitable for real-time scenarios. YOLOv8X and YOLOv10L exhibit slower performance, exceeding 7 s per frame. Inference remains the main latency source in embedded deployment.

The postprocessing results are presented in Figure 14. This phase involves the filtering of detections, thresholding the confidence level associated with these, and generating output formatted for downstream (e.g., control system) modules. The observation is similar to that of inference latency: small models, such as YOLOv11 and YOLOv8, operate with low-latency stream outputs, while heavy, large models have a long tail-end process, which may cause a bottleneck in control commanding that needs agility over changing environments.

Figure 14. Postprocessing time distribution per model (milliseconds). This phase includes confidence filtering and non-maximum suppression (NMS) to refine bounding boxes. Smaller models such as YOLOv10N and YOLOv8S remain under 0.25 s, while larger ones like YOLOv11X and YOLOv9S approach 0.52 s. Overall, postprocessing contributes minimally to total delay compared with inference time.

Together, these three phases describe the computational footprint of each model. In settings where milliseconds count—such as obstacle avoidance and lane keeping—lighter architectures (e.g., YOLOv8M and YOLOv10N) yielded the best speed–accuracy trade-off.

4.4. Error Metrics and Reliability

In safety-critical autonomous driving applications, accuracy alone may not be sufficient. Systems must also minimize both missed detections and false alarms. To address this, two key metrics are considered: F1-score and False-Positive Rate (FPR). The former quantifies over-detection cases that may cause unnecessary responses of the vehicle.

These two indicators, as in Figure 15, are graphed on a common dual-axis plot for scrutiny. The correspondence between true alarms and overestimation. YOLOv10B and YOLOv9S have F1-scores very near one, which leads to a small number of false alarms, showing a good compromise between accuracy and reliability. YOLOv10M and YOLOv9CS give similar precision but at an increased false-positive rate, which may cause unintended system responses, such as sudden deceleration or deviation from the route in practice. These findings highlight the relevance of post-processing procedures, such as threshold optimization, temporal filtering, and ensembles to reduce FPs. Here, reliability means accuracy, stability, and robustness in an uncertain environment. Advanced models achieve high accuracy in controlled scenarios; shallower architectures tend to need more computation or adaptive filtration prior to their deployment for real-time systems working in natural environments.

Figure 15. Complementary metrics describing detection reliability. Left: the F1-score for each model, which balances precision and recall. Right: false-positive rate per model. All tested YOLO architectures achieve F1-scores over 0.99 and false-positive rates below 0.015. This demonstrates uniform detection performance across versions, suggesting that even compact models maintain robustness under the same evaluation settings.

4.5. Training Cost

Inference-time accuracy, however, is only half the story. This is particularly important when working at scale or with systems that evolve. You cannot put unscalable (takes forever to train) or non-converging models back in production where you have time for iteration and datacenter capacity. We quantify that mechanism in two complementary ways: through the final training loss and the time to reach a fixed stopping time.

The final training loss of all model configurations is shown in Figure 16. Here, YOLOv8L obtained an even lower loss, i.e., optimization was stabilized due to balanced learning dynamics. It appears that the architecture is well suited to a training scheme, as it achieves a loss near zero without overfitting. By contrast, YOLOv10L has the worst generalization performance, and its high loss also suggests that it is considerably underfitted or misinitialized (likely due to its deep structure). As a newer family, the slow convergence of YOLOv10L in Table 3 shows that it may require more training epochs or proper hyper-parameter tuning for achieving its representative power.

Figure 16. Final training loss achieved after convergence. Lower values indicate smoother optimization and stronger generalization. YOLOv10B and YOLOv9S yield the lowest residual losses, while YOLOv8M and YOLOv10L exhibit higher remaining error, suggesting sensitivity to overfitting in larger parameter spaces.

The time factor of the process, which is the training cost, is indicated in Figure 17 as the second component. It shows different temporal profiles across architectures. It can be end-to-end trained in less than 30 min, even for minor variants such as YOLOv8M, which is very attractive for rapid prototyping or inclusion in agile pipelines. It enables more iterative design and tuning, which is essential when working on experimental setups or robotic platforms.

Figure 17. Total training time required for each YOLO version under identical experimental conditions. Lightweight models (YOLOv8N, YOLOv10N) complete training in less than one hour, whereas high-capacity models (YOLOv8X, YOLOv10B) exceed 1.2 h. This contrast reflects the computational trade-off between scalability and training efficiency.

By contrast, larger models such as YOLOv10B and YOLOv11N, which achieve better mAP for detection, take longer epochs (at least 75% of the total time) during training. This presents an operational bottleneck for communities or groups with limited GPU resources and smaller development windows. Longer training time can additionally result in delayed time-to-deployment or reduced retraining frequency when new data become available.

4.5.1. Summary of Trade-Offs

The computational issues are not only in terms of training cost, but also the choice of model should be consistent with the actual needs as well. In this Table, YOLOv10B performs best in terms of both highest accuracy and highest recall rate across all categories, and it is also the most challenging model to train. However, YOLOv8M may be considered one of the visual model trade-offs: better appearance is compromised in favor of competitive recognition performance, and it has a significantly lower number of FLPs, as well as a loss of training time. Sometimes this is a must, e.g., for embedded IT in the style of instruction and special real-time tasks, but without a GUI. It is the stability, not accuracy, that is the primary focus; a model selection based on stability, not necessarily the best accuracy, becomes preferable. Balancing stabilization and shelf-life by training time is essential. This discovery marries the parsimony of the model with the convenience of scalable AI.

Apart from the offline analysis, a real-time experiment was performed with an embedded device to confirm that the YOLOv10B model performs effectively under deployment. A controlled semi-realistic environment, fit to replicate practical detection scenarios, was employed for the test and conducted with a TRIGKEY S5 mini-PC and a scaled 1:10 static autonomous vehicle.

Six classes of traffic signs were considered: Left Turn, Forward, Right Turn, Dead End, No Entry, and Stop. Each sign was tested at two distances (20 and 40 cm) and three viewing angles (

- 45 °

/

0 °

/

+ 45 °

), which take into account natural variations in perception. The vehicle was kept fixed during the tests in order to separate the detection phase and performance indicators. A custom logging module captured inference data per frame such as FPS, inference time, CPU and RAM usage, and detected class. The recording was event-locked ERSE on timing markers at the start of each measurement interval, allowing for tight operator control and repeatable testing circumstances (Figure 18).

Figure 18. Experimental setup used for traffic sign recognition. The figure shows a 1:10 scale autonomous vehicle equipped with an Intel RealSense D435i camera mounted on a vertical support. The test track is made of a black surface with white lane markings, and includes printed traffic signs such as “Proceed Forward” and “Stop”, positioned at different locations. The system operates in real time, with detections displayed on the monitor.

During data logging, log files for each class of traffic sign were analyzed. Each row in the CSV file corresponded to a detection instance, where detection metrics were summarized across time. To analyze the real time performance of the model for different perception context, a dedicated 60-s testing was performed from each class. In every run, one traffic sign was placed in front of the camera and the detection pipeline was running continuously on an embedded platform. This way, it was possible to make a separate evaluation of the response time for each sign by showing differences between inference time, FPS, CPU, and RAM usage, as well as processor frequency. Subsection VI-B reports class-wise performance obtained for the six tested traffic signs: “Left Turn”, “Forward”, “Right Turn”, “Dead End”, “No Entry”, and "Stop”. Each subsection below is followed by a set of combined figures that provide an overview of the main real-time timing measurements.

4.5.2. Performance Under the “Left Turn” Signal Exposure

Figure 19 details the system performance during the “Left Turn” traffic sign for 60 s. The mean inference time was maintained around 320 ms (Figure 19d), which ensures an average frame rate of about 3.1 FPS (Figure 19a), enough to handle the static detection situation. CPU load was controlled at about 55%, with occasional peaks of up to 75% (Figure 19b), which may have been caused by background tasks or memory access polling. RAM usage fluctuated between 6340 and 6390 MB, with a few minor peaks (Figure 19c), suggesting good memory consistency across the examination. The CPU frequency initially reached 3600 MHz, then fell to around 3100 MHz (Figure 19), which is normal for power-constrained embedded platforms. Crucially, the YOLOv10B model successfully detected the “Left Turn” sign in all frames without any misclassifications. This verifies strong robustness in controlled circumstances with slight angular changes and background light variations.

Figure 19. Performance metrics on detection “Left Turn” traffic sign: (a) FPS vs. Time, (b) CPU Usage vs. Time, (c) RAM Usage vs. Time, (d) Inference Time vs. Time, (e) CPU Frequency vs. Time. For all of the performance measures, the system operates satisfactorily, and its detection robustness is good.

4.5.3. Performance Under “Forward” Signal Exposure

The real-time performance of the system, shown in Figure 20, was measured over 60 s in response to exposure to the “Forward” traffic sign. In this bandwidth, the model performed similarly in inference, despite slight variations in computation due to the embedded platform.

Figure 20. Performance parameters for detecting the “Forward” traffic symbol on which it is trained: (a) FPS vs. Time, (b) CPU Usage vs. Time, (c) RAM Usage vs. Time, (d) Inference Time vs. Time, (e) CPU Frequency vs. Time. During real-time embedded testing, the system performed well in terms of detection accuracy and stability.

The mean frame rate was about 3.05 FPS, represented in Figure 20a, approximately the same as in the previous experiment, which is acceptable for static detection. The CPU usage showed a slight increase with time, ranging from about 52 to 77% (Figure 20b), which are momentary computation bursts likely due to inference execution and memory management.

RAM utilization remained stable, with small oscillations between 6340 and 6395 MB (Figure 20c), and inference times were consistent around 327 ms (Figure 20d), demonstrating effective pipeline activation. The CPU frequency gradually decreased from 3600 MHz to about 3100 MHz (Figure 20e), consistent with thermal balancing on embedded platforms.

The system identified the “Forward” sign in every frame during the test without any misclassification. This finding verifies that the YOLOv10B model is able to recognize frontal signs even when they turn slightly from the camera angle or when the distance changes.

4.5.4. Performance Under “Right Turn” Signal Exposure

Figure 21 displays the system performance during the 60-s exposure to the “Right Turn” traffic sign. The model maintained reliable detection, with stable inference behavior throughout the test.

Figure 21. Performance metrics during detection of the “Right Turn” traffic sign: (a) FPS vs. Time, (b) CPU Usage vs. Time, (c) RAM Usage vs. Time, (d) Inference Time vs. Time, (e) CPU Frequency vs. Time. The system maintained real-time detection with high reliability and hardware stability.

The average frame rate held around 3.07 FPS (Figure 21a), while CPU utilization started near 53% and exhibited a steady rise to a peak of approximately 78% (Figure 21b). These values are consistent with previous evaluations and reflect typical usage patterns of the TRIGKEY S5 mini-PC under neural inference load.

RAM consumption remained confined to the 6342–6394 MB range (Figure 21c), confirming efficient memory usage. Inference times hovered around 326 ms (Figure 21d), with minimal jitter, contributing to the consistent FPS output. The CPU frequency graph (Figure 21e) reveals the expected thermal throttling behavior, decreasing from approximately 3600 MHz to 3100 MHz during the test.

Throughout the experiment, the YOLOv10B model successfully detected the “Right Turn” sign in all frames, with no confusion or false positives reported. This further reinforces the model’s robustness when identifying direction-based symbols, which are critical for autonomous navigation tasks.

4.5.5. Performance Under “Dead End” Signal Exposure

Figure 22 illustrates the system behavior during the 60-s exposure to the “Dead End” traffic sign. The YOLOv10B model performed reliably, maintaining consistent inference and resource usage throughout the trial.

Figure 22. Performance metrics during detection of the “Dead End” traffic sign: (a) FPS vs. Time, (b) CPU Usage vs. Time, (c) RAM Usage vs. Time, (d) Inference Time vs. Time, (e) CPU Frequency vs. Time. Stable hardware utilization and consistent detection were observed.

The frame rate remained steady, averaging 3.07 FPS across the test period (Figure 22a). CPU usage showed minor oscillations but generally remained within 52–76% (Figure 22b), demonstrating stable processing demand. RAM usage was also consistent, fluctuating slightly between 6342 and 6392 MB (Figure 22c), indicating efficient memory allocation during inference.

Inference time hovered around 326 ms (Figure 22d), which aligned with other sign evaluations, while CPU frequency followed the expected decay pattern from 3600 MHz to approximately 3100 MHz (Figure 22e), influenced by power and thermal regulation.

Detection performance was flawless, with every frame correctly classified as “Dead End” and no evidence of label confusion. These results emphasize the high visual saliency of this sign and the model’s ability to generalize reliably under variations in angle and distance, making it a dependable cue in real-world navigation tasks.

4.5.6. Performance Under “No Entry” Signal Exposure

Figure 23 presents the system behavior while detecting the “No Entry” sign. Despite being visually similar to the “Dead End” sign in color and shape, the model maintained high detection accuracy with no misclassifications reported.

Figure 23. Performance metrics during detection of the “No Entry” traffic sign: (a) FPS vs. Time, (b) CPU Usage vs. Time, (c) RAM Usage vs. Time, (d) Inference Time vs. Time, (e) CPU Frequency vs. Time. Results confirm stable inference and precise classification despite visual similarity to other classes.

The frame rate averaged 3.05 FPS (Figure 23a), consistent with previous tests. CPU usage fluctuated slightly more during this exposure, ranging from 53% to 78% (Figure 23b), likely reflecting brief background tasks or variance in frame complexity. RAM usage stayed within a narrow band from 6344 to 6394 MB (Figure 23c), confirming stable memory handling.

Inference time held steady at an average of 327 ms (Figure 23d), matching the expected computational latency for this architecture. CPU frequency displayed the typical thermal regulation pattern, slowly dropping from 3600 MHz to 3080 MHz (Figure 23e).

Overall, this class showed consistent, high-precision results even under angle and distance variation, reinforcing the model’s robustness against intra-class confusion and confirming its reliability for detecting regulatory signs in embedded deployments.

4.5.7. Performance Under “Stop” Signal Exposure

The system’s behavior while detecting the “Stop” sign is shown in Figure 24. As one of the most visually distinctive and universally recognized signs, “Stop” proved to be easily detectable by the model, with 100% classification accuracy and no mislabeling events during the entire exposure period.

Figure 24. Performance metrics during detection of the “Stop” traffic sign: (a) FPS vs. Time, (b) CPU Usage vs. Time, (c) RAM Usage vs. Time, (d) Inference Time vs. Time, (e) CPU Frequency vs. Time. Results show consistent behavior and perfect classification accuracy, affirming the model’s capability to detect highly distinctive signs.

The average frame rate remained stable at 3.1 FPS (Figure 24a), maintaining real-time responsiveness. CPU utilization ranged between 54% and 77% (Figure 24b), a typical load level for this pipeline. RAM consumption showed a tight envelope, varying only between 6341 and 6395 MB (Figure 24c), which supports the system’s memory stability and efficient memory management.

Inference time fluctuated minimally around an average of 319 ms (Figure 24d), confirming that the detection process remains within acceptable limits for static deployment. As in other tests, CPU frequency began near 3600 MHz and declined gradually toward 3075 MHz (Figure 24e), reflecting thermally regulated throttling typical of embedded hardware under sustained load.

This experiment confirms that the “Stop” sign, due to its strong visual contrast and distinct geometry, is one of the most robustly classified categories. Its detection was unaffected by angle or distance shifts, reinforcing the model’s capacity for accurate recognition under varying perceptual conditions.

4.5.8. Summary and Implications

The online analysis translated partially to the physical deployment of the YOLOv10B model running on an embedded platform, but it also uncovered finer aspects that would be useful for deployment. The stability of the system was maintained across all types of traffic signs, and there were no fatal frame rate delays or inference degradation. A basic frame rate of 3–4 FPS was consistently preserved, thanks in part to the modest capabilities of the TRIGKEY S5 mini-PC.

The robustness was very close on the two distances and the rotation distance for some classes: “Left Turn” and “Stop”. Such results accord with the fact that visually attention-grabbing signs could generalize to other situations. On the contrary, however acknowledgements like "Forward" and "Right Turn" tended to receive slightly larger confidence drops (short sign-droppage), in particular when an input sign was close to oblique or frontal, it is possibly attributed to their shortfall for directly facing view detection, which arises from intra-class ambiguity per se, and lesser variabilities within training data.

Critical to the on-ship testing was a check that the model is the power; essentially, that it can run the network in real time without crashing or being killed off due to thermal throttling. The five-minute, high CPU utilization frequency throttling is prolonged in this pattern, as it may need some form of active thermal management for long-duration missions or outdoor operation.

Overall, contributed in vivo performance evaluation that closes the gap between lab benchmarking and wide-area deployment is presented; successes and limitations of our implementation are outlined. The experimental results indicate YOLOv10 B is fit for single-driver low-speed semistructured driving tasks based on regression, while the class balance modification we made might be a point to be optimized in future iterations. Environment generalization can also be improved.

4.5.9. Power Consumption of Real-Time Signal Alternation and Processing

The continuous experiment of the dynamical performance of the system was performed for 180 s, wherein every 36 s, one of the six traffic signs was shown: Left Turn, Forward, Right Turn, Dead End, No-Entry, and Stop signs. The goal was to observe the dynamical processes under the effect of different types of visual stimuli and see if such energy profiles would change. The power consumption in watts of the CPU in the course of an experiment is shown in Figure 25. The essential part is that the installed TRIGKEY S5 device kept a baseline power of 4.52 W, which could be taken as a scenario measure when the steady-state usage presents the YOLOv10 B model inference load. However, the most interesting were the sudden peaks at seconds 54, 108, and 162: the moment when the sign is changed. The peaks are supposed to be short, high usage CPU period, which is either a reaction to other detected classes or recalibration of the detection internal pipeline, or even with some features reacting to our reasoning and storage systems. Worst-case scenario, with classes giving little confidence or far from us on samples, the power is still lower than 4.64 W, which suggests that the model also remains to have a prominent computable footprint regarding the class modifications. Most importantly, there was no anomaly in power or sustained power plateaux, which means that the framework remained stable in different loads of signal recognition. The results guarantee the effective deployment of YOLOv10 B models in embedded systems that could have many traffic signs displayed in a row.

Figure 25. Power consumption of the embedded system during a continuous 180 s experiment (watts).

5. Discussion

The comparison results obtained in this study demonstrate that YOLOv10 B is the best choice for the traffic sign task in small devices. This gives one of the optimal trade-offs between accuracy and speed when there are limited resources in the system. This is something that many of the recent works are trying to do: to design lightweight models while obtaining good accuracy even with less computation [,]. Nonetheless, most of those approaches are tested only on public data and not on real hardware.

However, the majority of these methods are only evaluated on public datasets and not applied to real hardware. This study was designed because it is not possible to extrapolate the results of YOLO architectures trained and tested on large datasets to small embedded platforms. Differences in sensor placement, limited computation, and environmental conditions require a controlled and reproducible environment to assess real-time performance in small autonomous vehicles. The objective of this benchmark is therefore not to generalize to full-scale systems, but to measure and understand how modern YOLO-based models perform under the resource and perception constraints of 1:10-scale environments.

In our setup, all the models were realized on a 1:10 scale autonomous car driven by a TRIGKEY S5 mini-PC. This framework let us peer into what actually occurs when the detector is operating in real time. In the case of YOLOv10 B, the accuracy remained high (over 0.98 in mAP@50–95) during training and around 0.97 for F1-score when running directly on the device. These numbers matter because few studies present how temperature or energy use evolve once the model is implemented on a real board [].

Compared with other methods that introduce additional layers or attention blocks to enhance small-object detection performance, YOLOv10 B almost retains the same accuracy on the large objects without adding new complexity. This means the fundamental architecture of those models is already strong enough to be viable on edge devices. Similar conclusions have been drawn in broader benchmarks that compared multiple YOLO generations from v5 to v11 [].

In the tests on real driving, solid detection for all six traffic sign types was maintained by YOLOv10 B. The “Forward” and “Right Turn” signs had small drops in confidence (and FPS too), probably because they have the same appearance in some frames with low light. Nonetheless, the overall performance remained steady and solid.

We also verified the power and heat on the TRIGKEY S5. The device remained cool, and the processor only idled after very long continuous tests. Not crucial, but I had the feeling long-term runs would be safer with a little passive cooler.

6. Conclusions

We provide an extensive analysis of state-of-the-art YOLO-based methods, with a particular emphasis on how they can be used for embedded traffic sign detection on a downsized autonomous vehicle platform. Among the twenty tested models, YOLOv10 B is found to be the most well-balanced model in terms of achieving a good trade-off among highest mAP@50–95 F1-score with the ability to perform real-time inference (>20FPS), and at the same time, keeping acceptable thermal and energy footprints working on a TRIGKEY S5 mini-PC. In comparison to existing methods where modifications are necessary for good small object predictions, YOLOv10-B shows that with its original design, we can achieve high accuracy results with low latency present in other works, while not complicating the detector more than is necessary, thus rendering it a better choice for embedded applications.

In addition, by combining offline training, real-time deployment, and system-level analyses, our work closes the gap between what theoretical models can accomplish and the actual needs of robot designs.

In contrast to previous works that use only public datasets or simulated benchmarks [,,,,,], our system offers a reproducible pipeline from dataset collection and model training to evaluation under realistic deployment constraints.

Prospects for future developments include the augmentation of the training dataset with difficult conditions (i.e., glare, dirt, and oblique views), passive thermal optimization to reduce heat generation, and model pruning strategies toward long-term stability. Finally, this work provides experimental evidence that existing state-of-the-art object detectors, such as Single Shot MultiBox Detector (SSD) and Faster Region-based Convolutional Neural Network (Faster R-CNN), can also be adapted and compared for embedded-scale evaluation.

These can be competitive on embedded systems for real-time traffic sign recognition when a fair benchmarking framework is used and the detector is properly integrated into an end-to-end pipeline.

7. Limitations and Future Work

From testing experiments, the YOLOv10 B model can recognize traffic signs with high accuracy on an embedded device in real time. However, there are still practical and methodological limitations, each of which suggests future directions for improvement. The major limitation in applying the method is due to the experimental setup. All experiments were conducted in the static state of the vehicle, so that it has a stable frame rate, temperature, and detection accuracy. Although this method is easy to analyze, it can not simulate the actual driving condition. When driving, the camera is subject to vibration, movement blur, and fast illumination variation, which can lower confidence or lead to wrong recognition. A second limitation has to do with the data utilized for training. The majority of the gathered videos were captured indoors or under soft light illumination, with clean and readily visible traffic signs. In the real world, road signs are often dirty, bent, or partially obscured, and must be read from various angles and distances. Widening the dataset to contain outdoor photographs shot in various types of light and weather conditions would most likely increase the robustness, as well as generality, on unseen environments for the model. Another constraint comes from the system integration. The detector was tested on the TRIGKEYS5 as a stand-alone application, but not integrated as a whole control framework like ROS2. For this reason, P2A latency was not quantified. In practical applications of autonomous systems, this time lag is quite important as the vehicle has to move while detection, decision, and motion control are taking place. Thermal performance was also reported as an issue. Over time, the CPU frequency reduced slowly as heat built up on the TRIGKEY S5. While the impact was negligible in short experiments, long-duration tests would necessitate further cooling or ideal thermal structuring to ensure that inference speed remains constant. To normalize for a point solution on given hardware, we reported efficiency as FPS/W. Such a parameter provides an architecture-neutral measure of computing capability, which allows one to compare various embedded platforms. For a 15W power envelope, the effective throughput of YOLOv10 B amounted to around 1.17 FPS/W. Similar ratios can be found for other models, which enable cross-platform comparisons and optimization of the design.

While such limitations are apparent, they also serve as a guide to what should naturally be pursued in the next step:

Dynamic testing: Experiment with the vehicle in an overall moving state to investigate how real-world conditions (vibration, motion blur, steering) influence the performance of recognition.
ROS2 integration: Integrate the YOLOv10 B pipeline in an ROS2 ecosystem to measure complete-system latency and correlation between perception and control timing.
Dataset augmentation: Collect more training data with different environmental conditions, such as raining, glazing, and partial occlusions, to improve the generalization ability of the model.
Thermal and model profiling: Make the power efficiency versus inference speed trade-off by deploying a light-weight cooling or model optimization, e.g., pruning, quantization.

Combined, these enhancements serve as a bridge between laboratory validation and long-term deployment in autonomous vehicles on public roads.

Environmental limitations:

Despite the tested models giving very good results when trained and tested in ideal weather for recognition, their accuracy can decrease under poor or adverse weather conditions.

Reflections and partial occlusions, from rain, splashes, or hail, will increase the rate of false positives, as documented in Wiseman [].

Our future work will involve the use of adaptive filtering and multimodal sensor fusion (e.g., LiDAR depth masking, temporal smoothing) to mitigate these issues and enable consistent off-road detection.

Integration with intelligent infrastructure:

More recent work has investigated integrating vision-based detection with GPS and mobile sensors for continuous assessment of the condition of traffic signs (Karasneh et al. []).

This direction is consistent with our aim to extend the current detection framework to automatic condition assessment and maintenance scheduling.

When integrated with geo-referenced information, it would be able to automatically recognize damaged or covered signs, assisting in the establishment of intelligent infrastructure and automatic road-safety handling.

Author Contributions

Conceptualization, R.R.-M. and J.M.C.-P.; methodology, R.R.-M.; software, R.R.-M.; validation, R.R.-M., H.G.-R., E.S.-F., J.S.-P., T.I.-P., L.C.R.-G., O.A.G.-B., J.I.G.-T., C.E.G.-T. and H.L.-G.; formal analysis, R.R.-M.; investigation, R.R.-M.; resources, R.R.-M. and J.M.C.-P.; data curation, R.R.-M., H.G.-R., E.S.-F., J.S.-P., T.I.-P., L.C.R.-G., O.A.G.-B., J.I.G.-T., C.E.G.-T. and H.L.-G.; writing–original draft preparation, R.R.-M.; writing–review and editing, R.R.-M. and J.M.C.-P.; visualization, R.R.-M.; supervision, J.M.C.-P.; project administration, J.M.C.-P. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available on request from the corresponding author.

Acknowledgments

The authors would like to thank the Secretaría de Investigación y Posgrado (SIP-IPN) and the Instituto Politécnico Nacional (IPN) for their institutional support. We also acknowledge the Universidad Autónoma de Zacatecas (UAZ) for providing the research facilities and resources that made this work possible.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Wang, X.; Guo, J.; Yi, J.; Song, Y.; Xu, J.; Yan, W.; Fu, X.; Wang, X.; Guo, J.; Yi, J.; et al. Real-Time and Efficient Multi-Scale Traffic Sign Detection Method for Driverless Cars. Sensors 2022, 22, 6930. [Google Scholar] [CrossRef]
Babu, V.S.; Behl, M. ROS F1/10 Autonomous Racecar Simulator. In Proceedings of the ROSCon 2019, Macau, China, 31 October–1 November 2019. [Google Scholar] [CrossRef]
O’Kelly, M.; Saha, A.; Babu, V.S.; Mangharam, R. F1TENTH: An Open-source Evaluation Environment for Continuous Control Reinforcement Learning. In Proceedings of the Machine Learning Research, PMLR, Virtual, 13–18 July 2020; Volume 123. [Google Scholar]
FIRA RoboWorld Cup. Autonomous Cars League—Official Rules and Track Layouts. 2025. Available online: https://firaworldcup.org/leagues/fira-challenges/autonomous-cars/ (accessed on 18 October 2025).
Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft COCO: Common Objects in Context. In Proceedings of the European Conference on Computer Vision (ECCV), Zurich, Switzerland, 6–12 September 2014. [Google Scholar]
Everingham, M.; Gool, L.V.; Williams, C.K.; Winn, J.; Zisserman, A. The Pascal Visual Object Classes (VOC) Challenge. Int. J. Comput. Vis. 2010, 88, 303–338. [Google Scholar] [CrossRef]
Terven, J.; Córdova-Esparza, D.M.; Romero-González, J.A. A Comprehensive Review of YOLO Architectures in Computer Vision: From YOLOv1 to YOLOv8 and YOLO-NAS. Mach. Learn. Knowl. Extr. 2023, 5, 1680–1716. [Google Scholar] [CrossRef]
Batool, A.; Nisar, M.W.; Shah, J.H.; Rehman, A.; Sadad, T. IELMNet: An Application for Traffic Sign Recognition using CNN and ELM. In Proceedings of the 2021 International Conference on Artificial Intelligence and Data Analytics (CAIDA), Riyadh, Saudi Arabia, 6–7 April 2021. [Google Scholar] [CrossRef]
Sun, W.; Du, H.; Zhang, X.; He, X. Traffic Sign Recognition Method Integrating Multi-Layer Features and Kernel Extreme Learning Machine Classifier. Comput. Mater. Contin. 2019, 60, 147–161. [Google Scholar] [CrossRef]
Fleyeh, H.; Roch, J. Benchmark Evaluation of Hog Descriptors as Features for Classification of Traffic Signs. Int. J. Traffic Transp. Eng. 2013, 3, 448–464. [Google Scholar] [CrossRef] [PubMed]
Liu, C.; Chang, F.; Chen, Z. High Performance Traffic Sign Recognition Based on Sparse Representation and SVM Classification. In Proceedings of the 2014 10th International Conference on Natural Computation (ICNC), Xiamen, China, 19–21 August 2014. [Google Scholar] [CrossRef]
Wei, Z.; Xia, J. Class-Imbalanced Traffic Sign Recognition Based on Improved YOLOv7. In Proceedings of the International Conference on Image, Signal Processing, and Pattern Recognition (ISPP 2023), Changsha, China, 24–26 February 2023. [Google Scholar] [CrossRef]
Gong, F.M.; Li, H.J. Traffic Sign Detection and Pattern Recognition Based on Binary Tree Support Vector Machines. Adv. Mater. Res. 2011, 204–210, 1394–1398. [Google Scholar] [CrossRef]
Wang, A.; Chen, H.; Liu, L.; Chen, K.; Lin, Z.; Han, J.; Ding, G. YOLOv10: Real-Time End-to-End Object Detection. arXiv 2024, arXiv:2405.14458. [Google Scholar] [CrossRef]
Chaman, M.; Khan, I.; Noor, M. A Real-Time Vehicle Detection System for ADAS in Autonomous Vehicles Using YOLOv11 Deep Neural Network on Embedded Edge Platforms. Eng. Technol. Appl. Sci. Res. 2025, 15, 12138. [Google Scholar] [CrossRef]
Ali, M.L.; Zhang, Z. The YOLO Framework: A Comprehensive Review of Evolution, Applications, and Benchmarks in Object Detection. Computers 2024, 13, 336. [Google Scholar] [CrossRef]
Zang, D.; Wei, Z.; Bao, M.; Cheng, J.; Zhang, D.; Tang, K.; Li, X. Deep Learning–based Traffic Sign Recognition for Unmanned Autonomous Vehicles. Proc. Inst. Mech. Eng. Part J. Syst. Control Eng. 2018, 232, 095965181875886. [Google Scholar] [CrossRef]
Gong, C.; Li, A.; Song, Y.; Xu, N.; He, W. Traffic Sign Recognition Based on the YOLOv3 Algorithm. Sensors 2022, 22, 9345. [Google Scholar] [CrossRef] [PubMed]
Bochkovskiy, A.; Wang, C.Y.; Liao, H.Y.M. YOLOv4: Optimal Speed and Accuracy of Object Detection. arXiv 2020, arXiv:2004.10934. [Google Scholar] [CrossRef]
Jocher, G. YOLOv5 by Ultralytics. 2023. Available online: https://github.com/ultralytics/yolov5 (accessed on 17 September 2025).
Lai, H.; Chen, L.; Liu, W.; Yan, Z.; Ye, S. STC-YOLO: Small Object Detection Network for Traffic Signs in Complex Environments. Sensors 2023, 23, 5307. [Google Scholar] [CrossRef]
Zhang, H.; Liang, M.; Wang, Y. YOLO-BS: A traffic sign detection algorithm based on YOLOv8. Sci. Rep. 2025, 15, 1–11. [Google Scholar] [CrossRef]
Jiang, L.; Zhan, P.; Bai, T.; Yu, H.; Member, S. YOLO-CCA: A Context-Based Approach for Traffic Sign Detection. arXiv 2024, arXiv:2412.04289. [Google Scholar]
Zou, Y.; Liu, S. Small object detection algorithm based on improved YOLOv10 for traffic sign. Transp. Res. Interdiscip. Perspect. 2025, 32, 101501. [Google Scholar] [CrossRef]
Li, R.; Chen, Y.; Wang, Y.; Sun, C. YOLO-TSF: A Small Traffic Sign Detection Algorithm for Foggy Road Scenes. Electronics 2024, 13, 3744. [Google Scholar] [CrossRef]
Zhu, Z.; Liang, D.; Zhang, S.; Huang, X.; Li, B.; Hu, S.; Dai, D.; Van Gool, L. Traffic-Sign Detection and Classification in the Wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 2110–2118. [Google Scholar] [CrossRef]
Zhang, J.; Zou, X.; Kuang, L.D.; Wang, J.; Sherratt, R.S.; Yu, X. CCTSDB 2021: A More Comprehensive Traffic Sign Detection Benchmark. Hum. Centric Comput. Inf. Sci. 2022, 12, 23. [Google Scholar] [CrossRef]
Jiang, T.; Zhong, Y. ODVerse33: Is the New YOLO Version Always Better? A Multi-Domain Benchmark from YOLO v5 to v11. arXiv 2025, arXiv:2502.14314. [Google Scholar] [CrossRef]
Zhu, Y.; Yan, W.Q. Traffic Sign Recognition Based on Deep Learning. Multimed. Tools Appl. 2022, 81, 17779–17791. [Google Scholar] [CrossRef]
Abu Mangshor, N.N.; Aida Paudzi, N.P.; Ibrahim, S.; Sabri, N. A Real-Time Malaysian Traffic Sign Recognition Using YOLO Algorithm. In Proceedings of the 12th National Technical Seminar on Unmanned System Technology 2020; Springer: Berlin/Heidelberg, Germany, 2021. [Google Scholar] [CrossRef]
Cao, J.; Song, C.; Peng, S.; Xiao, F.; Song, S. Improved Traffic Sign Detection and Recognition Algorithm for Intelligent Vehicles. Sensors 2019, 19, 4021. [Google Scholar] [CrossRef]
Venkatesh, K.; Sri, K.; Krishna, K.; Kanta, M.M. Traffic Sign Detection and Recognition Using Deep Learning. Interantional J. Sci. Res. Eng. Manag. 2023. [Google Scholar] [CrossRef]
Fang, H.; Cao, J.; Li, Z.Y. A Small Network MicronNet-BF of Traffic Sign Classification. Comput. Intell. Neurosci. 2022, 2022, 3995209. [Google Scholar] [CrossRef]
Ezzahra, K.F. Comparative Analysis of Transfer Learning-Based CNN Approaches for Recognition of Traffic Signs in Autonomous Vehicles. E3s Web Conf. 2023, 412, 01096. [Google Scholar] [CrossRef]
Kozhamkulova, Z. Development of Deep Learning Models for Traffic Sign Recognition in Autonomous Vehicles. Int. J. Adv. Comput. Sci. Appl. 2024, 15, 0150593. [Google Scholar] [CrossRef]
Zaibi, A.; Ladgham, A.; Sakly, A. A Lightweight Model for Traffic Sign Classification Based on Enhanced LeNet-5 Network. J. Sens. 2021, 18, 8870529. [Google Scholar] [CrossRef]
Committee, F.R.C. FIRA RoboWorld Cup 2024—Autonomous Car Challenge. 2024. Available online: https://firaworldcup.org/leagues/fira-challenges/autonomous-cars/ (accessed on 16 September 2025).
Bradski, G. OpenCV Library. 2000. Available online: https://opencv.org/ (accessed on 18 September 2025).
Intel. CVAT: Computer Vision Annotation Tool. 2020. Available online: https://www.cvat.ai (accessed on 16 September 2025).
Jocher, J. YOLOv8: Cutting-Edge Object Detection Models by Ultralytics. 2023. Available online: https://docs.ultralytics.com (accessed on 16 September 2025).
Roboflow, I. Roboflow: Organize, Label, and Prepare Computer Vision Datasets. 2022. Available online: https://roboflow.com (accessed on 16 September 2025).
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. SSD: Single Shot MultiBox Detector. In Proceedings of the ECCV, Amsterdam, The Netherlands, 11–14 October 2016; pp. 21–37. [Google Scholar]
Prechelt, L. Early stopping—But when? In Neural Networks: Tricks of the Trade; Springer: Berlin/Heidelberg, Germany, 1998; pp. 55–69. [Google Scholar]
Pan, S.J.; Yang, Q. A survey on transfer learning. IEEE Trans. Knowl. Data Eng. 2010, 22, 1345–1359. [Google Scholar] [CrossRef]
Wiseman, Y. Real-Time Monitoring of Traffic Congestions. In Proceedings of the IEEE International Conference on Electro Information Technology (EIT), Lincoln, NE, USA, 14–17 May 2017; pp. 501–505. [Google Scholar]
Karasneh, M.A.; Manasreh, D.; Matouq, Y.; Berner, W.C.C.; Nazzal, M.D. An Artificial Intelligence Driven Approach for Real-Time Detection of Traffic-Sign Deficiencies. Transp. Res. Rec. 2025, 2679, 917–927. [Google Scholar] [CrossRef]

Figure 1. Visual examples of the traffic sign classes used in the dataset: (a) Right Turn, (b) Straight Ahead, (c) Left Turn, (d) No Entry, (e) Dead End, (f) Stop.

Figure 2. Track layout used for data collection. Traffic signs were positioned at key intersections and corners. The environment simulates a reduced-scale urban driving scenario with realistic perception conditions.

Figure 3. Hardware architecture of the 1:10 autonomous vehicle.

Figure 4. Global mean Average Precision mAP@50–95 of YOLOv8-11 models on COCO. Each bar shows the mean detection accuracy over all traffic-sign categories on the validation set. Among them, YOLOv10 B obtains the best overall performance, which has the best trade-off between accuracy and computational costs on both datasets with identical training settings.

Figure 5. Average mAP@50–95 across all traffic sign classes for each YOLO variant (YOLOv8–YOLOv11, all scales).

Figure 6. Per-model detection performance for Class 0 (Left Turn). Bars indicate mean Average Precision (mAP@50–95) for each model. Most architectures reach values above 0.94, with YOLOv10B standing out for its strong generalization in visually consistent symbols.

Figure 7. Per-model detection performance for Class 1 (Forward). Lower mAP@50–95 values show that this category remains the most challenging. Visual similarity with Classes 0 and 2 leads to frequent confusion among models, suggesting that additional data augmentation may help.

Figure 8. Per-model performance for Class 2 (Right Turn). The results show larger fluctuations across mid-sized models due to the rotational symmetry of this sign. Specialized augmentation strategies could further improve robustness for these cases.

Figure 9. Per-model performance for Class 3 (Dead End). Almost all YOLO versions achieve very high precision, confirming the robustness of detection for signs with distinctive geometry and strong color contrast.

Figure 10. Per-model performance for Class 4 (No Entry). All architectures exhibit high mAP@50–95, typically above 0.97. This strong performance arises from the clear color separation and stable shape of the sign.

Figure 11. Per-model detection accuracy for Class 5 (Stop). The majority of models reach saturation near 1.0 mAP@50–95, showing that this symbol—with its distinctive red octagon and white text—is easily recognized under various conditions.

Figure 12. Average preprocessing time across YOLOv8–YOLOv11 architectures (ms). This stage includes image normalization, resizing, and data loading operations before inference. Preprocessing time remains below 0.6 s for the compact models YOLOv11N and YOLOv8M, while it goes above 1.3 s for heavier versions such as YOLOv10B and YOLOv8X due to the overhead introduced by tensor reshaping.

Figure 13. Inference time comparison of YOLOv8–YOLOv11 models (ms). YOLOv10N and YOLOv8N achieve sub-second inference, suitable for real-time scenarios. YOLOv8X and YOLOv10L exhibit slower performance, exceeding 7 s per frame. Inference remains the main latency source in embedded deployment.

Figure 14. Postprocessing time distribution per model (milliseconds). This phase includes confidence filtering and non-maximum suppression (NMS) to refine bounding boxes. Smaller models such as YOLOv10N and YOLOv8S remain under 0.25 s, while larger ones like YOLOv11X and YOLOv9S approach 0.52 s. Overall, postprocessing contributes minimally to total delay compared with inference time.

Figure 15. Complementary metrics describing detection reliability. Left: the F1-score for each model, which balances precision and recall. Right: false-positive rate per model. All tested YOLO architectures achieve F1-scores over 0.99 and false-positive rates below 0.015. This demonstrates uniform detection performance across versions, suggesting that even compact models maintain robustness under the same evaluation settings.

Figure 16. Final training loss achieved after convergence. Lower values indicate smoother optimization and stronger generalization. YOLOv10B and YOLOv9S yield the lowest residual losses, while YOLOv8M and YOLOv10L exhibit higher remaining error, suggesting sensitivity to overfitting in larger parameter spaces.

Figure 17. Total training time required for each YOLO version under identical experimental conditions. Lightweight models (YOLOv8N, YOLOv10N) complete training in less than one hour, whereas high-capacity models (YOLOv8X, YOLOv10B) exceed 1.2 h. This contrast reflects the computational trade-off between scalability and training efficiency.

Figure 18. Experimental setup used for traffic sign recognition. The figure shows a 1:10 scale autonomous vehicle equipped with an Intel RealSense D435i camera mounted on a vertical support. The test track is made of a black surface with white lane markings, and includes printed traffic signs such as “Proceed Forward” and “Stop”, positioned at different locations. The system operates in real time, with detections displayed on the monitor.

Figure 19. Performance metrics on detection “Left Turn” traffic sign: (a) FPS vs. Time, (b) CPU Usage vs. Time, (c) RAM Usage vs. Time, (d) Inference Time vs. Time, (e) CPU Frequency vs. Time. For all of the performance measures, the system operates satisfactorily, and its detection robustness is good.

Figure 20. Performance parameters for detecting the “Forward” traffic symbol on which it is trained: (a) FPS vs. Time, (b) CPU Usage vs. Time, (c) RAM Usage vs. Time, (d) Inference Time vs. Time, (e) CPU Frequency vs. Time. During real-time embedded testing, the system performed well in terms of detection accuracy and stability.

Figure 21. Performance metrics during detection of the “Right Turn” traffic sign: (a) FPS vs. Time, (b) CPU Usage vs. Time, (c) RAM Usage vs. Time, (d) Inference Time vs. Time, (e) CPU Frequency vs. Time. The system maintained real-time detection with high reliability and hardware stability.

Figure 22. Performance metrics during detection of the “Dead End” traffic sign: (a) FPS vs. Time, (b) CPU Usage vs. Time, (c) RAM Usage vs. Time, (d) Inference Time vs. Time, (e) CPU Frequency vs. Time. Stable hardware utilization and consistent detection were observed.

Figure 23. Performance metrics during detection of the “No Entry” traffic sign: (a) FPS vs. Time, (b) CPU Usage vs. Time, (c) RAM Usage vs. Time, (d) Inference Time vs. Time, (e) CPU Frequency vs. Time. Results confirm stable inference and precise classification despite visual similarity to other classes.

Figure 24. Performance metrics during detection of the “Stop” traffic sign: (a) FPS vs. Time, (b) CPU Usage vs. Time, (c) RAM Usage vs. Time, (d) Inference Time vs. Time, (e) CPU Frequency vs. Time. Results show consistent behavior and perfect classification accuracy, affirming the model’s capability to detect highly distinctive signs.

Figure 25. Power consumption of the embedded system during a continuous 180 s experiment (watts).

Table 1. Technical specifications of the dataset acquisition.

Parameter	Value
Camera model	Intel RealSense D435i
Color format	BGR (8-bit per channel)
Image resolution	640 × 480 pixels
Capture method	OpenCV (Python API)
Frame rate	30 FPS
Number of images	9000
Traffic sign classes	6 (official European designs)
Angles of view	$- 45 °$ , $0 °$ , $+ 45 °$
Distances from camera	20 cm, 40 cm
Track type	Indoor modular, 1:10 scale urban layout
Vehicle state	Static during acquisition

Table 2. Traffic sign classes used in the dataset.

Class ID	Description
0	Left Turn
1	Straight Ahead
2	Right Turn
3	Dead End
4	No Entry
5	Stop

Table 3. Cloud-based training environment specifications.

Component	Specification
Platform	RunPod.io
GPU	NVIDIA RTX 4090 (24 GB VRAM)
CPU	Intel Xeon Platinum 8375C
RAM	64 GB DDR4
OS	Ubuntu 22.04 LTS (64-bit)
Frameworks	PyTorch 2.0.1, Python 3.10

Table 4. Onboard embedded system specifications.

Component	Specification
Model	TRIGKEY S5 Mini PC
CPU	AMD Ryzen 7 5700U (8 cores, 1.8–4.3 GHz)
RAM	12.6 GB DDR4
GPU	Integrated AMD Radeon
OS	Ubuntu 20.04.6 LTS

Table 5. Real-time embedded inference evaluation metrics.

Metric	Description
Inference Latency	Processing time per frame
Live FPS	Actual frames per second from camera feed
CPU Load	Average CPU utilization
Memory Usage	RAM consumption of the full pipeline
Recognition Rate	Percentage of correct sign detections

Table 6. Core hardware components of the vehicle.

Component	Description
Chassis	1:10 autonomous platform
TRIGKEY S5	Embedded computer
RealSense D435i	RGB-D camera
FSESC 5.7 Pro	Motor controller (VESC)
Step-Down Regulator	Voltage conversion module
LiPo Battery	4S, power source

Table 7. Technical specifications of the TRIGKEY S5.

Component	Specification
Processor	AMD Ryzen 7 5700U (8 cores)
Graphics	AMD Radeon (Integrated)
RAM	12.6 GB DDR4
Storage	1 TB NVMe SSD
OS	Ubuntu 20.04.6 LTS
Connectivity	WiFi 6/Bluetooth 5.2

Table 8. Summary of main detection metrics (mAP@50–95, F1-score, and Inference FPS) for the best-performing YOLO models.

Model	mAP@50–95	F1-Score	FPS
YOLOv10B	0.9797	0.972	17.6
YOLOv9S	0.9796	0.969	18.4
YOLOv8M	0.9782	0.966	20.1

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Benchmarking YOLOv8 to YOLOv11 Architectures for Real-Time Traffic Sign Recognition in Embedded 1:10 Scale Autonomous Vehicles

Abstract

1. Introduction

2. Related Work

2.1. Classical Traffic Sign Detection Methods

2.1.1. Histogram of Oriented Gradients

2.1.2. Support Vector Machines

2.1.3. Feature-Based Object Detection

2.2. Deep Learning and YOLO Framework

2.2.1. YOLOv1 to YOLOv3

2.2.2. YOLOv4 and YOLOv5: Transitional Architectures

2.2.3. YOLOv6 to YOLOv11

2.2.4. Applications in Autonomous Navigation

2.3. Limitations of CNNs in TSR and Dataset Limit

2.4. Our Contribution in Context

3. Materials and Methods

3.1. Dataset Description

3.2. Data Collection and Annotation Protocol

3.2.1. Training Algorithm Description

3.2.2. Hyperparameter Justification

3.2.3. Preprocessing Protocol

3.3. Evaluated Architectures and Performance Metrics

3.4. 1:10 Scale Autonomous Vehicle Platform

4. Results and Discussion

4.1. Overall Performance

4.2. Class-Wise Performance

Insights on Per-Class Performance

4.3. Computational Efficiency

4.4. Error Metrics and Reliability

4.5. Training Cost

4.5.1. Summary of Trade-Offs

4.5.2. Performance Under the “Left Turn” Signal Exposure

4.5.3. Performance Under “Forward” Signal Exposure

4.5.4. Performance Under “Right Turn” Signal Exposure

4.5.5. Performance Under “Dead End” Signal Exposure

4.5.6. Performance Under “No Entry” Signal Exposure

4.5.7. Performance Under “Stop” Signal Exposure

4.5.8. Summary and Implications

4.5.9. Power Consumption of Real-Time Signal Alternation and Processing

5. Discussion

6. Conclusions

7. Limitations and Future Work

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Article Metrics

Citations

Article Access Statistics