5.1. Dataset Processing
The experiments utilize the TT100K dataset, which comprises over 100,000 images covering diverse weather conditions, illumination variations, and road scenarios. Consistent with the data preprocessing strategies widely adopted in the literature [
22,
23,
24,
25], this study focuses on 45 categories that each contain more than 100 instances. This filtering strategy ensures that each evaluated category possesses sufficient samples for robust feature learning and statistically valid evaluation. Ultimately, a filtered experimental dataset comprising 9738 images and 24,212 annotated targets is obtained. As illustrated in
Figure 5, several major categories, such as pn, pne, i5, p11, pl40, and pl50, each contain more than 1000 instances.
Figure 6 illustrates the density distribution of small targets within the filtered dataset. It can be observed that the vast majority of bounding boxes occupy less than 0.0025 of the total image area, indicating a significant prevalence of small targets in the dataset. Finally, during the experimental phase, to ensure a consistent category distribution across the training, validation, and test sets (partitioned at a ratio of 7:2:1), a stratified random sampling strategy is implemented across the 45 categories.
5.3. Comparison of C3k2 Improvement Modules
To evaluate the effectiveness of the improvements made to the C3k2 module, comparative experiments involving different improvement strategies were conducted using YOLO11n as the baseline model. The experimental results are presented in
Table 2.
As shown in
Table 2, upon introducing the EMA mechanism, the number of model parameters increases by 1.49 × 10
6. Although the precision (
P) experiences a slight decrease of 0.5 points, the recall (
R) and mAP@0.5 increase by 0.6 and 0.2 points, respectively. This indicates that while EMA assists in the recall of small targets to some extent, it introduces a substantial increase in the parameter count. When incorporating PConv, the parameter count increases by 0.96 × 10
6, while P, R, and mAP@0.5 are improved by 0.8, 1.7, and 0.7 points, respectively. This demonstrates that PConv can effectively enhance detection performance while simultaneously reducing computational redundancy.
To further address the impact of straightforward module stacking, we introduced a simple fusion configuration (+PConv + EMA) without any cross-guidance mechanism. In this setup, features sequentially pass through PConv and a full-channel EMA module. While this independent combination yields a mAP@0.5 of 78.8%, the parameter count rebounds to 3.59 × 106 due to the EMA module processing all feature channels.
Finally, by systematically integrating EMA and PConv through our proposed Cross-Branch Guidance mechanism, the C3k2_CGPEMA module is designed. Instead of independent processing, the EMA module in CGPEMA only acts on the 1/4 processing branch and utilizes zero-parameter element-wise masking to modulate the remaining 3/4 identity branch. Strikingly, compared to the simple fusion (+PConv + EMA), our proposed CGPEMA achieves higher precision metrics (P, R, and mAP@0.5 of 80.7%, 71.2%, and 79.2%, respectively) while simultaneously reducing the parameter count to 3.56 × 106. These metrics represent improvements of 1.3, 2.5, and 1.4 points over the baseline model, respectively, demonstrating that its overall performance surpasses that of the schemes employing either EMA or PConv individually.
5.4. Comparison of Loss Functions
To verify the enhancement in detection performance brought by the proposed GCD loss function, training was conducted on the TT100K dataset using YOLO11n as the baseline model. Six different bounding box regression loss configurations—CIoU, NWD + CIoU, GCD + CIoU, SIoU, MPDIoU, and Wise-IoU—were evaluated under identical conditions (with the balancing factor set to 0.6 for relevant configurations). The experimental results are presented in
Table 3.
Table 3 demonstrates that, compared with using CIoU alone, the introduction of NWD improves the precision (
P), recall (
R), and mAP@0.5 of the model by 0.9, 0.1, and 0.2 points, respectively. Furthermore, other recent advanced scale-oriented regression losses, including SIoU, MPDIoU, and Wise-IoU, also exhibit performance gains over the baseline CIoU, with Wise-IoU achieving a notable mAP@0.5 of 78.6%.
In contrast, when employing the proposed GCD + CIoU, the model’s P, R, mAP@0.5, and mAP@0.5:0.95 reach the highest values at 81.1%, 69.6%, 78.8%, and 60.2%, respectively. This represents significant improvements of 1.7, 0.9, 1.0, and 0.3 points over the CIoU baseline. Crucially, GCD + CIoU consistently outperforms all other recent scale-oriented losses (SIoU, MPDIoU, and Wise-IoU) across all metrics. By introducing symmetric normalization terms specifically tailored to the scales of the predicted and ground truth boxes, GCD eliminates the influence of absolute target size on the distance metric. This ensures that small and large targets incur consistent loss weights when experiencing proportional deviations, thereby demonstrating a superior theoretical and experimental advantage in handling scale variations for small targets compared to other state-of-the-art regression losses.
As can be observed from the curves in
Figure 7, under the same number of training epochs, the loss function employing GCD + CIoU exhibits the fastest convergence rate and achieves the lowest final loss value among all six evaluated configurations. During the initial training phase (0–100 epochs), the magnitude of the loss reduction for GCD + CIoU is significantly greater than those of CIoU, NWD + CIoU, and the other recent scale-oriented losses (SIoU, MPDIoU, and Wise-IoU). This indicates that it can more efficiently optimize bounding box regression, reduce localization errors, and enhance gradient update efficiency, thereby accelerating model convergence. By the end of the 300 epochs of training, the final loss value of GCD + CIoU remains significantly lower than those of all other five loss functions. This further verifies that the symmetric normalization term introduced by GCD effectively mitigates the impact of absolute target size. Consequently, it makes the loss more robust to the scale variations in both small and large targets, ultimately yielding superior optimization performance and more precise bounding box fitting capabilities.
To evaluate the impact of the weighting factor
in the proposed GCD loss function, a hyperparameter sensitivity analysis was conducted. The value of
was varied from 0.4 to 0.8 with an interval of 0.1. The corresponding experimental results on the TT100K dataset are summarized in
Table 4.
As indicated in
Table 4, the overall detection performance exhibits an initial increase followed by a slight decline as
scales up, attaining the optimal values at
= 0.6. Under this optimal setting, the Precision (
P), Recall (
R), mAP@0.5, and mAP@0.5:0.95 reach 81.1%, 69.6%, 78.8%, and 60.2%, respectively. Concurrently, the performance metrics manifest high stability across the tested spectrum. Specifically, the core metric mAP@0.5 fluctuates within a minor margin of only 0.7 points (ranging from 78.1% to 78.8%). Importantly, even at the boundary values of the tested range (e.g.,
= 0.4 and
= 0.8), the proposed GCD loss consistently outperforms the CIoU. For instance, the lowest mAP@0.5 across all tested
configurations is 78.1%, maintaining a stable 0.3 points improvement over the baseline. This minimal fluctuation and consistent superiority over the baseline demonstrate the strong robustness of the GCD method regarding parameter variations, indicating that the effectiveness of the proposed loss formulation in small-target localization is primarily derived from its structural scale-invariance rather than specific parameter configurations.
5.5. Ablation Experiment
To verify the effectiveness of each proposed improvement module and to rigorously isolate their individual contributions from potential complex interactions, an ablation study (Groups ①–⑧) was conducted on the TT100K dataset using YOLO11n as the baseline model. The improvement methods—reconstructing the detection scales, introducing the C3k2_CGPEMA module, and optimizing the loss function with GCD—are sequentially denoted as A, B, and C. Group ① represents the YOLO11n baseline model, while Groups ②–④ represent the application of single modules, Groups ⑤–⑦ represent pairwise combinations, and Group ⑧ represents the model incorporating all improvements. The experimental results are presented in
Table 5.
As shown in
Table 5, introducing the individual modules independently (Models ②, ③, and ④) yields consistent performance improvements over the baseline, proving their standalone effectiveness. After introducing the scale reconstruction strategy into the detection layer (Model ②), the model is better equipped to extract and utilize the shallow feature information of small targets. Compared with the YOLO11n baseline model, the precision (
P), recall (
R), and mAP@0.5 are improved by 1.9, 3.3, and 2.3 points, respectively, highlighting the advantages of this network structural optimization in capturing extremely small targets. When solely introducing the novel feature extraction module C3k2_CGPEMA (Model ③), the improved model exhibits significantly enhanced capabilities in feature focusing and extraction for tiny traffic signs. This is attributed to the synergistic effect between the compulsory high-frequency spatial detail preservation mechanism of PConv and the cross-dimensional feature aggregation capability of the EMA mechanism. Compared with the baseline model, P, R, and mAP@0.5 increase by 1.3, 2.5, and 1.4 points, respectively, effectively reducing the miss rate of traffic signs in complex backgrounds. By introducing the combined CIoU and GCD loss (Model ④), the model utilizes the scale invariance of GCD alongside the joint optimization of bounding box position and morphology, while preserving the stable convergence characteristics of CIoU. Without introducing any additional model parameters or computational overhead, P, R, and mAP@0.5 are improved by 1.7, 0.9, and 1.0 points, respectively. This indicates that the combined loss can effectively optimize the bounding box regression accuracy for small targets.
To explicitly address the complex interactions among the three components and rule out coincidental gains, Models ⑤–⑦ evaluate their pairwise combinations. The results demonstrate that any combination of two modules achieves further continuous performance gains over their single-module counterparts. For example, Model ⑤ (A + B) achieves an mAP@0.5 of 82.0%, which is higher than both Model ② (80.1%) and Model ③ (79.2%). This consistent upward trend across Models ⑤, ⑥, and ⑦ proves that the proposed modules possess low functional redundancy and excellent positive synergistic effects, rather than conflicting with one another.
Ultimately, the algorithm integrating the C3k2_CGPEMA module, the combined CIoU and GCD loss, and the detection layer scale reconstruction (Model ⑧) achieves optimal performance. Compared with the YOLO11n baseline model, P, R, mAP@0.5, and mAP@0.5:0.95 are significantly improved by 5.4, 7.4, 6.6, and 4.6 points, respectively. Additionally, while the GFLOPs increases to 13.5, the Params is slightly reduced to 2.46 M due to the structural optimization of the scale reconstruction. This systematic step-by-step experiment fully validates the effectiveness, positive interaction, and superiority of the proposed multiple improvements for the small-target traffic sign detection task on this dataset.
5.6. Comparative Experiments
To verify the comprehensive performance and advantages of the improved algorithm in the small-target traffic sign detection task, comparative experiments were conducted against mainstream object detection algorithms on the TT100K dataset. The compared models encompass the classical two-stage algorithm Faster R-CNN, as well as one-stage algorithms including YOLOv5n, YOLOv7-tiny, YOLOv8n, YOLOv10n, YOLO26n, YOLO11n and YOLO11s. To comprehensively evaluate the feasibility of practical deployment, the frames per second (FPS) metric was measured to assess inference speed. In the FPS testing phase, to ensure accurate measurements, the batch size was set to 1. After a 200-iteration warm-up period for the model, 1000 inference latency tests were executed. The detailed comparative results are presented in
Table 6.
As shown in
Table 6, compared with Faster R-CNN, the YOLO series algorithms achieve superior detection accuracy while maintaining lightweight models. Specifically, the mAP@0.5 scores of YOLOv5n and YOLOv7-tiny are 69.5% and 73.7%, respectively. With continuous version iterations, YOLOv8n and YOLOv10n further elevate the mAP@0.5 to 76.4% and 77.6%, while YOLO26n pushes the performance boundaries further, reaching 80.8%.
The baseline model YOLO11n performs acceptably on this dataset, achieving an mAP@0.5 of 77.8% and an inference speed of up to 173.2 FPS. Additionally, the scaled-up YOLO11s achieves an mAP@0.5 of 82.0%, but it requires a much higher model complexity with 9.40 M parameters and 21.5 GFLOPs. Building upon this, the improved algorithm (Ours) further increases the mAP@0.5 to 84.4%, outperforming YOLO11s by 2.4% while using significantly fewer parameters (2.46 M) and GFLOPs (13.5). However, this performance enhancement is accompanied by a shift in computational cost: although the total number of parameters slightly decreases by 0.13 M compared to YOLO11n, the GFLOPs increase from 6.5 to 13.5, and the inference speed consequently drops to 114.5 FPS. The increase in computational complexity primarily stems from the introduction of the high-resolution P2 detection head, a structure that is absolutely crucial for capturing extremely small traffic signs. Even so, the inference speed of 114.5 FPS still far exceeds the real-time processing requirements of autonomous driving scenarios (typically > 30 FPS), thereby guaranteeing real-time performance for practical deployment while significantly improving accuracy.
As indicated by the comprehensive performance comparison chart in
Figure 8, the improved algorithm enhances the recognition accuracy of the model. It not only mitigates the issue of missed detections prevalent in small-target traffic sign detection but also effectively suppresses interference from similar targets. It exhibits superior object recognition capabilities, making it highly suitable for small-target traffic sign detection tasks. For intuitive comparison, all coordinate axes in the figure are unified such that larger values indicate better performance. Specifically, P, R, mAP@0.5, and mAP@0.5:0.95 are plotted directly using their original percentages. Since FPS inherently follows the “larger is better” principle, it is linearly mapped to the range of 60–90 through normalization to maintain visual consistency. Regarding GFLOPs and parameters, where smaller original values represent higher efficiency, the figure displays the results of their reciprocals after linear scaling (also normalized to the 60–90 range). Therefore, a larger radius reflects lower computational complexity and a smaller parameter count.
5.7. Generalization Experiments
To verify the generalization capability of the improved algorithm across different datasets, generalization experiments were conducted on the CCTSDB dataset. The CCTSDB (CSUST Chinese Traffic Sign Detection Benchmark) dataset contains a diverse collection of Chinese traffic sign images captured under various real-world driving conditions, encompassing different weather, lighting, and complex background scenarios. It primarily includes three categories of traffic signs: warning, prohibitory, and mandatory. The experimental results are presented in
Table 7 and
Table 8.
As shown in
Table 7, compared with YOLO11n, the improved algorithm (Ours) increases the precision (
P), recall (
R), mAP@0.5, and mAP@0.5:0.95 by 1.0, 2.4, 1.1, and 2.2 points, respectively. Its stable performance across different datasets validates the strong generalization capability and practical value of the algorithm.
Table 8 illustrates the comparison of detection performance on mandatory, prohibitory, and warning signs before and after the model improvement. Compared with YOLO11n, the improved algorithm can detect these three categories of signs more accurately, with the mAP@0.5 increasing by 0.6, 1.9, and 0.9 points, respectively.
To intuitively compare the training results of the generalization experiments before and after the model improvements, comparative curves of the generalization experiments are plotted in
Figure 9. Throughout the 300 training epochs, the P curve of the improved model remains stably higher than that of the baseline model, and its convergence value in the later stage of training is superior. This indicates that the improved model possesses higher target recognition precision. Furthermore, the R curve of the improved model not only ascends faster but also converges to a higher value than the baseline model, reflecting a stronger sample recall capability and a lower miss rate. Regarding the mAP@0.5 curve, which serves as a comprehensive evaluation metric, the advantage of the improved model is evident. Its curve consistently remains above that of the baseline model, and the margin between the two remains stable throughout the training process, verifying the enhancement in the overall detection performance. In addition, judging from the trends of the three curves, the metrics of both models increase rapidly during the initial training phase (0–100 epochs) and gradually converge after 100 epochs. This demonstrates that while maintaining training stability, the improved model successfully breaks through the performance ceiling of the baseline model. Ultimately, the generalization experimental results fully prove the validity of the proposed model improvement strategies.
5.8. Analysis of Visualization Results
To visually compare detection performance on the TT100K dataset before and after improvements, a visualization analysis was conducted for YOLO11n and the proposed algorithm under three typical scenarios: strong illumination, low light, and dense target distributions. The results are shown in
Figure 10 ((a) YOLO11n, (b) the proposed algorithm), where green, blue, and red bounding boxes denote correct, false, and missed detections, respectively.
In strong illumination scenarios, overexposure and reflections cause severe loss of image details, posing a stringent challenge to feature extraction; while YOLO11n exhibits severe missed detections, the proposed algorithm correctly detects all targets in the test samples without any false or missed detections, and its confidence scores are significantly improved, demonstrating excellent capability in preserving features under intense lighting. In low-light scenarios, insufficient illumination degrades traffic sign clarity; YOLO11n yields lower detection accuracy and falsely detects a pl40 sign as a pl50 sign, whereas the proposed algorithm effectively overcomes feature degradation induced by poor illumination, correctly identifying all targets with no missed detections and maintaining high detection accuracy, thereby showing strong robustness under extreme lighting conditions. In scenarios with dense target distributions, the comparative results in
Figure 10 reveal that the original model experiences false and missed detections for pm30 signs in densely populated areas and misclassifies a p3 sign as a p26 sign; conversely, the proposed algorithm successfully suppresses mutual interference among densely packed targets and achieves precise recognition of all targets, exhibiting excellent discriminative capability. Overall, the proposed algorithm delivers superior detection performance across all evaluated scenarios.
To demonstrate the architectural superiority for traffic sign detection,
Figure 11 visualizes the effective receptive fields (ERFs) extracted from the fully trained weights of both the baseline and proposed models. The top row displays the proposed model (P2, P3, and P4 layers), while the bottom row illustrates the baseline YOLO11n (P3, P4, and P5 layers). By replacing the P5 layer with the high-resolution P2 layer, our model achieves a tightly bounded ERF that precisely envelops small-to-medium signs. This effectively eliminates the severe background clutter (e.g., surrounding trees) caused by the over-diffused P5 activation in the baseline. Furthermore, at equivalent scales (P3 and P4), the proposed ERFs exhibit a significantly higher energy concentration toward the target center compared to the baseline’s scattered patterns. This centralized focus optimally calibrates the receptive field, enhancing the signal-to-noise ratio and directly contributing to the improved detection accuracy.