A Real-Time Traffic Sign Detection Algorithm Based on Improved YOLO11n

Luo, Yutao; Ning, Hang; Nan, Chunli; Dong, Zeyang; Gan, Jiayi

doi:10.3390/electronics15132916

Open AccessArticle

A Real-Time Traffic Sign Detection Algorithm Based on Improved YOLO11n

by

Yutao Luo

,

Hang Ning

^*,

Chunli Nan

,

Zeyang Dong

and

Jiayi Gan

School of Information Engineering, Chang’an University, Xi’an 710064, China

^*

Author to whom correspondence should be addressed.

Electronics 2026, 15(13), 2916; https://doi.org/10.3390/electronics15132916

Submission received: 7 June 2026 / Revised: 30 June 2026 / Accepted: 1 July 2026 / Published: 3 July 2026

(This article belongs to the Section Electrical and Autonomous Vehicles)

Download

Browse Figures

Versions Notes

Abstract

To address the issues of low detection accuracy and high miss rates in long-range small traffic sign detection, which are caused by insufficient feature information and susceptibility to background interference, this paper proposes an improved real-time traffic sign detection algorithm based on YOLO11n. First, a cross-guided feature extraction module, C3k2_CGPEMA, is designed within the neck network. By embedding the Efficient Multi-Scale Attention (EMA) mechanism into the feature extraction branch of Partial Convolution (PConv), this module utilizes the spatial attention mask generated by the convolutional branch to provide cross-branch guidance and filter out complex background noise from the identity branch. This achieves precise fine-grained feature focusing while preserving high-frequency spatial details. Furthermore, a joint bounding box regression loss function combining Complete Intersection over Union (CIoU) and Gaussian Combined Distance (GCD) is adopted. This preserves the stable convergence properties of CIoU while leveraging the scale invariance of GCD to enhance the regression accuracy for small targets. Finally, the detection layers are reconstructed by removing the P5 layer and introducing a high-resolution P2 layer (160 × 160), significantly strengthening the localization capability for distant, tiny targets. Experimental results demonstrate that the proposed algorithm achieves improvements of 5.4, 7.4, and 6.6 points in precision, recall, and mAP@0.5, respectively, on the TT100K dataset compared to the baseline YOLO11n. While boosting detection accuracy, the model maintains an inference speed of 114.5 frames per second (FPS), fully satisfying the requirements for real-time detection in in-vehicle environments. Generalization experiments conducted on the CCTSDB dataset further validate the robustness of the proposed algorithm in complex environments.

Keywords:

traffic sign detection; YOLO11n; C3k2_CGPEMA; GCD; detection layer reconstruction

1. Introduction

Traffic sign detection is a core technology in intelligent transportation systems (ITSs) and autonomous driving. It provides traveling vehicles with critical information, such as road condition guidance and traffic warnings, which is of great significance for ensuring driving safety. In real-world driving scenarios, distant traffic signs usually occupy a small proportion of the image, making them typical small targets. Their feature information is scarce and highly susceptible to illumination variations and complex backgrounds. Consequently, traditional detection algorithms suffer from high miss rates and insufficient localization accuracy, whereas deep learning-based object detection algorithms are gradually becoming the mainstream solution. Common deep learning-based object detection methods include two-stage algorithms (e.g., Fast R-CNN [1], Faster R-CNN [2], and Mask R-CNN [3]) and one-stage algorithms (e.g., RT-DETR [4], SSD [5], and the YOLO series [6]). The former exhibit poor real-time performance, large computational overhead, and high hardware requirements. In contrast, the latter significantly reduce computational costs and accelerate inference speed while maintaining high detection accuracy, making them more suitable for deployment on in-vehicle terminals.

In recent years, the YOLO series of algorithms has made significant progress in the field of object detection; therefore, many current studies on traffic sign detection are based on improving this architecture. For instance, to address the problems of poor localization, low accuracy, and high missed detection rates for small traffic signs, Han et al. [7] improved the YOLOv5 model. They replaced the original convolutional layer with an SPD-Conv module to extract low-resolution features and introduced a Decoupled Head along with a Contextual Attention Module (CAM). This approach increased the Average Precision (AP) by 3.7 points. Similarly, Song et al. [8] tackled the challenges posed by small targets and complex urban backgrounds in traffic light detection. Based on YOLOv5, they utilized the Mosaic-9 method for dataset augmentation, introduced the SE attention mechanism to enhance network performance, and adopted the EIoU loss function to mitigate missed detections and false alarms. On their custom dataset, the model achieved a 6.3-point improvement in mAP. Focusing on the issues of tiny traffic sign sizes and complex background environments that limit detection performance, Liu et al. [9] proposed an improved YOLOv10 model. They designed a Three-Branch Downsampling (TBD) module to enhance feature extraction efficiency, integrated a customized small object detection layer, and employed a combined NWD and Wise-MPDIoU loss function to optimize bounding box matching, improving the mAP@0.5 by 4.0 points compared to the baseline model. Furthermore, addressing the limitations of multi-scale feature representation and the high missed detection rate of small traffic signs under edge-computing constraints, Jia et al. [10] improved the YOLOv8n model. They designed a Multi-scale Contextual Attention (MCA) mechanism, introduced the VoVGSCSP module to replace the original C2f module, and adopted Learnable Weight Concatenation (LWConcat) to optimize the feature fusion path. On the CTSDB dataset, this method increased the mAP@0.5 by 3.2 points.

In real-world traffic scenarios, small target detection suffers from inherent limitations, such as severe spatial information loss in deep network layers, high susceptibility to environmental perturbations (e.g., illumination changes and motion blur), and heavy background interference. The novelty of this study lies in the synergistic integration of a cross-guided feature extraction module, a scale-reconstructed detection head, and a joint bounding box regression loss function, which explicitly preserves the high-frequency spatial details of small targets and precisely optimizes bounding box localization under complex backgrounds. This paper proposes an improved real-time traffic sign detection algorithm based on YOLO11n, as shown in Figure 1. The specific improvement measures are as follows:

A cross-guided feature extraction module, C3k2_CGPEMA, is proposed within the neck network. This module achieves deep coupling by embedding EMA [11] into the local feature extraction branch of PConv [12]. By utilizing the spatial attention mask output by the convolutional branch to provide cross-branch constraints on the identity branch, it not only leverages the advantage of partial convolution in explicitly preserving the high-frequency spatial details of small targets, but also utilizes multi-scale feature aggregation to cooperatively filter out background interference in the identity channel. This consequently enhances the network’s capability to capture the fine-grained features of traffic signs.
A joint bounding box regression loss function combining GCD [13] and CIoU is adopted. By leveraging the scale invariance and optimization properties of GCD, this joint loss effectively mitigates the gradient vanishing and localization deviation issues encountered in bounding box regression for small targets, while simultaneously preserving the stable convergence characteristics of CIoU.
Through the scale reconstruction of the detection head [14], the network significantly enhances its capability to extract and represent the shallow features of tiny targets, while concurrently avoiding redundant computations in the large-target layers.

2. Related Works

2.1. Object Detection Evolution

With the continuous advancement of Convolutional Neural Networks (CNNs) in computer vision, object detection—a fundamental task in this domain—has achieved groundbreaking progress. Within the paradigm of object detection algorithms, two-stage detection models, typically represented by Faster R-CNN, rely on Region Proposal Networks (RPNs) to generate and filter region proposals. This mechanism effectively ensures both localization precision and recognition accuracy. However, constrained by their two-step detection architecture, these models exhibit high computational complexity, making them ill-suited for the rigorous demands of real-time detection applications.

Conversely, YOLO series algorithms—serving as quintessential one-stage detection algorithms—transform the object detection task into an end-to-end regression problem. By eliminating the region proposal generation phase, these algorithms substantially simplify the detection pipeline, thereby significantly enhancing the models’ inference efficiency and real-time responsiveness.

Although both mainstream detection frameworks demonstrate excellent performance in medium-to-large object detection tasks by efficiently accomplishing object localization and recognition, they still present notable limitations in small-object traffic sign detection scenarios. The fundamental cause is that, following multiple downsampling operations, the spatial resolution of deep feature maps in CNNs is drastically reduced. Consequently, it becomes exceedingly difficult to adequately capture and represent the crucial features of small traffic signs, which may occupy merely a few dozen pixels within an image. This ultimately causes the precision and recall rates of small object detection to fall short of practical application standards.

2.2. Small Object Detection in Traffic Scenes

To address the issue of feature degradation in small objects, researchers have explored various structural optimizations within the YOLO framework for traffic environments. For instance, Li et al. [15] improved the YOLOv7 algorithm by embedding a small-object detection layer within the neck network and incorporating the Normalized Gaussian Wasserstein Distance (NWD) metric. This approach enhanced the feature representation of scale-constrained traffic signs on the TT100K dataset. Similarly, considering the impact of adverse weather, Chen and Fan [16] proposed MSGC-YOLO, an optimized model designed to extract traffic sign features under snow conditions. While these customized network designs have achieved improvements in detection accuracy, balancing feature extraction capabilities with model complexity remains an ongoing focus in real-time traffic detection research. This motivates the continuous exploration of lightweight and efficient architectures.

2.3. Bounding Box Regression Loss

The accuracy of object detection depends heavily on the bounding box regression loss function. In the detection of tiny traffic signs, traditional Intersection over Union (IoU) metrics are highly sensitive to minor positional deviations. A minute pixel shift often leads to a precipitous drop in IoU values, rendering them incapable of providing effective gradient supervision. To address this, a series of optimized loss functions has been developed. Gevorgyan [17] introduced SIoU, which redefines penalty metrics by considering the angle of the vector between the predicted and ground-truth boxes to accelerate training convergence. Furthermore, Tong et al. [18] designed Wise-IoU (WIoU), incorporating a dynamic non-monotonic focusing mechanism that utilizes the outlier degree to assign gradient gains wisely. More recently, Ma et al. [19] proposed MPDIoU, which minimizes the distance between the top-left and bottom-right corners to overcome the limitations of traditional metrics when handling varying aspect ratios. Despite these significant theoretical advancements, applying them directly to extremely small traffic signs in complex backgrounds often fails to ensure stable geometric convergence.

3. YOLO11 Algorithm

YOLO11n [20] is adopted as the baseline model. As a lightweight network, it features a simple structure, rapid inference speed, and a low parameter count, making it highly suitable for deployment on embedded devices and in-vehicle edge computing platforms. The network architecture of YOLO11n consists of four main components: the Input, Backbone, Neck, and Head. Specifically, the Input module is responsible for image preprocessing; the Backbone extracts rich multi-scale semantic features through multi-layer convolutional networks; the Neck is utilized to fuse the multi-level features output by the backbone; and the Head outputs the bounding box coordinates, class labels, and confidence scores of the targets.

4. Improvement of the YOLO11n Algorithm

4.1. C3k2_CGPEMA Module

In YOLO11n, feature extraction and fusion primarily rely on the C3k2 module. However, when processing the large number of distant, small-target traffic signs present in the TT100K dataset, convolutional operations tend to induce a pixel-smoothing effect during downsampling and feature extraction. Consequently, the high-frequency spatial details of small targets (e.g., edges, contours, and specific patterns) are highly susceptible to being submerged by complex background information.

To address the aforementioned issues, a cross-guided feature extraction module, C3k2_CGPEMA, is proposed to replace the C3k2 module in the neck network, achieving a deep coupling between high-frequency detail preservation and multi-scale background suppression. Specifically, the core bottleneck structure within the C3k2_CGPEMA module introduces a channel splitting mechanism. In this structure, the feature extraction branch first captures spatial features via local convolutions and subsequently feeds them into the embedded EMA mechanism, which utilizes cross-dimensional multi-scale feature aggregation to generate high-quality spatial attention masks. Meanwhile, the identity branch bypasses convolutional operations to explicitly preserve the original high-frequency spatial details. Furthermore, the spatial mask generated by the feature extraction branch is broadcast across channels and applied to the identity branch. This forms a cross-branch guided denoising constraint that filters out complex background noise from the identity features, thereby compelling the network to precisely focus on the fine-grained features of tiny targets.

The network architectures of C3k2_CGPEMA and C3k_CGPEMA are illustrated in Figure 2 and Figure 3. The underlying design concept of these configurations is rooted in the Cross Stage Partial (CSP) network architecture, aimed at optimizing gradient flow and reducing computational redundancy. As illustrated in Figure 2 and Figure 3, the input feature maps are split into parallel branches. One branch acts as a residual bypass to preserve rich, unaltered gradient information, while the other branch undergoes deep feature extraction through a series of Bottleneck_CGPEMA modules. Within these bottlenecks, the CGPEMA block serves as the core component, utilizing its spatial detail preservation and cross-dimensional attention mechanisms to precisely capture the features of tiny traffic signs. A 1 × 1 convolution is then employed to fuse the concatenated outputs from both branches.

CGPEMA

When processing feature maps, depthwise or standard convolutions perform spatial filtering across all channels. For tiny traffic signs that occupy only a minimal number of pixels, this operation is highly prone to causing feature degradation. To minimize computational overhead while maximally preserving the high-frequency spatial information of small targets, the channel splitting mechanism of PConv and the multi-scale attention mechanism of EMA are integrated into a CGPEMA module, which is subsequently used to modify the C3k2 module. The architecture of the CGPEMA module is illustrated in Figure 4a.

The CGPEMA module adopts an ingenious channel splitting and asymmetric processing strategy. Assuming the total number of channels of the input feature map is 4C, the input features are first divided into two segments along the channel dimension. The feature extraction branch (with C channels) performs a standard 3 × 3 spatial convolution operation to capture local spatial features, which serves as the basis for the subsequent EMA multi-scale attention guidance. Conversely, the identity branch (with 3C channels) completely bypasses the current convolution operation and is preserved directly through identity mapping. In this splitting mechanism, the identity branch compulsorily retains the original high-frequency spatial details—such as edges and contours—that have not undergone feature smoothing, while the feature extraction branch provides the necessary receptive fields and spatial transformation capabilities.

Although channel splitting effectively preserves the edge features of small-target traffic signs, these signs are frequently subjected to interference from surrounding trees, buildings, or similarly colored background regions in real-world driving scenarios. To guide the model in precisely focusing its limited computational resources on target areas while purifying the identity features, the EMA multi-scale attention mechanism is embedded within the C feature extraction branch. The structure of EMA is illustrated in Figure 4b.

Traditional attention mechanisms (such as CBAM and SE) often focus exclusively on global average pooling along the channel dimension, neglecting information interaction across spatial dimensions. In contrast, the EMA mechanism abandons indiscriminate dimensionality reduction operations on the entire feature map. As detailed in the flowchart in Figure 4b, the module first applies a “Groups” operation to the input features (denoted generally as C × H × W in the diagram, which corresponds to the C extraction branch). This operation divides the input into G sub-features, yielding dimensions of C//G × H × W for each, where C//G denotes the integer division of the total channels by the number of groups G. Subsequently, the C//G × H × W feature map is fed into parallel branches. In the coordinate attention branch, two 1D Global Average Pooling operations (X AvgPool and Y AvgPool) are applied, yielding directional feature vectors of sizes C//G × 1 × W and C//G × H × 1, respectively. These vectors are concatenated, passed through a Concat + Conv(1 × 1) operation, split, and activated by Sigmoid functions to “Re-weight” the grouped features. Simultaneously, the other parallel branch directly applies a 3 × 3 convolution (Conv(3 × 3)) to the C//G × H × W feature map to capture multi-scale local spatial context. Finally, the module enters the “Cross-spatial learning” phase. This phase aggregates features from both branches through Group Norm, Avg Pool, Softmax, and Matrix Multiplication (Matmul), capturing cross-dimensional dependencies to output a refined spatial attention map.

The core principle of cross-guidance and denoising in the CGPEMA module is as follows: In addition to generating enhanced features for the processed branch through cross-dimensional interaction, the EMA mechanism further refines a highly discriminative 2D spatial attention mask via global spatial aggregation. Subsequently, this spatial mask is broadcast and multiplied element-wise with the features of the 3C identity branch. Analytically, the source of performance enhancement is rooted in this cross-branch constraint mechanism rather than a simple increase in representational capacity. In a conventional parallel or stacked module design, the identity branch would bypass the processing block, directly propagating unattenuated background noise to subsequent layers and causing spatial interference during channel concatenation. By utilizing the attention mask generated from the processing branch to modulate the identity branch, the module dynamically filters background noise without introducing any additional learnable parameters to the identity path. This structural guidance preserves high-frequency spatial details while effectively suppressing irrelevant background pixels. Consequently, it is this specific synergistic interaction—rather than the brute-force addition of module capacity—that fundamentally optimizes the feature distribution. Ultimately, the denoised identity features and the enhanced extracted features are concatenated along the channel dimension to construct a more robust feature representation for small targets.

4.2. GCD-Optimized Loss Function

In the YOLO11n network model, Complete Intersection over Union (CIoU) is adopted as the bounding box loss function to measure the similarity between the predicted box and the ground truth box. However, CIoU is highly sensitive to the positional offsets and scale variations in small targets. When a small target exhibits a slight positional deviation, the CIoU value drops sharply, making it difficult for the model to obtain effective supervision signals and consequently reducing the detection accuracy for small targets. To address this issue, the Gaussian Combined Distance (GCD) is introduced. By modeling the bounding boxes as 2D Gaussian distributions and constructing a scale-invariant distance metric, the robustness of the model in detecting small targets is significantly enhanced.

First, considering that traffic signs typically feature concentrated central features and substantial background interference at the edges, the predicted box

B_{p} (x_{p}, y_{p,} w_{p}, h_{p})

and the ground truth box

B_{t} (x_{t}, y_{t,} w_{t}, h_{t})

are modeled as 2D Gaussian distributions

N (μ, \sum)

, respectively. Here,

(x, y)

represents the center coordinates, and

(w, h)

denotes the width and height. The mean vector

μ

and covariance matrix

\sum

are defined as follows:

μ = [\begin{matrix} x \\ y \end{matrix}]

(1)

\sum = [\begin{matrix} \frac{w^{2}}{4} & 0 \\ 0 & \frac{h^{2}}{4} \end{matrix}]

(2)

Through the aforementioned transformation, the problem of measuring bounding box similarity is converted into a distance measurement problem between two Gaussian distributions,

N_{p} (μ_{p}, \sum_{p})

and

N_{t} (μ_{t}, \sum_{t})

. Unlike the Normalized Wasserstein Distance (NWD) [21], the GCD introduces normalization terms specifically tailored to the scales of the predicted and ground truth boxes, thereby constructing a symmetric and scale-invariant distance metric. The squared form of the GCD is defined as:

D_{g c}^{2} (N_{p}, N_{t}) = \frac{1}{2} (\frac{{(x_{p} - x_{t})}^{2}}{w_{p}^{2}} + \frac{{(y_{p} - y_{t})}^{2}}{h_{p}^{2}}) + \frac{1}{2} (\frac{{(w_{p} - w_{t})}^{2}}{4 w_{p}^{2}} + \frac{{(h_{p} - h_{t})}^{2}}{4 h_{p}^{2}}) + \frac{1}{2} (\frac{{(x_{t} - x_{p})}^{2}}{w_{t}^{2}} + \frac{{(y_{t} - y_{p})}^{2}}{h_{t}^{2}}) + \frac{1}{2} (\frac{{(w_{t} - w_{p})}^{2}}{4 w_{t}^{2}} + \frac{{(h_{t} - h_{p})}^{2}}{{4 h}_{t}^{2}})

(3)

By employing

w_{p}

,

h_{p}

,

w_{t}

, and

h_{t}

as denominators for normalization, this formula eliminates the impact of absolute target size on the distance metric. This ensures that small and large targets incur consistent loss weights when producing proportional deviations, thereby guaranteeing the model’s robustness in small target detection.

Since the value range of

D_{g c}^{2}

is not bounded within [0,1], a non-linear transformation is applied to map it into a similarity metric

M_{g c d}

. This enables its application in the loss function and ensures compatibility with CIoU:

M_{g c d} = e x p (- \sqrt{D_{g c}^{2} (N_{p}, N_{t})})

(4)

The corresponding GCD loss term

L_{g c d}

is defined as:

L_{g c d} = 1 - M_{g c d} = 1 - e x p (- \sqrt{D_{g c}^{2} (N_{p}, N_{t})})

(5)

Finally, by integrating CIoU and GCD, a joint bounding box regression loss function

L_{r e s}

is constructed as follows:

L_{r e s} = α L_{C I o U} + (1 - α) L_{g c d}

(6)

where

α

is a balancing factor used to regulate the weights of the two losses during joint optimization. Through this joint loss function, the model preserves the excellent perception capability of CIoU for regular targets while significantly enhancing its robustness against positional offsets and scale variations in small targets. This effectively mitigates critical issues in traffic sign detection tasks, such as high miss rates and low detection accuracy for small targets.

4.3. Detection Layer Reconstruction

In driving scenarios, the vast majority of traffic signs captured by wide-angle cameras are small-sized targets. To ensure that drivers have sufficient reaction time, the real-time detection of distant traffic signs is crucial, which inherently involves the processing of extremely small targets. The YOLO11n architecture employs three detection scales, generating feature maps through 8×, 16×, and 32× downsampling, respectively. However, after multiple downsampling operations, the feature information of distant tiny targets often suffers from severe degradation. Concurrently, the P5 layer (32× downsampling with a resolution of 20 × 20), which is primarily designed for large targets, yields limited benefits in tasks dominated by small targets. Instead, it introduces unnecessary computational overhead and parameter redundancy. To address this issue, the detection scales of the model are reconstructed by directly removing the original P5 layer and introducing a P2 tiny-target detection layer (with a resolution of 160 × 160 and 4× downsampling) into the neck network. Consequently, the improved model comprises a three-scale detection structure (160 × 160, 80 × 80, and 40 × 40), covering extremely small, small, and medium-sized targets.

Specifically, during the feature extraction and fusion stage, the computational branch deepening towards the P5 layer is pruned, which partially offsets the computational overhead introduced by the new P2 layer. Subsequently, cross-scale fusion is executed to integrate the shallow features from the backbone’s P2 layer—which contain high-frequency spatial details—with the semantically rich deep features obtained through upsampling in the neck network. The fused feature map is then fed into the C3k2_CGPEMA module to enhance feature representation and focus on the fine-grained information of tiny targets. Finally, this layer outputs through an independent branch, serving as a dedicated detection head for small targets. This structural optimization mitigates the issue of feature disappearance for tiny targets within deep networks. While maintaining acceptable inference efficiency, it significantly enhances the recall and precision for distant, extremely small targets.

5. Experiments

5.1. Dataset Processing

The experiments utilize the TT100K dataset, which comprises over 100,000 images covering diverse weather conditions, illumination variations, and road scenarios. Consistent with the data preprocessing strategies widely adopted in the literature [22,23,24,25], this study focuses on 45 categories that each contain more than 100 instances. This filtering strategy ensures that each evaluated category possesses sufficient samples for robust feature learning and statistically valid evaluation. Ultimately, a filtered experimental dataset comprising 9738 images and 24,212 annotated targets is obtained. As illustrated in Figure 5, several major categories, such as pn, pne, i5, p11, pl40, and pl50, each contain more than 1000 instances.

Figure 6 illustrates the density distribution of small targets within the filtered dataset. It can be observed that the vast majority of bounding boxes occupy less than 0.0025 of the total image area, indicating a significant prevalence of small targets in the dataset. Finally, during the experimental phase, to ensure a consistent category distribution across the training, validation, and test sets (partitioned at a ratio of 7:2:1), a stratified random sampling strategy is implemented across the 45 categories.

5.2. Experimental Setup and Evaluation Metrics

5.2.1. Experimental Setup

The experiments were conducted on a platform running the Windows 11 operating system, equipped with an NVIDIA RTX 4070 Ti GPU (12 GB). The software environment was configured with Python 3.10, PyTorch 2.5.2, and CUDA 11.8. The specific training parameters are detailed in Table 1.

5.2.2. Evaluation Metrics

In the experiments, Precision (P), Recall (R), and mean Average Precision (mAP@0.5) are utilized as evaluation metrics. Their specific calculation formulas are defined as follows:

P = \frac{T P}{T P + F P}

(7)

R = \frac{T P}{T P + F N}

(8)

A P = \int_{0}^{1} P (R) d R

(9)

m A P = \frac{1}{n} \sum_{i = 1}^{n} A P_{i}

(10)

m A P @ 0.5 = \frac{1}{n} \sum_{i = 1}^{n} A P @ {0.5}_{i}

(11)

m A P @ 0.5 : 0.95 = \frac{1}{10} \sum_{t \in T} m A P @ t

(12)

Here, TP, FP, and FN denote the number of true positives, false positives, and false negatives, respectively. The Average Precision (AP) is calculated as the area under the Precision-Recall curve, where P(R) denotes the precision at a given recall level R. Furthermore, n represents the total number of categories, and i indicates the index corresponding to the detected target. The mean average precision (mAP) is the average of AP across all categories. Specifically, mAP@0.5 refers to the mAP evaluated at an Intersection over Union (IoU) threshold of 0.5. Additionally, mAP@0.5:0.95 represents the average mAP calculated over a predefined set of ten IoU thresholds, denoted as T (T = {0.5, 0.55, 0.6, …, 0.95}), where t represents an individual threshold within this set.

5.3. Comparison of C3k2 Improvement Modules

To evaluate the effectiveness of the improvements made to the C3k2 module, comparative experiments involving different improvement strategies were conducted using YOLO11n as the baseline model. The experimental results are presented in Table 2.

As shown in Table 2, upon introducing the EMA mechanism, the number of model parameters increases by 1.49 × 10⁶. Although the precision (P) experiences a slight decrease of 0.5 points, the recall (R) and mAP@0.5 increase by 0.6 and 0.2 points, respectively. This indicates that while EMA assists in the recall of small targets to some extent, it introduces a substantial increase in the parameter count. When incorporating PConv, the parameter count increases by 0.96 × 10⁶, while P, R, and mAP@0.5 are improved by 0.8, 1.7, and 0.7 points, respectively. This demonstrates that PConv can effectively enhance detection performance while simultaneously reducing computational redundancy.

To further address the impact of straightforward module stacking, we introduced a simple fusion configuration (+PConv + EMA) without any cross-guidance mechanism. In this setup, features sequentially pass through PConv and a full-channel EMA module. While this independent combination yields a mAP@0.5 of 78.8%, the parameter count rebounds to 3.59 × 10⁶ due to the EMA module processing all feature channels.

Finally, by systematically integrating EMA and PConv through our proposed Cross-Branch Guidance mechanism, the C3k2_CGPEMA module is designed. Instead of independent processing, the EMA module in CGPEMA only acts on the 1/4 processing branch and utilizes zero-parameter element-wise masking to modulate the remaining 3/4 identity branch. Strikingly, compared to the simple fusion (+PConv + EMA), our proposed CGPEMA achieves higher precision metrics (P, R, and mAP@0.5 of 80.7%, 71.2%, and 79.2%, respectively) while simultaneously reducing the parameter count to 3.56 × 10⁶. These metrics represent improvements of 1.3, 2.5, and 1.4 points over the baseline model, respectively, demonstrating that its overall performance surpasses that of the schemes employing either EMA or PConv individually.

5.4. Comparison of Loss Functions

To verify the enhancement in detection performance brought by the proposed GCD loss function, training was conducted on the TT100K dataset using YOLO11n as the baseline model. Six different bounding box regression loss configurations—CIoU, NWD + CIoU, GCD + CIoU, SIoU, MPDIoU, and Wise-IoU—were evaluated under identical conditions (with the balancing factor set to 0.6 for relevant configurations). The experimental results are presented in Table 3.

Table 3 demonstrates that, compared with using CIoU alone, the introduction of NWD improves the precision (P), recall (R), and mAP@0.5 of the model by 0.9, 0.1, and 0.2 points, respectively. Furthermore, other recent advanced scale-oriented regression losses, including SIoU, MPDIoU, and Wise-IoU, also exhibit performance gains over the baseline CIoU, with Wise-IoU achieving a notable mAP@0.5 of 78.6%.

In contrast, when employing the proposed GCD + CIoU, the model’s P, R, mAP@0.5, and mAP@0.5:0.95 reach the highest values at 81.1%, 69.6%, 78.8%, and 60.2%, respectively. This represents significant improvements of 1.7, 0.9, 1.0, and 0.3 points over the CIoU baseline. Crucially, GCD + CIoU consistently outperforms all other recent scale-oriented losses (SIoU, MPDIoU, and Wise-IoU) across all metrics. By introducing symmetric normalization terms specifically tailored to the scales of the predicted and ground truth boxes, GCD eliminates the influence of absolute target size on the distance metric. This ensures that small and large targets incur consistent loss weights when experiencing proportional deviations, thereby demonstrating a superior theoretical and experimental advantage in handling scale variations for small targets compared to other state-of-the-art regression losses.

As can be observed from the curves in Figure 7, under the same number of training epochs, the loss function employing GCD + CIoU exhibits the fastest convergence rate and achieves the lowest final loss value among all six evaluated configurations. During the initial training phase (0–100 epochs), the magnitude of the loss reduction for GCD + CIoU is significantly greater than those of CIoU, NWD + CIoU, and the other recent scale-oriented losses (SIoU, MPDIoU, and Wise-IoU). This indicates that it can more efficiently optimize bounding box regression, reduce localization errors, and enhance gradient update efficiency, thereby accelerating model convergence. By the end of the 300 epochs of training, the final loss value of GCD + CIoU remains significantly lower than those of all other five loss functions. This further verifies that the symmetric normalization term introduced by GCD effectively mitigates the impact of absolute target size. Consequently, it makes the loss more robust to the scale variations in both small and large targets, ultimately yielding superior optimization performance and more precise bounding box fitting capabilities.

To evaluate the impact of the weighting factor

α

in the proposed GCD loss function, a hyperparameter sensitivity analysis was conducted. The value of

α

was varied from 0.4 to 0.8 with an interval of 0.1. The corresponding experimental results on the TT100K dataset are summarized in Table 4.

As indicated in Table 4, the overall detection performance exhibits an initial increase followed by a slight decline as

α

scales up, attaining the optimal values at

α

= 0.6. Under this optimal setting, the Precision (P), Recall (R), mAP@0.5, and mAP@0.5:0.95 reach 81.1%, 69.6%, 78.8%, and 60.2%, respectively. Concurrently, the performance metrics manifest high stability across the tested spectrum. Specifically, the core metric mAP@0.5 fluctuates within a minor margin of only 0.7 points (ranging from 78.1% to 78.8%). Importantly, even at the boundary values of the tested range (e.g.,

α

= 0.4 and

α

= 0.8), the proposed GCD loss consistently outperforms the CIoU. For instance, the lowest mAP@0.5 across all tested

α

configurations is 78.1%, maintaining a stable 0.3 points improvement over the baseline. This minimal fluctuation and consistent superiority over the baseline demonstrate the strong robustness of the GCD method regarding parameter variations, indicating that the effectiveness of the proposed loss formulation in small-target localization is primarily derived from its structural scale-invariance rather than specific parameter configurations.

5.5. Ablation Experiment

To verify the effectiveness of each proposed improvement module and to rigorously isolate their individual contributions from potential complex interactions, an ablation study (Groups ①–⑧) was conducted on the TT100K dataset using YOLO11n as the baseline model. The improvement methods—reconstructing the detection scales, introducing the C3k2_CGPEMA module, and optimizing the loss function with GCD—are sequentially denoted as A, B, and C. Group ① represents the YOLO11n baseline model, while Groups ②–④ represent the application of single modules, Groups ⑤–⑦ represent pairwise combinations, and Group ⑧ represents the model incorporating all improvements. The experimental results are presented in Table 5.

As shown in Table 5, introducing the individual modules independently (Models ②, ③, and ④) yields consistent performance improvements over the baseline, proving their standalone effectiveness. After introducing the scale reconstruction strategy into the detection layer (Model ②), the model is better equipped to extract and utilize the shallow feature information of small targets. Compared with the YOLO11n baseline model, the precision (P), recall (R), and mAP@0.5 are improved by 1.9, 3.3, and 2.3 points, respectively, highlighting the advantages of this network structural optimization in capturing extremely small targets. When solely introducing the novel feature extraction module C3k2_CGPEMA (Model ③), the improved model exhibits significantly enhanced capabilities in feature focusing and extraction for tiny traffic signs. This is attributed to the synergistic effect between the compulsory high-frequency spatial detail preservation mechanism of PConv and the cross-dimensional feature aggregation capability of the EMA mechanism. Compared with the baseline model, P, R, and mAP@0.5 increase by 1.3, 2.5, and 1.4 points, respectively, effectively reducing the miss rate of traffic signs in complex backgrounds. By introducing the combined CIoU and GCD loss (Model ④), the model utilizes the scale invariance of GCD alongside the joint optimization of bounding box position and morphology, while preserving the stable convergence characteristics of CIoU. Without introducing any additional model parameters or computational overhead, P, R, and mAP@0.5 are improved by 1.7, 0.9, and 1.0 points, respectively. This indicates that the combined loss can effectively optimize the bounding box regression accuracy for small targets.

To explicitly address the complex interactions among the three components and rule out coincidental gains, Models ⑤–⑦ evaluate their pairwise combinations. The results demonstrate that any combination of two modules achieves further continuous performance gains over their single-module counterparts. For example, Model ⑤ (A + B) achieves an mAP@0.5 of 82.0%, which is higher than both Model ② (80.1%) and Model ③ (79.2%). This consistent upward trend across Models ⑤, ⑥, and ⑦ proves that the proposed modules possess low functional redundancy and excellent positive synergistic effects, rather than conflicting with one another.

Ultimately, the algorithm integrating the C3k2_CGPEMA module, the combined CIoU and GCD loss, and the detection layer scale reconstruction (Model ⑧) achieves optimal performance. Compared with the YOLO11n baseline model, P, R, mAP@0.5, and mAP@0.5:0.95 are significantly improved by 5.4, 7.4, 6.6, and 4.6 points, respectively. Additionally, while the GFLOPs increases to 13.5, the Params is slightly reduced to 2.46 M due to the structural optimization of the scale reconstruction. This systematic step-by-step experiment fully validates the effectiveness, positive interaction, and superiority of the proposed multiple improvements for the small-target traffic sign detection task on this dataset.

5.6. Comparative Experiments

To verify the comprehensive performance and advantages of the improved algorithm in the small-target traffic sign detection task, comparative experiments were conducted against mainstream object detection algorithms on the TT100K dataset. The compared models encompass the classical two-stage algorithm Faster R-CNN, as well as one-stage algorithms including YOLOv5n, YOLOv7-tiny, YOLOv8n, YOLOv10n, YOLO26n, YOLO11n and YOLO11s. To comprehensively evaluate the feasibility of practical deployment, the frames per second (FPS) metric was measured to assess inference speed. In the FPS testing phase, to ensure accurate measurements, the batch size was set to 1. After a 200-iteration warm-up period for the model, 1000 inference latency tests were executed. The detailed comparative results are presented in Table 6.

As shown in Table 6, compared with Faster R-CNN, the YOLO series algorithms achieve superior detection accuracy while maintaining lightweight models. Specifically, the mAP@0.5 scores of YOLOv5n and YOLOv7-tiny are 69.5% and 73.7%, respectively. With continuous version iterations, YOLOv8n and YOLOv10n further elevate the mAP@0.5 to 76.4% and 77.6%, while YOLO26n pushes the performance boundaries further, reaching 80.8%.

The baseline model YOLO11n performs acceptably on this dataset, achieving an mAP@0.5 of 77.8% and an inference speed of up to 173.2 FPS. Additionally, the scaled-up YOLO11s achieves an mAP@0.5 of 82.0%, but it requires a much higher model complexity with 9.40 M parameters and 21.5 GFLOPs. Building upon this, the improved algorithm (Ours) further increases the mAP@0.5 to 84.4%, outperforming YOLO11s by 2.4% while using significantly fewer parameters (2.46 M) and GFLOPs (13.5). However, this performance enhancement is accompanied by a shift in computational cost: although the total number of parameters slightly decreases by 0.13 M compared to YOLO11n, the GFLOPs increase from 6.5 to 13.5, and the inference speed consequently drops to 114.5 FPS. The increase in computational complexity primarily stems from the introduction of the high-resolution P2 detection head, a structure that is absolutely crucial for capturing extremely small traffic signs. Even so, the inference speed of 114.5 FPS still far exceeds the real-time processing requirements of autonomous driving scenarios (typically > 30 FPS), thereby guaranteeing real-time performance for practical deployment while significantly improving accuracy.

As indicated by the comprehensive performance comparison chart in Figure 8, the improved algorithm enhances the recognition accuracy of the model. It not only mitigates the issue of missed detections prevalent in small-target traffic sign detection but also effectively suppresses interference from similar targets. It exhibits superior object recognition capabilities, making it highly suitable for small-target traffic sign detection tasks. For intuitive comparison, all coordinate axes in the figure are unified such that larger values indicate better performance. Specifically, P, R, mAP@0.5, and mAP@0.5:0.95 are plotted directly using their original percentages. Since FPS inherently follows the “larger is better” principle, it is linearly mapped to the range of 60–90 through normalization to maintain visual consistency. Regarding GFLOPs and parameters, where smaller original values represent higher efficiency, the figure displays the results of their reciprocals after linear scaling (also normalized to the 60–90 range). Therefore, a larger radius reflects lower computational complexity and a smaller parameter count.

5.7. Generalization Experiments

To verify the generalization capability of the improved algorithm across different datasets, generalization experiments were conducted on the CCTSDB dataset. The CCTSDB (CSUST Chinese Traffic Sign Detection Benchmark) dataset contains a diverse collection of Chinese traffic sign images captured under various real-world driving conditions, encompassing different weather, lighting, and complex background scenarios. It primarily includes three categories of traffic signs: warning, prohibitory, and mandatory. The experimental results are presented in Table 7 and Table 8.

As shown in Table 7, compared with YOLO11n, the improved algorithm (Ours) increases the precision (P), recall (R), mAP@0.5, and mAP@0.5:0.95 by 1.0, 2.4, 1.1, and 2.2 points, respectively. Its stable performance across different datasets validates the strong generalization capability and practical value of the algorithm.

Table 8 illustrates the comparison of detection performance on mandatory, prohibitory, and warning signs before and after the model improvement. Compared with YOLO11n, the improved algorithm can detect these three categories of signs more accurately, with the mAP@0.5 increasing by 0.6, 1.9, and 0.9 points, respectively.

To intuitively compare the training results of the generalization experiments before and after the model improvements, comparative curves of the generalization experiments are plotted in Figure 9. Throughout the 300 training epochs, the P curve of the improved model remains stably higher than that of the baseline model, and its convergence value in the later stage of training is superior. This indicates that the improved model possesses higher target recognition precision. Furthermore, the R curve of the improved model not only ascends faster but also converges to a higher value than the baseline model, reflecting a stronger sample recall capability and a lower miss rate. Regarding the mAP@0.5 curve, which serves as a comprehensive evaluation metric, the advantage of the improved model is evident. Its curve consistently remains above that of the baseline model, and the margin between the two remains stable throughout the training process, verifying the enhancement in the overall detection performance. In addition, judging from the trends of the three curves, the metrics of both models increase rapidly during the initial training phase (0–100 epochs) and gradually converge after 100 epochs. This demonstrates that while maintaining training stability, the improved model successfully breaks through the performance ceiling of the baseline model. Ultimately, the generalization experimental results fully prove the validity of the proposed model improvement strategies.

5.8. Analysis of Visualization Results

To visually compare detection performance on the TT100K dataset before and after improvements, a visualization analysis was conducted for YOLO11n and the proposed algorithm under three typical scenarios: strong illumination, low light, and dense target distributions. The results are shown in Figure 10 ((a) YOLO11n, (b) the proposed algorithm), where green, blue, and red bounding boxes denote correct, false, and missed detections, respectively.

In strong illumination scenarios, overexposure and reflections cause severe loss of image details, posing a stringent challenge to feature extraction; while YOLO11n exhibits severe missed detections, the proposed algorithm correctly detects all targets in the test samples without any false or missed detections, and its confidence scores are significantly improved, demonstrating excellent capability in preserving features under intense lighting. In low-light scenarios, insufficient illumination degrades traffic sign clarity; YOLO11n yields lower detection accuracy and falsely detects a pl40 sign as a pl50 sign, whereas the proposed algorithm effectively overcomes feature degradation induced by poor illumination, correctly identifying all targets with no missed detections and maintaining high detection accuracy, thereby showing strong robustness under extreme lighting conditions. In scenarios with dense target distributions, the comparative results in Figure 10 reveal that the original model experiences false and missed detections for pm30 signs in densely populated areas and misclassifies a p3 sign as a p26 sign; conversely, the proposed algorithm successfully suppresses mutual interference among densely packed targets and achieves precise recognition of all targets, exhibiting excellent discriminative capability. Overall, the proposed algorithm delivers superior detection performance across all evaluated scenarios.

To demonstrate the architectural superiority for traffic sign detection, Figure 11 visualizes the effective receptive fields (ERFs) extracted from the fully trained weights of both the baseline and proposed models. The top row displays the proposed model (P2, P3, and P4 layers), while the bottom row illustrates the baseline YOLO11n (P3, P4, and P5 layers). By replacing the P5 layer with the high-resolution P2 layer, our model achieves a tightly bounded ERF that precisely envelops small-to-medium signs. This effectively eliminates the severe background clutter (e.g., surrounding trees) caused by the over-diffused P5 activation in the baseline. Furthermore, at equivalent scales (P3 and P4), the proposed ERFs exhibit a significantly higher energy concentration toward the target center compared to the baseline’s scattered patterns. This centralized focus optimally calibrates the receptive field, enhancing the signal-to-noise ratio and directly contributing to the improved detection accuracy.

6. Conclusions

To address the challenges of scarce feature information, complex background interference, and high miss rates in long-range, small-target traffic sign detection, this paper proposes a real-time traffic sign detection algorithm based on an improved YOLO11n network. Through multiple optimization strategies, the perception and localization capabilities of the model for tiny targets are significantly enhanced. First, a novel cross-guided feature extraction module, C3k2_CGPEMA, is designed in the neck network. By embedding the EMA mechanism into the localized feature extraction branch of PConv, deep coupling between high-frequency spatial detail preservation and background noise suppression is achieved via a cross-branch guided mask. Second, a joint bounding box regression loss function combining CIoU and GCD is introduced, which leverages the scale invariance of GCD to overcome gradient instability issues during small-target regression while preserving the stable convergence characteristics of CIoU. Finally, the scale of the detection layer is reconstructed by removing the P5 layer designed for large targets and introducing a high-resolution P2 layer, enabling the network to focus on the fine-grained features of distant, extremely small targets.

Experimental results demonstrate that the proposed algorithm achieves a precision, recall, and mAP@0.5 of 84.8%, 76.1%, and 84.4% on the TT100K dataset, representing improvements of 5.4, 7.4, and 6.6 points over the baseline YOLO11n model, respectively. Generalization experiments on the CCTSDB dataset further validate the robustness of the improved algorithm in complex environments, with the mAP@0.5 reaching 99.3%, a 1.1-point increase compared to that of YOLO11n. Although the computational complexity of the model has increased, it still maintains an inference speed of 114.5 FPS, which far exceeds real-time detection requirements (>30 FPS), thereby achieving an excellent balance between accuracy and efficiency.

Despite the distinct advantages demonstrated by the proposed algorithm, certain limitations remain. Future work will primarily focus on three directions. First, the reliability of the model in extreme weather scenarios (such as heavy rain and snow) will be further enhanced through data augmentation and multimodal fusion to improve the safety of all-weather autonomous driving. Second, model compression techniques, including structural pruning, knowledge distillation, and quantization, will be thoroughly investigated to drastically reduce the model size and computational latency while maintaining detection accuracy. This will facilitate the efficient deployment of the algorithm on resource-constrained edge devices in subsequent stages. Third, while the current evaluation effectively addresses the spatial localization of tiny objects and primarily focuses on the 45 most common categories to ensure statistical stability, real-world traffic signs often exhibit a severe long-tailed distribution. Addressing the semantic sample imbalance of extremely rare classes via independent approaches, such as few-shot learning or long-tail data augmentation, represents another crucial direction for our future research.

Author Contributions

Conceptualization, Y.L. and H.N.; methodology, Y.L.; software implementation, model training and experimental validation, Y.L.; formal analysis and investigation, Y.L.; dataset processing and curation, Y.L., C.N. and Z.D.; resources and project administration, H.N.; writing—original draft preparation, Y.L.; writing—review and editing, Y.L. and H.N.; visualization, Y.L. and J.G.; overall research guidance and correspondence, H.N. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Key Research and Development Program of Shaanxi Province, grant number 2023-YBGY-120 (Project Name: Research on Intelligent Vehicle Safety Driving Model Based on Road Alignment).

Data Availability Statement

The publicly available TT100K dataset analyzed in this study is accessible at https://cg.cs.tsinghua.edu.cn/traffic-sign/ (accessed on 22 December 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Girshick, R. Fast R-CNN. In Proceedings of the 2015 IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar] [CrossRef]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask R-CNN. In Proceedings of the 2017 IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar] [CrossRef]
Zhao, Y.; Lv, W.; Xu, S.; Wei, J.; Wang, G.; Dang, Q.; Liu, Y.; Chen, J. DETRs Beat YOLOs on Real-Time Object Detection. In Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 17–21 June 2024; pp. 16965–16974. [Google Scholar] [CrossRef]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.-Y.; Berg, A.C. SSD: Single Shot MultiBox Detector. In Proceedings of the 14th European Conference on Computer Vision (ECCV), Amsterdam, The Netherlands, 11–14 October 2016; pp. 21–37. [Google Scholar] [CrossRef]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar] [CrossRef]
Han, T.; Sun, L.; Dong, Q. An Improved YOLO Model for Traffic Signs Small Target Image Detection. Appl. Sci. 2023, 13, 8754. [Google Scholar] [CrossRef]
Song, J.; Hu, T.; Gong, Z.; Zhang, Y.; Cui, M. TLDM: An Enhanced Traffic Light Detection Model Based on YOLOv5. Electronics 2024, 13, 3080. [Google Scholar] [CrossRef]
Liu, H.; Wang, K.; Wang, Y.; Zhang, M.; Liu, Q.; Li, W. An Enhanced Algorithm for Detecting Small Traffic Signs Using YOLOv10. Electronics 2025, 14, 955. [Google Scholar] [CrossRef]
Jia, Y.; Wei, Y.; Wang, S. A Traffic Sign Detection Algorithm Based on an Improved YOLOv8n. Electronics 2026, 15, 2022. [Google Scholar] [CrossRef]
Ouyang, D.; He, S.; Zhan, J.; Guo, H.; Huang, Z.; Luo, M.; Zhang, G. Efficient Multi-Scale Attention Module with Cross-Spatial Learning. In Proceedings of the 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 4–10 June 2023; pp. 1–5. [Google Scholar] [CrossRef]
Chen, J.; Kao, S.-H.; He, H.; Zhuo, W.; Wen, S.; Lee, C.-H.; Chan, S.-H.G. Run, Don’t Walk: Chasing Higher FLOPS for Faster Neural Networks. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 12021–12031. [Google Scholar] [CrossRef]
Guan, Z.; Fu, X.; Huang, P.; Zhang, H.; Du, H.; Liu, Y.; Wang, Y.; Ma, Q. Gaussian Combined Distance: A Generic Metric for Object Detection. IEEE Geosci. Remote Sens. Lett. 2025, 22, 8000905. Available online: https://ieeexplore.ieee.org/document/10847878 (accessed on 15 May 2026). [CrossRef]
Wang, Z.; Su, Y.; Kang, F.; Wang, L.; Lin, Y.; Wu, Q.; Li, H.; Cai, Z. PC-YOLO11s: A Lightweight and Effective Feature Ex-traction Method for Small Target Image Detection. Sensors 2025, 25, 348. [Google Scholar] [CrossRef] [PubMed]
Li, S.; Wang, S.; Wang, P. A Small Object Detection Algorithm for Traffic Signs Based on Improved YOLOv7. Sensors 2023, 23, 7145. [Google Scholar] [CrossRef] [PubMed]
Chen, B.; Fan, X. MSGC-YOLO: An Improved Lightweight Traffic Sign Detection Model under Snow Conditions. Mathematics 2024, 12, 1539. [Google Scholar] [CrossRef]
Gevorgyan, Z. SIoU Loss: More Powerful Learning for Bounding Box Regression. arXiv 2022, arXiv:2205.12740. [Google Scholar] [CrossRef]
Tong, Z.; Chen, Y.; Xu, Z.; Yu, R. Wise-IoU: Bounding box regression loss with dynamic focusing mechanism. arXiv 2023, arXiv:2301.10051. [Google Scholar] [CrossRef]
Ma, S.; Xu, Y. MPDIoU: A loss for efficient and accurate bounding box regression. arXiv 2023, arXiv:2307.07662. [Google Scholar] [CrossRef]
Lu, B.; Liu, L.; Wang, C.; Wang, D.; Xu, H.; Cao, J. MSC-YOLO: An Accurate and Effective Maritime Ship Detection Model Based on Improved YOLOv11n. J. Mar. Sci. Eng. 2026, 14, 1066. [Google Scholar] [CrossRef]
Xu, C.; Wang, J.; Yang, W.; Yu, H.; Yu, L.; Xia, G.-S. Detecting tiny objects in aerial images: A normalized Wasserstein distance and a new benchmark. ISPRS J. Photogramm. Remote Sens. 2022, 190, 79–93. [Google Scholar] [CrossRef]
Zhu, Z.; Liang, D.; Zhang, S.; Huang, X.; Li, B.; Hu, S. Traffic-Sign Detection and Classification in the Wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 2110–2118. [Google Scholar] [CrossRef]
Chu, J.; Zhang, C.; Yan, M.; Zhang, H.; Ge, T. TRD-YOLO: A Real-Time, High-Performance Small Traffic Sign Detection Algorithm. Sensors 2023, 23, 3871. [Google Scholar] [CrossRef] [PubMed]
Zhao, S.; Yuan, Y.; Wu, X.; Wang, Y.; Zhang, F. YOLOv7-TS: A Traffic Sign Detection Model Based on Sub-Pixel Convolution and Feature Fusion. Sensors 2024, 24, 989. [Google Scholar] [CrossRef] [PubMed]
You, S.; Bi, Q.; Ji, Y.; Liu, S.; Feng, Y.; Wu, F. Traffic Sign Detection Method Based on Improved SSD. Information 2020, 11, 475. [Google Scholar] [CrossRef]

Figure 1. Improved YOLO11n network structure.

Figure 2. Architecture of C3k2_CGPEMA.

Figure 3. Architecture of C3k_CGPEMA.

Figure 4. CGPEMA and EMA structures. (a) CGPEMA; (b) EMA.

Figure 5. Category distribution of the TT100K dataset.

Figure 6. Density heatmap of small targets.

Figure 7. Loss comparison curves.

Figure 8. Comprehensive performance comparison. The red stars (★) indicate the metrics where our proposed model achieves the best performance.

Figure 9. Comparison curves of the generalization experiment.

Figure 10. Comparison of visualized detection results. (a) YOLO11n; (b) Proposed algorithm.

Figure 11. ERF visualizations of the proposed multi-scale detection heads versus the baseline.

Table 1. Parameter configuration.

Parameter	Configuration
imgsz	640 × 640
epochs	300
batch	16
workers	8
amp	False
optimizer	SGD
lr0	0.001
momentum	0.937
weight_decay	0.0005
warmup_epochs	3

Table 2. Comparison experiment of C3k2.

Model	P/%	R/%	mAP@0.5/%	Params/10⁶
YOLO11n	79.4	68.7	77.8	2.59
+EMA	78.9	69.3	78.0	4.08
+PConv	80.2	70.4	78.5	3.55
+PConv + EMA	80.3	70.6	78.8	3.59
+CGPEMA	80.7	71.2	79.2	3.56

Table 3. Comparison experiment of loss functions.

Model	P/%	R/%	mAP@0.5/%	mAP@0.5:0.95/%
CIoU	79.4	68.7	77.8	59.9
+NWD	80.3	68.8	78.0	59.4
+GCD	81.1	69.6	78.8	60.2
SIoU	79.8	69.0	78.1	59.7
MPDIoU	80.1	69.2	78.4	60.0
Wise-IoU	80.6	69.4	78.6	60.1

Table 4. Sensitivity analysis of the hyperparameter

α

in the GCD loss function.

Table 4. Sensitivity analysis of the hyperparameter

α

in the GCD loss function.

GCD	P/%	R/%	mAP@0.5/%	mAP@0.5:0.95/%
$α$ = 0.4	80.5	68.9	78.1	59.5
$α$ = 0.5	80.6	69.4	78.5	59.8
$α$ = 0.6	81.1	69.6	78.8	60.2
$α$ = 0.7	80.9	69.5	78.6	60.0
$α$ = 0.8	80.4	69.2	78.3	59.6

Table 5. Ablation Study ("√" and "-" indicate the inclusion and exclusion of the component, respectively).

Model	A	B	C	P/%	R/%	mAP@0.5/%	mAP@0.5:0.95/%	GFLOPs	Params/10⁶
①	-	-	-	79.4	68.7	77.8	59.9	6.5	2.59
②	√	-	-	81.3	72.0	80.1	61.3	10.1	1.95
③	-	√	-	80.7	71.2	79.2	60.4	8.9	3.56
④	-	-	√	81.1	69.6	78.8	60.2	6.5	2.59
⑤	√	√	-	81.6	73.9	82.0	62.5	13.5	2.46
⑥	√	-	√	81.9	74.5	81.7	61.5	10.1	1.95
⑦	-	√	√	81.4	74.1	81.5	61.3	8.9	3.56
⑧	√	√	√	84.8	76.1	84.4	64.5	13.5	2.46

Table 6. Comparison experiment.

Model	P/%	R/%	mAP@0.5/%	mAP@0.5:0.95/%	GFLOPs	Params/10⁶	FPS
Faster R-CNN	62.2	57.4	61.3	41.6	343.77	109.65	28.5
YOLOv5n	69.7	61.8	69.5	48.7	4.4	1.82	203.4
YOLOv7-tiny	72.7	67.9	73.7	53.9	13.6	6.13	140.7
YOLOv8n	78.4	68.4	76.4	59.8	8.1	3.01	182.8
YOLOv10n	78.6	69.1	77.6	58.2	8.3	2.71	164.1
YOLO26n	81.9	72.7	80.8	60.4	5.9	2.52	167.6
YOLO11n	79.4	68.7	77.8	59.9	6.5	2.59	173.2
YOLO11s	82.3	74.6	82.0	62.1	21.5	9.4	146.9
Ours	84.8	76.1	84.4	64.5	13.5	2.46	114.5

Table 7. Generalization experiment.

Model	P/%	R/%	mAP@0.5/%	mAP@0.5:0.95/%
YOLO11n	97.3	95.2	98.2	79.2
Ours	98.3	97.6	99.3	81.4

Table 8. Comparison of detection results for three categories of traffic signs before and after model improvement.

Model	Mandatory			Prohibitory			Warning
Model	P/%	R/%	mAP@0.5/%	P/%	R/%	mAP@0.5/%	P/%	R/%	mAP@0.5/%
YOLO11n	97.2	95.6	98.7	97.0	93.6	97.3	97.6	96.5	98.5
Ours	98.4	97.9	99.3	98.0	97.0	99.2	98.4	97.8	99.4

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Luo, Y.; Ning, H.; Nan, C.; Dong, Z.; Gan, J. A Real-Time Traffic Sign Detection Algorithm Based on Improved YOLO11n. Electronics 2026, 15, 2916. https://doi.org/10.3390/electronics15132916

AMA Style

Luo Y, Ning H, Nan C, Dong Z, Gan J. A Real-Time Traffic Sign Detection Algorithm Based on Improved YOLO11n. Electronics. 2026; 15(13):2916. https://doi.org/10.3390/electronics15132916

Chicago/Turabian Style

Luo, Yutao, Hang Ning, Chunli Nan, Zeyang Dong, and Jiayi Gan. 2026. "A Real-Time Traffic Sign Detection Algorithm Based on Improved YOLO11n" Electronics 15, no. 13: 2916. https://doi.org/10.3390/electronics15132916

APA Style

Luo, Y., Ning, H., Nan, C., Dong, Z., & Gan, J. (2026). A Real-Time Traffic Sign Detection Algorithm Based on Improved YOLO11n. Electronics, 15(13), 2916. https://doi.org/10.3390/electronics15132916

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Real-Time Traffic Sign Detection Algorithm Based on Improved YOLO11n

Abstract

1. Introduction

2. Related Works

2.1. Object Detection Evolution

2.2. Small Object Detection in Traffic Scenes

2.3. Bounding Box Regression Loss

3. YOLO11 Algorithm

4. Improvement of the YOLO11n Algorithm

4.1. C3k2_CGPEMA Module

CGPEMA

4.2. GCD-Optimized Loss Function

4.3. Detection Layer Reconstruction

5. Experiments

5.1. Dataset Processing

5.2. Experimental Setup and Evaluation Metrics

5.2.1. Experimental Setup

5.2.2. Evaluation Metrics

5.3. Comparison of C3k2 Improvement Modules

5.4. Comparison of Loss Functions

5.5. Ablation Experiment

5.6. Comparative Experiments

5.7. Generalization Experiments

5.8. Analysis of Visualization Results

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI