Ghost-YOLO-GBH: A Lightweight Framework for Robust Small Traffic Sign Detection via GhostNet and Bidirectional Multi-Scale Feature Fusion

Tang, Jingyi; Xu, Bu; Li, Jue; Zhang, Mengyuan; Huang, Chao; Li, Feng

doi:10.3390/eng6080196

Open AccessArticle

Ghost-YOLO-GBH: A Lightweight Framework for Robust Small Traffic Sign Detection via GhostNet and Bidirectional Multi-Scale Feature Fusion

by

Jingyi Tang

¹,

Bu Xu

²

,

Jue Li

^1,*

,

Mengyuan Zhang

¹,

Chao Huang

³ and

Feng Li

⁴

¹

College of Traffic & Transportation, Chongqing Jiaotong University, Chongqing 400074, China

²

School of Mechatronics and Vehicle Engineering, Chongqing Jiaotong University, Chongqing 400074, China

³

School of Civil Engineering, Central South University, Changsha 410075, China

⁴

Jiangxi Provincial Key Laboratory of Traffic Infrastructure Safety, East-China Jiaotong University, Nanchang 310013, China

^*

Author to whom correspondence should be addressed.

Eng 2025, 6(8), 196; https://doi.org/10.3390/eng6080196

Submission received: 26 June 2025 / Revised: 2 August 2025 / Accepted: 4 August 2025 / Published: 7 August 2025

(This article belongs to the Special Issue Artificial Intelligence for Engineering Applications, 2nd Edition)

Download

Browse Figures

Versions Notes

Abstract

Traffic safety is a significant global concern, and traffic sign recognition (TSR) is essential for the advancement of intelligent transportation systems. Traditional YOLO11s-based methods often struggle to balance detection accuracy and processing speed, particularly in the context of small traffic signs within complex environments. To address these challenges, this study presents Ghost-YOLO-GBH, an innovative lightweight model that incorporates three key enhancements: (1) the integration of a GhostNet backbone, which substitutes the conventional YOLO11s architecture and utilizes Ghost modules to exploit feature redundancy, resulting in a 40.6% reduction in computational load while ensuring effective feature extraction for small targets; (2) the development of a HybridFocus module that combines large separable kernel attention with multi-scale pooling, effectively minimizing background interference and improving contextual feature aggregation by 4.3% in isolated tests; and (3) the implementation of a Bidirectional Dynamic Multi-Scale Feature Pyramid Network (BiDMS-FPN) that allows for bidirectional cross-stage feature fusion, significantly enhancing the accuracy of small target detection. Experimental results on the TT100K dataset indicate that Ghost-YOLO-GBH achieves an impressive 81.10% mean Average Precision (mAP) at a threshold of 0.5, along with an 11.7% increase in processing speed (45 FPS) and an 18.2% reduction in model parameters (7.74 M) compared to the baseline YOLO11s. Overall, Ghost-YOLO-GBH effectively balances accuracy, efficiency, and lightweight deployment, demonstrating superior performance in real-world applications characterized by small signs and cluttered backgrounds. This research provides a novel framework for resource-constrained TSR applications, contributing to the evolution of intelligent transportation systems.

Keywords:

traffic sign recognition; Ghost-YOLO-GBH; small target detection; lightweight model

1. Introduction

Traffic safety is of paramount importance, with the World Health Organization reporting approximately 1.19 million fatalities due to road traffic accidents each year. This statistic underscores the persistent global public safety challenge presented by traffic incidents. A safer road environment not only mitigates accidents but also alleviates traffic congestion, thereby enhancing logistics, facilitating the movement of goods and services, and stimulating economic growth [1]. Recent advancements in intelligent transportation systems have prioritized the enhancement of road safety, with traffic sign recognition emerging as a critical element. This technology is indispensable for minimizing violations, improving driving efficiency, and ensuring safety. As vehicles increasingly depend on automated systems, the demand for precise and efficient traffic sign recognition has intensified [2]. However, this undertaking encounters challenges such as recognizing signs from long distances, adapting to varying weather conditions, fluctuating light levels, sign occlusion, and the diversity of sign designs [3,4]. These factors require that the recognition algorithm has high accuracy and robustness, and must have low computational complexity and model parameters to ensure the smooth deployment of the model on EDGE devices, such as the vehicle end. Consequently, investigating traffic sign recognition under real-world conditions is essential for reducing accidents and improving overall driving safety.

Deep learning techniques have made substantial progress in computer vision, particularly with the advancement of CNN-based object detection algorithms [5]. The YOLO series has gained prominence in object detection tasks due to its efficiency and accuracy [6]. Algorithms including R-CNN, SSD, YOLOv3, YOLOv5, YOLOv7, and YOLO8 have been widely adopted for traffic sign recognition [7]. Recent studies have proposed various enhancements to these algorithms, including the integration of color and shape information for small traffic sign localization, the use of depth-separable convolution to reduce model complexity, and the implementation of advanced loss functions to improve detection precision and bounding box localization [8,9,10]. For example, Li and Wang [11] employed the Faster R-CNN framework with a MobileNet architecture to design a detector that refines the localization of small traffic signs, which are often challenging to regress accurately. Wang et al. [12] proposed an improved lightweight traffic sign recognition algorithm based on YOLOv4-Tiny, enhancing detection accuracy and recall by optimizing anchor box generation through K-means clustering and leveraging low-level feature information for small target detection. Haque et al. [13] introduced a novel lightweight CNN architecture named DeepThin for traffic sign recognition, designed to operate without GPU requirements. This architecture employs overlapping max pooling, sparse striding, and ensemble learning to achieve high accuracy while significantly reducing the parameter count. Overall, these advancements reflect a concerted effort to enhance detection accuracy and efficiency in traffic sign recognition, addressing the challenges posed by diverse and complex real-world environments.

Despite these advancements, existing algorithms still face the challenge of balancing detection accuracy and speed in small target recognition tasks, which is essential for practical traffic sign recognition applications. To address this issue, this paper introduces an enhanced YOLO11s model, termed Ghost-based Bidirectional Lightweight YOLO with Global Attention (Ghost-YOLO-GBL). The primary contributions of this work are as follows: (1) The backbone network of the original YOLO is substituted with a lightweight and efficient GhostNet, which generates a redundant feature map through the Ghost module. This modification reduces computational load, significantly enhancing reasoning speed on mobile terminals while preserving accuracy; (2) the integration of fast pooling of spatial pyramids with a large separable kernel attention module effectively diminishes interference from irrelevant background information, allowing the model to concentrate on the essential characteristics of traffic signs and improve recognition accuracy in complex environments. These enhancements enable the Ghost-YOLO-GBL algorithm to achieve a balance between detection accuracy and speed, aligning more closely with the requirements of practical traffic sign recognition applications; (3) a novel Feature Pyramid Network, BiDMS-FPN, is developed to replace the traditional feature pyramid. This innovation enhances the detection accuracy of small targets through two-way dynamic multi-scale fusion and cross-stage feature reuse. Collectively, these improvements allow the Ghost-YOLO-GBH algorithm to effectively balance detection accuracy and speed, thereby better addressing the needs of practical traffic sign recognition applications.

2. Related Work

2.1. Overview of YOLO Series Algorithms and Their Development

Since its inception, the YOLO algorithm has emerged as a leading and influential method for object detection. Redmon et al. [14] highlighted that YOLO revolutionized traditional object detection by framing the task as a regression problem. Unlike conventional two-stage detection methods such as R-CNN, YOLO directly predicts bounding boxes and class probabilities from the input image. This innovative approach enables YOLO to achieve real-time performance and high efficiency, making it a widely adopted solution across various computer vision applications.

The YOLO series has undergone multiple iterations, each introducing substantial advancements and improvements. The original YOLOv1 employed a single-scale fully convolutional neural network for object detection, demonstrating impressive real-time capabilities and operational efficiency. Building on this foundation, YOLOv2 incorporated multi-scale feature maps and anchor boxes, significantly enhancing detection accuracy and robustness.

YOLOv3 further refined the architecture by integrating a deeper Darknet-53 backbone for feature extraction and incorporating multi-scale predictions alongside cross-scale connections [15]. These enhancements improved both detection speed and precision. Subsequently, YOLOv4 adopted the CSPDarknet53 backbone and spatial pyramid pooling (SPP) structures, resulting in notable gains in detection performance and inference speed.

YOLOv5, developed by Ultralytics, focused on creating a lightweight network architecture and implementing multi-scale training strategies, thereby enhancing detection speed and robustness. More recently, YOLOv7 introduced the expandable efficient layer aggregation network (E-ELAN), novel transition modules, and reparameterization strategies to strengthen feature extraction and semantic information representation, optimizing overall detection effectiveness.

YOLOv8 builds upon the foundations laid by YOLOv5 and incorporates several key advancements. It features the path aggregation feature pyramid network (PA-FPN) architecture, anchor-free detection mechanisms, and decoupled heads, which collectively lead to significant improvements in loss computation and network design.

The YOLO family has consistently aimed to improve detection accuracy and inference speed. YOLO11 enhances feature expression capabilities and increases robustness against complex backgrounds, all while reducing the number of parameters through core innovative architectures such as C3K2 and C2PSA. In the object detection algorithm, Backbone, Neck, and Head are the core modular architecture of the model. This design improves the efficiency of feature extraction, fusion, and prediction through division and cooperation and is widely used in various target detection algorithms (such as Yolo and Faster R-CNN). Among them, Backbone is responsible for extracting multi-level visual features from the input image; the role of Neck is to fuse the multi-scale features of Backbone, enhance feature expression, and solve the problem of scale difference; finally, the Backbone is used to extract multi-level visual features from the input image, head performs task-specific predictions (classification + localization) based on the features of the Neck output, as illustrated in Figure 1 [16].

2.2. Limitations of YOLO11s in Small Traffic Sign Recognition

Real-time performance is a critical requirement for the practical application of traffic sign recognition, particularly within the context of autonomous and assisted driving systems, and makes the model have the potential to be more advantageous in the deployment of edge devices, such as on-board terminals. To meet this demand, the lightweight YOLO11s model was selected from the YOLO11 series due to its efficient design and architectural enhancements, which have significantly reduced both the parameter count and computational cost while still achieving high detection accuracy [17]. However, despite these advantages, the YOLO11s algorithm faces several challenges in specific recognition scenarios, especially concerning small traffic sign recognition.

One of the primary limitations of the YOLOv8s model is its reliance on standard convolution operations, where deep convolution can lead to information decay. YOLOv8s extracts features through multi-layer downsampling (e.g., 32-fold downsampling), which can result in a substantial loss of detailed information for small target traffic signs. This limitation is particularly concerning in real-world scenarios, where small signs, such as yield or speed limit signs, are often critical for safe driving.

Furthermore, the presence of complex background environments introduces significant interference in the detection process. Distracting elements and cluttered surroundings pose considerable challenges for YOLOv8s in distinguishing traffic signs from the background. This issue is exacerbated when small-scale traffic signs are involved, as the model’s detection capabilities are often insufficient, leading to an increased incidence of false positives. In such cases, the algorithm may misidentify unmarked objects as traffic signs or completely fail to recognize actual signs, thereby compromising the reliability of the recognition system. Although YOLOv8s employs a feature pyramid network (FPN) and path aggregation network (PAN) structure to fuse multi-scale features, it remains inefficient in transferring semantic information to minimal targets. For instance, in traffic sign detection, the contrast between small signs and their backgrounds is often weak. The shallow features of the FPN lack semantic richness, while the deep features suffer from a loss of spatial detail, resulting in an elevated missed detection rate.

These limitations in the YOLO11s algorithm with respect to handling small traffic signs and complex backgrounds can severely hinder its performance in real-world traffic sign recognition applications. Accurate and reliable detection is paramount for ensuring driving safety and supporting the functionalities of autonomous driving systems.

3. Proposed Approach: Ghost-YOLO-GBH

To address the limitations of the YOLOv8s algorithm in recognizing small traffic signs and navigating complex background environments, while enhancing the model’s lightweight characteristics for improved deployment on edge devices such as onboard terminals, an improved model known as Ghost-YOLO-GBH has been proposed. Figure 2 illustrates the architecture of this enhanced detection network. First, the Ghost-YOLO-GBH model replaces the original backbone with a GhostNet architecture. The Ghost module generates redundant feature maps, effectively reducing the computational load, thereby significantly improving the reasoning speed of mobile terminals while maintaining accuracy.

Secondly, we design a HybridFocus module to replace the normal Spatial Pyramid Pooling Fast (SPPF) module. This model utilizes large kernel convolution and a spatial attention mechanism to extract multi-scale features more effectively, thereby reducing the interference of irrelevant background information on traffic sign detection and enhancing the model’s accuracy in complex environments. Finally, a new bidirectional dynamic multi-scale fusion network BiDMS-FPN is designed to replace the original feature pyramid network, and the accuracy of small target detection is improved by cross-stage feature reuse.

Through these enhancements, the Ghost-YOLO-GBH model achieves significant performance improvements in small-scale traffic sign recognition tasks and can detect traffic signs more accurately at different scales and in complex backgrounds.

3.1. GhostNet Backbone Network

The redundant feature map can enhance the representation of the key features of traffic signs. For example, for speed limit signs, multiple similar circular feature maps can strengthen the recognition ability of the model for circular shapes and improve its robustness under different illumination, angles, and other conditions. At the same time, the multi-scale redundant feature map can capture the details of signs in different sizes, such as long-distance small signs and short-distance large signs. However, the redundancy of the feature map also means that the computation of the model increases. In the traffic sign recognition scenario, the computing resources are often limited, such as on-board equipment and embedded roadside monitoring systems. The original backbone network of YOLO11s has a large amount of calculation and many parameters, which will take up too many resources, resulting in slow operation of the equipment and increased energy consumption. GhostNet decreases the number of parameters and improves the execution speed of the model under the condition of ensuring a good detection effect by innovative feature redundancy generation [18].

The key innovation of GhostNet (a lightweight convolutional network through feature reuse) lies in using Cheap Operations to produce feature redundancy. Ghost Module appears as the first convolution module in the GhostNet network, which provides an effective alternative to vanilla convolution. As shown in Figure 3, the Ghost Module first uses ordinary convolution for preliminary feature extraction and then uses a linear transformation operation instead of conventional convolution to considerably improve the computational efficiency. Finally, the feature map is generated by a tensor splicing operation.

The Ghost bottleneck consists of two stacked Ghost modules, as illustrated in Figure 4. The first Ghost module functions as an extension layer, generating the original feature map through a minimal amount of standard convolution. The second Ghost module applies a cost-effective linear transformation to the original feature map using depthwise separable convolution, resulting in the creation of redundant Ghost feature maps. These are then concatenated with the original feature maps to produce the final output.

When the stride is set to 1, input and output are combined using an Identity Shortcut, allowing for direct feature fusion. Conversely, when the stride is set to 2, a depthwise convolution (DWConv) with a stride of 2 is inserted between the two Ghost modules to facilitate downsampling. Following the second Ghost module, the Rectified Linear Unit (ReLU) activation function is disabled to prevent the loss of feature information. Other layers utilize batch normalization (BN) followed by ReLU to ensure robust nonlinear expression capabilities.

This design not only decreases the computing load and the number of parameters but also preserves rich feature information, thereby improving the efficiency and performance of the model.

Consequently, replacing the original backbone network in YOLO11s with the Ghost network constructed from Ghost bottlenecks can significantly decrease both the computational load and the parameter count, leading to improved model efficiency. Its lightweight nature facilitates the deployment of traffic sign recognition models in scenarios of restricted resources, like on-board systems and embedded devices, allowing for efficient feature extraction and retention of deep semantic information while enhancing recognition accuracy and real-time performance.

3.2. HybridFocus Module for Robust Background Suppression

Traffic sign recognition often faces challenges posed by complex backgrounds, small sign sizes, and diverse shapes. To solve these issues, this paper introduces a novel spatial pyramid pooling structure called HybridFocus. This module enhances the extraction of key features from traffic signs by incorporating Large Separable Kernel Attention [19] into the existing SPPF framework. This approach effectively mitigates the interference from irrelevant background information, improving the efficiency of traffic sign extraction by focusing on the salient features of the signs, thus enhancing recognition accuracy in complex environments.

The original SPPF module of YOLO11s constructs a spatial pyramid by cascading 5 × 5 max pooling operations to achieve multi-scale feature representations, as illustrated in Figure 5. Although this method captures multi-scale context, its cascading strategy with a fixed kernel size limits the expansion of the receptive field. Consequently, the contour details of traffic signs can be diluted during the identification process, and geometric variants of similar semantics (such as circular prohibition signs and triangular warning signs) may experience a high false detection rate due to the lack of long-range spatial association. The introduction of the attention mechanism enables the model to adaptively assign different weights to features at various scales, enhancing its multi-scale feature extraction capability and improving the accuracy of traffic sign recognition, thereby better addressing the geometric sensitivity required for effective traffic sign recognition.

Previous attention mechanisms exhibit several limitations. For example, self-attention demonstrates strong long-range dependencies and adaptability, but it often struggles to effectively model local structures, making it challenging to capture local patterns and structural information in images or text. Traditional large kernel attention mechanisms aim to address the shortcomings of self-attention through considering the two-dimensional structure of images; however, they frequently incur significant computational costs when employing large convolution kernels.

To overcome these challenges, this study introduces the large separable kernel attention mechanism, which builds upon existing designs while significantly reducing computational costs and achieving high performance with relatively low overhead. The large separation kernel attention mechanism separates the two-dimensional convolutional kernel of a deep convolutional layer into one-dimensional convolutional kernels cascaded horizontally and vertically. This decomposition enables the attention module to leverage deep convolution layers with large convolution kernels to generate a preliminary attention map, allowing the model to concentrate on critical regions of the image. This component, in conjunction with SPPF, forms the new HybridFocus module, as de-picted in Figure 6.

Once the preliminary attention map is obtained, the large separable kernel attention mechanism employs dilated convolutions with different dilation rates to further capture features. These convolution layers cover a wider receptive field without incurring additional computational costs, thereby facilitating the capture of broader contextual information. By processing image features separately in the horizontal and vertical directions, large separable kernel attention enhances the model’s comprehension of spatial relationships within the image. Following a series of convolution operations, the large separable kernel attention mechanism generates the final attention map by fusing the features obtained from the last convolution layer. The mathematical expressions governing the output of the large separable kernel attention mechanism are provided in Equations (1)–(4).

{\bar{Z}}^{C} = \sum_{H, W} W_{(2 d - 1) \times 1}^{C} * (\sum_{H, W} W_{1 \times (2 d - 1)}^{C} * F^{C})

(1)

Z^{C} = \sum_{H, W} W_{(k / d) \times 1}^{C} * (\sum_{H, W} W_{1 \times (k / d)}^{C} * {\bar{Z}}^{C})

(2)

A^{C} = W_{1 \times 1} * Z^{C}

(3)

{\bar{F}}^{C} = A^{C} \otimes F^{C}

(4)

The processing of large separable kernel attention is carried out as follows: First, a one-dimensional convolution W^C_1×(2d−1) is applied to the input feature F^C, followed by another one-dimensional convolution W^C_(2d−1)×1 applied to the result. Here,

W_{s h a p e}^{C}

denotes the dynamic convolution kernel weights, with subscripts indicating kernel size, and

d

represents the kernel size scaling factor, which controls the receptive field size. These two convolutional operations are executed independently on each channel, allowing the large separable kernel attention module to capture features in each spatial direction independently, thereby enhancing computational efficiency. The mathematical representation of this process is depicted in the equations, where

{\bar{Z}}^{C}

and

Z^{C}

are intermediate feature maps. The objective is to decompose the large kernel into two directional convolutions to reduce computational complexity. Equations (1) and (2) correspond to the two stages of large kernel feature extraction, respectively. In generating

{\bar{Z}}^{C}

, the vertical convolution kernel is element-wise multiplied with the global feature map to ensure the propagation of global feature information into the vertical feature map, resulting in the final output feature map Z^C. This procedure is mathematically represented in Equation (2). After obtaining Z^C, a 1 × 1 convolution W_1×1 is applied to generate the attention map A^C. This step integrates features across different channels to assign weights to each channel, thereby adaptively weighting the input features. This process emphasizes important features while suppressing irrelevant ones, as illustrated in Equation (3). Finally, the element-wise product of the attention map A^C and the input feature F^C is computed to produce the final output feature

{\bar{F}}^{C}

. This operation is represented mathematically in Equation (4).

3.3. New Feature Pyramid Network BiDMS-FPN

In actual traffic sign recognition scenarios, the model often falls short in detail feature extraction and expression, leading to difficulties in accurately identifying small targets. It is essential to detect both the fine texture of nearby small signs and the contours of distant fuzzy signs simultaneously. Traditional pyramid networks struggle to accommodate such large-scale differences in target detection. Research on Trident networks [20] indicates that networks with larger receptive fields are better suited for detecting larger objects, while smaller-scale targets benefit from smaller receptive fields. Therefore, in the FPN stage, different multi-scale convolution kernels are selected for different scale feature layers to adaptively obtain multi-scale receptive field information.

To address this challenge, this paper introduces a novel Bidirectional Dynamic Multi-Scale Feature Pyramid Network (BIDMS-FPN). Inspired by the concept of BiFPN, the network facilitates efficient interaction and adaptive fusion of multi-level features by deeply integrating a lightweight GhostNet backbone, a novel dynamic multi-scale convolution module (DMS-CSPNet), and an enhanced upsampling unit (EUCB). Its core innovations include the gradient shunting mechanism using cross-level partial networks, and dynamic tuning of features is achieved by progressively extended convolutional kernel configurations (P3/P4/P5 using kernel groups of 1-3-5, 3-5-7, and 5-7-9, respectively); combined with bidirectional feature propagation paths, the proposed method can be applied to the real-world applications, enhancing cross-scale fusion of high-level semantics and low-level details. This method not only maintains the speed of real-time reasoning but also provides an accurate speed balance solution for multi-scale target detection tasks in complex scenes. The basic architecture of the BiDMS-FPN is illustrated in Figure 7.

DMS-CSPNet is designed as the core feature extraction module of BiDMS-FPN, aimed at achieving accurate detection of extreme-scale targets through dynamic multi-scale convolution and cross-stage feature reuse. The workflow of this module is illustrated in Figure 8. The mathematical formulation for feature extraction within DMS-CSPNet is presented in Equations (5)–(7).

The input feature map

X \in R^{B \times C \times H \times W}

is divided into a main branch (which preserves the original features) and a sub-branch (which performs multi-scale processing) based on the scale (E):

X_{m a i n}, X_{s u b} = S p l i t (X), X_{s u b} \in R^{B \times e C \times H \times W}

(5)

The sub-branch undergoes processing through multiple serial DMS modules, each containing parallel multi-core convolutions:

F_{s u b} = C o n c a t ({D W C o n v}_{1 \times 1} (x), {D W C o n v}_{3 \times 3} (x), {D W C o n v}_{5 \times 5} (x))

(6)

After the main branch and sub-branch features are concatenated, the channel dimensions are adjusted using a 1 × 1 convolution:

Y = C_{1 \times 1} (C o n c a t (X_{m a i n}, F_{s u b}))

(7)

The EUCB [21] serves as a core module for high-resolution feature reconstruction within BiDMS-FPN. It recovers detailed features while reducing the number of parameters through a lightweight design that incorporates upsampling, deep convolution, and channel shuffling, effectively addressing the issue of missing detection for small signs. The operational flow is depicted in Figure 9, with the step-by-step working principle and mathematical formulas provided in Equations (8)–(11).

First, the spatial resolution of the input feature map is doubled through an upsampling operation. The input feature map

X \in R^{B \times C \times H \times W}

is upsampled using bilinear interpolation:

X_{u p} = {U p s a m p l e}_{2 \times 2} (X) \in R^{B \times C \times 2 H \times 2 W}

(8)

Next, spatial features are extracted using depthwise separable convolution to minimize computational load. Each channel is convolved independently using a 3 × 3 kernel:

F_{d w} [c, :, :] = X_{u p} [c, :, :] ⊛ W_{c} (c = 1, \dots, C)

(9)

where

W_{c} \in R^{3 \times 3}

is the convolution kernel for the second channel, and the output dimension remains

B \times C \times 2 H \times 2 W

.

Through channel shuffling, the independence between channels is disrupted, enhancing cross-group information interaction. The channels are divided into g groups and rearranged after transposition:

F_{s h u f f l e} [i, :, :] = F_{d w} [⌊\frac{i}{m}⌋ + (i % m) \times g, :, :]

(10)

where m = C/g, when g = C, this operation reduces to a channel dimension transpose.

Finally, point-wise convolution is utilized to dynamically adjust channel weights, completing the feature fusion process. The integration of channel information occurs via a 1 × 1 convolution:

Y = F_{s h u f f l e} \otimes W_{1 \times 1}, W_{1 \times 1} \in R^{C \times C \times 1 \times 1}

(11)

The BiFPN serves as the feature fusion framework of BiDMS-FPN, employing weighted feature fusion and a deep separable convolution mechanism to enhance the model’s robustness in identifying traffic signs across various scales and complex backgrounds. Given the specific requirements of traffic sign recognition, this study replaces the primary path aggregation network (PANet) with BiFPN, optimizing the feature fusion network of YOLO11s.

The specific structure of BiFPN is illustrated in Figure 10 [22]. In this map, the blue section represents the top-down pathway, primarily transmitting semantic information from high-level features, while the red section depicts the bottom-up pathway, predominantly conveying spatial information from low-level features. The purple section highlights the newly introduced connections between the input and output nodes of the same layer, as previously discussed. The mathematical formulations governing feature fusion within BiFPN are presented in Equations (12) and (13).

P_{6}^{t d} = Conv [\frac{w_{1} \cdot P_{6}^{i n} + w_{2} \cdot Resize (P_{7}^{i n})}{w_{1} + w_{2} + ϵ}]

(12)

P_{6}^{o u t} = Conv (\frac{w_{1}^{'} \cdot P_{6}^{i n} + w_{2}^{'} \cdot P_{6}^{t d} + w_{3}^{'} \cdot Resize (P_{5}^{o u t})}{w_{1}^{'} + w_{2}^{'} + w_{3}^{'} + ϵ})

(13)

where P_iⁱⁿ represents the input feature map for the i-th layer, while P_i^out signifies the output feature map for that layer; w₁ and w₂ represent the weights associated with the feature maps from the 6th and 7th layers, respectively; Resize(P₇ⁱⁿ)) indicates the process of adjusting the resolution of the 7th layer’s feature map to match that of the 6th layer; Conv refers to the convolution operation applied to the normalized feature maps, and ϵ is a small constant introduced to prevent division by zero; Weights w₁′, w₂′, and w₃′ correspond to the input feature map of the 6th layer, the temporary output feature map of the 6th layer, and the output feature map of the 5th layer, respectively. Resize(P₅^out)) refers to adjusting the output feature map from the 5th layer to the resolution of the 6th layer.

4. Experiments and Results

4.1. Dataset and Evaluation Metrics

To evaluate the performance of the proposed Ghost-YOLO-GBH model, experiments were conducted using the widely recognized Tsinghua-Tencent 100K (TT100K) traffic sign dataset [23]. This dataset comprises over 100,000 images captured in various real-world driving scenarios, featuring a diverse array of weather conditions, lighting situations, and variations in traffic signs. It provides comprehensive annotations for more than 10 different traffic sign categories. Additionally, many of the images include non-traffic signs, making it an ideal benchmark for assessing the model’s capabilities in recognizing small traffic signs and suppressing complex backgrounds. The dataset encompasses a wide range of lighting conditions within intricate backgrounds, including extreme lighting scenarios. Notably, low-light conditions and sharp shadows represent a significant portion of the dataset, manifesting in the following aspects: Low-light images exhibit color distortion, blurred contours, and increased noise; the dynamic range of images under sharp shadow conditions is unbalanced, resulting in lost details in shadowed areas. The model’s recognition abilities are evaluated in terms of its resilience to interference, feature integrity, and cross-domain generalization (from normal to extreme lighting conditions). Figure 11 illustrates examples of extreme lighting conditions.

However, the original dataset contains a wide variety of classes, with some traffic sign instances numbering fewer than five, which can negatively impact the model’s training effectiveness. To address this issue, the dataset was pre-filtered to focus on the first 42 types of small traffic signs. The filtered dataset includes 16,551 images for the training set and 4952 images for the validation set.

A basic analysis of the bounding box sizes for each category in the dataset is illustrated in Figure 12. The distribution of object widths and heights is depicted on the right side of the figure, where the x-axis represents the width of the objects and the y-axis represents their height (unit: meters). It is evident that most objects have relatively small dimensions, concentrated below a width and height of 0.05 m. According to the definition of small objects, if the ratio of an object’s bounding box width and height to the image width and height is below a certain threshold (e.g., 0.1), the object can be classified as small. The analysis indicates that a significant majority of objects in the cleaned dataset are indeed small.

To objectively evaluate the performance of the enhanced YOLO11 algorithm in traffic sign detection, we selected a set of metrics across three dimensions: detection accuracy, real-time performance, and computational efficiency. These metrics include mean average precision (mAP), frames per second (FPS), parameter count, and floating-point operations per second (GFLOPS).

This study establishes a multi-dimensional quantitative analysis framework for accuracy assessment, incorporating two fundamental metrics: precision (P) and recall (R), in conjunction with mAP to form a composite evaluation system. The mAP is defined as the average of precision values calculated at various recall levels. Specifically, mAP is derived by computing precision at multiple recall thresholds and averaging these values, thereby providing a comprehensive assessment of the model’s performance across different thresholds. This framework enables a systematic comparison of the algorithm’s overall effectiveness in object localization and classification tasks.

The confusion matrix, presented in Table 1, offers a clear arrangement for precision evaluation. Precision, mAP, and recall are defined in Equations (14)–(16):

Precision = \frac{TP}{TP + FP}

(14)

Recall = \frac{TP}{TP + FN}

(15)

mAP = \frac{1}{N} \sum_{i = 1}^{N} Precision (R_{i})

(16)

where TP denotes targets predicted as positive that are genuinely positive; FP indicates targets predicted as positive that are negative (false alarms); FN refers to targets that are truly positive but were not detected (missed detections).

Furthermore, to address the practical requirements of model deployment, this study assesses the model’s usability by examining three engineering metrics: trainable parameter counts, FPS, and GFLOPs. These metrics collectively provide an evaluation of the algorithm’s lightweight characteristics and operational efficiency in scenarios with limited computational resources.

4.2. Implementation Details

This experiment was conducted on a Windows 11 operating system, utilizing an NVIDIA RTX 3090 GPU (Nvidia, Santa Clara, CA, USA) with 30 GB of memory. The environment was configured with Python 3.12.3, PyTorch 2.6.0, and CUDA 12.4. The Ghost-YOLO-GBH model was implemented using the PyTorch deep learning framework. The backbone network, FasterNet, was pretrained on the ImageNet dataset, after which the entire model was fine-tuned on the TT100K dataset.

Training was carried out using the Stochastic Gradient Descent (SGD) optimizer, with a learning rate set to 0.01 and a batch size of 32. The model was trained for a total of 200 epochs, and the parameter variations during the training process are depicted in Figure 13. By the end of the 200 epochs, the loss value converged to a low and stable level, indicating that the model’s parameter counts were approaching an optimal solution. Further training was unlikely to yield significant reductions in loss.

To enhance the model’s generalization capabilities, various data augmentation techniques were employed, including random scaling, rotation, and color jittering, applied to the input images. Additionally, to tackle the issue of class imbalance inherent in the traffic sign dataset, where certain sign categories are underrepresented, the weighted focal loss function was utilized. This approach helps to ensure that the model pays more attention to the minority classes, thereby improving detection performance across all categories.

As illustrated in Figure 14, the application of weighted focal loss significantly enhances the generalization performance of the traffic sign detection model. In the recall-confidence curve prior to the application of this technique (Figure 14a), the model achieves a recall of 0.89 at a confidence level of 0.000. However, as the confidence level increases to 0.6, the recall rate drops to approximately 0.4, indicating that the model tends to over-suppress low-confidence predictions, particularly for rare instances.

In contrast, after the application of weighted focal loss (Figure 14b), the curve maintains a high recall rate of 0.75 at a confidence level of 0.6, while also starting from an initial recall rate of 0.93. This demonstrates a significant improvement in the model’s tolerance to challenging samples. Furthermore, the distribution band of the gray curve in the processed chart converges significantly below the blue main line. The Y-axis scale introduces a low value interval of 0.01−0.05, which represents a more than five-fold improvement in performance for extremely rare categories with recall rates below 0.01. This indicates that weighted focal loss compels the model to optimize all categories equally through a dynamic weighting mechanism, effectively enhancing the detection robustness of rare categories in long-tail data.

4.3. Comparative Experiments

4.3.1. Comparative Experiments of Backbone Network

To explore the influence of lightweight backbone networks on the performance of the Traffic Sign Detection Task, this study replaces the backbone network of Yolo11s with the state-of-the-art backbone networks GhostNet and compares it with MobileNetV3, EfficientNet-B0, and FasterNet. The purpose of the experiment is to analyze the differences in Params, FLOPs, and FPS among different backbone networks and the influence of different mainstream backbone networks on the accuracy of traffic sign detection (mAP@0.5). The experimental results are shown in Table 2.

The experimental results show that the mAP of the GhostNet model is 75.12% when the threshold is 0.5, which is better than other models, the key information is effectively extracted by the feature reuse strategy, and the key information is effectively extracted by the feature reuse strategy, especially for small targets in traffic sign detection tasks. GhostNet also has the lowest specs (6.74 m) and GFLOPS (12.9 g), which are ideal for resource-constrained environments such as in-car terminals. Therefore, selecting GhostNet as the backbone of Yolo11s allows the highest detection accuracy to be achieved with minimal computational cost.

4.3.2. Comparative Experiment of SPPF and FPN Networks

In the experiments of this section, the performance of the HybridFocus module and BiDMS-FPN is systematically evaluated in relation to traditional methods to verify their innovativeness. Each innovative module is compared with a representative contemporary approach as well as the original YOLOv8s module. Specifically, the Space Pyramid module utilizes SIMSPPF, while the neck network employs BiFPN. Both modules have been extensively studied to demonstrate their superiority [22,24,25,26,27,28].

Based on the comparative experimental data presented in Section 4.3.2, the primary reason for selecting the HybridFocus module to replace the traditional SIMSPPF module is its significant advancement in feature extraction efficiency. As shown in Table 3, while the YOLOv8s-SIMSPPF (24.50 GFLOPS, 10.26 million parameters) demonstrates superiority over the base model YOLOv8s, it has not yet proven to be effective. In contrast, the YOLOv8s-HybridFocus achieves further optimization within the same parameter scale: Computational complexity is reduced by 8.98%, detection accuracy is improved by 4.91%, and inference speed is enhanced by 10.19%. This remarkable performance is attributed to the co-design of the LSKA large kernel attention mechanism and spatial pyramid pooling. The former enhances the semantic fusion ability of multi-scale features through adaptive weight allocation, while the latter improves sensitivity to small targets via dynamic kernel size configuration. Together, these innovations address the issue of feature representation rigidity caused by the fixed kernel size of SimSPPF, making them particularly suitable for complex scenes characterized by significant scale variations in traffic sign detection.

In the design of the neck network, the rationale for replacing the traditional BiFPN with BiDMS-FPN lies in its innovative approach to resolving the trade-off between efficiency and accuracy in feature fusion. Experimental results indicate that YOLOv8-BiDMS-FPN can reduce the number of parameters by 36.91% and decrease computational demand by 12.50%, while simultaneously achieving a 7.68 percentage point increase in mAP@50 and a 4.24% improvement in frames per second (FPS). This “reduction and enhancement” property is attributed to the bidirectional dynamic multi-scale fusion mechanism: The proposed DMS-CSPNet structure compresses redundant computations through a gradient shunting strategy and dynamically allocates convolution kernel groups based on feature hierarchy characteristics. This approach effectively reduces computational complexity and enhances computational efficiency while significantly improving the complementarity of cross-scale features and minimizing memory consumption. Compared to the static fusion mode of BiFPN, this design offers a more robust solution for multi-target detection in real-time traffic scenarios.

4.3.3. A Comparative Experiment of Algorithms

To validate the effectiveness of the proposed model, this study compared its performance against several state-of-the-art traffic sign detection models, including Faster R-CNN, YOLOv5s, YOLOv7, YOLOv8, YOLOv9, YOLOv12, RT-DETR, and the original YOLO11s, all under consistent experimental conditions. The results of these comparative experiments are summarized in Table 4. The proposed Ghost-YOLO-GBH model demonstrated superior performance across composite metrics, specifically, in the recognition of small traffic signs and in complex background scenarios.

From the experimental results presented in Table 4, the improved model Ghost-YOLO-GBH demonstrates superiority over other mainstream algorithms across three dimensions: mAP@50, parameter count (M), and GFLOPS. This performance advantage can be attributed to the synergy of the three innovative modules. Notably, during the experiment, YOLOv12 was unable to learn the classification task associated with this scenario during the training phase.

The lightweight nature of the model is particularly pronounced, with the number of parameters constituting only 69.42% of that in YOLOv8s, and the computational demand being 20.22% lower than that of YOLOv9s. This “high precision-low consumption” characteristic endows Ghost-YOLO-GBH with significant advantages in resource-constrained environments, thereby validating the effectiveness of the Ghost module’s feature redundancy compression strategy.

In terms of real-time performance, Ghost-YOLO-GBH operates at 45 FPS, which is 40.0% faster than YOLOv8s (32.15 FPS) and significantly outpaces the two-stage model R-CNN (24.00 FPS). This FPS level is significantly higher than the minimum standard of ADAS (30 FPS), and the single frame inference takes 22 ms (1/45 s) to meet the on-board real-time response threshold (e.g., ≤100 ms for emergency braking). This increase in speed enables the model to meet the millisecond response requirements of vehicle terminals, largely due to the efficient operations of Ghost-Net, which maintains a computational cost of 21.30 GFLOPS—79.40% lower than YOLOv7. It is worth noting that while RT-DETR achieves 98.20 FPS, its parameter count of 19.80 million and computational burden of 57.10 GFLOPS far exceed those of Ghost-YOLO-GBH, making it unsuitable for deployment on edge devices such as those used in vehicular applications.

Figure 15 is a visual comparison of the detection effect of the Ghost-YOLO-GBH and the original Yolo11s models (example), obtaining higher detection scores in various scenarios. It can be seen that the improved model proposed in this paper is effective for traffic sign detection tasks. These advances can be attributed to the key components of the Ghost-yolo-gbh model, which include a lightweight GhostNet backbone, a BiDMS-FPN module for improved feature fusion, and a HybridFocus module for efficient background suppression. These innovations enable the model to effectively extract and fuse multi-scale features, while emphasizing the key features of traffic signs and reducing the interference of irrelevant background information.

4.4. Ablation Study

To further understand the contributions of the individual components, an ablation study was conducted. Table 5 presents the results of the ablation experiments, where each component was removed or replaced, and the corresponding performance changes were observed.

4.4.1. Analysis of Individual Module Effects

The experimental results show that compared with the benchmark model Yolo11s, the introduction of the GhostNet module (Model 2) improves the mAP@0.5 from 74.80% to 75.12%, an increase of 0.43%. Additionally, the inference speed increases from 40.30 to 46.60, an increase of 15.63%. The parameter count decreases from 9.46 M to 6.74 M, a decrease of 28.76%. The GFLOPS is reduced from 21.70 G to 12.90 G, a decrease of 40.56%. These results validate that GhostNet significantly minimizes redundant computations through its feature reuse strategy; however, its impact on accuracy improvement is limited, with the primary advantage being the optimization of lightweight performance.

When the HybridFocus module (Model 3) is utilized alone, the mAP@0.5 significantly increases to 79.10%, an increase of 5.75%. The parameter count rises to 10.51 M, an increase of 11.10%. The GFLOPS increases to 22.30 G, an increase of 2.80%. The FPS improves to 45.40, an increase of 16.66%. This demonstrates that HybridFocus enhances the model’s ability to extract contextual information through multi-scale pooling and large kernel attention mechanisms, all while maintaining low computational overhead, making it particularly effective for preserving features of small targets.

The introduction of the BiDMS-FPN module alone (Model 4) yields an mAP@0.5 of 77.10%, which is an increase of approximately 3.10% compared to the baseline. The FPS is enhanced to 44.30, representing an increase of approximately 3.03%. It also results in a reduction of parameters to 7.76 M and GFLOPS to 22.40 G, representing decreases of 18.0% and 3.2%, respectively, compared to the benchmark model. This indicates that BiDMS-FPN optimizes target localization capabilities through cross-scale feature interactions based on two-way dynamic multi-scale fusion and the DMS-CSPNet node design. However, its lightweight structure limits the increase of parameter count, which could potentially enhance localization accuracy, thereby demonstrating the high efficiency of dynamic path selection.

In order to further observe the effectiveness of the improved modules on the feature extraction of traffic signs, the method of heat map, Grad-CAM [29], is used to realize the visualization results and compare and verify them (Figure 16).

From the perspective of regional attention, with the improvement of each module, the highlighted area of the heat map focuses on the physical location of the traffic sign, and the correlation with the target feature is enhanced, and the thermal distribution is more accurate to fit the edge structure and contour characteristics of traffic signs.

4.4.2. Analysis of Combined Module Effects

The combination of GhostNet and HybridFocus (Model 5) achieves an mAP@0.5 of 79.00%, which is 0.10% lower than that of the HybridFocus module alone (Model 3). However, this configuration significantly reduces the number of parameters to 6.94 M and GFLOPS to 13.10 G, while the FPS decreases to 41.25. This indicates that the lightweight backbone of GhostNet effectively mitigates the expansion of parameter count associated with HybridFocus, though its feature compression may slightly compromise the integrity of multi-scale contextual information, highlighting the inherent trade-off between accuracy and efficiency.

When the HybridFocus and BiDMS-FPN modules are combined (Model 6), the mAP@0.5 increases to 78.10%, an improvement of approximately 4.41% compared to the baseline. However, the FPS drops to 33.80, a decrease of approximately 16.20%. Additionally, the parameter count decreases to 9.03 M, a reduction of approximately 4.50% from the baseline, and the GFLOPS increases to 24.70, an increase of approximately 13.80%. The multi-path dynamic fusion mechanism of BiDMS-FPN enhances detection accuracy but introduces additional computational complexity, leading to a drop in real-time performance.

When the GhostNet and BiDMS-FPN modules are combined (Model 7), the mAP@0.5 reaches 78.01%, an increase of approximately 4.30% compared to the baseline. The FPS improves to 44.64, an increase of approximately 10.80%. Moreover, the parameter count drops to just 4.98 M, the lowest in the study, representing a decrease of approximately 47.30% from the baseline, and the GFLOPS reduces to 12.70, a decrease of approximately 40.10%. This configuration enables efficient detection with low resource consumption by leveraging the lightweight features of GhostNet and the dynamic capabilities of BiDMS-FPN.

4.4.3. Comprehensive Advantages

When the GhostNet, HybridFocus, and BiDMS-FPN modules are integrated (Ghost-YOLO-GBH), the mAP@0.5 reaches 81.10%, an increase of approximately 8.41% compared to the baseline. The FPS increases to 45.00, an improvement of approximately 11.70%. Additionally, the parameter count decreases to 7.74 M, a reduction of approximately 18.20% from the baseline, and the GFLOPS drops to 21.30, a decrease of approximately 1.80%. These results indicate that the combination of GhostNet’s lightweight backbone, the multi-scale contextual enhancement from HybridFocus, and the dynamic cross-scale fusion mechanism of BiDMS-FPN achieves significant performance improvements without a substantial rise in computational demands, thereby strengthening the model’s overall detection capabilities.

BiDMS-FPN facilitates efficient interaction and parameter sharing among cross-level features through its BiFPN and DMS_CSP node design. Its dynamic path selection mechanism, exemplified by the BiFPN mode of the Fusion layer, optimizes feature fusion efficiency, while the DMS_CSP module further enhances feature expression capabilities through multi-scale convolution branches. Although the parameter count increases slightly when used alone, combining it with GhostNet significantly compresses the parameter count, demonstrating the compatibility between the modules.

The parameter count of Ghost-YOLO-GBH is only 85.71% of that of Model 6, while achieving an mAP@0.5 that is 4.74% higher, highlighting the efficiency of feature reuse and parameter sharing between modules. Despite an increase in GFLOPS compared to the standalone GhostNet configuration, the overall performance remains superior to other combinations. The overall performance of the GhostNet configuration surpasses that of the single GhostNet configuration, particularly in the detection of small traffic sign targets.

The significant execution speed (45 FPS) of the proposed model proves that it can effectively consider the improvement of accuracy and speed at the same time and can meet the needs of real-time detection tasks to a certain extent. At the same time, the number of parameters and the calculation are low (7.74 M/21.3 GFLOPS), and the requirement for terminal equipment is low, which can be well applied to the field of traffic sign recognition. Through hardware acceleration and multi-threading optimization, this model can further adapt to scenarios with stringent real-time requirements, such as vehicle terminals and traffic monitoring.

5. Conclusions and Future Work

The proposed Ghost-YOLO-GBH model represents a significant advancement in traffic sign detection, effectively addressing critical challenges related to small target recognition and computational efficiency in resource-constrained scenarios. By integrating Ghost-Net backbones, HybridFocus multi-scale context enhancement, and BiDMS-FPN dynamic cross-scale fusion, this method is well-suited for multi-scale data fusion in mobile ad hoc networks. The model achieves a notable balance between accuracy, speed, and lightweight performance in small target recognition tasks, particularly under extreme illumination conditions and complex backgrounds.

The GhostNet backbone, through its feature reuse strategy and channel-split operations, drastically reduces redundant parameters (6.74 M) and computational costs (12.90 GFLOPS), while maintaining robust feature extraction capabilities. The HybridFocus module, inspired by multi-scale pooling and large-kernel attention mechanisms, enhances contextual information aggregation, particularly for small targets, achieving a 4.30% mAP@0.5 improvement in standalone tests. The BiDMS-FPN module, leveraging BiFPN and DMS_CSP nodes, optimizes cross-scale feature interactions, improving detection robustness in cluttered backgrounds.

Experimental results on the TT100K dataset validate the model’s superiority: 81.10% mAP@0.5 (+6.30% over baseline), 45.00 FPS (+11.7% speed gain), and 7.74 M parameters (−18.2% reduction), demonstrating its suitability for edge devices like vehicle terminals. The synergy among modules ensures efficient feature reuse and dynamic path selection, mitigating computational overhead while preserving accuracy.

Moving forward, further research could explore the integration of additional techniques, such as knowledge distillation and meta-learning, to optimize the model’s performance and expand its adaptability to diverse traffic sign datasets and evolving real-world conditions. Given current resource constraints, future work will prioritize hardware-aware optimizations, including model quantization, hardware-specific compilation (e.g., TensorRT/OpenVINO deployment pipelines), and embedded system deployment tailored for automotive-grade processors like NVIDIA Jetson Orin and Qualcomm Snapdragon Ride platforms. These implementations will address critical latency-power trade-offs through operator fusion, INT8 quantization sensitivity analysis, and memory bandwidth optimization, specifically required for edge deployment scenarios. Continued advancements in this field will undoubtedly contribute to the development of more robust intelligent transportation systems, ultimately enhancing driving safety and the capabilities of autonomous vehicles.

Author Contributions

Conceptualization, F.L. and J.L.; methodology, J.T.; software, M.Z.; validation, B.X., M.Z., and C.H.; formal analysis, C.H.; investigation, J.T. and B.X.; resources, F.L.; data curation, M.Z.; writing—original draft preparation, J.T. and B.X.; writing—review and editing, J.L. and F.L.; visualization, C.H.; supervision, J.L.; project administration, F.L.; funding acquisition, F.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China, grant number 52208436; Open Fund of Engineering Research Center of Catastrophic Prophylaxis and Treatment of Road and Traffic Safety of the Ministry of Education (Changsha University of Science and Technology), grant number kfj210402.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The datasets presented in this article are not readily available because the data are part of an ongoing study. Requests to access the datasets should be directed to the first author.

Acknowledgments

In preparation for this work, the authors used DeepSeek, Kimi to improve the language and readability of the manuscript and to make it more standardized. After using this tool/service, the author reviews and edits the content as needed and takes full responsibility for the content of the publication.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Mogelmose, A.; Trivedi, M.M.; Moeslund, T.B. Vision-Based Traffic Sign Detection and Analysis for Intelligent Driver Assistance Systems: Perspectives and Survey. IEEE Trans. Intell. Transp. Syst. 2012, 13, 1484–1497. [Google Scholar] [CrossRef]
Kheder, M.Q.; Mohammed, A.A. Improved traffic sign recognition system (itsrs) for autonomous vehicle based on deep convolutional neural network. Multimed. Tools Appl. 2023, 83, 61821–61841. [Google Scholar] [CrossRef]
An, F.; Wang, J.; Liu, R. Road Traffic Sign Recognition Algorithm Based on Cascade Attention-Modulation Fusion Mechanism. IEEE Trans. Intell. Transp. Syst. 2024, 25, 17841–17851. [Google Scholar] [CrossRef]
Khalil, R.A.; Safelnasr, Z.; Yemane, N.; Kedir, M.; Shafiqurrahman, A.; Saeed, N. Advanced Learning Technologies for Intelligent Transportation Systems: Prospects and Challenges. IEEE Open J. Veh. Technol. 2024, 5, 397–427. [Google Scholar] [CrossRef]
Pramudito, D.K. Enhancing Real-Time Object Detection in Autonomous Systems Using Deep Learning and Computer Vision Techniques. J. Acad. Sci. 2024, 1, 788–804. [Google Scholar]
Vijayakumar, A.; Vairavasundaram, S. YOLO-based Object Detection Models: A Review and its Applications. Multimed. Tools Appl. 2024, 83, 83535–83574. [Google Scholar] [CrossRef]
Mahadshetti, R.; Kim, J.; Um, T.-W. Sign-YOLO: Traffic Sign Detection Using Attention-Based YOLOv7. IEEE Access 2024, 12, 132689–132700. [Google Scholar] [CrossRef]
Zhang, L.; Yang, K.; Han, Y.; Li, J.; Wei, W.; Tan, H.; Yu, P.; Zhang, K.; Yang, X. TSD-DETR: A lightweight real-time detection transformer of traffic sign detection for long-range perception of autonomous driving. Eng. Appl. Artif. Intell. 2024, 139, 109536. [Google Scholar] [CrossRef]
Han, Y.; Wang, F.; Wang, W.; Li, X.; Zhang, J. YOLO-SG: Small traffic signs detection method in complex scene. J. Supercomput. 2023, 80, 2025–2046. [Google Scholar] [CrossRef]
Kamal, U.; Tonmoy, T.I.; Das, S.; Hasan, K. Automatic Traffic Sign Detection and Recognition Using SegU-Net and a Modified Tversky Loss Function With L1-Constraint. IEEE Trans. Intell. Transp. Syst. 2019, 21, 1467–1479. [Google Scholar] [CrossRef]
Li, J.; Wang, Z. Real-Time Traffic Sign Recognition Based on Efficient CNNs in the Wild. IEEE Trans. Intell. Transp. Syst. 2018, 20, 975–984. [Google Scholar] [CrossRef]
Wang, L.; Zhou, K.; Chu, A.; Wang, G.; Wang, L. An Improved Light-Weight Traffic Sign Recognition Algorithm Based on YOLOv4-Tiny. IEEE Access 2021, 9, 124963–124971. [Google Scholar] [CrossRef]
Haque, W.A.; Arefin, S.; Shihavuddin, A.; Hasan, M.A. DeepThin: A novel lightweight CNN architecture for traffic sign recognition without GPU requirements. Expert Syst. Appl. 2021, 168, 114481. [Google Scholar] [CrossRef]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar] [CrossRef]
Redmon, J.; Farhadi, A. Yolov3: An incremental improvement. arXiv 2018. [Google Scholar] [CrossRef]
Ali, M.L.; Zhang, Z. The YOLO Framework: A Comprehensive Review of Evolution, Applications, and Benchmarks in Object Detection. Computers 2024, 13, 336. [Google Scholar] [CrossRef]
Wang, Z.; Su, Y.; Kang, F.; Wang, L.; Lin, Y.; Wu, Q.; Li, H.; Cai, Z. PC-YOLO11s: A Lightweight and Effective Feature Extraction Method for Small Target Image Detection. Sensors 2025, 25, 348. [Google Scholar] [CrossRef] [PubMed]
Han, K.; Wang, Y.; Tian, Q.; Guo, J.; Xu, C.; Xu, C. Ghostnet: More features from cheap operations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020. [Google Scholar] [CrossRef]
Lau, K.W.; Po, L.-M.; Rehman, Y.A.U. Large Separable Kernel Attention: Rethinking the Large Kernel Attention design in CNN. Expert Syst. Appl. 2023, 236, 121352. [Google Scholar] [CrossRef]
Li, Y.; Chen, Y.; Wang, N.; Zhang, Z. Scale-aware trident networks for object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019. [Google Scholar] [CrossRef]
Rahman, M.M.; Munir, M.; Marculescu, R. Emcad: Efficient multi-scale convolutional attention decoding for medical image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024. [Google Scholar] [CrossRef]
Tan, M.; Pang, R.; Le, Q.V. Efficientdet: Scalable and efficient object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020. [Google Scholar] [CrossRef]
Zhu, Z.; Liang, D.; Zhang, S.; Huang, X.; Li, B.; Hu, S. Traffic-sign detection and classification in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar] [CrossRef]
Li, T.; Zhang, Y.; Li, Q.; Zhang, T. AB-DLM: An Improved Deep Learning Model Based on Attention Mechanism and BiFPN for Driver Distraction Behavior Detection. IEEE Access 2022, 10, 83138–83151. [Google Scholar] [CrossRef]
Zhou, S.; Yang, D.; Zhang, Z.; Zhang, J.; Qu, F.; Punetha, P.; Li, W.; Li, N. Enhancing autonomous pavement crack detection: Optimizing YOLOv5s algorithm with advanced deep learning techniques. Measurement 2024, 240, 115603. [Google Scholar] [CrossRef]
Sun, Y.; Zheng, J.; Wang, H.; Zhang, Y.; Guo, J.; Ning, H. LKStar-Yolov8n: An autonomous driving object detection algorithm based on large convolution kernel star structure of Yolov8n. Signal Image Video Process. 2025, 19, 1–10. [Google Scholar] [CrossRef]
He, L.; Wei, H.; Wang, Q. A New Target Detection Method of Ferrography Wear Particle Images Based on ECAM-YOLOv5-BiFPN Network. Sensors 2023, 23, 6477. [Google Scholar] [CrossRef] [PubMed]
Li, C.; Li, L.; Jiang, H.; Weng, K.; Geng, Y.; Li, L.; Ke, Z.; Li, Q.; Cheng, M.; Nie, W. YOLOv6: A single-stage object detection framework for industrial applications. arXiv 2022. [Google Scholar] [CrossRef]
Selvaraju, R.R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; Batra, D. Grad-CAM: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017. [Google Scholar] [CrossRef]

Figure 1. Network structure diagram of YOLO11.

Figure 2. Network structure diagram of Ghost-YOLO-GBH.

Figure 3. Structure diagram of the Ghost module: (a) convolutional layer and (b) Ghost module.

Figure 4. Structure diagram of Ghost bottlenecks.

Figure 5. Structure diagram of the SPPF module.

Figure 6. Structure diagram of the HybridFocus module.

Figure 7. BiDMS-FPN workflow diagram.

Figure 8. Block diagram of DMS-CSPNet.

Figure 9. EUCB workflow diagram.

Figure 10. Feature network design of BiFPN.

Figure 11. Sample TT100K dataset.

Figure 12. Image features of the TT100K data set.

Figure 13. Trend of loss with the number of training iterations.

Figure 14. Comparison of pre- and post-processing confidence: (a) pre-processing and (b) post-processing.

Figure 15. Comparison of detection performance before and after improvement: (a) the original YOLO11s model and (b) the Ghost-YOLO-GBH.

Figure 16. Visualization results of the Ghost-YOLO-GBH modules in the heat map course: (a) original photo, (b) the original YOLO11s, (c) YOLO11s + Ghost, (d) YOLO11s + Ghost + HybridFocus, and (e) the Ghost-YOLO-GBH.

Table 1. Confusion matrix.

Predicted/Actual	Positive (Target Present)	Negative (Target Absent)
Positive (Predicted as Target)	TP (True Positive)	FP (False Positive)
Negative (Not Predicted as Target)	FN (False Negative)	TN (True Negative)

Table 2. Comparison of Backbone networks.

Network	mAP@0.5/%	FPS	Parameter Count (M)	GFLOPS
MobileNet-V3	68.00	52.20	12.5	20.20
EfficientNet-B0	73.80	56.50	8.60	14.70
FasterNet	71.60	61.20	7.54	15.80
GhostNet	75.12	46.60	6.74	12.90

Table 3. Comparative experiment of the innovation module.

Model	mAP@50/%	FPS	Parameter Count (M)	GFLOPS
YOLO11s	74.80	40.30	9.46	21.50
YOLO11s-SimSPPF	75.40	41.20	10.26	24.50
YOLO11s-Hybrid-Focus	79.10	45.40	10.51	22.30
YOLO11-Bifpn	71.60	42.50	12.30	25.60
YOLO11-BiDMS-FPN	77.10	44.30	7.76	22.40
Ghost-YOLO-GBH	81.10	45.00	7.74	21.30

Table 4. Comparison of algorithms.

Model	mAP@50/%	FPS	Parameter Count (M)	GFLOPS
Faster R-CNN	79.30	24.00	41.30	92.20
YOLOv5s	70.30	46.20	7.89	16.50
YOLOv7	71.40	38.20	37.20	101.90
YOLOv8s	76.70	32.15	11.15	28.80
YOLOv9s	75.20	40.00	9.70	26.70
YOLOv12	-	-	9.30	21.40
RT-DETR	81.10	98.20	19.80	57.10
YOLO11s	74.80	40.30	9.46	21.50
Ghost-YOLO-GBH	81.10	45.00	7.74	21.30

Table 5. Comparison of ablation experiment results.

Model	GhostNet	HybridFocus	BiDMS-FPN	mAP (%)	FPS	Parameter Count (M)	GFLOPS
YOLO11s	×	×	×	74.80	40.30	9.46	21.70
Model 2	√	×	×	75.12	46.60	6.74	12.90
Model 3	×	√	×	79.10	45.40	10.51	22.30
Model 4	×	×	√	77.10	44.30	7.76	22.40
Model 5	√	√	×	79.00	41.25	6.94	13.10
Model 6	×	√	√	78.10	33.80	9.03	24.70
Model 7	√	×	√	78.01	44.64	4.98	12.70
Ghost-YOLO-GBH	√	√	√	81.10	45.00	7.74	21.30

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Tang, J.; Xu, B.; Li, J.; Zhang, M.; Huang, C.; Li, F. Ghost-YOLO-GBH: A Lightweight Framework for Robust Small Traffic Sign Detection via GhostNet and Bidirectional Multi-Scale Feature Fusion. Eng 2025, 6, 196. https://doi.org/10.3390/eng6080196

AMA Style

Tang J, Xu B, Li J, Zhang M, Huang C, Li F. Ghost-YOLO-GBH: A Lightweight Framework for Robust Small Traffic Sign Detection via GhostNet and Bidirectional Multi-Scale Feature Fusion. Eng. 2025; 6(8):196. https://doi.org/10.3390/eng6080196

Chicago/Turabian Style

Tang, Jingyi, Bu Xu, Jue Li, Mengyuan Zhang, Chao Huang, and Feng Li. 2025. "Ghost-YOLO-GBH: A Lightweight Framework for Robust Small Traffic Sign Detection via GhostNet and Bidirectional Multi-Scale Feature Fusion" Eng 6, no. 8: 196. https://doi.org/10.3390/eng6080196

APA Style

Tang, J., Xu, B., Li, J., Zhang, M., Huang, C., & Li, F. (2025). Ghost-YOLO-GBH: A Lightweight Framework for Robust Small Traffic Sign Detection via GhostNet and Bidirectional Multi-Scale Feature Fusion. Eng, 6(8), 196. https://doi.org/10.3390/eng6080196

Article Menu

Ghost-YOLO-GBH: A Lightweight Framework for Robust Small Traffic Sign Detection via GhostNet and Bidirectional Multi-Scale Feature Fusion

Abstract

1. Introduction

2. Related Work

2.1. Overview of YOLO Series Algorithms and Their Development

2.2. Limitations of YOLO11s in Small Traffic Sign Recognition

3. Proposed Approach: Ghost-YOLO-GBH

3.1. GhostNet Backbone Network

3.2. HybridFocus Module for Robust Background Suppression

3.3. New Feature Pyramid Network BiDMS-FPN

4. Experiments and Results

4.1. Dataset and Evaluation Metrics

4.2. Implementation Details

4.3. Comparative Experiments

4.3.1. Comparative Experiments of Backbone Network

4.3.2. Comparative Experiment of SPPF and FPN Networks

4.3.3. A Comparative Experiment of Algorithms

4.4. Ablation Study

4.4.1. Analysis of Individual Module Effects

4.4.2. Analysis of Combined Module Effects

4.4.3. Comprehensive Advantages

5. Conclusions and Future Work

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI