3.1. Network Architecture of TSD-Net
Figure 2 illustrates the overall architecture of our proposed TSD-Net. To address the aforementioned challenges of low resolution and complex backgrounds, our TSD-Net focuses on two key approaches: effective feature extraction fusion and receptive field expansion. We built the network architecture based on YOLO11 [
38]. To enhance the model’s expressiveness and lightweight performance, we upgraded the C3k2 module in the network to a C3k2-Dynamic module, improving the model’s feature representation capability by increasing parameter count while avoiding significant increases in FLOPs. To tackle the challenge of complex background interference, we integrated a Feature Enhancement Module (FEM) in the neck section, which enhances the model’s robustness against complex backgrounds by expanding the receptive field. To address the insufficient resolution issue in small traffic sign detection, we redesigned the feature pyramid network and constructed a high-resolution detection head to effectively preserve target details. Additionally, we designed the ADFF detection head, which combines feature map weights generated from ASFF [
27] and dynamic convolution [
25], achieving efficient integration of low-level and high-level features and significantly improving the network’s capability to detect densely distributed small objects.
Compared to recent works, while YOLOv8 [
39] has established a solid foundation for general object detection, its inherent limitations in feature representation capability restrict its precision in detecting small traffic signs. To address this issue, our C3k2-Dynamic module implements an adaptive feature extraction mechanism that enhances representational capacity while maintaining computational efficiency. Unlike TSD-YOLO [
14], which focuses on long-range attention mechanisms and dual-branch architecture, our Feature Enhancement Module (FEM) tackles complex background interference through expanded receptive fields. Furthermore, while Transformer-based architectures typically incur substantial computational overhead and suffer from feature misalignment issues [
40], our Adaptive Dynamic Feature Fusion (ADFF) detection head achieves efficient multi-scale feature integration through a computationally efficient architecture.
In summary, our approach addresses the limitations of previous traffic sign detection studies in low-resolution and complex background scenarios through the synergistic integration of three complementary modules. First, the C3k2-Dynamic module establishes a rich feature representation foundation while maintaining computational efficiency. These features are enhanced through the FEM, which leverages receptive field expansion to effectively distinguish traffic signs in complex visual environments. Subsequently, the ADFF detection head coordinates adaptive multi-scale feature fusion to preserve fine-grained spatial details.
Figure 3 shows our component block diagram.
3.2. C3k2-Dynamic Module
In traffic sign detection tasks, feature extraction capability directly impacts detection performance. Traditional convolutional neural networks process all input features with fixed-weight convolutional kernels, lacking adaptability to different feature contents, which limits feature representation capability [
41]. To address this issue, we propose the C3k2-Dynamic module, which is based on the CSP structure and incorporates dynamic convolution mechanisms to achieve adaptive processing of different input features.
The overall architecture of the C3k2-Dynamic module is shown in
Figure 4a. Consistent with the C3k2 module, when the C3k parameter is True, C3k2-Dynamic will use the C3k module; otherwise, it will use the Bottleneck module. The C3k2-Dynamic module retains the basic CSP structure of C3k2 but replaces the standard Bottleneck with a dynamic convolution Bottleneck, as illustrated in
Figure 4b. The dynamic convolution Bottleneck consists of two key components: the first part is a standard
convolution for channel dimensionality reduction to decrease subsequent computational costs; the second part is a dynamic convolution layer which, unlike the fixed convolutional kernels in standard convolution, can adaptively adjust convolutional weights based on input features, as shown in
Figure 5.
There are significant structural differences between conventional convolution and dynamic convolution. Conventional convolution processes input features
X using only a single fixed kernel, while dynamic convolution fuses parameters from multiple expert kernels to create a richer parameter space. For input features
X, our dynamic convolution layer with
M experts (
) is defined as [
25]:
where ∗ is the convolution operation,
represents the weight tensor of the
i-th expert convolution (kernel size
), and
is the dynamically generated weight coefficient. These coefficients are obtained through a routing function:
where
represents the global average pooling layer,
and
are learnable parameters, and
denotes the sigmoid function that ensures coefficient values range between 0 and 1. Additionally, our choices regarding the number of expert kernels (
M), kernel size (
K), and the implementation of sigmoid function in the routing mechanism will be validated through comprehensive ablation studies presented in
Section 4.2.1. Although multiple expert kernels are introduced, the computation does not involve
M separate convolution operations. Instead, our approach first generates fusion weights through the routing function then linearly combines
M expert kernels into a single effective convolution kernel and finally performs only one standard convolution operation. Consequently, the computational overhead introduced by parameter generation and weight fusion steps is negligible compared to the convolution operation itself [
25].
After improvement, the dynamic convolution mechanism enables the model to adaptively adjust convolution strategies based on input feature content, demonstrating significant advantages over traditional fixed-kernel convolution. In traffic sign detection tasks within road environments, dynamic convolution leverages adaptive parameters to accommodate highly variable contexts while effectively mitigating class imbalance issues in datasets by providing more flexible feature extraction for rare sign categories. Furthermore, its multi-expert architecture substantially expands the model’s parameter space without proportionally increasing computational complexity, making it more suitable for deployment in autonomous driving systems.
3.3. Feature Enhancement Module (FEM)
Due to the complexity of road environments, small-sized traffic sign detection tasks frequently encounter traffic signs with similar features. However, traditional YOLO architectures exhibit limited extraction capabilities. Features extracted at this stage contain minimal semantic information and narrow receptive fields, making it challenging to differentiate between small-sized traffic signs and background elements [
12,
16]. To address this issue, we introduce the Feature Enhancement Module (FEM) [
26] to enhance the backbone network’s ability to extract features from small-sized traffic signs. FEM effectively enhances small object feature representation and distinguishes background through a three-fold mechanism: first, it employs a multi-branch convolutional structure to extract multi-dimensional discriminative semantic information, enabling the model to simultaneously attend to various features of traffic signs, including shape, color, and texture [
42]; second, it utilizes atrous convolution to obtain more abundant local contextual information, effectively expanding the receptive field range. When backgrounds are complex, a larger receptive field helps the model understand the relationship between targets and their surroundings [
43]. Finally, through a residual structure, it forms an equivalent mapping feature map to preserve critical feature information of small objects, ensuring that original discriminative features are not lost while enhancing feature representation [
44].
The overall architecture of FEM is illustrated in
Figure 6. This module comprises two branches equipped with atrous convolution. Each branch initially performs a 1 × 1 convolution operation on the input feature map to optimize channel dimensions for subsequent processing. The first branch is constructed as a residual structure, forming an equivalent mapping feature map to preserve critical feature information of small objects. The remaining three branches perform cascaded standard convolution operations with kernel sizes of 1 × 3, 3 × 1, and 3 × 3, respectively. Notably, the middle two branches introduce atrous convolution, enabling the extracted feature maps to retain richer contextual information. Finally, the outputs from all branches are concatenated along the channel dimension, generating a comprehensive feature representation containing multi-scale and multi-directional information.
The mathematical representation of the FEM can be formalized as follows:
where
,
,
, and
denote standard convolution operations with kernel sizes of 1 × 1, 1 × 3, 3 × 1, and 3 × 3, respectively;
represents atrous convolution with a dilation rate of 5 (the optimal value determined through ablation studies, as detailed in
Section 4.2.2);
indicates the concatenation operation of feature maps along the channel dimension; ⊕ denotes element-wise addition of feature maps;
F is the input feature map;
,
, and
represent the output feature maps of the three branches after convolution operations; and
Y is the final output feature map of the FEM.
Through this design, FEM effectively addresses the insufficient feature representation problem in small object detection. The multi-branch structure enables the module to extract discriminative features from different dimensions while the atrous convolution increases the receptive field, which is conducive to learning richer local contextual features and enhancing the ability to capture contextual information.
3.4. Adaptive Dynamic Feature Fusion (ADFF) Detection Head
The Feature Pyramid Network (FPN) [
45] enhances the hierarchical feature representation of CNNs through a top-down pathway, significantly improving multi-scale object detection performance. YOLO11 [
38] adopts an enhanced FPN structure, achieving efficient multi-scale object detection through multi-level feature fusion. This structure contains bottom-up and top-down feature transmission paths and introduces a PSA attention mechanism to enhance feature representation capability. YOLO11’s detection heads are distributed across three feature layers with different resolutions, specifically at 1/8, 1/16, and 1/32 of the input image resolution, corresponding to small-, medium-, and large-scale object detection. However, for most small-sized traffic signs, this FPN architecture still lacks ideal resolution [
46]. Furthermore, YOLO11 primarily employs simple element-wise addition or feature concatenation during feature fusion; these static fusion methods struggle to adaptively adjust fusion strategies according to variations in different scenes and object sizes [
47]. Although feature maps at different scales contain complementary semantic and spatial information, how to effectively integrate this information to improve detection performance remains a challenge to be addressed.
To solve the aforementioned problems, we redesigned the FPN structure. First, by adding a small object detection layer in the 4× downsampled high-resolution feature map, we enable the backbone network to extract multi-scale features
, where the highest-resolution
layer contains rich detail information particularly suitable for small-sized traffic sign detection. Furthermore, to address the feature misalignment issue between shallow and deep layers, we propose an Adaptive Dynamic Feature Fusion (ADFF) detection head inspired by ASFF [
27]. As shown in
Figure 7, ADFF includes two key steps: feature rescaling and adaptive dynamic fusion. The feature rescaling step precisely aligns features from different scales while preserving their unique characteristics. The adaptive dynamic fusion step then determines the optimal contribution of each scale at each spatial location. Unlike existing methods [
27], ADFF implements a dual-layer adaptive mechanism with dynamic convolution in the weight generation network, allowing the model to dynamically adjust fusion weights based on input content.
Feature Rescaling. In our improvement of FPN, we denote the feature maps from different levels of YOLO11 as , where corresponds to the four feature levels and , respectively. To achieve cross-scale feature fusion, we need to adjust all feature maps to the same spatial dimensions. For instance, for the first layer, we rescale feature maps from other levels n to match the shape of . Given that the four levels of YOLO11 have varying resolutions and channel dimensions, we employ different strategies for upsampling and downsampling operations. For upsampling operations, we first use a convolutional layer to compress the feature map’s channel dimensions to match the target level, followed by nearest-neighbor interpolation to enlarge the feature map resolution to the target size. For example, when the target level is , the feature map requires upsampling, while needs upsampling. For downsampling operations, we implement a multi-level downsampling strategy. When the downsampling ratio is , we utilize a convolutional layer with a stride of 2 to simultaneously adjust channel dimensions and reduce resolution; when the downsampling ratio is , we first apply a max-pooling layer with a stride of 2, followed by a convolutional layer with a stride of 2; for larger downsampling ratios (such as ), we achieve this through cascading multiple downsampling operations.
Adaptive Dynamic Fusion. For a feature vector at position
in each feature map, the fusion process can be represented as:
where
represents the feature vector of the output feature map
at position
, and
,
,
, and
denote the spatial importance weights from four different levels of features. Similar to previous methods [
27], we enforce
and
. Therefore, for the spatial importance weight of
, we can define:
Similarly, , , and are calculated through the softmax function, where , , , and serve as the corresponding control parameters.
Unlike traditional methods that use fixed
convolutional layers to generate weight scalar maps, our ADFF method employs the dynamic convolution mechanism described in
Section 3.2 to compute these weight scalar maps. Specifically, for each layer’s feature map
, we utilize a dynamic convolution layer with
M dynamic experts to generate the corresponding weight map
. Compared to traditional convolution with fixed kernels, this dynamic convolution mechanism introduces only negligible computational overhead while significantly enhancing the model’s expressive capacity, thereby learning more optimal feature weight distributions.