Traffic Sign Detection Based on Lightweight Multiscale Feature Fusion Network

: Traffic sign detection is a research hotspot in advanced assisted driving systems, given the complex background, light transformation, and scale changes of traffic sign targets, as well as the problems of slow result acquisition and low accuracy of existing detection methods. To solve the above problems, this paper proposes a traffic sign detection method based on a lightweight multiscale feature fusion network. Since a lightweight network model is simple and has fewer pa ‐ rameters, it can greatly improve the detection speed of a target. To learn more target features and improve the generalization ability of the model, a multiscale feature fusion method can be used to improve recognition accuracy during training. Firstly, MobileNetV3 was selected as the backbone network, a new spatial attention mechanism was introduced, and a spatial attention branch and a channel attention branch were constructed to obtain a mixed attention weight map. Secondly, a feature ‐ interleaving module was constructed to convert the single ‐ scale feature map of the specified layer into a multiscale feature fusion map to realize the combined encoding of high ‐ level semantic information and low ‐ level semantic information. Then, a feature extraction base network for light ‐ weight multiscale feature fusion with an attention mechanism based on the above steps was con ‐ structed. Finally, a key ‐ point detection network was constructed to output the location information, bias information, and category probability of the center points of traffic signs to achieve the detec ‐ tion and recognition of traffic signs. The model was trained, validated, and tested using TT100K datasets, and the detection accuracy of 36 common categories of traffic signs reached more than 85%, among which the detection accuracy of five categories exceeded 95%. The results showed that, compared with the traditional methods of Faster R ‐ CNN, CornerNet, and CenterNet, traffic sign detection based on a lightweight multiscale feature fusion network had obvious advantages in the speed and accuracy of recognition, significantly improved the detection performance for small tar ‐ gets, and achieved a better real ‐ time performance.


Introduction
Given the various problems existing in the process of road transportation [1], organizations in various countries have invested manpower, material resources, and financial resources to find solutions. Advanced driving assistance systems (ADASs) have become one of the key technologies to solve the problem. An ADAS can collect information about the movement of a vehicle and the surrounding environment and understand the information to help drivers make decisions; however, as an important part of an ADAS, traffic sign identification and recognition can timely convey guidance, restriction, or warning or can prompt information expressed by traffic signs within a certain range to drivers, which can ensure the safety of road driving to a certain extent.
In early studies, the feature-learning method has mainly been used, which mainly utilizes the inherent features of traffic signs and detects and identifies traffic signs based on machine learning. In references [2][3][4][5], traffic signs are detected by feature-learning methods. However, these methods overuse the shapes and color features of traffic signs. Cases of clear color and standard shape have better detection and recognition effects. However, in complex environments, such as light transformation and shape distortion, this type of method has the problem of low detection and recognition rates. Yu et al. [6] designed a framework for traffic sign detection in complex scenes based on visual co-visuality that used three visual attention cues, contrast, center deviation, and symmetry, to detect traffic signs. The advantages of this method were the integration of bottom-up and top-down visual processing and no heavy-learning tasks. Yu et al. [7] proposed a traffic sign detection method based on saliency maps and Fourier descriptors. This method used frequency tuning to obtain a saliency map and used binary operation to obtain a binary image and capture traffic sign area. Based on a visual attention mechanism model, Zhang et al. [8] extracted edge and color information as early visual features, calculated each feature to obtain a visual saliency map, and then used a graph neural network and Kmeans algorithm to determine candidate regions containing traffic signs. The visual saliency method mainly relies on modeling of the visual attention mechanism. In complex environments, this kind of method has a certain improvement in the detection and recognition rates of traffic signs compared with the feature-learning method. However, this kind of method is relatively deficient in the extraction of deep features of traffic signs, such as the feature extraction of traffic signs of the same category in different weather environments.
Deep convolutional neural networks are widely used in traffic sign detection and recognition tasks and have achieved good results. Yin et al. [9] designed a novel structure combining intranetwork connections and residual connections and used an efficient GPU to accelerate convolution operations. Zhe et al. [10] created the TT100K traffic sign database based on Tencent Street View and labeled each traffic sign appearing in the images. Xie et al. [11] proposed a two-level cascaded convolutional neural network structure that could effectively improve the classification accuracy of traffic signs. Zhu et al. [12] proposed a novel framework for traffic sign detection and recognition that contained a fully convolutional network to guide traffic sign proposals and a deep convolutional neural network for target classification. Zhang [13] and Zuo [14] each applied Faster R-CNN to traffic sign detection and recognition and improved the efficiency of feature extraction through an RPN network and end-to-end model training. There are many kinds of traffic signs, including not only graphic signs but also text signs. To detect graphic signs and text signs at the same time, Luo et al. [15] first classified the region of interest into two categories to distinguish graphic and text signs and then identified specific categories through a deep feature extraction network. Zhu et al. [16] proposed a text-based traffic sign detection network that solved the multiscale problem of text detection by narrowing the text detection area. In practical applications, small traffic signs are prone to low resolution when imaging and different degrees of blur or deformation, so the images contain less information and the features are not easy to extract. In response to this problem, Peng et al. [17] introduced local regions into a detection network structure to provide more local information for the back-end recognition network and improve the detection and recognition rates of small traffic signs. Pei et al. [18] proposed a multiscale deconvolution network that flexibly used a multiscale CNN to form efficient and reliable local traffic sign recognition model training. Li [19] and Heng [20] et al. both used adversarial generative networks to enhance the detection and recognition capabilities of the original network. Xiang et al. [21] proposed an improved capsule network for traffic sign recognition in a multiscale capsule network. Traffic sign images are susceptible to the effects of shooting angle and distance, which change the size of the sign and seriously affect the feature extraction and classification processes. To address this problem, Xie et al. [22] proposed a traffic sign recognition method based on an LeNet-5 network by adjusting target size using bilinear interpolation and then using the adjusted image for feature extraction. In complex traffic scenarios, some objects are similar to traffic signs in appearance, and a detector may detect them incorrectly. To address this problem, Yuan et al. [23] found that the locations of traffic signs had obvious statistical features when they were counted, so the location a priori information was introduced into a network to improve the accuracy of traffic sign detection. Lee et al. [24] proposed a novel traffic sign detection system that used CNN to simultaneously estimate the locations and precise boundaries of traffic signs. Kong et al. [25] combed the latest research results and proposed future development trends in machine learning and deep learning in traffic target detection. Zhou et al. [26] proposed a standard dataset and presented a high-resolution traffic sign recognition algorithm for complex environments. There are many small objects in traffic scenes, but their detection is still a challenge due to low resolution and limited information. For this reason, Lian et al. [27] proposed a small object detection method for traffic scenes based on attentional feature fusion to improve the small target detection accuracy.
To better apply convolutional neural networks in the field of traffic sign detection and recognition, various research scholars have proposed many convolutional neural networks with better performances but higher network complexity based on AlexNet [28], such as VGG [29], ResNet [30], DenseNet [31], etc. The transportation problem is complex, as well as nonlinear, and there is a lot of data information to be processed. The use of traditional deep neural networks leads to great difficulties in storing models in embedded devices due to the large limitation of storage space for hardware devices. In response to the high complexity of deep neural networks, which is difficult to be applied in real life, some scholars have tried to study lightweight networks to reduce complexity while maintaining the network effect and have made breakthrough progress. The research on lightweight networks is mainly divided into two directions, model compression and network structure design, in which the representative methods of model compression are knowledge distillation, pruning, quantization, and low-rank decomposition, and the representative networks of network structure design are SqueezeNet [32], MobileNet [33,34], ShuffleNet [35], and Xception [36].
In summary, a traffic sign detection and recognition algorithm based on deep learning can handle more complex application scenarios, but there are still some difficulties in the process of practical application. A traffic sign detection algorithm requires high realtime performance. Additionally, a traffic sign generally occupies a small proportion of the image, and there is a certain scale change. Based on a situation analysis of traffic sign detection and recognition, this paper researches and proposes a traffic sign detection and recognition algorithm based on lightweight multiscale feature fusion and an attention mechanism for intelligent driving applications. Firstly, the convolutional neural network C-MobileNetV3, which is based on the improved MobileNetV3 [37], is adopted as the feature extraction network, and the input images are convolved with different numbers of convolutional kernels to generate single-scale images of different depths. Then, feature maps of different scales are input into the feature-interleaving module, and the top-down connection, bottom-up connection, and layer-by-layer cascade modules are used to obtain multiscale feature fusion maps that fuse different scale information. Finally, the obtained multiscale feature fusion map is input into the detection network for foreground and background discrimination to complete the detection and recognition of traffic signs.

Lightweight Feature Extraction Network
To efficiently complete the extraction of feature maps, in this paper we selected a modified lightweight network based on MobileNetV3 as the base network for feature extraction and named it C-MobileNetV3. The basic structure of the network was a bottleneck [38]. The core idea was to reduce the computational complexity by using deep, separable convolution instead of standard convolution. In addition, a lightweight attention module was added to enhance the network learning ability. The bottleneck design consisted of two structures with a step size of 1 and a step size of 2, respectively. Among them, the structure with the step size of 1 was added with the residual connection. In addition, some bottlenecks were added with a squeeze-and-excitation (SE [32]) module, as shown in Fig The input image was first mapped from the low-dimensional space to the high-dimensional space by expanding the convolution of 1 1 to obtain enough information that the number of expansion channels could be set as desired. Next, channel-by-channel convolution was used to output the information of each channel, and point-by-point convolution was used to complete the joint coding between channels to compress the number of channels, which in turn made the network lighter. ReLU6 was used as the activation function for the first half of the model, and Hard-Swish was used for the second half. The overall framework of the lightweight feature extraction base network is shown in Figure  2. The input image was first convolved in 16 dimensions with a step size of 2 and a convolution kernel size of 3 3 to extract preliminary information and generate a 1024 1024 16   feature map. Then, it was input into the feature extraction network and stacked by different bottleneck specifications.

Multiscale Feature Fusion Network
In this paper, we proposed a feature-interleaving module by borrowing the idea of BiFPN [39] for feature fusion. As shown in Figure 3, the network was generally composed of three parts: a top-down module, a bottom-up module, and a layer-by-layer connection module.   In this paper, we used the summation operation for feature fusion and achieved consistency in the number of channels through the convolution kernel of 1 1 240   . The green arrow in the figure represents the bilinear interpolation. The size of the feature map after the bilinear interpolation became twice the previous one, and the transformed feature map could be fused with the underlying feature map. The convolution kernel of 3 3  was used to re-extract the features to ensure the stability of the features.

Hybrid Attention Module
As can be seen from the structure in Figure 4, the input feature map of the attention module [40] generated the channel domain attention feature map after the channel attention branch. After spatial-domain-attention-branching, the spatial attention feature map was generated, and then the two were added pixel by pixel to generate the hybrid attention feature map.

Spatial Attention Module
The structure diagram of the spatial attention module is shown in Figure 5, which omits the batch normalization (BN) layer and the sigmoid layer. The main steps of the spatial-attention-branching subnetwork calculation method were as follows: 1. Input the feature map where the superscript indicates the height and width of the feature map and the number of channels. 2. Convolution of the input feature map of the size 3 3  to generate spatial feature where BN denotes the batch normalization operation and the convolution operation, and the superscript f denotes the convolution kernel size and the number of output channels.
3. Compression of the channel direction to generate the feature map ; its generation process is as follows: 4. Generation of spatial attention weight maps by pooling, convolution, and deformation operations where C MaxPool is the pixel maximum in the channel direction of the feature map; C AvgPool is the pixel average in the channel direction of the feature map; Concat is the stitching of the feature map in the channel direction; Sigmoid is the activation function; and ReaShape is the deformation.

Channel Attention Module
The channel domain attention mechanism used a network-learning method to obtain feature map channel weights, which complemented the spatial domain attention branch to achieve a reasonable allocation of computer resources. The structure of the channeldomain attention branch is shown in Figure 6.  is performed on the input feature map F , generate the channel domain basic feature map C F : where BN denotes the batch normalization operation; f denotes the convolution operation; and the superscript denotes the convolution kernel size, as well as the number of output channels. 3. Calculate the global maximum pooling and global average pooling: where , respectively: ReaShape(Sigmoid(BN( ( )))) where  denotes pixel-by-pixel summation, Sigmoid denotes the activation function, ( ) where  indicates multiplying pixel by pixel, and + indicates adding pixel by pixel.

Detection Network
Based on the idea of CenterNet [41], this paper used a target detection method without an anchor frame to detect traffic signs. This paper used the center point of the traffic sign truth box in the image to represent the current target, that is, a traffic sign was detected as a point. The actual prediction box of the object was obtained by predicting the center point offset and width of the target, and a heat map represented the classification information. The detection network framework is shown in Figure 7. The output of the multiscale fusion feature extraction network was used as the input of the detection network, and the input was input into three branches that predicted the heat map, width and height, and position offset of the traffic sign center point. Then, the obtained results were mapped to the original image size by suppressing the center point with a higher retention score through the maximum value, and finally, the target position and category were obtained.  The loss function for the heat map portion was denoted as follows: where ˆx yz Y represents the predicted value of the heat map at the channel c position   , x y ; xyz Y represents the true value of the corresponding position, which was calculated based on the Gaussian distribution of the centroid of the true value; N represents the number of traffic signs in the picture; the  parameters are used to control the loss weights of the hard and easy classification samples; and the  parameters are used for weight. In the experiments,  was taken as 2, and  was taken as 4.
The bias loss function was denoted as follows: where p O is the actual bias, and ˆp O represents the bias predicted by the network.
The length and width predicted loss value was calculated as follows: where ˆp k S refers to the predicted width and height, and k S refers to the actual width and height. The total loss function of the network was as follows: represent the weights of different partial loss functions.

Experiment Platform
The dataset used in this chapter was TT100K, and the evaluation metrics used were precision, recall, PR curve, index, and frames per second (FPS). The experiments were conducted with an Intel(R) Xeon(R) E5-2680v3 processor, with a 2.5 GHz GPU and an NVIDIA Geforce GTX 1080ti graphics card using Python3.6 as the programming language and Pycharm as the software ,based on Pytorch 1.1.0 deep learning platform, CUDA 9.0 and CUDnn7.1 was used to implement the algorithms in this paper.

Ablation Study
The models for conducting the ablation experiments were classified as follows. Model 1: The network contained C-MobileNet with a detection network. Model 2: The network contained C-MobileNet, a spatial domain attention module, and a detection module.
Model 3: The network contained C-MobileNet, a channel domain attention module, and a detection module.
Model 4: The network contained C-MobileNet, a hybrid attention module, and a detection module.
Model 5: The network contained C-MobileNet, a feature-interleaving module, and a detection module.
Model 6: The network contained C-MobileNet, a feature-interleaving module, a spatial domain attention module, and a detection module.
Model 7: The network contained C-MobileNet, a feature-interleaving module, a channel domain attention module, and a detection module.
Model 8: The network contained C-MobileNet, a feature-interleaving module, a hybrid attention module, and a detection module.
The experimental results are shown in Table 1. An analysis of the horizontal information in the table shows that the recognition effects of medium-sized traffic signs and large-sized traffic signs were better than those of small-sized traffic signs. This was due to the clear edges and obvious features of mediumsized and large-sized traffic signs, while the small-sized traffic signs had fuzzier features, which led to poorer recognition results. Through a longitudinal comparison of the information in the table, it can be seen that, for small-sized traffic signs, supplementation with the attention module and the multiscale feature fusion module could effectively improve the recognition effect.
Specifically, comparing Model 1, Model 2, Model 3, and Model 4 showed that the channel domain attention module and the spatial domain attention module could improve the detection effect to a certain extent, while the hybrid attention module, as well as the hybrid attention module combining the two, had more significant improvement effects on the detection task of traffic signs. Comparing Model 1 with Model 5, it can be seen that the introduction of the feature-interleaving module in the model improved the recognition effect from an overall perspective, especially for small sizes, where the improvement effect could reach 1.4%. Model 6, Model 7, and Model 8 added the attention module based on Model 5. Comparing Model 5, Model 6, and Model 7, it can be seen that adding the channel domain attention module and spatial domain attention module to the network improved the detection effect of traffic signs in a certain range. Comparing Model 6, Model 7, and Model 8 shows that using both the spatial domain attention module and the channel domain attention module further improved the detection effect of the models. Figure 8 shows the detection plots of different models for traffic signs.

Structural Experiments
To verify the effectiveness of each module in the proposed network, we designed the following experiments to demonstrate the effectiveness of the proposed method.
The models for conducting structural experiments were classified as follows: Model 1: MobileNetV1 was chosen as the base network for lightweight feature extraction, and other parts of the network remained unchanged.
Model 2: ShuffleNet was selected as the base network for lightweight feature extraction, and other parts of the network remained unchanged.
Model 3: MobileNetV2 was selected as the base network for lightweight feature extraction, and other parts of the network remained unchanged.
Model 4: BiFPN was selected for the multiscale fusion module, and other parts of the network remained unchanged.
Model 5: NAS-FPN [42] was selected for the multiscale fusion module, and other parts of the network remained unchanged.
Model 6: PANet [43] was selected for the multiscale fusion module, and other parts of the network remained unchanged.
Model 7: CBAM [44] was selected as the attention module, and other parts of the network remained unchanged.
Model 8: DANet [45] was selected as the attention module, and other parts of the network remained unchanged.
Model 9: The network proposed in this paper was used. Experiments were conducted on the above models, and the data results are shown in Table 2. According to the cross-sectional information in the analysis table, it can be seen that the recognition effects of medium-sized traffic signs and large-sized traffic signs were better than that of small-sized traffic signs. By comparing Model 1, Model 2, Model 3, and Model 7 longitudinally, it can be seen that the accuracy rate of the lightweight network using the C-Mobilenetv3 proposed in this paper was higher. Compared with Mo-bileNetV1, ShuffleNet, and MobileNetV2, the accuracy rate of small-scale traffic signs was improved by 1.2-3.2%, that of medium-scale traffic signs was improved by 1-2.6%, and that of large-scale traffic signs was improved by 1-2.8%.
A longitudinal comparison of Model 4, Model 5, Model 6, and Model 7 shows that the multiscale fusion module using the feature-interleaving module proposed in this paper had a significant improvement in the recognition effect. Compared with BiFPN, NAS-FPN, and PANet, the accuracy rate of small-scale traffic signs was improved by 1-1.8%, that of medium-scale traffic signs was improved by 0.7-1.5%, and that of large-scale traffic signs was improved by 0.4-1.4%.
A longitudinal comparison of Model 7, Model 8, and Model 9 shows that, compared to CBAM and DANet, the accuracy rate of small-scale traffic signs improved by 0.7-1.5%, that of medium-scale traffic signs improved by 0.2-0.3%, and that of large-scale traffic signs improved by 0.3-0.5%.

Comparative Experiment of Similar Algorithm
To verify the proposed algorithm, Traffic Sign Detection and Recognition Network Combining Multiscale Features and Hybrid Attention Mechanism (MFHA-TSDR), comparative experiments were conducted with similar algorithms, including Faster R-CNN, CornerNet [46], and CenterNet, which were selected based on the TT100K dataset in the same hardware environment, as shown in Table 3, and Figure 9. By analyzing the information in the table, it can be seen that the comprehensive performance of the proposed network for traffic sign detection in this paper was excellent, and the accuracy and recall rates of each scale were well-improved compared to other networks. By comparing the FPS values, it can be seen that the proposed network in this paper also had a great advantage in speed due to the use of a lightweight network. Overall, the proposed network could meet the requirements for both accuracy and speed of traffic sign detection and reduce the false detection rate caused by scale differences. Analyzing the recognition effect graphs of each algorithm for different scales of traffic signs shows that our proposed algorithm had higher recognition accuracy, while Cor-nerNet and CenterNet were second, and Faster R-CNN was less effective. Especially for the detection of small-scale traffic signs, the detection effect of Faster R-CNN was significantly lower than those of the other two algorithms. In contrast, the proposed network achieved good detection results for all three scales of traffic signs with better robustness.
Above was the quantitative analysis of the three algorithms; next, some of the actual detection result plots were selected for qualitative analysis, as shown in Figure 10. Among them, column a shows the original images in the TT100K dataset, column b shows the detection results of Faster R-CNN, column c shows the detection results of CornerNet, column d shows the detection results of CenterNet, and column shows the detection results of MFHA-TSDR (the proposed algorithm in this paper). Comparing the detection results in the first row, we can see that all four methods obtained good detection results for traffic signs with larger scales, while for small-scale targets, Faster R-CNN did not detect them, and there was a missed detection situation. The traffic signs presented in the images in the second row are of large scale, and the results show that all four methods detected them correctly. In the third row, the traffic signs are dense, and Faster R-CNN had a large number of missed detections. CornerNet also had some missed detections, and CenterNet had no missed detections, but the "minimum speed limit of 80 km/h" was wrongly detected as "The proposed network detects all traffic signs completely". The traffic signs in the fourth row of images are traffic signs with serious deformation, and the proposed network could detect the traffic signs with serious deformation for "drive on the right". In the fifth row, the traffic signs are shaded by trees, and Faster R-CNN failed to detect the "speed limit" and "no parking" traffic signs, while CornerNet and CenterNet failed to detect the "no vehicle" signs that were shaded by leaves. Faster R-CNN failed to detect "speed limit" and "no parking" traffic signs, and CornerNet and CenterNet failed to detect "no long-time or temporary parking" traffic signs that were obscured by leaves. The algorithm proposed in this paper detected "no long-time or temporary parking" traffic signs that were heavily obscured by leaves and shadows and achieved the highest detection rate and accuracy.
In summary, all four algorithms could detect traffic signs well when the traffic sign features in the image were obvious. Faster R-CNN, CornerNet, and CenterNet had missed or false detection when the traffic signs in the images were small-scale, densely laid out, deformed, and obscured, while the proposed network in this paper demonstrated better robustness.

Different Types of Traffic Sign Detection
The traffic sign detection and recognition network proposed in this paper could complete the detection and recognition of 221 types of traffic signs, which mostly covers the traffic signs that may be encountered in daily traffic scenarios. In this paper, we selected some traffic signs that appeared more frequently in the TT100K dataset for data analysis, and the traffic sign detection and recognition network proposed in this paper could achieve more than an 80% recognition rate for all kinds of traffic signs. The results are shown in Table 4. According to the information in the table, it can be seen that the detection accuracy of the network proposed in this paper was different for different categories of traffic signs, and most of them could reach more than 85%. However, the accuracy of the target in the category of height limit 5 (ph5) was 69.5%, which was not very good. After analyzing the test, we concluded that the reason was that the signs in this category of height limit were similar to many other signs, including signs of the same type of height limit such as height limit 5.3 (ph5.3), which had similar appearances and extracted similar features, so the recognition accuracy was not high; secondly, for the "120km/h" (pl120), these traffic signs had obvious features and were easy to distinguish, so the detection accuracy was the highest, reaching 97.5%. The detection accuracy of "no entry (pne)" was the second most accurate, reaching 96.8%. Overall, the algorithm proposed in this paper achieved good detection results for all kinds of traffic signs.

Discussion
The traffic sign detection method based on a lightweight multiscale feature fusion network proposed in this paper had better advantages than similar algorithms and was tested in real scenarios with good results, but there are still some limitations when applying it to actual assisted driving systems. An automated driving system is large and complex, and further research can be carried out on the following aspects in the future.
(1) The meanings of traffic signs are expressed through shapes, colors, graphics, and words, and traffic signs with the same meanings in different countries or regions may have different shapes, colors, graphics, and words. Therefore, it is a challenge for a network to recognize not only the existing data categories in the training set, but also the data not previously seen. Therefore, in the next step of study, the migration capability of the network should investigated so that the network can be more widely used. (2) When testing the network, experiments were conducted only under daily conditions, such as normal conditions, the presence of occlusion, and uneven lighting. However, during actual driving, traffic sign detection results in bad weather are more important for driving safety. Therefore, in future study, more challenging environments, such as the presence of fog, haze, rain, or snow, should be selected for experimental analysis. (3) In the process of driving, a driver pays different attention to the traffic signs he sees, paying attention to ones that require attention and ignoring unwanted information.
For example, if a driver is at an intersection and his route is straight, while the traffic signs for a right-turn road may also exist in the driver's field of vision, the driver automatically ignores the information that is not meaningful for behavioral decision making, which reduces the burden of information processing in the human brain. Therefore, the next step of research should try to construct a network that gives different attention to traffic signs in images to further simplify information redundancy and make the network more suitable for complex traffic environments.

Conclusions
A traffic sign detection and recognition network based on lightweight multiscale feature fusion and an attention mechanism was proposed to address the problems of poor real-time performance, the single scale of extracted information, and unreasonable allocation of computer resources in the currently proposed convolutional neural network for traffic sign recognition.
(1) To address the problems that traffic sign detection requires high real-time performance and that the existing convolutional neural networks had many redundancies, a lightweight feature extraction network was designed and a key-point detection method was adopted instead of the original anchor frame traversal detection method. In addition, a feature-interleaving module was designed to realize the multiscale extraction of feature information for the problem that traffic sign sizes in a traffic scene map were variable and the semantic information obtained by the existing network was single. (2) To improve the detection effect when a traffic sign occupied a small image size, was densely arranged, or had too much background information in the image, a hybrid attention module was designed and constructed, which was divided into a spatial attention branch and a channel attention branch, giving different weights to different locations in space and different channels, respectively. (3) Experiments showed that the algorithm in this paper achieved 85% recognition accuracy for different scale targets and most categories. Compared with Faster R-CNN, ConerNet, and CenterNet, the check-all rate and check-accuracy rate of the algorithm in this paper were significantly higher, and a better real-time performance was achieved. Therefore, the proposed network in this paper was robust, had high recognition accuracy, and achieved a good real-time performance. Validation; X.F. and Q.L.: Writing-review and editing. All authors have read and agreed to the published version of the manuscript.
Funding: This research received no external funding.

Informed Consent Statement: Not applicable.
Data Availability Statement: Not applicable.