DSRA-DETR: An Improved DETR for Multiscale Trafﬁc Sign Detection

: Trafﬁc sign detection plays an important role in improving the capabilities of automated driving systems by addressing road safety challenges in sustainable urban living. In this paper, we present DSRA-DETR, a novel approach focused on improving multiscale detection performance. Our approach integrates the dilated spatial pyramid pooling model (DSPP) and the multiscale feature residual aggregation module (FRAM) to aggregate features at various scales. These modules excel at reducing feature noise and minimizing loss of low-level features during feature map extraction. Ad-ditionally, they enhance the model’s capability to detect objects at different scales, thereby improving the accuracy and robustness of trafﬁc sign detection. We evaluate the performance of our method on two widely used datasets, the GTSDB and CCTSDB, and achieve impressive average accuracies (APs) of 76.13% and 78.24%, respectively. Compared with other well-known algorithms, our method shows a signiﬁcant improvement in detection accuracy, demonstrating its superiority and generality. Our proposed method shows great potential for improving the performance of trafﬁc sign detection for autonomous driving systems and will help in the development of safe and efﬁcient autonomous driving technologies.


Introduction
In sustainable urban living, road safety faces challenges such as distracted drivers and novice drivers' unfamiliarity with traffic signs. To address these challenges, traffic sign detection technology in autonomous driving systems can assist drivers in identifying traffic signs accurately, contributing to road safety. Achieving accurate recognition of small-sized traffic signs is crucial for autonomous vehicles to assess road conditions and ensure safe operation. Researchers are actively working to improve the detection accuracy of small-scale traffic sign images, aiming to enhance the performance and reliability of autonomous driving systems.
Accurate recognition of small-scale traffic signs is essential for advancing autonomous driving technology, providing autonomous vehicles with sufficient time to respond to changing road conditions. Detecting and interpreting small-scale traffic signs accurately contribute to the safety, efficiency, and reliability of autonomous vehicles, making it a key research area within AI applications for sustainable urban living [1]. Deep learning methods, including the R-CNN [2] series, YOLO [3] series, SSD [4] series, and visual transformer architecture [5], have been widely used for traffic sign detection. The introduction of DETR [6] has paved the way for transformer-based target detectors. However, the existing methods still have limitations in terms of detection accuracy and the detection of small traffic signs at a distance.
In this paper, we propose a novel traffic sign detection method called DSRA-DETR, which improves upon Anchor-DETR [7] by integrating designed modules. DSRA-DETR utilizes multiscale feature information extracted from the backbone and employs dilated

The YOLO and SSD Series
The YOLO series is considered a standard one-stage algorithm. YOLOv1 [3], which is the first paper in this series, introduced the core idea of using the entire image as the network input and directly regressing the location and category of bounding boxes in the output layer. However, it falls short in terms of localization accuracy compared to Faster R-CNN and struggles with detecting small objects. YOLOv2 [18], an advancement over the v1 version, addresses these limitations in three key aspects: improved prediction accuracy, faster processing speed, and enhanced object recognition, all while maintaining its efficient processing speed. YOLOv3 [10] further enhances the architecture and training techniques introduced in v2 to improve accuracy without compromising inference time. In addition, refs. [19,20] have made notable contributions in further enhancing the YOLO algorithm. Liu et al. introduced the SSD [4] algorithm, which is based on multiscale detection. It achieves a processing speed comparable to that of YOLO and a detection accuracy comparable to that of Faster R-CNN. However, its performance in detecting small targets is still not entirely satisfactory. Both [21][22][23] have made improvements to the SSD algorithm from different angles. It is worth mentioning that RetinaNet [24] proposes focal loss to address the issue of severe imbalance between positive and negative sample ratios in one-stage target detection. Moreover, scholars have introduced the CornerNet [ 25] algorithm, which utilizes diagonal keypoints to tackle the bounding box(bbox) problem. Building upon this, CenterNet [26], employs central keypoints to further address the bbox problem. Some scholars [27][28][29] use this series of methods for traffic sign detection.

The Image Registration Series
In traffic sign detection, feature descriptors in image alignment can be used to extract features in traffic sign images and match them with corresponding features in other images. Among these, it is worth mentioning the FNRG [30] method proposed by Xiao et al., which starts with a novel consistency seed search strategy. This strategy exploits the first neighbor relationship of feature points between two images to achieve consistency matching without any parameters or thresholds. It is an eye-catching image-matching method. Additionally, the LGF algorithm consists of two components: an effective twoview approximate deterministic sampling algorithm and a simple and effective model selection framework. The LGF [31] algorithm is able to obtain a coarse minimum subset of samples using the local neighbor-keeping relationships corresponding to the inputs. It then refines these subsets using a global residual optimization strategy. In this way, the same traffic signs appearing in different images can be detected. Some scholars [32,33] use this series of methods for traffic sign detection.

The DETR Series
More recently, the detection transformer (DETR) [34] became the first architecture to apply the transformer [35] architecture to target detection, marking a significant advancement in the vision field. While it demonstrates impressive performance on the COCO [36] dataset, its convergence speed is relatively slow due to the computational demands of the transformer architecture [5]. To tackle this issue, Deformable-DETR [37] proposes a deformable attention mechanism and Conditional DETR [6] introduces a conditional cross-attention mechanism, both of which make important contributions in reducing the convergence time of DETR. Many scholars, including [38][39][40], have made significant contributions to enhancing DETR by introducing improvements from various perspectives and degrees based on the aforementioned work. Anchor-DETR [7], on the other hand, is a deformable-based object detection framework that incorporates anchor points and row-column decouple attention into DETR. Anchor-DETR is known for its fast convergence and competitive performance compared to other detectors.

Method
DSRA-DETR is an advanced traffic sign detection architecture that builds upon the foundation of Anchor-DETR. In order to overcome the challenges posed by small-scale traffic sign detection, DSRA-DETR introduces a series of innovative components. One such component is the dilated spatial pyramid pooling module, which plays a crucial role in this architecture. By leveraging dilated convolutions, this module effectively filters out extraneous and irrelevant information from the low-level features. This filtering process ensures that only the most relevant and discriminative features are retained for further analysis and processing. Additionally, DSRA-DETR incorporates a feature residual aggregation module, which serves as a vital component for aggregating and enhancing the representation of low-level feature information. This module intelligently combines and refines the extracted features, enabling the model to capture more detailed and contextaware representations of traffic signs. By integrating this module into the architecture, DSRA-DETR significantly improves the accuracy and robustness of traffic sign detection, particularly in scenarios where small-scale signs are prevalent. Figure 1 provides a visual overview of the DSRA-DETR architecture, showcasing its various components and their interactions. The backbone network forms the foundation of the architecture, being responsible for extracting initial feature representations from the input data. The dilated spatial pyramid pooling model operates on these features, capturing multiscale information and selectively incorporating contextual details. The feature residual aggregation module then refines the features, enhancing their discriminative power and contributing to the overall performance of the model. After being processed by this module, the feature is then fed into the encoder layer and decoder layer of the transformer. residual aggregation module then refines the features, enhancing their discriminative power and contributing to the overall performance of the model. After being processed by this module, the feature is then fed into the encoder layer and decoder layer of the transformer.

Backbone
The backbone plays a crucial role in the target detection task by aiding the model in extracting features from the input image. These features are then utilized in the latter part of the model. As a result, having a strong backbone is essential for our traffic sign detection task. We use ResNet50 as the backbone network for all models, which are pre-trained on ImageNet. Figure 2 illustrates the structure of the backbone we are using.
ResNet50 consists of four layers: layer1, layer2, layer3, and layer4. Each layer follows the same internal structure but has different downsampling rates: 4×, 8×, 16×, and 32×, respectively. The lower layers contain detailed location feature information, making them suitable for detecting small targets. On the other hand, the higher layers capture abstract semantic features, making them more suitable for detecting larger targets. In our method, we leverage the multilayer features extracted from all four layers for further processing and analysis.

Backbone
The backbone plays a crucial role in the target detection task by aiding the model in extracting features from the input image. These features are then utilized in the latter part of the model. As a result, having a strong backbone is essential for our traffic sign detection task. We use ResNet50 as the backbone network for all models, which are pre-trained on ImageNet. Figure 2 illustrates the structure of the backbone we are using.
ResNet50 consists of four layers: layer1, layer2, layer3, and layer4. Each layer follows the same internal structure but has different downsampling rates: 4×, 8×, 16×, and 32×, respectively. The lower layers contain detailed location feature information, making them suitable for detecting small targets. On the other hand, the higher layers capture abstract semantic features, making them more suitable for detecting larger targets. In our method, we leverage the multilayer features extracted from all four layers for further processing and analysis.

Dilated Spatial Pyramid Pooling Model
The DSPP module is an essential element of our proposed model, which draws inspiration from the design principles of the DeepLabv2 [41] architecture. However, we made several improvements to make it more suitable for our specific needs. The module comprises four convolutional layers: three 3 × 3 dilated convolutional layers with expansion rates of [1,3,6] and one 1 × 1 convolutional layer. The use of dilated convolutional layers instead of regular convolutional layers reduces the computational cost of the module while maintaining its effectiveness.
To apply the DSPP module to the input features, we first convolve the input with three different expansion rates in parallel. The expansion rate refers to the number of output channels per input channel. Then, we concatenate the resulting feature maps before passing them through the final 1 × 1 convolutional layer, which downscales the feature maps to a desired number of output channels. The DSPP module allows our model to capture features at multiple scales, allowing for more accurate detection of traffic signs of varying sizes and scales. The module's structure is depicted in Figure 3.

Dilated Spatial Pyramid Pooling Model
The DSPP module is an essential element of our proposed model, which draws inspiration from the design principles of the DeepLabv2 [41] architecture. However, we made several improvements to make it more suitable for our specific needs. The module comprises four convolutional layers: three 3 × 3 dilated convolutional layers with expansion rates of [1,3,6] and one 1 × 1 convolutional layer. The use of dilated convolutional To apply the DSPP module to the input features, we first convolve the input with three different expansion rates in parallel. The expansion rate refers to the number of output channels per input channel. Then, we concatenate the resulting feature maps before passing them through the final 1 × 1 convolutional layer, which downscales the feature maps to a desired number of output channels. The DSPP module allows our model to capture features at multiple scales, allowing for more accurate detection of traffic signs of varying sizes and scales. The module's structure is depicted in Figure 3. The input feature map, , has a shape of ∈ × × × , where B represents batches, C represents channels, and H and W represent the height and width, respectively. The mathematical expression for after passing through a dilated convolution block with a dilation rate of is as follows: In this expression, refers to the activation function, represents batch normalization, and 3×3 denotes a 3 × 3 dilated convolution operation. The overall expression for this module can be expressed as follows: Here, represents the resulting output feature map. It is obtained by concatenating feature maps 1 , 3 , and 6 , and then applying a 1 × 1 convolution, followed by batch normalization and activation. Through the utilization of the DSPP module, we are able to apply targeted noise reduction to the feature information extracted from the backbone. This noise reduction process selectively preserves the relevant traffic sign features while removing unwanted "feature noise" from the feature map. Consequently, the original feature map from the backbone can more effectively carry out the task of traffic sign detection following the integration of the DSPP module. In our ablation experiments, we visually demonstrated the impact of this process by visualizing the feature maps, providing a clearer and more illustrative understanding of this point. The input feature map, F in , has a shape of F in ∈ R B×C×H×W , where B represents batches, C represents channels, and H and W represent the height and width, respectively. The mathematical expression for F in after passing through a dilated convolution block with a dilation rate of i is as follows: In this expression, ReLU refers to the ReLU activation function, BN represents batch normalization, and Conv 3×3 denotes a 3 × 3 dilated convolution operation. The overall expression for this module can be expressed as follows: Here, F out represents the resulting output feature map. It is obtained by concatenating feature maps F 1 , F 3 , and F 6 , and then applying a 1 × 1 convolution, followed by batch normalization and ReLU activation.
Through the utilization of the DSPP module, we are able to apply targeted noise reduction to the feature information extracted from the backbone. This noise reduction process selectively preserves the relevant traffic sign features while removing unwanted "feature noise" from the feature map. Consequently, the original feature map from the backbone can more effectively carry out the task of traffic sign detection following the integration of the DSPP module. In our ablation experiments, we visually demonstrated the impact of this process by visualizing the feature maps, providing a clearer and more illustrative understanding of this point.

Feature Residual Aggregation Module
The feature residual aggregation module (FRAM) is a crucial component in our proposed model architecture. It addresses the challenge of feature resolution discrepancies encountered in object detection tasks. The FRAM effectively preserves and leverages lowerlevel features, resulting in significant improvements in the model's detection performance, especially for small-scale traffic signs. Its primary objective is to ensure that the model retains essential information from lower-level features while extracting them hierarchically. This is achieved through a feature residual aggregation process that considers features from different scale layers that have undergone the DSPP module.
Inside the FRAM, the process starts with an absolute value subtraction of the feature matrix at each level. This step calculates the differences between the feature layers from different levels, allowing the module to discern disparities in content and characteris- tics. The differences obtained from the layer-wise calculations are then convolved with the original high-level features. This convolution operation integrates the dissimilarities between the layers with the existing high-level features, resulting in a comprehensive representation of the combined information. By fusing the residuals of the low-level features, richer information on small target features is aggregated in the feature maps used in the subsequent detection part. This is crucial for the improvement in small-target detection performance. To ensure successful fusion, the convolved results from different layers are concatenated. This concatenation step consolidates the information obtained from each layer and prepares it for subsequent processing. The concatenated features undergo a downsampling operation, reducing the number of output channels to the desired level.
By utilizing the FRAM, our model demonstrates notable improvements in detection capabilities, particularly in recognizing small-scale traffic signs. The module preserves and effectively leverages essential information from lower-level features, enabling the model to capture and utilize intricate details associated with traffic signs more efficiently. Handling feature resolution discrepancies and preserving critical information from lowerlevel features enhances the model's accuracy and robustness. This enhancement enables the model to detect traffic signs of varying sizes and scales with greater precision and reliability. Please refer to Figure 4 for the structure diagram.
formance, especially for small-scale traffic signs. Its primary objective is to ensure that the model retains essential information from lower-level features while extracting them hierarchically. This is achieved through a feature residual aggregation process that considers features from different scale layers that have undergone the DSPP module.
Inside the FRAM, the process starts with an absolute value subtraction of the feature matrix at each level. This step calculates the differences between the feature layers from different levels, allowing the module to discern disparities in content and characteristics. The differences obtained from the layer-wise calculations are then convolved with the original high-level features. This convolution operation integrates the dissimilarities between the layers with the existing high-level features, resulting in a comprehensive representation of the combined information. By fusing the residuals of the low-level features, richer information on small target features is aggregated in the feature maps used in the subsequent detection part. This is crucial for the improvement in small-target detection performance. To ensure successful fusion, the convolved results from different layers are concatenated. This concatenation step consolidates the information obtained from each layer and prepares it for subsequent processing. The concatenated features undergo a downsampling operation, reducing the number of output channels to the desired level.
By utilizing the FRAM, our model demonstrates notable improvements in detection capabilities, particularly in recognizing small-scale traffic signs. The module preserves and effectively leverages essential information from lower-level features, enabling the model to capture and utilize intricate details associated with traffic signs more efficiently. Handling feature resolution discrepancies and preserving critical information from lowerlevel features enhances the model's accuracy and robustness. This enhancement enables the model to detect traffic signs of varying sizes and scales with greater precision and reliability. Please refer to Figure 4 for the structure diagram.  In this module, C2, C3, C4, and C5 represent the feature maps of different layers after passing through the DSPP module. Their shapes are as follows: The calculation formula for the residuals module can be expressed as Here, F r represents the output of a residual module. F h denotes the high-level feature input received by the module, F l denotes the low-level feature input received by the module, Ds represents the downsampling operation, and Abs represents the absolute value operation. Therefore, the overall calculation formula for this module can be expressed as F out = ReLU(BN(Conv 1×1 (concat(C5, R 5,4,3,2 )))) In the formula, R 5,4,3,2 represents the residuals obtained by aggregating the respective layer features.

Losses
The loss function has a significant impact on target detection. Its purpose is to assess the disparity or error between the predicted outcomes of the model and the real labels. By quantifying the difference between the predicted value and the true label, the loss function offers feedback signals to guide the model's optimization and learning during training. To accurately detect objects in an image, we utilize a combination of classification and bounding box regression tasks. The classification task involves predicting the labels of the objects present in the image, while the bounding box regression task aims to accurately locate the objects in the image. To supervise these tasks, we use appropriate loss functions. Specifically, for the classification task, we use the cross-entropy loss function, which is given by Here, y denotes the true label of the sample and p denotes the predicted probability of the model. For the bounding box loss, we use a linear combination of L1 loss and Generalized Intersection over Union (GIoU) Loss: where L iou (X, Y) denotes the GIoU loss function, λ L1 (X, Y) represents the L1 distance loss function, and λ iou and λ L1 are hyperparameters. The L1 loss function is defined as Furthermore, the GIoU loss function is given by Here, X and Y denote the true and predicted bounding boxes, respectively, and C represents the minimum bounding box containing X and Y. The symbols |.| and \ indicate the area and set differences, respectively. Finally, our overall loss function is expressed as where N is the number of samples in the batch. This loss function combines the classification and bounding box losses, encouraging the model to make accurate predictions for both tasks simultaneously. By minimizing this loss function during training, the model learns to accurately classify and locate objects in images.

Datasets
This study utilizes two well-established datasets, namely the German Traffic Sign Detection Dataset (GTSDB) and the Chinese Traffic Sign Dataset (CCTSDB), to facilitate the evaluation of traffic sign detection algorithms. The GTSDB consists of a total of 827 images, each with a resolution of 800 × 1360 pixels, encompassing four distinct types of traffic signs: "prohibitory", "mandatory", "danger", and "other". The size of the traffic signs in this dataset varies from 16 to 126 pixels. On the other hand, the CCTSDB comprises 17,856 images, with resolutions of 760 × 1280 and 768 × 1024 pixels, and includes three types of traffic signs: "prohibitory", "warning", and "mandatory". These datasets provide a comprehensive collection of traffic sign images captured in Germany and China, offering diverse variations in weather conditions and road scenarios. To provide visual exemplification, Figure 5 showcases selected examples extracted from these datasets, offering a glimpse into the diversity of traffic sign images utilized in this study. signs in this dataset varies from 16 to 126 pixels. On the other hand, the CCTSDB comprises 17,856 images, with resolutions of 760 × 1280 and 768 × 1024 pixels, and includes three types of traffic signs: "prohibitory", "warning", and "mandatory". These datasets provide a comprehensive collection of traffic sign images captured in Germany and China, offering diverse variations in weather conditions and road scenarios. To provide visual exemplification, Figure 5 showcases selected examples extracted from these datasets, offering a glimpse into the diversity of traffic sign images utilized in this study.

Evaluation Criteria
In this paper, the performance of the algorithm model will be evaluated using three metrics, AP, AP50, and AP75, from the COCO dataset, where AP50 is the average precision obtained when the detector threshold is greater than 50, and AP75 is the average precision obtained when the detector threshold exceeds 75. The calculation of AP in the COCO dataset is based on the precision-recall curve. First, for each category, the predictions are ranked according to their confidence level, and then the true positives (TP) and false positives (FP) are calculated for each prediction. Then, precision and recall are calculated based on TP and FP, and a precision-recall curve is plotted. Finally, the area under curve (AUC) is calculated, which is the AP value of the category. The final AP value of the model is obtained by averaging the AP values of all categories.

Experimental Details
The experiments detailed in this paper were conducted using the PyTorch deep learning framework on a 64-bit Linux system, utilizing an NVIDIA GeForce RTX3090 (made by NVIDIA in Santa Clara, CA, USA) graphics card with 24 GB of video memory. During the training phase, the datasets were divided into training, test, and validation sets in an 8:1:1 ratio. The images were resized to 800 × 800 pixels.
The learning rate is a critical parameter that significantly influences the convergence speed of the model. If set too large, it may lead to loss oscillation, while setting it too small may cause the model to converge to a local optimum. After careful consideration, we set the learning rate and weight decay rate to 0.0001 and trained the model for a total of 100 epochs. Comparing the SGD optimizer to the AdamW optimizer, we found that the latter yielded better model convergence. Therefore, we opted to use the AdamW optimizer. The batch size was set to 4, and we utilized 300 query positions.
Data enhancement techniques play a pivotal role in enhancing a model's robustness and preventing overfitting. Hence, we employed various techniques, such as scaling, rotation, and random cropping, during the training process to mitigate overfitting.

Ablation Study
To assess the efficacy of each component in DSRA-DETR, we conducted ablation experiments on the GTSDB and CCTSDB datasets, evaluating the overall structure of our design as well as the ASPP and FAM modules using AP, AP50, and AP75 as performance metrics. We used Anchor-DETR as a baseline and incrementally improved it with our DSPP and FRAMs, subsequently evaluating its performance on both datasets. The results are shown in Tables 1 and 2. The baseline model achieved an AP score of 73.61% on the GTSDB dataset and 76.92% on the CCTSDB dataset. With the incorporation of the multiscale features, the model's performance was enhanced, resulting in a respective increase of 0.51% and 0.29% in AP scores for the two datasets. Moreover, when we further integrated the DSPP and FRAMs, the model achieved even better results, with improvements of 0.86% and 1.15% in AP scores for the GTSDB dataset, and 0.33% and 0.70% for the CCTSDB dataset, respectively. These results suggest that the proposed DSRA-DETR model can effectively improve the detection performance of traffic signs for both datasets. Table 2 lists the average accuracy for small targets as APs, medium targets as APm, and large targets as APl. By analyzing the table, we can observe that the model improved the detection performance for all three sizes of targets for both datasets to varying degrees when trained with multilayer features. Specifically, we can see that the detection performance of the model for small targets (AP) was improved to some extent when multilayer features were added. Notably, when DSPP and FRAMs were used for differential aggregation of multiple features, we observed that the AP metrics of the model improved from 55.82% to 57.12% and from 60.23% to 63.04% for the GTSDB and CCTSDB datasets, respectively. This implies that using features extracted from multiple layers and combining the proposed module allow the complex details of the target to be captured and the useless information to be filtered out from the low-level features. The detection performance of the model for small targets was further improved.
In Figure 6, we compare the detection performance of our proposed DSRA-DETR model with the baseline model for the two datasets, CCTSDB and GTSDB. To demonstrate the effectiveness of our model in detecting small targets or multiple small targets, we specifically selected two examples from each dataset for comparative experiments. The results show that our DSRA-DETR model outperforms the baseline model in detecting small targets, which can be attributed to the integration of our proposed FRAM and DSPP modules. Furthermore, to provide further comparative illustration, we present a visual of the feature maps of these examples.  Table 2 lists the average accuracy for small targets as APs, medium targets as APm, and large targets as APl. By analyzing the table, we can observe that the model improved the detection performance for all three sizes of targets for both datasets to varying degrees when trained with multilayer features. Specifically, we can see that the detection performance of the model for small targets (AP) was improved to some extent when multilayer features were added. Notably, when DSPP and FRAMs were used for differential aggregation of multiple features, we observed that the AP metrics of the model improved from 55.82% to 57.12% and from 60.23% to 63.04% for the GTSDB and CCTSDB datasets, respectively. This implies that using features extracted from multiple layers and combining the proposed module allow the complex details of the target to be captured and the useless information to be filtered out from the low-level features. The detection performance of the model for small targets was further improved.
In Figure 6, we compare the detection performance of our proposed DSRA-DETR model with the baseline model for the two datasets, CCTSDB and GTSDB. To demonstrate the effectiveness of our model in detecting small targets or multiple small targets, we specifically selected two examples from each dataset for comparative experiments. The results show that our DSRA-DETR model outperforms the baseline model in detecting small targets, which can be attributed to the integration of our proposed FRAM and DSPP modules. Furthermore, to provide further comparative illustration, we present a visual of the feature maps of these examples.    Figure 7 presents visualizations of the feature maps for the two selected exemplary instances, illustrating the effects of the DSPP and FRAMs on enhancing the representation of traffic signs. The feature maps exhibited noticeable improvements with the application of the DSPP module. This module effectively eliminates extraneous information while emphasizing essential aspects such as the spatial location and edge characteristics of the traffic signs. As a result, the feature maps become more focused and discriminative.
Additionally, the FRAM plays a crucial role in augmenting the feature maps' capacity to represent small-scale targets. This enhancement is particularly significant as it enables the model to concentrate more effectively on extracting and leveraging relevant information from small-scale traffic signs during its operational phase. With the incorporation of the FRAM, the model exhibits an improved ability to discern subtle details and capture the distinctive features associated with smaller traffic signs. These visualizations provide compelling evidence of the efficacy of the proposed DSRA-DETR model. The DSPP and FRAMs effectively refine the feature maps, enhancing their representational power and facilitating accurate detection and localization of traffic signs. The combination of these modules contributes to the overall performance improvements observed in terms of average precision (AP) scores for both the GTSDB and CCTSDB datasets. emphasizing essential aspects such as the spatial location and edge characteristics of the traffic signs. As a result, the feature maps become more focused and discriminative. Additionally, the FRAM plays a crucial role in augmenting the feature maps' capacity to represent small-scale targets. This enhancement is particularly significant as it enables the model to concentrate more effectively on extracting and leveraging relevant information from small-scale traffic signs during its operational phase. With the incorporation of the FRAM, the model exhibits an improved ability to discern subtle details and capture the distinctive features associated with smaller traffic signs. These visualizations provide compelling evidence of the efficacy of the proposed DSRA-DETR model. The DSPP and FRAMs effectively refine the feature maps, enhancing their representational power and facilitating accurate detection and localization of traffic signs. The combination of these modules contributes to the overall performance improvements observed in terms of average precision (AP) scores for both the GTSDB and CCTSDB datasets.

Comparison with Previous Methods
In our study, we evaluated the performance of our DSRA-DETR algorithm in comparison to several popular algorithms used for traffic sign detection and general target detection tasks. The algorithms we compared included YOLOv3, Deformable DETR, Cor-nerNet, and Conditional DETR, all of which use the training and evaluation APIs provided by the COCO dataset. We present the results of our experiments in Table 3. Compared to Deformable DETR and Conditional DETR, our algorithm achieved a significant improvement in AP for the GTSDB of 2.24% and 2.67%, respectively. Similarly, the APs for the CCTSDB also showed notable enhancements of 2.25% and 1.11%, respectively. These improvements can be attributed to the incorporation of two essential modules, DSPP and FRAM. The DSPP module refines the original feature map by eliminating redundant features and creating a more suitable feature map for the traffic sign detection

Comparison with Previous Methods
In our study, we evaluated the performance of our DSRA-DETR algorithm in comparison to several popular algorithms used for traffic sign detection and general target detection tasks. The algorithms we compared included YOLOv3, Deformable DETR, CornerNet, and Conditional DETR, all of which use the training and evaluation APIs provided by the COCO dataset. We present the results of our experiments in Table 3. Compared to Deformable DETR and Conditional DETR, our algorithm achieved a significant improvement in AP for the GTSDB of 2.24% and 2.67%, respectively. Similarly, the APs for the CCTSDB also showed notable enhancements of 2.25% and 1.11%, respectively. These improvements can be attributed to the incorporation of two essential modules, DSPP and FRAM. The DSPP module refines the original feature map by eliminating redundant features and creating a more suitable feature map for the traffic sign detection task. On the other hand, the FRAM aggregates rich location information from lower-level features, leading to better small-target detection performance for higher-level features.
Compared to YOLOv3 and CornerNet, our algorithm demonstrated remarkable improvements in AP for the GTSDB of 14.85% and 19.38%, respectively. For the CCTSDB, the APs were enhanced by 16.32% and 20.55%, respectively. These substantial performance gains can be attributed to the attention mechanism based on the transformer architecture. This novel visual processing method calculates the pixel point's association with other pixel points, offering a different approach from traditional CNN architecture. Moreover, the introduction of the DSPP and FRAMs plays a crucial role in further enhancing the algorithm's overall performance.
The precision-recall curve in Figure 8 clearly shows that our proposed method outperforms all other compared methods, as it has the largest area enclosed by the coordinate axes. This indicates that our method achieves the best results after training. It is noteworthy that all three methods based on the transformer architecture surpass the performance of the two CNN-based methods, providing further confirmation of the effectiveness of the transformer architecture.
tecture. This novel visual processing method calculates the pixel point's association with other pixel points, offering a different approach from traditional CNN architecture. Moreover, the introduction of the DSPP and FRAMs plays a crucial role in further enhancing the algorithm's overall performance.
The precision-recall curve in Figure 8 clearly shows that our proposed method outperforms all other compared methods, as it has the largest area enclosed by the coordinate axes. This indicates that our method achieves the best results after training. It is noteworthy that all three methods based on the transformer architecture surpass the performance of the two CNN-based methods, providing further confirmation of the effectiveness of the transformer architecture. The loss-epoch curves depicted in Figure 8 demonstrate that our proposed method exhibits superior convergence speed compared to other methods. This can be attributed to Deformable-DETR, Conditional-DETR, and Anchor-DETR, which accelerate convergence. However, it is important to acknowledge that transformer-based algorithms typically require more time to complete an epoch than CNN-based algorithms.

Conclusions and Discussion
In this paper, we introduce DSRA-DETR, a novel method for multiscale traffic sign detection. Our method incorporates an efficient feature fusion module to enhance Anchor-DETR. Unlike traditional CNN-based detectors, we leverage the transformer architecture, which has shown great potential in various computer vision tasks. We investigated different feature fusion methods and pyramidal feature map generation and found that integrating multilevel feature maps maximizes their effectiveness in traffic sign detection. Additionally, we integrated the DSPP module to enhance feature information and improved localization capability at each level. Moreover, the FRAM was employed for feature aggregation, enabling our model to capture valuable underlying feature information and further enhance performance.
Extensive experiments on the GTSDB and CCTSDB datasets demonstrate that DSRA-DETR outperforms several advanced target detection methods in terms of accuracy. However, it is important to acknowledge that the transformer-based model requires significant memory and computational power, making it challenging to deploy in real self-driving The loss-epoch curves depicted in Figure 8 demonstrate that our proposed method exhibits superior convergence speed compared to other methods. This can be attributed to Deformable-DETR, Conditional-DETR, and Anchor-DETR, which accelerate convergence. However, it is important to acknowledge that transformer-based algorithms typically require more time to complete an epoch than CNN-based algorithms.

Conclusions and Discussion
In this paper, we introduce DSRA-DETR, a novel method for multiscale traffic sign detection. Our method incorporates an efficient feature fusion module to enhance Anchor-DETR. Unlike traditional CNN-based detectors, we leverage the transformer architecture, which has shown great potential in various computer vision tasks. We investigated different feature fusion methods and pyramidal feature map generation and found that integrating multilevel feature maps maximizes their effectiveness in traffic sign detection. Additionally, we integrated the DSPP module to enhance feature information and improved localization capability at each level. Moreover, the FRAM was employed for feature aggregation, enabling our model to capture valuable underlying feature information and further enhance performance.
Extensive experiments on the GTSDB and CCTSDB datasets demonstrate that DSRA-DETR outperforms several advanced target detection methods in terms of accuracy. However, it is important to acknowledge that the transformer-based model requires significant memory and computational power, making it challenging to deploy in real self-driving vehicle systems for sustainable urban life. In future research, we propose focusing on lightweighting and real-time optimization, aiming to reduce model size and computational requirements without compromising accuracy. This would be beneficial for improving the algorithm and its applicability in sustainable urban living.
In conclusion, our proposed DSRA-DETR method offers a promising solution for multiscale traffic sign detection, showcasing its effectiveness and surpassing existing methods in accuracy. In future research, we aim to explore a lightweight and real-time traffic sign detection algorithm suitable for deployment in autonomous vehicle systems to enhance road safety. Furthermore, we aim to promote the application of artificial intelligence in sustainable urban living, contributing to a safer and more efficient traffic management system.