An Improved Real-Time Detection Transformer Model for the Intelligent Survey of Traffic Safety Facilities

Wan, Yan; Wang, Hui; Lu, Lingxin; Lan, Xin; Xu, Feifei; Li, Shenglin

doi:10.3390/su162310172

Open AccessArticle

An Improved Real-Time Detection Transformer Model for the Intelligent Survey of Traffic Safety Facilities

by

Yan Wan

¹,

Hui Wang

^2,*

,

Lingxin Lu

^2,3,

Xin Lan

^2,3,

Feifei Xu

¹ and

Shenglin Li

³

¹

School of Civil and Transportation Engineering, Ningbo University of Technology, Ningbo 315211, China

²

Key Laboratory of New Technology for Construction of Cities in Mountain Area, Ministry of Education, School of Civil Engineering, Chongqing University, Chongqing 400045, China

³

College of Artificial Intelligence, Southwest University, Chongqing 400715, China

^*

Author to whom correspondence should be addressed.

Sustainability 2024, 16(23), 10172; https://doi.org/10.3390/su162310172

Submission received: 15 September 2024 / Revised: 18 November 2024 / Accepted: 19 November 2024 / Published: 21 November 2024

(This article belongs to the Section Sustainable Engineering and Science)

Download

Browse Figures

Versions Notes

Abstract

:

The undertaking of traffic safety facility (TSF) surveys represents a significant labor-intensive endeavor, which is not sustainable in the long term. The subject of traffic safety facility recognition (TSFR) is beset with numerous challenges, including those associated with background misclassification, the diminutive dimensions of the targets, the spatial overlap of detection targets, and the failure to identify specific targets. In this study, transformer-based and YOLO (You Only Look Once) series target detection algorithms were employed to construct TSFR models to ensure both recognition accuracy and efficiency. The TSF image dataset, comprising six categories of TSFs in urban areas of three cities, was utilized for this research. The dimensions and intricacies of the Detection Transformer (DETR) family of models are considerably more substantial than those of the YOLO family. YOLO-World and Real-Time Detection Transformer (RT-DETR) models were optimal and comparable for the TSFR task, with the former exhibiting a higher detection efficiency and the latter a higher detection accuracy. The RT-DETR model exhibited a notable reduction in model complexity by 57% in comparison to the DINO (DETR with improved denoising anchor boxes for end-to-end object detection) model while also demonstrating a slight enhancement in recognition accuracy. The incorporation of the RepGFPN (Reparameterized Generalized Feature Pyramid Network) module has markedly enhanced the multi-target detection accuracy of RT-DETR, with a mean average precision (mAP) of 82.3%. The introduction of RepGFPN significantly enhanced the detection rate of traffic rods, traffic sign boards, and water surround barriers and somewhat ameliorated the problem of duplicate detection.

Keywords:

traffic safety facility asset survey; traffic safety facility recognition; YOLO; real-time detection transformer; reparameterized generalized feature pyramid network

1. Introduction

The traffic safety facility (TSF) system includes signal lights, traffic signs, pavement markings, guardrails, barriers, lighting equipment, sight-inducing markers, anti-glare facilities, etc., with the functions of traffic management, safety protection, traffic guidance, isolation, and glare prevention [1]. A TSF survey is an important part of road management to ensure the effectiveness and reliability of traffic facilities and minimize the incidence of traffic accidents [2,3]. Due to the variety of types and styles, TSF management is a labor-consuming job, as the relevant facilities’ ledgers may be lost, and errors and omissions are frequent. In addition, weather, accidents, and other reasons may also lead to damage and loss of traffic facilities. However, the management departments usually adopt manual surveys to ensure the integrity of TSFs, which have the problems of low efficiency and incomplete TSF files. Intelligent detection technology for TSFs can help to solve the above problems [4]. Intelligent detection technology can be employed to enhance the facility’s account quality and utilized to identify deficiencies in the TSF, as well as other issues, promptly.

Image-based deep learning algorithms have the potential to be applied in the TSF detection area. However, most of the research on existing intelligent recognition methods has been aimed at driverless vehicles, mainly traffic sign recognition (TSR) [5,6,7,8,9] and traffic light recognition (TLR) [10,11,12,13]. Some of the pavement facility-based distress detection investigations have incorporated portions of traffic safety and municipal facilities [14,15]. Few studies retrieved are on TSF recognition (TSFR). Lu et al. [16] built a dataset including six categories of TSFs, and an end-to-end transformer-based detection (DINO) model demonstrated the highest detection accuracy on the TSF-CQU dataset compared with Faster R-CNN (Region-Convolutional Neural Network) and YOLOv7 (You Only Look Once version 7). However, detection rates cannot be guaranteed for all types of TSF targets with certain amounts of false positive and false negative issues.

This paper presents a further investigation of TSFR, with the principal research framework illustrated in Figure 1.

As illustrated in Figure 1, this paper is developed through four principal stages: literature research, identification of research objectives, algorithm comparison, and algorithm improvement. Firstly, the literature research analyzed the problems faced by the TSFR task and the advanced target detection algorithms that can be applied to this task. Secondly, based on the determined manpower minimization goal, further evaluation indicators were made regarding the efficiency and accuracy goals. Thirdly, a comparison of the detection accuracy and efficiency metrics of advanced algorithms was carried out, and a misclassification analysis was conducted based on the preferred three algorithms that performed the best in terms of accuracy. Finally, the optimum algorithms were improved, and a detailed comparison and case studies were conducted before and after the improvement.

2. Literature Review

The TSFR task includes both TSR and TLR issues, but most of the research has focused on one type of target. A comprehensive system comprising a traffic light detector, a tracker, and a classifier based on deep learning, stereo vision, and vehicle odometry was proposed for TSR and automated driving applications [12]. Arcos-García et al. achieved exceptionally high accuracy in the TSR domain by employing multiple spatial transformer networks (STNs) and comparing stochastic gradient descent optimizers to enhance the efficacy of the classification model and diminish the number of model parameters [5]. To enhance the efficacy of traffic sign target detection, Xie et al. [17] proposed a TSR method that integrates the SSD (Single Shot MultiBox Detector) with FPN (feature pyramid network) and incorporates an attention mechanism. The efficacy of this approach was validated on the CCTSDB (Changsha University of Science & Technology Chinese Traffic Sign Detection Benchmark) dataset, demonstrating enhanced accuracy, classification precision, and resilience to interference. The advent of edge computing has paved the way for the development of an intelligent traffic perception system. The enhanced SSD-ResNet50 and DarkNet-53 algorithms have demonstrated considerable potential for application in the research of real-time TSR [18]. One of the challenges that TLR often encounters is that the target in the field of view is often too small to be accurately identified. The introduction of traffic light position frequency expert information led to a significant improvement in the efficiency of the TLR model, with an 83% precision and 73% recall rate achieved [11]. Driverless-oriented TLR and TSR focus on the resolution of issues frequently centered on the stability of algorithms and the real-time nature of recognition methods under the influence of environments such as darkness, fog, rain, and snow [19].

In comparison to TSR and TLR for automated driving, the real-time requirements for TSF detection are relatively less demanding, yet the detection objects are more diverse and exhibit significant gaps. Due to the characteristics of construction and maintenance, sign lines are often included in pavement distress surveys by unmanned aerial vehicles (UAVs) [20] or car-mounted cameras [14,15,16]. To address the diversity of roadside traffic objects, Fang et al. [21] developed a straightforward and universal multi-view feature descriptor based on a mobile laser scanning (MLS) system to characterize the global features of a single object. This approach has been proven effective in detecting roadside traffic facilities, including trees, cars, and traffic poles. Thanh and Chaisomphob proposed a method for automatically detecting and classifying TSFs in a highway environment based on MLS point cloud data, which was reported to be effective, robust, reliable, and capable of detecting and marking TSFs with high accuracy [22]. Nevertheless, the method cannot be used to detect large diagonal poles and some small signs with short trunk heights or utility poles with square cross sections. Jiang et al. [23] proposed an enhanced CenterNet target detection model to address the challenges posed by significant target scale variation, a high proportion of small targets, complex backgrounds, and partial occlusion of targets in UAV images. The model was applied to the detection of TSFs, including culverts, alarm posts, traffic signs, isolated gates, curb signs, portal frames, and monitors, and a mAP of 0.867 was achieved.

Due to the scanning angles and visual characteristics, the use of UAV and MLS data for TSFR purposes will inevitably fail to identify some types of TSFs. The vision data obtained from the driving viewpoint has the potential to automatically identify additional facility categories. Liu et al. [24] proposed a novel symmetric TSR model, M-YOLO, for complex scenes. The M-YOLO model demonstrated excellent detection performance in experimental results on the CCTSDB dataset and the small target dataset HRRSD (High-Resolution Remote Sensing Detection), which contain traffic signs in complex scenes. Sanjeewani and Verma [25] presented a comprehensive convolutional neural network (CNN) optimization method for the accurate detection of road safety attributes, including rumble strips, flexible risers, guide posts, and signals. Yang et al. [26] conducted a comprehensive evaluation of the performance of the Mask R-CNN, YOLOX, and YOLOv7 algorithms in detecting multiple classes of TSFs based on the Mapillary dataset, and the results demonstrated that YOLOv7 exhibited superior accuracy compared to the other two networks. It should be noted that the issue of duplicate detection remains unresolved. In [27], a highway TSF detection and pixel-wise segmentation model based on a mask region-based convolutional neural network (Mask-RCNN) with a feature pyramid network (FPN) was proposed. This model was designed to detect various types of highway infrastructures, including retaining walls, noise barriers, rumble strips, guardrails, guardrail anchors, and central cable barriers. A self-attention mechanism based on the generic region of interest extractor (GRoIE) model, in conjunction with the intersection over the minimum area merging (IoMA-Merging) post-processing algorithm, was introduced to address the issue of false positives (FPs) resulting from scale diversity and the high degree of continual appearance variation. Nevertheless, instances of overlapping detections and unrecognized assets persist due to scale diversity. DINO and non-maximum suppression (NMS) [28] have been demonstrated to provide partial solutions to the issue of duplicate detections [14,16].

Although the TSFR method has been developed, the detection of specialized targets is highly repetitive and generates a large amount of carbon emissions. One way to optimize this detection task is to make the detection model lightweight and to share this task in other scenarios such as routine surveys, public transport, etc. Therefore, this study proposes using an existing TSF dataset to carry out lightweight detection algorithm research and explore how to achieve efficient detection of TSFs based on low-cost video data. To conduct TSFR research, some relevant datasets are summarized and compared in Table 1, which includes the mainstream datasets in autonomous driving.

As seen in Table 1, datasets such as Cityscapes [29], KITTI [30], and nuScenes [31], which are used for segmentation or detection, are more frequently applied in autonomous driving datasets. It can be concluded from Table 1 that Cityscapes [29], KITTI [30], and BDD100K [33] cannot be directly applied to the field of target detection. The labeled target category in the nuScenes [31] and Waymo Open [32] datasets do not include specific TSFs. CULane [34] and ApolloScape [35] datasets were collected in China, which fit the context of the needs of this study, but CULane only focused on lane line detection with a single target, while the annotations of ApolloScape were more focused on autopilot-related scenarios rather than TSF asset types. Compared to those datasets for TSR and TLR in automated driving application scenery, TSF-CQU [16] has more TSF categories and can be used for this study. Based on the excellent performance of DINO in the TSFR task in [16], the more lightweight RT-DETR developed in [39] was used as the main framework in this paper. The YOLO series [40,41,42,43,44,45,46,47,48,49] of algorithms were used to construct TSFR models for comparison because of the rapid development of YOLO.

3. Methods

3.1. YOLO Series

3.1.1. YOLOv7–YOLOv11

YOLOv7 proposes several architectural changes and a series of free packages that improve accuracy without affecting inference speed. The backbone network consists of CBS (Convolutional Block with Sigmoid), ELAN (Efficient Layered Adaptive Network), and MP-1 (Multi-Path Downsample Block Type 1). CBS is structured for feature extraction and channel conversion. ELAN splices feature maps together through different branches to facilitate effective learning and convergence of the deeper network. MP-1 fuses feature maps obtained from different downsampling methods to retain more feature information without increasing the amount of computation. The neck network mainly consists of three sub-modules of SPPCSPC (Spatial Pyramid Pooling Cross Stage Partial Network): E-ELAN (Extended ELAN), UPSampling, and Concatenation structure. The SPPCSPC module is used to improve the efficiency and accuracy of feature extraction. The E-ELAN module adds two splicing operations compared to the ELAN module. The UPSampling module is used to achieve the efficient fusion of features at different levels, and the Cat structure aims to further optimize the effect of convolutional layers. The detection head is responsible for the final prediction output of the network, decoupling the feature information processed by neck using the reparameterization module to adjust the number of channels for the three different sizes of features output by neck and then going through the 1 × 1 convolution operation to arrive at the prediction of the target object’s position, confidence level, and category [41].

YOLOv7-tiny exhibits a reduced model size and diminished computational effort, rendering it well-suited for deployment on constrained computing platforms. However, YOLOv7-tiny may exhibit constraints in target detection, particularly for diminutive targets at considerable distances. In intricate traffic scenarios, YOLOv7-tiny may encounter challenges in accurately identifying objects, particularly in the detection of traffic signs. We validated the detection potential and compressibility of the algorithm in a relevant domain [14,42,43].

The inputs of YOLOv8 are similar to those of YOLOv7. The structure of YOLOv8 used in the backbone part is Darknet53, which includes the basic convolutional unit Conv, the spatial pyramid pooling module SPPF (Spatial Pyramid Pooling—Fast) that implements the fusion of local and global features at the feature map level, and the C2F (CSP Bottleneck with 2 Convolutions) module that increases the depth of the network and the sensory field to increase the feature extraction capability. The neck network is similar to that of YOLOv5. For loss function computation, a task-aligned assigner positive sample allocation strategy is used. It consists of a weighted combination of three loss functions in two parts: classification loss VFL (Varifocal Loss) and regression loss CIOU (Complete IOU) + DFL (Deep Feature Loss) [44].

The concept of programmable gradient information (PGI) was introduced in YOLOv9 to cope with the variations required for deep networks to achieve multiple objectives. PGI can provide complete input information for the target task to compute the objective function, thus obtaining reliable gradient information to update the network weights. Generalized Efficient Layer Aggregation Network (GELAN) based on gradient path planning was designed. The architecture of GELAN confirms that PGI achieves excellent results on lightweight models [45].

YOLOv10 focuses on balancing efficiency and accuracy through a series of architectural and training protocol optimizations. A novel approach is consistent dual assignment for training without NMS, where a combination of one-to-many and one-to-one strategies are used during training, ensuring consistency between training and inference and eliminating the need for non-maximal suppression (NMS) during inference. By removing the NMS step, YOLOv10 significantly reduces inference latency. The NMS-free approach enables true end-to-end deployment of the model, simplifying the inference pipeline and potentially improving the efficiency of the overall system. The classification header in YOLOv10 is designed to be lightweight, reducing computational redundancy in the classification process. The spatial-channel decoupled downsampling technique separates spatial and channel information during downsampling to optimize the feature extraction process. By decoupling these aspects, the model can process the input data more efficiently, resulting in better performance with lower computational requirements. Large kernel convolutions are used to improve the model’s ability to capture detailed features over larger spatial regions. These convolutions enable the model to better understand the target context in the image and improve detection accuracy. The Partial Self-Attention module improves accuracy while adding very little additional computational cost. This module helps the model focus on relevant features in the input data, improving its ability to accurately detect and classify targets. YOLOv10 incorporates a variety of advanced techniques to improve overall performance, such as showing greater capability in small target detection, demonstrating lower false prediction rates, and producing higher confidence scores for predictions [46].

YOLOv11 does not have many improvements over YOLOv8, mainly some architectural details. Its enhanced feature extraction uses improved trunk and neck architectures for more accurate target detection and performance on complex tasks. The C2PSA mechanism was proposed, which embeds a multi-head attention mechanism inside a C2 (the predecessor of C2f) mechanism. The classification detection head in the decoupled head adds two DWConv to reduce the computational effort. Another point of difference between YOLOv11 and YOLOv8 is the depth and width of the network [47,48].

The YOLOv7-tiny model has been designed to be more compact and efficient than the original YOLOv7 to optimize performance on edge GPUs. However, this has resulted in a slight reduction in accuracy. The downsampling module of YOLOv8-tiny may result in the loss of fine-grained feature information, which can subsequently affect the accuracy of the detection. Furthermore, the robustness of the network in complex backgrounds requires further enhancement. YOLOv9 addresses the issues of information loss and gradient instability through the introduction of GELAN and PGI, thereby improving the accuracy and computational efficiency of the model. However, it also has a complex network structure, a large number of parameters, and high-performance requirements for devices, making it unsuitable for edge terminal devices. YOLOv11 demonstrates a notable enhancement in inference speed and accuracy compared to its predecessor, with the introduction of variants (11n, 11s, 11m, and 11x). Above all, YOLOv7-tiny, YOLOv9-tiny, YOLOv10-n, and YOLOv11-n were chosen for TSFR model construction and comparison.

3.1.2. YOLO-World

YOLO-World enhances YOLO’s open-vocabulary detection capabilities through visual-linguistic modeling and pre-training on large-scale datasets. Open-vocabulary detection refers to the ability to detect and recognize object categories that have not been seen during the training phase. YOLO-World is based on YOLOv8 and uses Darknet as its encoder. Through a Path Aggregation Network (PAN), YOLO-World constructs a feature pyramid that fuses feature maps at different scales, enhancing the model’s ability to detect targets of different sizes. YOLO-World encodes input text (e.g., category names, noun phrases, or object descriptions) into text embeddings using the pre-trained CLIP text encoder. CLIP is a visual-linguistic pre-training model capable of mapping textual and image features into the same semantic space. In the training phase, online vocabularies containing both positive and negative nouns are construed to enhance the model’s ability to recognize large vocabulary objects. In the inference phase, YOLO-World employs a “cue-detection” paradigm [49].

3.2. Detection Based on Transformer

3.2.1. DINO

The extraction of multi-scale features is conducted using ResNet50 backbones, which are subsequently fed into the transformer encoder in conjunction with the corresponding positional embeddings. Subsequently, a novel mixed query selection strategy is proposed for the initialization of anchors as positional queries for the decoder, following the enhancement of features with the encoder layers. The multiscale features are fed to the transformer encoder for enhancement, together with the corresponding position embeddings. The deformable attention mechanism is employed to integrate the output of the features generated by the encoder and to refine the query layer sequentially. The mixed query selection (QS) module is responsible for enhancing the positional information utilizing the top-K features that have been extracted from the encoder. The decoder is subdivided into a contrast denoising training module, which learns by introducing positive and negative samples with noise, and a bipartite graph matching component. This latter component correlates the output prediction with the real target in the input image, thereby obtaining the accurate target detection result. The output sequence of the decoder is passed through the Feed-Forward Network, which generates the predictions of the final category and the bounding box predictions [50].

3.2.2. Real-Time Detection Transformer (RT-DETR)

The DETR (Detection Transformer) family suffers from the problem of high computational cost, which limits its wide use in practical applications. Based on this, the Baidu Flying Paddle team proposed RT-DETR [39] as a new member of the DETR family, which is the first real-time end-to-end target detector that outperforms all the YOLO detectors of the same size in terms of speed and accuracy, making it a breakthrough in the DETR family of models. They designed a hybrid encoder that efficiently handles multi-scale features by decoupling intra-scale interactions and cross-scale fusion to reduce computational costs and enable real-time object detection. RT-DETR supports the flexibility of adjusting the inference speed by using different decoder layers without retraining, which contributes to the practical application of real-time detection. The framework of RT-DETR is shown in Figure 2.

The backbone part of RT-DETR uses CNNs, such as the ResNet family [51] or HGNet [52] developed by Baidu. The RT-DETR backbone network in this paper is mounted with ResNet18, which has 18 convolutional layers including the base convolution, residual block, and fully connected layers. ResNet18 is more lightweight in terms of network depth and the number of parameters, and thus has higher training and inference speed in resource-constrained scenarios. The goal of choosing ResNet18 is to balance accuracy and computational efficiency. RT-DETR draws outputs from the backbone network at three scales with output steps of 8, 16, and 32. Once the outputs of the three scales are extracted from the backbone network, they are fed into the neck network for further computational processing.

For the neck network, the Flying Paddle team designed a series of encoder variants to verify the feasibility of decoupling intra- and inter-scale feature interactions and eventually evolved into an efficient hybrid encoder, which consists of an attention-based intra-scale feature interaction (AIFI) module and a CNN-based cross-scale feature-fusion module (CCFM). The efficient hybrid encoder in RT-DETR can efficiently process multiscale features, which are processed by decoupling the interactions within a single scale and cross-scale fusion. After extracting the multi-scale feature maps from the backbone network, only the feature map with the smallest scale (S5 in Figure 2) is used to do the multi-head self-attention processing. The output is used to do feature fusion with the other two scaled feature maps (S3 and S4 in Figure 2) in the CCFM module.

The computational process of the AIFI module is specifically shown in Figure 3a. The 2D image features are pulled into vectors, and then the positional encoding is generated. The inputs are fed to the AIFI module for processing, and the computation process includes a multi-head self-attention and a Feed-Forward Neural Network (FFN).

As shown in Figure 3a, the computational process of the AIFI module is as follows. In step 1, the input features are summed with the positional encodings, the value obtained after the summation is used for the self-attention Q and K, and V is the original feature input. In Step 2, the output of the self-attention computation is summed and normalized with the original input features. In Step 3, the normalized output goes to the FFN for the next step, and the output of the FFN is summed with the input of the FFN and normalized again to get the result of the encoder layer. In Step 4, the output of Step 3 is adjusted back to 2D features to complete the subsequent “cross-scale feature fusion”, and F5 in Figure 2 is the result of the AIFI computation process.

As shown in Figure 2, the inputs of CCFM are S3, S4, and F5. The first fusion module is a computation of S4 and F5. F5 first undergoes 1 × 1 convolution with upsampling to form the same size as S4, and the two are spliced together to enter the computation of the fusion module. As shown in Figure 3b, there are two branches in the fusion; the upper branch undergoes a 1 × 1 convolution computation, and the lower branch undergoes the same 1 × 1 convolution computation and the computation of the RepBlock [53] repeated three times in our experiment. Then, the upper and lower branches perform element-wise summation, which is a bit-by-bit addition of two tensors of the same shape over the corresponding elements. The first fusion output is then subjected to the same computational process as S3 to obtain the output of the second fusion module. The output of the second fusion is then downsampled and fused again with the fusion results of F5 and S4. This is followed by another downsampling, and then a final Fusion with the result of the 1×1 convolution of F5 and upsampling, obtaining the third output.

From the CCFM module, three feature maps (outputs) at different scales after inter-scale feature fusion processing are obtained. Subsequently, these feature maps are spliced as inputs to the decoder. RT-DETR innovatively introduces an IoU-based query selection mechanism, which further improves the model performance by providing high-quality initialized target queries to the decoder. It employs “IoU soft labeling” as the label for categorization, with the IoU values between the predicted and real frames used as the labels for category prediction. This approach is essentially an in-depth application of IoU perception, which has been validated in several works such as RTMDet [54] and DAMO-YOLO [55]. The purpose of introducing IoU soft labeling is to address the discrepancy between categories and regressions. The traditional unique heat coding approach may lead to “unaligned” situations, i.e., the categories are learned in advance while the localization is still inaccurate. The use of IoUs as category labels allows the learning of categories to be modulated by the regression. The categories are learned accurately enough only when the regression is also learned well enough. Thus, the IoU soft labels act as explicit constraints on the categories during the training process, helping to improve the overall performance of the model.

RT-DETR employs the decoder part of DINO, known as the “DINO Head”. It draws on DINO’s denoising idea and aims to optimize the sample quality and promote fast convergence of training. Besides, the settings of RT-DETR, such as bipartite graph matching and loss function, are the same as DINO, and the loss part uses several loss functions to train the model. The Focal Loss function [56] is used for classification loss, which is usually used to solve the problem of class imbalance in classification problems. The Smooth L1 loss [57] and the GIoU loss [58] functions are combined and used for bounding box regression loss. The overall loss function is a weighted sum of the above categorization loss and the bounding box regression loss.

Even though RT-DETR was developed to reduce computational costs, its training process may still prove to be more complex and challenging than that of traditional CNN-based models. This may necessitate a greater input of tuning efforts to achieve optimal performance. Besides, RT-DETR employs a Hungarian matching strategy to provide sparse supervision during training, which may result in the encoder and decoder being undertrained, thereby limiting the method’s optimal performance.

3.3. Improved RT-DETR with Reparameterized Generalized Feature Pyramid Network Module (RT-DETR-RepGFPN)

As the six types of TSF targets have scale differences, the corresponding detection accuracy is affected. To further improve the ability to capture multi-scale features, the RepGFPN (Reparameterized Generalized Feature Pyramid Network) module [59] was introduced into RT-DETR, and the improved model framework is shown in Figure 4.

In RepGFPN, feature fusion can be regarded as a weighted summation process, whereby feature maps of varying scales are merged by addition to form a feature representation that is rich in semantic and spatial information. The fused feature map F_out can be expressed as (1)

F_{o u t} = F_{t} + u p s a m p l e (F_{h})

(1)

where F_l and F_h represent the low-level and high-level feature maps, respectively, and upsample represents the upsampling operation, which is used to adjust the resolution of the feature map to match the low-level feature map.

The feature pyramid network (FPN) has been a pivotal component in the field of target detection, as it can efficiently aggregate disparate resolution features extracted from the backbone network. The Generalized FPN (GFPN) serves as the neck structure, enabling the exchange of high-level semantic information with low-level spatial information, thereby achieving excellent performance. However, GFPN-based models frequently exhibit high latency. A detailed examination of the GFPN structure reveals that this issue is primarily because feature mappings at different scales share the same channel dimensions. Additionally, the fused operations do not align with the requirements of real-time detection models, and the inefficiency of convolution-based cross-scale feature fusion further contributes to the problem. In light of these considerations, the TinyML team put forth a novel, efficient RepGFPN designed to meet the specifications of real-time target detection. The Efficient-RepGFPN in DAMO-YOLO plays a pivotal role in aggregating features of varying resolutions extracted from the backbone network, rendering it more suitable for real-time detection models by enhancing feature interactions and optimizing convolution-based cross-scale feature fusion.

The RepGFPN in RT-DETR serves as a replacement for the CCFM feature fusion module in the neck network, which has the same inputs as the original CCFM module. The three inputs are employed for the fusion of features across different scales through the enhanced queen fusion mechanism. Furthermore, different channel dimensions are employed for the feature maps at different scales, and Cross-Stage Partial Network (CSPNet) [60] and Efficient Long-Range Attention Network (ELAN) [61] were integrated.

4. Experiment

4.1. Data

A total of 1422 images from the TSF-CQU database were utilized in this research project, which includes 230 images from Chongqing, 44 images from Ningbo, and 948 images from Shanghai. Examples of the data are presented in Figure 5. The dataset is labeled with a total of 8410 target samples, which are distributed across different target types. These distributions are presented in Table 2, while the samples themselves are shown in Figure 6.

The data annotation process utilized the more popular target detection dataset annotation software, LabelImg v1.8.0. The software is straightforward to operate and is capable of supporting YOLO (txt format), Pascal VOC (XML format), and CreateML (JSON format) labels. Given that the dataset labels of the DETR series algorithm are in MS COCO (JSON format), and given the subsequent comparison experiments with the YOLO series algorithms, this study employed LabelImg to label the text format labels initially, subsequently combining this with Python 3.8 scripts to convert to MS COCO format labels for utilization by the model.

4.2. Experiment Configuration

The experimental environment of the transformer detection series was a Windows 10 system with an 11th-generation Intel Core i9-11900K CPU from Intel Corporation (Santa Clara, CA, USA), which operates at a frequency of 3.5 GHz. The GeForce RTX 3090 graphics card from NVIDIA Corporation (Santa Clara, CA, USA) was used for training, with 24 GB of video memory, CUDA 12.2 configured, and 128 GB of computer memory. The programming language employed was Python, and the deep learning framework utilized was PyTorch 1.12.1.

The experimental environment of the YOLO series was Python 3.10.8, the Pytorch 2.0.1+cu117 framework, and CUDA 11.7. The hardware devices were AMD Ryzen 7 5800H from Advanced Micro Devices, Inc. (Santa Clara, CA, USA) with a Radeon Graphics processor from AMD (Santa Clara, CA, USA) and GeForce RTX 4090 from NVIDIA Corporation.

4.3. Confusion Matrix

The confusion matrix is comprised of four principal elements, which are as follows:

(1): True positive (TP): the number of samples that the model correctly predicts as positive categories;
(2): True negative (TN): the number of samples correctly predicted by the model to be in the negative category;
(3): False positive (FP): the number of samples incorrectly predicted by the model to be in the positive category;
(4): False negative (FN): the number of samples that the model incorrectly predicts to be in the negative category.

4.4. Evaluation Metrics

For efficiency evaluation, the parameter number (params) and operation volume (GIGA floating point operations, GFLOPs), processing frames per second (FPS), training time (TT), and graphics memory used for training (GMT) were used. For accuracy, average precision (AP) and mean average precision (mAP) with an intersection over union (IoU) threshold of 50% were used, which are calculated by Equations (2)–(6).

P r e c i s i o n = \frac{T P}{T P + F P}

(2)

R e c a l l = \frac{T P}{T P + F N}

(3)

A P = \int_{0}^{1} P d R

(4)

m A P = \frac{1}{N} \sum_{i = 1}^{N} A P_{i}

(5)

I o U = \frac{p i x_{p r} \cap p i x_{g t}}{p i x_{p r} \cup p i x_{g t}}

(6)

where TP denotes true positive case numbers and FP denotes false positive case numbers; Recall is the percentage of all true targets that are detected; AP is the area under the Precision-Recall curve, which measures the performance of the model in a category;

p i x_{p r}

is the region of the prediction result;

p i x_{g t}

represents the region of the real label; and mAP is the mean AP value of all categories.

5. Results and Discussion

5.1. Evaluation Results for Comparison of Different Models

The evaluation results of the models built are compared in Table 3.

As illustrated in Table 3, the transformer series models exhibited superior detection accuracy relative to the YOLO series. However, the former were more substantial in size and necessitated a greater investment of training time. The YOLO series models demonstrated a more streamlined overall performance, with a general reduction in computational complexity. Among these, the YOLO-World model, which uses a text CLIP text encoder, exhibited a notable increase in both size and computational complexity. Nevertheless, it exhibited the second highest integrated detection accuracy (mAP = 0.807), as the best performing RT-DETR obtained a mAP of 0.811. Given the considerable discrepancies in the dimensions and configurations of the target categories, the classification precision of disparate models may exhibit a considerable disparity.

The smaller models employed in the comparative YOLO series demonstrated enhanced potential in terms of computational capacity and the operation of embedded terminals. However, available evidence suggests that both RT-DETR and YOLO-World have compression capabilities by knowledge distillation [42]. In addition to the efficiency of the model operation, post-processing means must be employed as an aid. Based on practical engineering experience, the time consumed by human screening images to add detection targets (8–12 s) was approximately four to six times the time cost (2–3 s) of deleting or revising after false detection. The manual single-image analysis times were considerably longer than the algorithmic single-image inference times. Consequently, the analysis concentrated on the particular issues of missed and false detections. The detection models with mAP exceeding 0.8 were subjected to a further selection process to facilitate a comparison of classification accuracy, as illustrated in Figure 7.

As illustrated in Figure 7, all models exhibited the lowest average accuracy in rod detection, which is primarily attributable to the considerable interference of background information (e.g., tree poles, etc.) associated with the rods. Similarly, traffic lights exhibited relatively low detection accuracy due to their smaller targets. RT-DETR, however, was capable of detecting traffic lights with smaller targets with a high degree of accuracy, likely due to its ability to sense and process information across scales provided by its backbone network AIFI and CCFM modules. For continuous detection targets such as guardrails and WSB, the detection accuracy was generally high, and YOLO-World even achieved very high APs of 0.888 and 0.931, respectively. This demonstrates the advantage of YOLO-World in detecting large targets (with a high pixel percentage) and the existence of the weighting tendency problem.

There was no significant difference in classification detection accuracy between RT-DETR and DINO. The reductions in the detection AP values of WSB and gantry did not affect the overall efficiency. Given that the TSFR task is designed to cover a high percentage of facilities and assist in the inventory of facilities, as well as to further optimize efficiency for the timely detection of facility anomalies, further analyses were conducted by drawing confusion matrices for RT-DETR and YOLO-World, as shown in Figure 8.

As illustrated in Figure 8, the misclassification patterns of the two models exhibited notable similarities. The primary issues encountered were false positives (FPs) and false negatives (FNs). Both models incorrectly identified a small number of traffic lights as traffic signs. RT-DETR demonstrated a lower rate of false negatives in the detection of road signs, boards, traffic lights, and guardrails but exhibited a much higher rate of false positives in the detection of background elements, including a small number of traffic poles that were incorrectly identified as guardrails.

Regarding the FN problems, YOLO-World may have demonstrated worse performance because it may be more inclined to sacrifice some degree of accuracy in pursuit of enhanced speed. Conversely, DR-DETR may have exhibited better efficacy in addressing the FN problem due to its reliance on a transformer architecture, which is better equipped to handle smaller targets and more intricate scenes. However, this also implies that greater computational resources and time are required for the processing of each image.

Regarding the FP problems, RT-DETR exhibited inferior performance, which may be because RT-DETR is more focused on global features in terms of feature extraction, whereas YOLO-World may be more adept at extracting local features. YOLO-World employs a pre-trained CLIP text encoder to encode the input text as text embeddings, thereby enhancing the model’s generalization ability, which may be the primary reason for its capacity to mitigate FP problems.

TSFR solutions need both precision and recall to be high. Moreover, false positives may result from distant targets that fall outside the detection scope. Thus, we focused on the improvement of the FN problems and conducted a model improvement for RT-DETR. The smaller objects included in the dataset (e.g., light category) were more challenging to detect. Due to the design of the data capture field of view, far-distance targets may be small. In some cases, targets that should be classified as background were incorrectly identified as objects (e.g., WSBs and gantries in the opposite direction lanes). Despite the small amount of data available for WSBs and gantries, their distinctive features not only enabled them to be accurately detected but also detected unmarked targets in the opposite lane and far distance. Rods are characterized by thin shapes, which render them susceptible to background interference, thereby increasing the likelihood of false detection by the model. It is recommended that subsequent research should focus on further limiting the range of detection targets to the region of interest within a certain distance in a single direction. Additionally, efforts should be made to improve the identification of rod types.

5.2. Model Improvement Results

The evaluation indexes of the detection results for the improved and baseline models are shown in Table 4.

As evidenced in Table 4, the model RT-DETR-RGFP demonstrated an improvement in the detection Recall rate, accompanied by a corresponding reduction in precision. The mean average precision was enhanced by 1.2%. Overall, an improvement in the detection model was not obvious. A further comparison of the detection recall of each target type is presented in Figure 9.

As illustrated in Figure 9, RT-DETR-RepGFPN has demonstrated a significantly enhanced recall rate for the categories of rod, board, and WSB. The detection recall rates for lights and guardrails were lower than expected. The low detection rate of TLR may be primarily due to small and ultra-long-range targets, while guardrails may be prone to interference from complex backgrounds. To further compare and analyze the effectiveness of the detection models, detected cases are shown in Figure 10 and Figure 11.

As illustrated in Figure 10a, the outline of a tall building in the distance was erroneously identified as a signboard. The introduction of RepGFPN (Figure 10b) effectively mitigated this FP category error but missed a pedestrian traffic light at the corner. The enhanced model depicted in Figure 10d was capable of identifying the distant signage and guardrail, whereas RT-DETR was unable to discern these two targets. The guardrail situated beneath the background interference of the bicycle and the rod supporting the board in Figure 10f were recognized by the enhanced model but were overlooked by RT-DETR in Figure 10e. Figure 10c–f demonstrates that RepGFPN effectively enhanced the capacity to recognize minute targets and the differentiation between the background and foreground. The ability to recognize minute targets may be attributed to the distinct scale channels employed by RepGFPN, whereas the capacity to distinguish the background is primarily attributed to the queen fusion mechanism. Queen fusion represents a cross-scale connectivity approach whereby features of different levels are fused with image features at different positions to facilitate more efficient information transfer. It is also worth noting that the FP problems associated with the repeated detection of the same target were somewhat mitigated, as shown in Figure 10g,h.

As illustrated in Figure 11, RT-DETR-RepGFPN was capable of accurately distinguishing traffic rods from street light poles. RT-DETR is prone to misidentifying street light poles as traffic rods due to the influence of background interference. The deterioration of data quality resulting from illumination had a negligible impact on the detection outcomes. Furthermore, the detection of small targets in distant views was also highly effective. Further studies should focus on narrowing the scope of interest, eliminating systematic duplication, and reducing the number of small targets.

Performance improvement analysis: RepGFPN employed disparate channel dimensions for feature maps at varying scales, thereby optimizing performance under computational resources. By flexibly controlling the number of channels at different scales, it is possible to achieve greater accuracy than would be the case if the same number of channels was used at all scales. RepGFPN enhanced feature interactions through a modified queen fusion mechanism while reducing latency by removing additional upsampling operations. The optimized fusion mechanism served to reduce the occurrence of duplicate detections, as it permitted the model to process and fuse feature information from disparate scales in a more efficacious manner. Furthermore, RepGFPN enhanced feature fusion through the integration of CSPNet (Cross Stage Partial Network) and ELAN with reparameterization. This integration facilitated an improvement in the model’s capacity to distinguish features, thereby reducing the misclassification of the background as a target.

Negative impact analysis of the introduction of RepGFPN: RepGFPN enhanced detection efficacy by optimizing the feature fusion strategy; however, this fusion may prove insufficient when confronted with lateral targets. The distinction of lateral targets may necessitate the identification of finer features, which RepGFPN’s fusion strategy may be insufficiently equipped to capture. Targeted increases in training data for lateral targets may help to ameliorate this problem.

6. Conclusions

The objective of this study was to assess and develop an efficacy TSFR method. To this end, a comprehensive investigation into the target detection models employed for six categories of traffic safety facilities data was conducted. The advanced detection algorithms of the DETR and YOLO series were compared and analyzed. The main contributions are as follows.

A comparison of the YOLO and DETR series of models demonstrated that the detection accuracy of RT-DETR and YOLO-World was comparable, with the former exhibiting superior accuracy and the latter demonstrating a higher efficiency. However, the TSFR model size and complexity of RT-DETR remained considerably higher than that of the YOLO-World.
The RT-DETR-RepGFPN model was proposed for the TSFR task, which further enhanced the model with a mAP of 0.823, increasing the number of parameters by 4 M while only reducing the operational efficiency of FPS by six.
The introduction of RepGFPN significantly enhanced recall for the categories of rod, board, and WSB but reduced the detection rate of lights and guardrails.
The problem of duplicate detection was somewhat ameliorated.

The proposed method can be used to assist in the archiving of traffic safety facilities, but it still requires a lot of manpower work. Further enhancements to the precision of the proposed method could be achieved by narrowing the scope of the region of interest for detection, expanding the range of rod types, and incorporating samples of lateral traffic light facilities. Additionally, the use of compressed models through directional distillation could facilitate the process.

Furthermore, RT-DETR and RT-DETR-RepGFPN did not show an absolute accuracy advantage and presented a disadvantage in terms of efficiency when compared to YOLO-Wrold. Further investigation can also be conducted concerning the analyses and improvements for the YOLO-World model with the primary objective of rod detection accuracy improvement.

Author Contributions

Conceptualization, L.L., H.W. and Y.W.; methodology, L.L. and H.W.; software, L.L. and X.L.; validation, H.W., Y.W., X.L. and S.L.; formal analysis, L.L., H.W. and Y.W.; investigation, L.L. and H.W.; resources, F.X., H.W. and Y.W.; data curation, L.L., Y.W. and F.X.; writing—original draft preparation, L.L., H.W. and Y.W.; writing—review and editing, H.W., X.L. and S.L.; visualization, L.L. and H.W.; supervision, H.W. and S.L.; project administration, H.W. and Y.W.; funding acquisition, F.X. and Y.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Ningbo Public Welfare Science and Technology Project, grant number 2023S063.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors upon request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

GB50688-2011(2019); Code for the Design of Urban Road Traffic Facility. Ministry of Housing and Urban-Rural Development of the People’s Republic of China: Beijing, China, 2019. (In Chinese)
Cui, T. Research on design technology of safety facilities in highway traffic engineering. In IOP Conference Series: Earth and Environmental Science; IOP Publishing: Bristol, UK, 2020; Volume 587, p. 012006. [Google Scholar] [CrossRef]
Chen, R.; Hei, L.; Lai, Y. Image Recognition and Safety Risk Assessment of Traffic Sign Based on Deep Convolution Neural Network. IEEE Access 2020, 8, 201799–201805. [Google Scholar] [CrossRef]
Lv, Z.; Shang, W. Impacts of intelligent transportation systems on energy conservation and emission reduction of transport systems: A comprehensive review. Green Technol. Sustain. 2023, 1, 100002. [Google Scholar] [CrossRef]
Arcos-García, Á.; Álvarez-García, J.A.; Soria-Morillo, L.M. Deep neural network for traffic sign recognition systems: An analysis of spatial transformers and stochastic optimisation methods. Neural Netw. Off. J. Int. Neural Netw. Soc. 2018, 99, 158–165. [Google Scholar] [CrossRef] [PubMed]
Min, W.; Liu, R.; He, D.; Han, Q.; Wei, Q.; Wang, Q. Traffic Sign Recognition Based on Semantic Scene Understanding and Structural Traffic Sign Location. IEEE Trans. Intell. Transp. Syst. 2022, 23, 15794–15807. [Google Scholar] [CrossRef]
Wang, Z.; Wang, J.; Li, Y.; Wang, S. Traffic Sign Recognition With Lightweight Two-Stage Model in Complex Scenes. IEEE Trans. Intell. Transp. Syst. 2022, 23, 1121–1131. [Google Scholar] [CrossRef]
Zhu, Y.; Yan, W.Q. Traffic sign recognition based on deep learning. Multimed. Tools Appl. 2022, 81, 17779–17791. [Google Scholar] [CrossRef]
Zhu, Z.; Liang, D.; Zhang, S.; Huang, X.; Li, B.; Hu, S. Traffic-sign detection and classification in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; IEEE: Piscataway, NJ, USA; pp. 2110–2118. [Google Scholar] [CrossRef]
Philipsen, M.P.; Jensen, M.B.; Mogelmose, A.; Moeslund, T.B.; Trivedi, M.M. Traffic Light Detection: A Learning Algorithm and Evaluations on Challenging Dataset. In Proceedings of the 2015 IEEE 18th International Conference on Intelligent Transportation Systems, Gran Canaria, Spain, 15–18 September 2015; IEEE: Piscataway, NJ, USA, 2015; pp. 2341–2345. [Google Scholar] [CrossRef]
Almeida, T.; Macedo, H.; Matos, L.; Prado, B.; Bispo, K. Frequency Maps as Expert Instructions to lessen Data Dependency on Real-time Traffic Light Recognition. In Proceedings of the 2020 International Conference on Computational Science and Computational Intelligence (CSCI), Las Vegas, NV, USA, 16–18 December 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 1463–1468. [Google Scholar] [CrossRef]
Behrendt, K.; Novak, L.; Botros, R. A deep learning approach to traffic lights: Detection, tracking, and classification. In Proceedings of the 2017 IEEE International Conference on Robotics and Automation (ICRA), Singapore, 29 May–3 June 2017; IEEE: New York, NY, USA, 2017; pp. 1370–1377. [Google Scholar] [CrossRef]
Wang, Q.; Zhang, Q.; Liang, X.; Wang, Y.; Zhou, C.; Mikulovich, V.I. Traffic lights detection and recognition method based on the improved YOLOv4 algorithm. Sensors 2022, 22, 200. [Google Scholar] [CrossRef]
Ning, Z.; Wang, H.; Li, S.; Xu, Z. YOLOv7-RDD: A Lightweight Efficient Pavement Distress Detection Model. IEEE Transation Intel. Transp. Syst. 2024, 25, 6994–7003. [Google Scholar] [CrossRef]
Yang, Y.; Wang, H.; Xu, Z. A method for surveying road pavement distress based on front-view image data using a lightweight segmentation approach. J. Comput. Civ. Eng. 2024, 38, 04024026. [Google Scholar] [CrossRef]
Lu, L.; Wang, H.; Wan, Y.; Xu, F. A Detection Transformer-Based Intelligent Identification Method for Multiple Types of Road Traffic Safety Facilities. Sensors 2024, 24, 3252. [Google Scholar] [CrossRef]
Xie, F.; Zheng, G. Traffic Sign Object Detection with the Fusion of SSD and FPN. In Proceedings of the 2023 IEEE International Conference on Electrical, Automation and Computer Engineering (ICEACE), Changchun, China, 29–31 December 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 995–998. [Google Scholar] [CrossRef]
Wang, W.; He, F.; Li, Y.; Tang, S.; Li, X.; Xia, J.; Lv, Z. Data information processing of traffic digital twins in smart cities using edge intelligent federation learning. Inf. Process. Manag. 2023, 60, 103171. [Google Scholar] [CrossRef]
Purwar, S.; Chaudhry, R. A Comprehensive Study on Traffic Sign Detection in ITS. In Proceedings of the 2023 International Conference on Disruptive Technologies (ICDT), Greater Noida, India, 11–12 May 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 173–179. [Google Scholar] [CrossRef]
Bu, T.; Zhu, J.; Ma, T. A UAV Photography–Based Detection Method for Defective Road Marking. J. Perform. Constr. Facil. 2022, 36, 04022035. [Google Scholar] [CrossRef]
Fang, L.; Shen, G.; Luo, H.; Chen, C.; Zhao, Z. Automatic Extraction of Roadside Traffic Facilities From Mobile Laser Scanning Point Clouds Based on Deep Belief Network. IEEE Trans. Intell. Transp. Syst. 2021, 22, 1964–1980. [Google Scholar] [CrossRef]
Thanh Ha, T.; Chaisomphob, T. Automated Localization and Classification of Expressway Pole-Like Road Facilities from Mobile Laser Scanning Data. Adv. Civ. Eng. 2020, 2020, 5016783. [Google Scholar] [CrossRef]
Jiang, X.; Cui, Q.; Wang, C.; Wang, F.; Zhao, Y.; Hou, Y.; Zhuang, R.; Mei, Y.; Shi, G. A Model for Infrastructure Detection along Highways Based on Remote Sensing Images from UAVs. Sensors 2023, 23, 3847. [Google Scholar] [CrossRef]
Liu, Y.; Shi, G.; Li, Y.; Zhao, Z. M-YOLO: Traffic Sign Detection Algorithm Applicable to Complex Scenarios. Symmetry 2022, 14, 952. [Google Scholar] [CrossRef]
Sanjeewani, P.; Verma, B. Optimization of Fully Convolutional Network for Road Safety Attribute Detection. IEEE Access 2021, 9, 120525–120536. [Google Scholar] [CrossRef]
Yang, Z.; Zhao, C.; Maeda, H.; Sekimoto, Y. Development of a Large-Scale Roadside Facility Detection Model Based on the Mapillary Dataset. Sensors 2022, 22, 9992. [Google Scholar] [CrossRef]
Zhang, X.; Hsieh, Y.-A.; Yu, P.; Yang, Z.; Tsai, Y.J. Multiclass Transportation Safety Hardware Asset Detection and Segmentation Based on Mask-RCNN with RoI Attention and IoMA-Merging. J. Comput. Civ. Eng. 2023, 37, 04023024. [Google Scholar] [CrossRef]
Bodla, N.; Singh, B.; Chellappa, R.; Davis, L.S. Soft-NMS–improving object detection with one line of code. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 5561–5569. [Google Scholar] [CrossRef]
Cordts, M.; Omran, M.; Ramos, S.; Rehfeld, T.; Enzweiler, M.; Benenson, R.; Franke, U.; Roth, S.; Schiele, B. The cityscapes dataset for semantic urban scene understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; IEEE: Piscataway, NJ, USA, 2016; pp. 3213–3223. [Google Scholar] [CrossRef]
Geiger, A.; Lenz, P.; Stiller, C.; Urtasun, R. Vision meets robotics: The kitti dataset. Int. J. Robot. Res. 2013, 32, 1231–1237. [Google Scholar] [CrossRef]
Caesar, H.; Bankiti, V.; Lang, A.H.; Vora, S.; Liong, V.E.; Xu, Q.; Krishnan, A.; Pan, Y.; Baldan, G.; Beijbom, O. nuScenes: A multimodal dataset for autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 11621–11631. [Google Scholar] [CrossRef]
Sun, P.; Kretzschmar, H.; Dotiwalla, X.; Chouard, A.; Patnaik, V.; Tsui, P.; Guo, J.; Zhou, Y.; Chai, Y.; Caine, B.; et al. Scalability in perception for autonomous driving: Waymo open dataset. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern RecognitionSeattle, WA, USA, 13–19 June 2020; pp. 2446–2454. [Google Scholar] [CrossRef]
Yu, F.; Chen, H.; Wang, X.; Xian, W.; Chen, Y.; Liu, F. Bdd100k: A diverse driving dataset for heterogeneous multitask learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 2636–2645. [Google Scholar] [CrossRef]
Pan, X.; Shi, J.; Luo, P.; Wang, X.; Tang, X. Spatial as deep: Spatial CNN for traffic scene understanding. In Proceedings of the AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018; Volume 32. [Google Scholar] [CrossRef]
Huang, X.; Wang, P.; Cheng, X.; Zhou, D.; Geng, Q.; Yang, R. The Apolloscape open dataset for autonomous driving and its application. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 42, 2702–2719. [Google Scholar] [CrossRef] [PubMed]
Johner, F.M.; Wassner, J. Efficient evolutionary architecture search for CNN optimization on GTSRB. In Proceedings of the 2019 18th IEEE International Conference on Machine Learning And Applications (ICMLA), Boca Raton, FL, USA, 16–19 December 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 56–61. [Google Scholar] [CrossRef]
de Charette, R.; Nashashibi, F. Real time visual traffic lights recognition based on Spot Light Detection and adaptive traffic lights templates. In Proceedings of the 2009 IEEE Intelligent Vehicles Symposium, Xi’an, China, 3–5 June 2009; pp. 358–363. [Google Scholar] [CrossRef]
Zhang, J.; Zou, X.; Kuang, L.D.; Wang, J.; Sherratt, R.S.; Yu, X. CCTSDB 2021: A more comprehensive traffic sign detection benchmark. Hum.-Centric Comput. Inf. Sci. 2022, 12, 23. [Google Scholar]
Zhao, Y.; Lv, W.; Xu, S.; Wei, J.; Wang, G.; Dang, Q.; Liu, Y.; Chen, J. DETRs Beat YOLOs on Real-time Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; Available online: http://arxiv.org/pdf/2304.08069v3 (accessed on 21 July 2024).
Terven, J.; Córdova-Esparza, D.-M.; Romero-González, J.-A. A Comprehensive Review of YOLO Architectures in Computer Vision: From YOLOv1 to YOLOv8 and YOLO-NAS. Mach. Learn. Knowl. Extr. 2023, 5, 1680–1716. [Google Scholar] [CrossRef]
Wang, C.Y.; Bochkovskiy, A.; Liao, H.Y.M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. arXiv 2022, arXiv:2207.02696. [Google Scholar] [CrossRef]
Jiang, S.; Wang, H.; Ning, Z.; Li, S. Lightweight pruning model for road distress detection using unmanned aerial vehicles. Autom. Constr. 2024, 168, 105789. [Google Scholar] [CrossRef]
Xu, F.; Wan, Y.; Ning, Z.; Wang, H. Comparative Study of Lightweight Target Detection Methods for Unmanned Aerial Vehicle-Based Road Distress Survey. Sensors 2024, 24, 6159. [Google Scholar] [CrossRef]
Wang, G.; Chen, Y.; An, P.; Hong, H.; Hu, J.; Huang, T. UAV-YOLOv8: A small-object-detection model based on improved YOLOv8 for UAV aerial photography scenarios. Sensors 2023, 23, 7190. [Google Scholar] [CrossRef]
Wang, C.Y.; Yeh, I.H.; Mark Liao, H.Y. Yolov9: Learning what you want to learn using programmable gradient information. arXiv 2024, arXiv:2402.13616. [Google Scholar]
Wang, A.; Chen, H.; Liu, L.; Chen, K.; Lin, Z.; Han, J.; Ding, G. Yolov10: Real-time end-to-end object detection. arXiv 2024, arXiv:2405.14458. [Google Scholar]
Viso.AI Gaudenz Boesch. Yolov11: A New Iteration of “You Only Look Once. 2024. Available online: https://viso.ai/computer-vision/yolov11/ (accessed on 10 October 2024).
Ultralytics. Ultralytics yolov11. 2024. Available online: https://docs.ultralytics.com/models/yolo11/ (accessed on 10 October 2024).
Cheng, T.; Song, L.; Ge, Y.; Liu, W.; Wang, X.; Shan, Y. Yolo-world: Real-time open-vocabulary object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 16901–16911. [Google Scholar]
Zhang, H.; Li, F.; Liu, S.; Zhang, L.; Su, H.; Zhu, J.; Ni, L.M.; Shum, H.Y. DINO: Detr with improved denoising anchor boxes for end-to-end object detection. arXiv 2022, arXiv:2203.03605. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar] [CrossRef]
Sun, K.; Xiao, B.; Liu, D.; Wang, J. Deep high-resolution representation learning for human pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 5693–5703. [Google Scholar] [CrossRef]
Ding, X.; Zhang, X.; Ma, N.; Han, J.; Ding, G.; Sun, J. Repvgg: Making VGG-style convnets great again. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 13733–13742. [Google Scholar]
Lyu, C.; Zhang, W.; Huang, H.; Zhou, Y.; Wang, Y.; Liu, Y.; Zhang, S.; Chen, K. Rtmdet: An empirical study of designing real-time object detectors. arXiv 2022, arXiv:2212.07784. [Google Scholar]
Xu, X.; Jiang, Y.; Chen, W.; Huang, Y.; Zhang, Y.; Sun, X. Damo-yolo: A report on real-time object detection design. arXiv 2022, arXiv:2211.15444. [Google Scholar]
Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar] [CrossRef]
Girshick, R. Fast R-CNN. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar] [CrossRef]
Rezatofighi, H.; Tsoi, N.; Gwak, J.; Sadeghian, A.; Reid, I.; Savarese, S. Generalized intersection over union: A metric and a loss for bounding box regression. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 658–666. [Google Scholar] [CrossRef]
Jiang, Y.; Tan, Z.; Wang, J.; Sun, X.; Lin, M.; Li, H. Giraffedet: A heavy-neck paradigm for object detection. arXiv 2022, arXiv:2202.04256. [Google Scholar]
Wang, C.Y.; Liao, H.Y.M.; Wu, Y.H.; Chen, P.Y.; Hsieh, J.W.; Yeh, I.H. CSPNet: A New Backbone that can Enhance Learning Capability of CNN. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Seattle, WA, USA, 13–19 June 2020; IEEE: Piscataway, NJ, USA, 2020. [Google Scholar] [CrossRef]
Zhang, X.; Zeng, H.; Guo, S.; Zhang, L. Efficient long-range attention network for image super-resolution. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer Nature: Chamm Switzerland, 2022; pp. 649–667. [Google Scholar] [CrossRef]

Figure 1. Research flow chart of this paper.

Figure 2. The framework of RT-DETR [39].

Figure 3. The key modules: (a) the AIFI module and (b) the fusion module in CCFF [39].

Figure 4. The framework of RT-DETR-GFPN.

Figure 5. Data samples. From top to bottom: images in row 1 were collected in Chongqing; images in row 2 were collected in Ningbo; images in row 3 were collected in Shanghai.

Figure 6. TSF samples: (a) guardrail; (b) water surround barrier (WSB); (c) light; (d) traffic sign (board) with traffic warning and guide information in Chinese shown on the board; (e) gantry; (f) traffic rod (rod) with traffic warning information in Chinese [16].

Figure 7. Categorized detection precision results.

Figure 8. Confusion matrixes. (a) RT-DETR; (b) YOLO-World.

Figure 9. Categorized recall results.

Figure 10. Detection samples of Shanghai data.

Figure 11. Detection samples of Chongqing data. Left: RT-DETR; Right: RT-DETR-RepGFPN.

Table 1. An overview of relevant datasets.

Dataset	Data Collection Area	Description	Data Type	Application Scenarios
Cityscapes [29]	Berlin, Germany,	Cityscape dataset	Images, segmentation labeling	Image segmentation, scene understanding
Cityscapes [29]	etc.	Cityscape dataset	Images, segmentation labeling	Image segmentation, scene understanding
KITTI [30]	Karlsruhe, Germany, etc.	The largest computer vision algorithm evaluation dataset for autonomous driving scenarios in the world	Images, LiDAR data, inertial measurement unit data	Evaluation of stereoscopic images, 3D object detection, 3D tracking, etc.
nuScenes [31]	Boston, USA; Singapore	Large-scale multimodal dataset for autonomous driving research	Images, LiDAR data, inertial measurement unit data, etc.	Target detection, target tracking, image segmentation, etc.
Waymo Open [32]	Six cities in the USA	Large-scale sensor dataset for autonomous driving research	Images, LiDAR data, inertial measurement unit data	Detection, tracking, motion prediction, and planning
BDD100K [33]	New York and San Francisco, USA	Large-scale dataset for autonomous driving perception and understanding	Images, LiDAR data, inertial measurement unit data	Target detection, image segmentation, behavioral recognition, etc.
CULane [34]	Beijing, China	Lane line detection and tracking dataset for automated driving research	Images, LiDAR data, inertial measurement unit data	Lane line detection and tracking
ApolloScape [35]	Four cities in China	Large-scale multimodal dataset for autonomous driving from Baidu Inc.	Images, LiDAR data, inertial measurement unit data	Target detection, image segmentation, target tracking, etc.
GTSRB [36]	Multiple cities in Germany	German traffic sign recognition benchmark	Images, bounding box labeling	Traffic sign detection
TT100k [9]	Several cities in China	Tsinghua-Tencent traffic sign dataset	Images, bounding box labeling	Traffic sign detection
LaRa [37]	Riga, Latvia	TSR dataset	Images, bounding box labeling	Traffic signal detection
LISA [10]	California, USA	TLR dataset	Images, bounding box labeling	Traffic signal detection
TSF-CQU [16]	Shanghai, Chongqing, and Ningbo, China	Traffic facility dataset	Images, bounding box labeling	Target detection, image segmentation, target tracking, etc.
CCTSDB 2021 [38]	China	TSR dataset	Images; bounding box labeling; and attributes including category meanings (three types), weather conditions (six types), and sign sizes (five types)	Traffic sign detection

Table 2. The data description for the experiment.

Category	Number	Training Set
Rod	1897	1661
Board	2961	2530
Light	1547	1337
Guardrail	1483	1275
WSB	239	200
Gantry	283	241
Total	8410	7244

Table 3. Evaluation results.

Models	Params (M)	FLOPs (G)	TT for 100 Epochs (h)	GMT (G)	mAP
DINO	46.606	279	9.58	9.75	0.806
RT-DETR	20.094	58.6	4.2	3.48	0.811
Yolov7-tiny	6.021	13.1	1.62	11.7	0.800
Yolov9-t	2.618	10.7	1.28	5.8	0.789
Yolov10-n	2.696	8.2	0.4	5.5	0.753
Yolov11-n	2.583	6.3	0.27	5.2	0.788
Yolo-World	12.749	33.3	0.58	24.6	0.807

Table 4. Model improvement results.

Models	Params (M)	FPS	mAP	Precision	Recall
RT-DETR	20.0941	153	0.811	0.594	0.846
RT-DETR-RGFP	24.9633	147	0.823	0.584	0.856

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wan, Y.; Wang, H.; Lu, L.; Lan, X.; Xu, F.; Li, S. An Improved Real-Time Detection Transformer Model for the Intelligent Survey of Traffic Safety Facilities. Sustainability 2024, 16, 10172. https://doi.org/10.3390/su162310172

AMA Style

Wan Y, Wang H, Lu L, Lan X, Xu F, Li S. An Improved Real-Time Detection Transformer Model for the Intelligent Survey of Traffic Safety Facilities. Sustainability. 2024; 16(23):10172. https://doi.org/10.3390/su162310172

Chicago/Turabian Style

Wan, Yan, Hui Wang, Lingxin Lu, Xin Lan, Feifei Xu, and Shenglin Li. 2024. "An Improved Real-Time Detection Transformer Model for the Intelligent Survey of Traffic Safety Facilities" Sustainability 16, no. 23: 10172. https://doi.org/10.3390/su162310172

APA Style

Wan, Y., Wang, H., Lu, L., Lan, X., Xu, F., & Li, S. (2024). An Improved Real-Time Detection Transformer Model for the Intelligent Survey of Traffic Safety Facilities. Sustainability, 16(23), 10172. https://doi.org/10.3390/su162310172

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

An Improved Real-Time Detection Transformer Model for the Intelligent Survey of Traffic Safety Facilities

Abstract

1. Introduction

2. Literature Review

3. Methods

3.1. YOLO Series

3.1.1. YOLOv7–YOLOv11

3.1.2. YOLO-World

3.2. Detection Based on Transformer

3.2.1. DINO

3.2.2. Real-Time Detection Transformer (RT-DETR)

3.3. Improved RT-DETR with Reparameterized Generalized Feature Pyramid Network Module (RT-DETR-RepGFPN)

4. Experiment

4.1. Data

4.2. Experiment Configuration

4.3. Confusion Matrix

4.4. Evaluation Metrics

5. Results and Discussion

5.1. Evaluation Results for Comparison of Different Models

5.2. Model Improvement Results

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI