YOLOv7-RAR for Urban Vehicle Detection

Aiming at the problems of high missed detection rates of the YOLOv7 algorithm for vehicle detection on urban roads, weak perception of small targets in perspective, and insufficient feature extraction, the YOLOv7-RAR recognition algorithm is proposed. The algorithm is improved from the following three directions based on YOLOv7. Firstly, in view of the insufficient nonlinear feature fusion of the original backbone network, the Res3Unit structure is used to reconstruct the backbone network of YOLOv7 to improve the ability of the network model architecture to obtain more nonlinear features. Secondly, in view of the problem that there are many interference backgrounds in urban roads and that the original network is weak in positioning targets such as vehicles, a plug-and-play hybrid attention mechanism module, ACmix, is added after the SPPCSPC layer of the backbone network to enhance the network’s attention to vehicles and reduce the interference of other targets. Finally, aiming at the problem that the receptive field of the original network Narrows, with the deepening of the network model, leads to a high miss rate of small targets, the Gaussian receptive field scheme used in the RFLA (Gaussian-receptive-field-based label assignment) module is used at the connection between the feature fusion area and the detection head to improve the receptive field of the network model for small objects in the image. Combining the three improvement measures, the first letter of the name of each improvement measure is selected, and the improved algorithm is named the YOLOv7-RAR algorithm. Experiments show that on urban roads with crowded vehicles and different weather patterns, the average detection accuracy of the YOLOv7-RAR algorithm reaches 95.1%, which is 2.4% higher than that of the original algorithm; the AP50:90 performance is 12.6% higher than that of the original algorithm. The running speed of the YOLOv7-RAR algorithm reaches 96 FPS, which meets the real-time requirements of vehicle detection; hence, the algorithm can be better applied to vehicle detection.


Introduction
Traffic congestion is a common occurrence in cities. On one hand, it is related to urban road design; on the other hand, it is related to artificial driving. Drivers on the road completely depend on their driving experience and can be driving for a long time, which can cause the visual fatigue of drivers and car accidents. It is of great significance to study a method that assists or even replaces the human eye and completes the vehicle's automatic recognition and detection reliably.
Vehicle detection and recognition is a popular research direction in computer vision, which has a wide application prospect in automatic driving. However, the real-time acquisition of road vehicle pictures by onboard cameras in the process of image acquisition is affected by camera angles and the distance between bodies; there will be problems such as block, blur, dark light, and the small size of the target object. Thus, the recognition rate is low. In order to improve the recognition rate, Zha et al. [1] studied the image information of vehicles in parking lots, trained the classifier by manually extracting features, and matched the interested vehicle features to obtain better recognition results. Amit et al. [2] proposed a strong classifier based on a machine learning algorithm that uses more features to form a better decision boundary and fewer features to exclude a large number of negative samples map for prediction, and similar to YOLOv3, an anchor is generated on the feature map at multiple different scales [35]. At present, the YOLO algorithm is widely used in industrial production and has become the mainstream target detection algorithm. Therefore, the application algorithm framework used in this paper is the YOLO algorithm. At present, the performance of this algorithm is the best in YOLOv7, as proposed by Wang et al. in 2022 [36]. However, there is still room for improvement in the detection accuracy of the algorithm.
This paper proposes three directions of improvement mechanisms. First of all, it is to improve the network model architecture. In most literature on high-speed architecture design, the model's parameter number, computation amount, and computation density are mainly considered. Ma et al. [37] also analyzed the influence of the input-output channel ratio, the number of architecture branches, and operation by element on the network reasoning speed based on the characteristics of memory access cost. Dollar P et al. [38] gave extra consideration to activation when scaling the model, that is, more consideration to the number of elements in the output tensor of the convolution layer. The above gradient analysis methods enable faster and more accurate reasoning, and the design of the ELAN network leads to the conclusion that a deeper network can learn and converge effectively by controlling the shortest and longest gradient paths. Based on this conclusion, the Res3Unit module is proposed in this paper.
Second, the addition of attention mechanisms. Traditional attention mechanisms, such as CBAM and SENet, are usually used in the enhancement of convolutional modules [39,40]. Recently, self-attention modules have been proposed to replace traditional convolutions such as SAN and BoTNet. However, the relationship between self-attention mechanisms and convolution has not been discovered and utilized. Xuran Pan et al. [41] found that the two modules are heavily dependent on the same 1 × 1 convolution operation, so they proposed an ACmix module that can reuse and share the features obtained by the two modules and aggregate the intermediate features. Compared with pure convolution or self-attention modules, this module has less overhead. Hence, this article adds ACmix to the network structure.
Finally, in order to enhance the receptive field of the network for distant small targets, the RFLA Gauss module is introduced. Enhancing the receptive field of the network to the target object is the RFBNet model proposed by Songtao Liu et al. [42], first published in ECCV 2018. This model mainly adds a dilated convolutional layer on the basis of inception to effectively increase the receptive field. In this paper, we introduce a new module, RFLA Gauss , which was proposed by Chang Xu et al. [43] in 2022, aiming at the characteristics of small objects in perspective, such as fewer pixels in the whole image and limited features that can be collected. The real box does not overlap with almost all anchor boxes (that is, IoU = 0) and does not contain any anchor points, resulting in the lack of positive samples of small objects. New prior knowledge based on Gaussian distribution was introduced, and a label assignment strategy based on the Gaussian receptive field (RFLA) was established, which solved the problem of small object recognition. Therefore, this paper will introduce the model structure to improve the detection effect of the algorithm model for small targets.
In this paper, the YOLOv7 algorithm is improved and the YOLOv7-RAR algorithm is proposed. To solve the problem of insufficient fusion of the nonlinear characteristics of the network, the Res3Unit structure is proposed to reconstruct the backbone network of YOLOv7 and to solve the problem that the network is weak in vehicle target positioning. We add the plug-and-play module ACmix between the backbone network and the detection head. The RFLA Gauss module is used to solve the problem that the receptive field of the original network shrinks with the deepening of the network model, and four groups of ablation experiments are conducted to compare the performance of the improved model.

Problem Description
At present, the vehicle detection and recognition algorithm is more inclined to highperformance algorithms; that is, the algorithm has the characteristics of high accuracy and high-speed image processing. Among the current excellent object detection algorithms, both single-stage detection algorithms and two-stage detection algorithms have mediocre performance in vehicle detection and recognition. Specifically, the algorithms have poor performance in detection speed, which does not meet the needs of the high real-time performance of vehicles. Table 1 shows the performance of some algorithms using the UA-DETRAC dataset; as can be seen from the table, the real-time performance of most algorithms is not high, and the detection accuracy of algorithms with high real-time performance is low. However, the YOLO algorithm gradually improves the algorithm performance with model iteration. At present, the YOLO algorithm has been updated to YOLOv7, and the excellent architecture of YOLOv7 is used for further improvement in order to obtain better algorithm performance in the detection vehicle.

YOLOv7 Algorithm
The authors of YOLOv7 are Chien-Yao Wang, Alexey Bochkovskiy, and Hong-yuan Mark Liao. One of the improvements of YOLOv7 is that the activation function is changed from Leakrelu to Swish. Other basic modules are optimized by using the residual design idea for reference, but the basic architecture of the network has not changed much and still includes three parts: backbone, neck, and head.

Backbone
DarkNet, the basic backbone network of the YOLO algorithm, was built by Joseph Redmon. Other versions of the YOLO algorithm are optimized on its architecture. The backbone network of YOLOv7 includes the CBS, E-ELAN, MP, and SPPCSPC modules. CBS, as the most basic module, is integrated into other modules.

Feature Fusion Zone
The feature fusion layer of the network is to enable the network to better learn the features extracted from the backbone network. The features of different granularities are learned separately and merged in a centralized way so as to learn as many image features as possible.

Detection Head
The YOLOv7 algorithm follows the advantages of previous algorithms and retains three detection heads, which are used to detect and output the predicted category probabil- ity, confidence, and predicted frame coordinates of the target object. The detection heads output three feature scales: 20 × 20, 40 × 40, and 80 × 80. The target scales detected by the three scales respectively correspond to a large target, a medium target, and a small target.
The original backbone network is first stacked by four CBS; four convolution operations are performed on the input image to extract the underlying features, and then the fine-grained features are extracted by the MP and E-ELAN modules. However, such a structure will still use a lot of repeated feature information and lose more fine-grained features [44,45]. It is not good for the network to learn more nonlinear features. In order to further reduce the use of repeated features and deepen the extraction of fine-grained feature information, this paper proposes an improved network module, Res3Unit, based on ELAN. Its main idea is to let the network obtain as many nonlinear features as possible and reduce the use of repeated features. The network module sets the structure of multiple fusion branches, which will reduce the use of repeated features and fuse the features collected by the upper layer to be more fine-grained [36]. The Res3Unit structure is shown in Figure 1.

Detection Head
The YOLOv7 algorithm follows the advantages of previous algorithms and retains three detection heads, which are used to detect and output the predicted category probability, confidence, and predicted frame coordinates of the target object. The detection heads output three feature scales: 20 × 20, 40 × 40, and 80 × 80. The target scales detected by the three scales respectively correspond to a large target, a medium target, and a small target.

Backbone
The original backbone network is first stacked by four CBS; four convolution operations are performed on the input image to extract the underlying features, and then the fine-grained features are extracted by the MP and E-ELAN modules. However, such a structure will still use a lot of repeated feature information and lose more fine-grained features [44,45]. It is not good for the network to learn more nonlinear features. In order to further reduce the use of repeated features and deepen the extraction of fine-grained feature information, this paper proposes an improved network module, Res3Unit, based on ELAN. Its main idea is to let the network obtain as many nonlinear features as possible and reduce the use of repeated features. The network module sets the structure of multiple fusion branches, which will reduce the use of repeated features and fuse the features collected by the upper layer to be more fine-grained [36]. The Res3Unit structure is shown in Figure 1. A picture of a car was selected for testing, and the results are shown in Figure 2. Three stage features that need to be sampled are selected, namely, stage6_E-ELAN_features, stage8_E-ELAN_features, and stage12_SPPCSPC_features. Compared with the original backbone network to extract the nonlinear features of the image, it is found that the improved backbone network can extract the nonlinear features of the vehicles in the image more fully and clearly, indicating the effectiveness of the improved algorithm. A picture of a car was selected for testing, and the results are shown in Figure 2. Three stage features that need to be sampled are selected, namely, stage6_E-ELAN_features, stage8_E-ELAN_features, and stage12_SPPCSPC_features. Compared with the original backbone network to extract the nonlinear features of the image, it is found that the improved backbone network can extract the nonlinear features of the vehicles in the image more fully and clearly, indicating the effectiveness of the improved algorithm.

Mixed Attention Mechanism
The self-attention module uses the weighted average operation based on the context of the input features to dynamically calculate the attention weight through the similarity function between the relevant pixel pairs. This flexibility allows the attention module to adaptively focus on different areas and capture more features. The study of early attention mechanisms such as SENet and CBAM shows that self-attention can be used as an enhancement of the convolution module. By decomposing the operations of these two modules, it shows that they largely depend on the same convolution operation.

Mixed Attention Mechanism
The self-attention module uses the weighted average operation based on the of the input features to dynamically calculate the attention weight through the sim function between the relevant pixel pairs. This flexibility allows the attention mo adaptively focus on different areas and capture more features. The study of early at mechanisms such as SENet and CBAM shows that self-attention can be used as hancement of the convolution module. By decomposing the operations of these tw ules, it shows that they largely depend on the same convolution operation.  (e) our_algorithm_stage8_E-ELAN_features; (f) our_algorithm_stage12_SPPCSPC_features.
As shown in Figure 3, the ACmix module is added after the SPPCSPC module of the backbone network to enhance the feature perception and location information of the backbone network on small targets of distant vehicles and reduce the attention to the interference background. As shown in Figure 4 [41], since both modules share the same 1 × 1 convolution operation, only one projection can be performed, and these intermediate feature maps are used for different aggregation operations, respectively.

operations.
As shown in Figure 3, the ACmix module is added after the SPPCSPC module of the backbone network to enhance the feature perception and location information of the backbone network on small targets of distant vehicles and reduce the attention to the interference background. As shown in Figure 4 [41], since both modules share the same 1 × 1 convolution operation, only one projection can be performed, and these intermediate feature maps are used for different aggregation operations, respectively.  The RFB module, proposed by Songtao Liu et al., was the earliest one to use the RF module to enhance the receptive field. This module simulates the receptive field of human

Enhancing the Network Receptive Field
The RFB module, proposed by Songtao Liu et al., was the earliest one to use the RF module to enhance the receptive field. This module simulates the receptive field of human vision to enhance the feature extraction ability of the network. It uses different convolution kernels and different step sizes to combine different receptive fields, connects 1 × 1 convolutions to reduce the dimensionality, and finally forms a hybrid superposition of different receptive fields. Its module structure is shown in Figure 5 [42].

Enhancing the Network Receptive Field
The RFB module, proposed by Songtao Liu et al., was the earliest one to use the RF module to enhance the receptive field. This module simulates the receptive field of human vision to enhance the feature extraction ability of the network. It uses different convolution kernels and different step sizes to combine different receptive fields, connects 1 × 1 convolutions to reduce the dimensionality, and finally forms a hybrid superposition of different receptive fields. Its module structure is shown in Figure 5 [42]. However, according to the analysis of Guo et al., the feature receptive fields learned at different scales are different, and the superposition and fusion of different receptive fields in the feature fusion area (neck) of YOLOv7 will weaken the multi-scale feature expression, resulting in poor detection effects on smaller targets. For the performance of a small target on the image, when the receptive field of its feature points is remapped back to the input image, the effective receptive field is actually Gaussian-distributed. The gap between the prior uniform distribution and the Gaussian-distributed receptive field will result in a mismatch between the ground truth and the receptive fields of the feature However, according to the analysis of Guo et al., the feature receptive fields learned at different scales are different, and the superposition and fusion of different receptive fields in the feature fusion area (neck) of YOLOv7 will weaken the multi-scale feature expression, resulting in poor detection effects on smaller targets. For the performance of a small target on the image, when the receptive field of its feature points is remapped back to the input image, the effective receptive field is actually Gaussian-distributed. The gap between the prior uniform distribution and the Gaussian-distributed receptive field will result in a mismatch between the ground truth and the receptive fields of the feature points assigned to it. RFLA, published by Xu et al. in ECCV 2022, effectively solves the receptive field problem of small target recognition. Therefore, the RFLA Gauss module optimization network based on the Gaussian receptive field is introduced in this paper. The principle of the model is shown in Figure 6. Firstly, feature extraction is performed, and then convolution is performed with the Gaussian kernel function. After that, the extracted features are integrated into a feature point, and the Gaussian effective receptive field is obtained.
Sensors 2023, 23, x FOR PEER REVIEW 9 of 20 points assigned to it. RFLA, published by Xu et al. in ECCV 2022, effectively solves the receptive field problem of small target recognition. Therefore, the RFLAGauss module optimization network based on the Gaussian receptive field is introduced in this paper. The principle of the model is shown in Figure 6. Firstly, feature extraction is performed, and then convolution is performed with the Gaussian kernel function. After that, the extracted features are integrated into a feature point, and the Gaussian effective receptive field is obtained.

Dataset Selection
The dataset used in this paper is UA-DETRAC. UA-DETRAC is a large-scale dataset for vehicle detection and tracking. The UA-DETRAC dataset contains 140,000 images (60% for training and 40% for testing). The image scale is 960 × 540. The dataset is mainly taken from road overpasses in Beijing and Tianjin (Beijing-Tianjin-Hebei scene), and 8250 vehicles and 1.21 million target objects are manually labeled. Figure 7 shows the types of vehicles in the dataset, including cars, buses, vans, and other types of vehicles. Figure 8

Dataset Selection
The dataset used in this paper is UA-DETRAC. UA-DETRAC is a large-scale dataset for vehicle detection and tracking. The UA-DETRAC dataset contains 140,000 images (60% Sensors 2023, 23, 1801 9 of 20 for training and 40% for testing). The image scale is 960 × 540. The dataset is mainly taken from road overpasses in Beijing and Tianjin (Beijing-Tianjin-Hebei scene), and 8250 vehicles and 1.21 million target objects are manually labeled. Figure 7 shows the types of vehicles in the dataset, including cars, buses, vans, and other types of vehicles. Figure 8 shows the distribution and vehicle types of the dataset. The weather conditions are divided into four categories, namely, cloudy, night, sunny, and rainy. Figure 9 shows the initial training samples of part of the training dataset.

Dataset Selection
The dataset used in this paper is UA-DETRAC. UA-DETRAC is a large-scale dataset for vehicle detection and tracking. The UA-DETRAC dataset contains 140,000 images (60% for training and 40% for testing). The image scale is 960 × 540. The dataset is mainly taken from road overpasses in Beijing and Tianjin (Beijing-Tianjin-Hebei scene), and 8250 vehicles and 1.21 million target objects are manually labeled. Figure 7 shows the types of vehicles in the dataset, including cars, buses, vans, and other types of vehicles. Figure 8 shows the distribution and vehicle types of the dataset. The weather conditions are divided into four categories, namely, cloudy, night, sunny, and rainy. Figure 9 shows the initial training samples of part of the training dataset.

Dataset Selection
The dataset used in this paper is UA-DETRAC. UA-DETRAC is a large-scale dataset for vehicle detection and tracking. The UA-DETRAC dataset contains 140,000 images (60% for training and 40% for testing). The image scale is 960 × 540. The dataset is mainly taken from road overpasses in Beijing and Tianjin (Beijing-Tianjin-Hebei scene), and 8250 vehicles and 1.21 million target objects are manually labeled. Figure 7 shows the types of vehicles in the dataset, including cars, buses, vans, and other types of vehicles. Figure 8 shows the distribution and vehicle types of the dataset. The weather conditions are divided into four categories, namely, cloudy, night, sunny, and rainy. Figure 9 shows the initial training samples of part of the training dataset.

Environmental Preparation
In this paper, the Pytorch framework is used as the experimental environment for algorithm training. The environment is CUDA v11.2, the Pytorch version is v1.10, the GPU version is NVIDIA GeForce RTX3090, the video memory is 25.4 GB, and the Python version is 3.9. The batch size of each batch of training is set to 32 for a total of 200 training rounds. In order to verify the effectiveness of the three improved methods proposed in this paper, this paper uses four groups of ablation experiments. In the first group of experiments, we only retained the Res3Unit module in the original network architecture and named the module YOLOv7-Res. In the second group of experiments, we only retained

Environmental Preparation
In this paper, the Pytorch framework is used as the experimental environment for algorithm training. The environment is CUDA v11.2, the Pytorch version is v1.10, the GPU version is NVIDIA GeForce RTX3090, the video memory is 25.4 GB, and the Python version is 3.9. The batch size of each batch of training is set to 32 for a total of 200 training rounds.
In order to verify the effectiveness of the three improved methods proposed in this paper, this paper uses four groups of ablation experiments. In the first group of experiments, we only retained the Res3Unit module in the original network architecture and named the module YOLOv7-Res. In the second group of experiments, we only retained ACmix and named the improved model YOLOv7-AC. In the third group of experiments, we only retained the RFLA module and named the model YOLOv7-RF. In the fourth group of experiments, we retained all the improved modules to test the performance of the YOLOv7-RAR model. When training the network, the input image was resized to a uniform size of 640 × 640, the initial learning rate was set to 0.01, and the One Cycle Policy was used to adjust the learning rate. The parameter settings are shown in Table 2. In the experiment, the models are evaluated by the average precision (AP), the AP50 value of IOU greater than 0.5, AP50:90, the number of images detected per second (FPS), and the amount of model calculation (GFLOPs).

Backbone Network Improvement
Comparing the YOLOv7-RAR algorithm backbone network using Res3Unit with the original YOLOv7 backbone network is shown in Figure 10, the obvious improvement is in the model recall; the experimental results are shown in Table 3. After using the Res3Unit module, the average accuracy of AP50 increases by 1.7%, and the value of AP50:90 increases by 1.8%, indicating that the Res3Unit module has a greater improvement on the backbone network performance of the algorithm.

Adding the Mixed Attention Mechanism
The mixed attention mechanism ACmix module is added to the MP module and the E-ELAN module of the original network, and the experimental results of the comparison of the original network are shown in Figure 11 and Table 4. Compared with the original network model performance, the optimization strategy in this paper reduced the model calculation by 10.8%, and the model's FPS value is also larger, which fully meets the realtime requirements of urban traffic detection. It shows that the performance of the model is improved after adding the mixed attention mechanism.  The mixed attention mechanism ACmix module is added to the MP module E-ELAN module of the original network, and the experimental results of the comp of the original network are shown in Figure 11 and Table 4. Compared with the o network model performance, the optimization strategy in this paper reduced the calculation by 10.8%, and the model's FPS value is also larger, which fully meets t time requirements of urban traffic detection. It shows that the performance of the is improved after adding the mixed attention mechanism.   Figure 10. Results of the YOLOv7-Res3Unit ablation experiment. The mixed attention mechanism ACmix module is added to the MP module E-ELAN module of the original network, and the experimental results of the com of the original network are shown in Figure 11 and Table 4. Compared with the o network model performance, the optimization strategy in this paper reduced the calculation by 10.8%, and the model's FPS value is also larger, which fully meets t time requirements of urban traffic detection. It shows that the performance of the is improved after adding the mixed attention mechanism.

Visual Analysis of the Model
The feature information of interest to the network model can be seen from th feature map. In order to verify the attention of the added module ACmix to th target feature, this paper visualizes the feature map output from the first stage last layer of the backbone network, as shown in Figure 12. It can be seen that in stage, the network focuses on the extraction of the overall features, while in the volution layer of the backbone network, it can be seen that the network focuses small target features.

Visual Analysis of the Model
The feature information of interest to the network model can be seen from the visual feature map. In order to verify the attention of the added module ACmix to the small target feature, this paper visualizes the feature map output from the first stage and the last layer of the backbone network, as shown in Figure 12. It can be seen that in the first stage, the network focuses on the extraction of the overall features, while in the last convolution layer of the backbone network, it can be seen that the network focuses on some small target features.

Receptive Field Improvement Experiment
The RFLAGauss module based on the Gaussian receptive field is added to the feature fusion area of the original network, and the performance of the original network is compared. The experimental results are shown in Figure 13 and Table 5. The GFLOPs of the model are reduced by 2.3% compared with the original network model, and other performances are not much different from the original model, indicating that the RFLAGauss module is added to the feature fusion area. The calculation performance of the model is improved.

Receptive Field Improvement Experiment
The RFLA Gauss module based on the Gaussian receptive field is added to the feature fusion area of the original network, and the performance of the original network is compared. The experimental results are shown in Figure 13 and Table 5. The GFLOPs of the model are reduced by 2.3% compared with the original network model, and other performances are not much different from the original model, indicating that the RFLA Gauss module is added to the feature fusion area. The calculation performance of the model is improved.

Receptive Field Improvement Experiment
The RFLAGauss module based on the Gaussian receptive field is added to the fusion area of the original network, and the performance of the original network pared. The experimental results are shown in Figure 13 and Table 5. The GFLOP model are reduced by 2.3% compared with the original network model, and other mances are not much different from the original model, indicating that the RFLAGau ule is added to the feature fusion area. The calculation performance of the mode proved.

Overall Network Improvement
Based on the previous three groups of ablation experiments, the three models are combined into a completely improved network model in this paper. Compared with the performance of the original network model, the experimental results are shown in Figure 14 and Table 6. The average accuracy of the model class is 1.6% for AP, 2.9% for AP50, and 14.6% for AP50: 90, and the running speed of the algorithm is as high as 96 FPS. It basically meets the real-time requirements of urban traffic detection. Based on the previous three groups of ablation experiments, the three mo combined into a completely improved network model in this paper. Compared performance of the original network model, the experimental results are shown i 14 and Table 6. The average accuracy of the model class is 1.6% for AP, 2.9% fo and 14.6% for AP50: 90, and the running speed of the algorithm is as high as 9 basically meets the real-time requirements of urban traffic detection.  In this paper, Res3Unit is used to optimize the backbone network, and a hy tention mechanism is added to pay more attention to vehicle features and reduc terference of other input features. The latest ACmix module is fused in the featur area to enhance the receptive field of the network for small targets, and four gr ablation experiments are carried out to prove the effectiveness of the improved Figure 15 shows the detection effect of the YOLOv7-RAR algorithm in different e ments of urban roads. Table 7 shows that compared with other network models, work model has a faster detection speed and higher detection accuracy.  Figure 14. Results of YOLOv7-RAR ablation experiment. In this paper, Res3Unit is used to optimize the backbone network, and a hybrid attention mechanism is added to pay more attention to vehicle features and reduce the interference of other input features. The latest ACmix module is fused in the feature fusion area to enhance the receptive field of the network for small targets, and four groups of ablation experiments are carried out to prove the effectiveness of the improved model. Figure 15 shows the detection effect of the YOLOv7-RAR algorithm in different environments of urban roads. Table 7 shows that compared with other network models, our network model has a faster detection speed and higher detection accuracy.

Summary and Conclusions
In this paper, an accurate and real-time detection algorithm, YOLOv7-RAR, is proposed, and four groups of ablation experiments have been conducted successively. The experiments proved that YOLOv7-RAR could well realize vehicle detection with high ac-

Summary and Conclusions
In this paper, an accurate and real-time detection algorithm, YOLOv7-RAR, is proposed, and four groups of ablation experiments have been conducted successively. The experiments proved that YOLOv7-RAR could well realize vehicle detection with high accuracy and speed.
Through four groups of ablation experiments, this paper draws the following conclusions:

•
Setting the structure of multiple fusion branches will reduce the use of repeated features in the network and fuse the features collected by the upper layer in a more fine-grained way.

•
The separation of the attention mechanism module and the convolution module can extract the image features as much as possible, and the aggregation use can share the collected feature information to the greatest extent.

•
Enhancing the receptive field for small targets in the distant view can reduce the miss rate of the model for vehicles in the distant view. • By combining the three improved mechanisms, the final accuracy of the model is 2, which is 4% higher than that of the original model, and the AP50:90 performance is improved by 12.6% compared with the original algorithm.
The image data collected by the camera has a key impact on the prediction effect of the model. For places where the light is not good, the performance of the model will be poor. Solving the input problem of acquisition can make the application effects better.
We hope to continue to introduce faster and more accurate vehicle recognition algorithms in the future and to be able to use larger datasets and let the algorithm recognize more types of vehicles in order to hope to contribute to the field of vehicle detection. In addition, the transformer mechanism has ushered in another wave of research, which has potential research value in vehicle detection applications.