Research on Deep Learning Automatic Vehicle Recognition Algorithm Based on RES-YOLO Model

With the introduction of concepts such as ubiquitous mapping, mapping-related technologies are gradually applied in autonomous driving and target recognition. There are many problems in vision measurement and remote sensing, such as difficulty in automatic vehicle discrimination, high missing rates under multiple vehicle targets, and sensitivity to the external environment. This paper proposes an improved RES-YOLO detection algorithm to solve these problems and applies it to the automatic detection of vehicle targets. Specifically, this paper improves the detection effect of the traditional YOLO algorithm by selecting optimized feature networks and constructing adaptive loss functions. The BDD100K data set was used for training and verification. Additionally, the optimized YOLO deep learning vehicle detection model is obtained and compared with recent advanced target recognition algorithms. Experimental results show that the proposed algorithm can automatically identify multiple vehicle targets effectively and can significantly reduce missing and false rates, with the local optimal accuracy of up to 95% and the average accuracy above 86% under large data volume detection. The average accuracy of our algorithm is higher than all five other algorithms including the latest SSD and Faster-RCNN. In average accuracy, the RES-YOLO algorithm for small data volume and large data volume is 1.0% and 1.7% higher than the original YOLO. In addition, the training time is shortened by 7.3% compared with the original algorithm. The network is then tested with five types of local measured vehicle data sets and shows satisfactory recognition accuracy under different interference backgrounds. In short, the method in this paper can complete the task of vehicle target detection under different environmental interferences.


Introduction
In the 1960s, researchers began to conduct preliminary research on vehicle target detection [1,2]. It is a wide consensus that vehicle target detection is the basis of high-end technologies such as traffic big data platform construction, vehicle information management, and automatic driving [3,4]. Automatic vehicle target recognition is highly difficult because of errors, such as vehicle movement error, ambient light error, and camera field angle change error. Existing recognition methods are hardly able to distinguish vehicle position in real scenes. It also requires manual means for visual interpretation to distinguish vehicles in complex scenes. Therefore, combining images with the deep learning method for automatic vehicle discrimination is feasible. The convolutional neural network (CNN) is mainly used to perform target-recognition tasks in real vehicle scenes.
Mainstream vehicle target detection methods mainly use object features for vehicle target recognition. These methods fall under two categories: the first is to collect a certain number of feature samples based on traditional manual extraction and then classify them To summarize, this paper uses a RES-YOLO deep learning detection network, carries out network training through the BDDK100 vehicle detection data set, selects recent improved models [20][21][22] and the latest Faster R-CNN model, compares the optimized RES-YOLO algorithm model with those traditional models, and tests its detection accuracy under different conditions of small and large data volumes to verify its advantages and reliability for vehicle detection. Finally, the robustness of this algorithm in a complex background environment is further proved by using the data of real measurements in different environments.
The method proposed in this paper aims to accurately recognize vehicles in images, which is a fundamental problem in driverless and traffic flow detection. In this paper, the hardware system of small vehicle recognition is designed, and the RES-YOLO data processing algorithm is proposed. The system can be mounted on the vehicle platform for data acquisition and analysis. At this time, it is equivalent to the vehicle identification of a driverless system. The system can also be fixed at the intersection for data collection and can play a role in traffic flow statistics. Our proposed method focuses more on solving the problem of accurate vehicle recognition in the driverless system. The method in this paper will serve the same type of fields that need vehicle target recognition. In addition, it will provide them with a vehicle recognition method that considers stability and efficiency. Additionally, our method can also provide a detailed reference for solving the problem of vehicle target recognition based on the vision of tiny mobile devices. For the vehicle recognition problem with more background interference, our method can also better solve it, which contributes to improving the stability of the driverless system.
The structure of the article is as follows. Firstly, the basic principle of the YOLO model is introduced in the first section. Then, the second section presents the RES-YOLO vehicle target detection network proposed in this paper. The third section explains the experimental data set and expounds on this paper's experimental process. This part compares the processing effect between the algorithm and other methods. In addition, it also shows the results of relevant experiments. Additionally, we use localized multiscene data to test the RES-YOLO algorithm. Finally, the fourth section gives the critical conclusions of the full text and further discussion.

Basic Principles of YOLO
The full name of YOLO is You Only Look Once. Its essence is to convert an object detection problem into a regression problem and directly obtain the location of the detection boundary and the detection probability from the image. Firstly, n is defined as the scale factor of the feature layer in the detection process, namely, the feature scale factor. Secondly, W and H are defined as the width and height of the image, and S1 = w/N, S2 = H/n. The YOLO algorithm first divides the image data into S1 * S2 detection grids, and each detection network is set with multiple target anchor frames.
When detecting each target anchor frame, if the center of the object being detected falls into the grid, the surrounding grids will be formed into a grid unit, the boundary of the grid unit will be predicted, and the confidence score will be calculated. The value of these scores represents the accuracy of the object detected in the grid unit. The definition of confidence of the detected object is shown in Formula (1): (1) where Pr(Object) represents the probability of the existence of the target object in the grid unit and IOU represents the coincidence rate between the frame mark in the model prediction and the actual object.
Each target anchor box is determined by five main parameters: x, y, w, h, δ, where (x,y) represents the central coordinate of the target object, (w,h) represents the width and height of the target object, and the confidence δ represents the IOU between the prediction frame and any ground real frame. The final detected object boundary frame mark is obtained by the following Formula (2): where k x , k y , k w , k h represent the corresponding bounding box parameters and c x , c y represent the coordinates at the upper left corner of the target anchor box, as shown in Figure 1. The actual pixel coordinates of the object to be detected in the bounding box can be obtained by multiplying the object bounding box parameter and the feature scale factor n.
Each target anchor box is determined by five main parameters: x, y, w, h, δ, where (x, y) represents the central coordinate of the target object, (w, h) represents the width and height of the target object, and the confidence δ represents the IOU between the prediction frame and any ground real frame. The final detected object boundary frame mark is obtained by the following formula (2): ⎩ ⎨ ⎧ k x =x+c x k y =y+c y k w =w×e w k h =h×e h (2) where k x , k y , k w , k h represent the corresponding bounding box parameters and c x , c y represent the coordinates at the upper left corner of the target anchor box, as shown in Figure 1. The actual pixel coordinates of the object to be detected in the bounding box can be obtained by multiplying the object bounding box parameter and the feature scale factor n. After the above process, each target anchor frame can output N parameters tn i , i∈ [1, N], and the occurrence probability of these N types of target objects can be calculated with the normalized exponential function. The formula is shown in formula (3).
After obtaining the corresponding data, index I is calculated according to formula (4) and sorted. The unqualified target anchor frame is filtered and removed by setting the threshold. The final target anchor frame is the target object to be judged.
The GoogleNet model for image classification inspires the YOLO network architecture in the structure. The network has 24 convolution layers and finally two fully connected layers. The network does not use the initial module of GoogleNet, but uses 1 × 1 reduction layers and 3 × 3 convolution layers. The complete network structure flow is shown in Figure 2.  After the above process, each target anchor frame can output N parameters tn i , i ∈ [1, N], and the occurrence probability of these N types of target objects can be calculated with the normalized exponential function. The formula is shown in Formula (3).
After obtaining the corresponding data, index I is calculated according to Formula (4) and sorted. The unqualified target anchor frame is filtered and removed by setting the threshold. The final target anchor frame is the target object to be judged.
The GoogleNet model for image classification inspires the YOLO network architecture in the structure. The network has 24 convolution layers and finally two fully connected layers. The network does not use the initial module of GoogleNet, but uses 1 × 1 reduction layers and 3 × 3 convolution layers. The complete network structure flow is shown in Figure 2.

Basic Structure of RES-YOLO Network
The main reasons for choosing the YOLO algorithm are as follows: (1) In dealing with the regression problem, there is no need to execute complex processes, and the results can be obtained quickly by running directly on the image to be detected; (2) YOLO algorithm detection is based on the whole image instead of a sliding window, and carries out feature coding on the target and surrounding neighborhood information to improve the detection accuracy; (3) YOLO algorithm has a low decomposition ability and strong robustness for nondetection category data.
The research mainly optimizes the feature extraction and detection network in the YOLO network, and selects ResNet50 to replace the traditional YOLO network for feature extraction in the target recognition process. Activation_ 40_ relu is used as the feature extraction layer to output the image data after 16 instances of down-sampling to build a new RES-YOLO network. Some feature layers in YOLO are still used in the detection network. This ensures the excellent performance of the feature extraction network and retains the advantages of the YOLO network. At the same time, the ResNet50 feature extraction network selected in this paper is consistent with the traditional network structure. The specific reasons for choosing the ResNet50 network for improvement will be discussed later in Section 3.2.
The improved network structure is shown in Figure 3. The RES-YOLO network comprises five parts, including stage 0 layer, stage 1 layer, stage 2 layer, stage 3 layer, and stage 4 layer. After data input, the image of each frame is processed through a multi-layer convolution operation. The operation sequence of each layer is as follows:

Basic Structure of RES-YOLO Network
The main reasons for choosing the YOLO algorithm are as follows: (1) In dealing with the regression problem, there is no need to execute complex processes, and the results can be obtained quickly by running directly on the image to be detected; (2) YOLO algorithm detection is based on the whole image instead of a sliding window, and carries out feature coding on the target and surrounding neighborhood information to improve the detection accuracy; (3) YOLO algorithm has a low decomposition ability and strong robustness for nondetection category data.
The research mainly optimizes the feature extraction and detection network in the YOLO network, and selects ResNet50 to replace the traditional YOLO network for feature extraction in the target recognition process. Activation_ 40_ relu is used as the feature extraction layer to output the image data after 16 instances of down-sampling to build a new RES-YOLO network. Some feature layers in YOLO are still used in the detection network. This ensures the excellent performance of the feature extraction network and retains the advantages of the YOLO network. At the same time, the ResNet50 feature extraction network selected in this paper is consistent with the traditional network structure. The specific reasons for choosing the ResNet50 network for improvement will be discussed later in Section 3.2.
The improved network structure is shown in Figure 3. The RES-YOLO network comprises five parts, including stage 0 layer, stage 1 layer, stage 2 layer, stage 3 layer, and stage 4 layer. After data input, the image of each frame is processed through a multi-layer convolution operation. The operation sequence of each layer is as follows: (1) Stage 0 layer: convolution process for 1 time and maximum pooling process for 1 time; (2) Stage 1 layer: BTNK1 process for 1 time and BTNK2 process for 2 times; (3) Stage 2 layer: BTNK1 process for 1 time and BTNK2 process for 3 times; (4) Stage 3 layer: BTNK1 process for 1 time and BTNK2 process for 5 times; (1) Stage 0 layer: convolution process for 1 time and maximum pooling process for 1 time; (2) Stage 1 layer: BTNK1 process for 1 time and BTNK2 process for 2 times; (3) Stage 2 layer: BTNK1 process for 1 time and BTNK2 process for 3 times; (4) Stage 3 layer: BTNK1 process for 1 time and BTNK2 process for 5 times; (5) Stage 4 layer: BTNK1 process for 1 time and BTNK2 process for 2 times, and finally processed by YOLO module. As shown in the rightmost sub-graph of Figure 3, the complex structure of the original ResNet type network is intensively designed in the network design, and the central operation part of the network is designed into two operation units, BTNK1 and BTNK2. After reading the data, BTNK1 first performs two-way convolution operations, one of which continues to convolute twice and then turns into the activation function ReLU. The other turns directly into the activation function ReLU and outputs the results after being processed by the activation function. After reading the data, the BTNK2 unit first performs two operations, one of which is convoluted three times and then transferred to the activation function ReLU. The other is directly transferred to the activation function ReLU. The result is output after being processed by the activation function. When processing, the two operation units will call a particular memory buffer to effectively carry out the parallel operation to improve operation and processing efficiency. It is also convenient to carry out unit testing and quickly locate the fault node during debugging.
Because the vehicle detection part of the YOLO module is generally a tiny CNN, and its complexity is much lower than that of the feature extraction network, this paper uses some convolution layers and YOLO v2 unique layers for optimization. Finally, the RES-YOLO network structure suitable for performing the vehicle detection task is designed, as shown in Appendix A. After optimization, the number of network layers is slightly reduced, and the performance is improved to a certain extent. A total of 150 layers are involved. To provide a more efficient and robust detection network for vehicle detection, YOLO v3\v4\v5 and other networks with complex network structures are not used. Subsequent experiments have confirmed that, with our optimization, there is no  As shown in the rightmost sub-graph of Figure 3, the complex structure of the original ResNet type network is intensively designed in the network design, and the central operation part of the network is designed into two operation units, BTNK1 and BTNK2. After reading the data, BTNK1 first performs two-way convolution operations, one of which continues to convolute twice and then turns into the activation function ReLU. The other turns directly into the activation function ReLU and outputs the results after being processed by the activation function. After reading the data, the BTNK2 unit first performs two operations, one of which is convoluted three times and then transferred to the activation function ReLU. The other is directly transferred to the activation function ReLU. The result is output after being processed by the activation function. When processing, the two operation units will call a particular memory buffer to effectively carry out the parallel operation to improve operation and processing efficiency. It is also convenient to carry out unit testing and quickly locate the fault node during debugging.
Because the vehicle detection part of the YOLO module is generally a tiny CNN, and its complexity is much lower than that of the feature extraction network, this paper uses some convolution layers and YOLO v2 unique layers for optimization. Finally, the RES-YOLO network structure suitable for performing the vehicle detection task is designed, as shown in Appendix A. After optimization, the number of network layers is slightly reduced, and the performance is improved to a certain extent. A total of 150 layers are involved. To provide a more efficient and robust detection network for vehicle detection, YOLO v3\v4\v5 and other networks with complex network structures are not used. Subsequent experiments have confirmed that, with our optimization, there is no need to use a more complex YOLO network structure for detection. This part of the discussion and experiment will be explained later.
In this part, the details of the RES-YOLO network structure proposed in this paper can be seen in Appendix A. The local structures of BTNK1 and BTNK2 proposed here, as the characteristic structures of the RES-YOLO network, are our method's characteristics. At the same time, the STAGE0~4 layers in the RES-YOLO network processing are also the characteristics of our method. Finally, the YOLO module is consistent with the traditional YOLO v3\v4\v5 and other network structures. Here, only a part is spliced for use.

Data Enhancement
This paper enhances the data used in training. In the training process, the random transformation of the original training data is used to improve the data size of the training set to improve the effect of network training. The training data enhancement of the algorithm is mainly carried out by randomly turning the image horizontally and its corresponding frame label information, and expanding the training data set as much as possible while ensuring the information entropy of the original data. The enhancement effect is shown in Figure 4.
need to use a more complex YOLO network structure for detection. This part of the discussion and experiment will be explained later.
In this part, the details of the RES-YOLO network structure proposed in this paper can be seen in Appendix A. The local structures of BTNK1 and BTNK2 proposed here, as the characteristic structures of the RES-YOLO network, are our method's characteristics. At the same time, the STAGE0 ~ 4 layers in the RES-YOLO network processing are also the characteristics of our method. Finally, the YOLO module is consistent with the traditional YOLO v3\v4\v5 and other network structures. Here, only a part is spliced for use.

Data Enhancement
This paper enhances the data used in training. In the training process, the random transformation of the original training data is used to improve the data size of the training set to improve the effect of network training. The training data enhancement of the algorithm is mainly carried out by randomly turning the image horizontally and its corresponding frame label information, and expanding the training data set as much as possible while ensuring the information entropy of the original data. The enhancement effect is shown in Figure 4.

Constructing Adaptive Loss Function
In the deep learning network of the YOLO series, the loss function is composed of four variables loss xy , loss wh , loss conf , loss class . They correspond to four loss function terms, and the corresponding influencing factors are: the coordinates of the center position of the detection frame mark, the size of the detection frame mark, the confidence, and the type of the detection object. The constructed loss function is shown in formula (5): Loss=loss xy +loss wh +loss conf +loss class

Constructing Adaptive Loss Function
In the deep learning network of the YOLO series, the loss function is composed of four variables loss xy , loss wh , loss conf , loss class . They correspond to four loss function terms, and the corresponding influencing factors are: the coordinates of the center position of the detection frame mark, the size of the detection frame mark, the confidence, and the type of the detection object. The constructed loss function is shown in Formula (5): Loss = loss xy +loss wh +loss conf +loss class where N T , N F represent the number of positive samples and the number of negative samples, S 2 = S w × S h , loss xy , loss wh , loss conf , loss class represent the weight coefficient corresponding to the loss function item under the influence of the central position coordinate of the detection frame mark, the size and confidence of the detection frame mark, and the type of the detection object, i represents the corresponding grid number and j represents The essence of the YOLO algorithm is to grid the whole image data for one detection, so that the number of positive samples in the detection unit is 0 or 1, while the number of negative samples is much greater than 1. This results in remarkable differences between the number of positive and negative samples in the whole image data, with an order of magnitude of difference of up to 10 2 ∼ 10 3 . The imbalance between positive and negative samples will reduce the efficiency of the algorithm model training. At the same time, a large number of negative samples will cover up the positive samples, resulting in model degradation after training. In order to solve this problem, the detection network of the traditional YOLO and its subsequent versions (such as YOLO v2, YOLO v3, YOLO v4, etc.) try to improve learning efficiency by controlling two weight proportion coefficients λ obj , λ noobj to suppress the influence of negative samples on the loss function and increase the contribution of positive samples to the loss function. However, the simple assignment of these two scale coefficients can only be controlled within a certain threshold; otherwise, it will cause overfitting or training errors. For this reason, this paper introduces an adaptive proportional coefficient to adjust λ obj , λ noobj , so as to control the order of magnitude difference between the positive and negative samples.
Firstly, based on Formula (6), it is found that the confidence loss loss conf is the core to be optimized, which includes both coefficients λ obj , λ noobj , as shown in Formula (7).
Deriving the above formula, we produce: Through the above formula, it is found that the derivative of confidence is positively correlated with the difference between the predicted value and the true value, so: where ∆W represents the increment of the loss function to the variable and η W represents the learning rate. According to the above three formulas, if the difference between the number of positive and negative samples is too significant, the gradient characteristics will not be prominent, and finally the detection will fail. Based on this, the adaptive adjustment coefficient shown below is constructed to balance the impact of the difference.
The adaptive adjustment coefficient can be generated adaptively according to the difference between the predicted value and the actual value. Here, b effectively controls the value of the adjustment coefficient by controlling the value of the exponential part, and α is effectively controlled by using the high gradient change principle of the exponential function. When the sample under test is a positive one, ∼ C ij = 1; when it is a negative one, ∼ C ij = 0. After introducing the new adaptive scale factor, we have: Deriving the above formula, we have: (11) Therefore, after adding the adaptive scale coefficient, the larger the difference between the predicted and the actual value, the larger the derivative of the loss term, and vice versa. The derivative curve of the loss function is shown in Figure 5.
adjustment coefficient shown below is constructed to balance the impact of the difference.
The adaptive adjustment coefficient can be generated adaptively according to the difference between the predicted value and the actual value. Here, b effectively controls the value of the adjustment coefficient by controlling the value of the exponential part, and is effectively controlled by using the high gradient change principle of the exponential function. When the sample under test is a positive one, C ij = 1; when it is a negative one, C ij = 0. After introducing the new adaptive scale factor, we have: Deriving the above formula, we have: Therefore, after adding the adaptive scale coefficient, the larger the difference between the predicted and the actual value, the larger the derivative of the loss term, and vice versa. The derivative curve of the loss function is shown in Figure 5. In Figure 5, it is not difficult to find an excessively large b, which will result in an excessively large positive and negative samples gradient variation, whereas an excessively small b will result in insensitive gradient variation. When γ=0, the loss function is consistent with the traditional function, showing that formula (10) is essentially a generalized form of the loss function of confidence. Finally, after testing, the best effect is achieved when b=2, γ=2. The final definition of the loss function in this model is shown in formula (12). In Figure 5, it is not difficult to find an excessively large b, which will result in an excessively large positive and negative samples gradient variation, whereas an excessively small b will result in insensitive gradient variation. When γ = 0, the loss function is consistent with the traditional function, showing that Formula (10) is essentially a generalized form of the loss function of confidence. Finally, after testing, the best effect is achieved when b = 2, γ = 2. The final definition of the loss function in this model is shown in Formula (12). Loss = loss xy +loss wh +loss conf +loss class

Experimental Data Set and Sensor
The BDD100K algorithm is used to verify the effectiveness in this paper. Firstly, the data set is divided into two groups: the small sample data are used to pre-verify the algorithm, and large sample data are used to further optimize and evaluate the training model. More details of the data set are shown in Table 1. First, select some data from the BDD100K data set as small samples for training and testing, so as to pre-verify the feasibility and effectiveness of the algorithm for vehicle detection. Then, take the BDD100K data set as a large sample size data set for model training and verification, so as to verify and compare the general performance of the model algorithm. During the model evaluation, we divide the total data set into a training set, a verification set and a test set, with the ratio of 6:1:3. Table 1 also reflects the evaluation method of train/val split.
The data used in the pre-verification are referred to here as small data test. The data set selects the category of small cars through the annotation file "bdd100k_labels_images_val.json" of the original data of BDD100K, and reorganizes the customized small batch training and test data set. For the large sample size data set, which is referred to here as large data test, all the data used for vehicle target judgment in the BDD100K data set are directly used. The reconstructed sample data mainly include the path of the picture source and the location information of the vehicle target mark, as shown in Figure 6 below. As shown in Figure 7, in this study, the convenient vision sensor, which we built ourselves, is used for image acquisition. The structure of the sensor system is divided into six parts: part A is the image processor, which is mainly used to detect the vehicle target on the collected image. Part B is the terminal block, which is used to connect various sensor parts. Part C is the power indicator, which is used to display the working state of the system. Part D is the power input unit, which needs to input a 5 V 2 A power supply to ensure that the system works normally. Part E is the steering gear control board, which is used to control the movement of the steering gear base with two degrees of freedom. Part F is the steering gear part, which can adjust the viewing angle position of As shown in Figure 7, in this study, the convenient vision sensor, which we built ourselves, is used for image acquisition. The structure of the sensor system is divided into six parts: part A is the image processor, which is mainly used to detect the vehicle target on the collected image. Part B is the terminal block, which is used to connect various sensor parts. Part C is the power indicator, which is used to display the working state of the system. Part D is the power input unit, which needs to input a 5 V 2 A power supply to ensure that the system works normally. Part E is the steering gear control board, which is used to control the movement of the steering gear base with two degrees of freedom. Part F is the steering gear part, which can adjust the viewing angle position of the camera. Using this set of sensor devices, the vehicle image data can be collected for a long time on the premise of low power consumption, so as to prepare the data for subsequent experiments. As shown in Figure 7, in this study, the convenient vision sensor, which we built ourselves, is used for image acquisition. The structure of the sensor system is divided into six parts: part A is the image processor, which is mainly used to detect the vehicle target on the collected image. Part B is the terminal block, which is used to connect various sensor parts. Part C is the power indicator, which is used to display the working state of the system. Part D is the power input unit, which needs to input a 5 V 2 A power supply to ensure that the system works normally. Part E is the steering gear control board, which is used to control the movement of the steering gear base with two degrees of freedom. Part F is the steering gear part, which can adjust the viewing angle position of the camera. Using this set of sensor devices, the vehicle image data can be collected for a long time on the premise of low power consumption, so as to prepare the data for subsequent experiments.

Determination of Feature Extraction Network
In order to better evaluate the impact of different extraction network settings on the YOLO model, a lightweight, a robust, and a complex extraction network are selected for testing. Based on the two factors of operation time and operation accuracy, the optimal network structure is selected by comparison. The leading information of the test is shown in Tables 2 and 3. Considering the training time, the small sample data group is uniformly selected for training throughout the study, and the trained models are tested respectively. Finally, the optimal network matching is selected for the next experiment.
Through experiments, the detection-accuracy-loss curve and the accuracy-recall curve of the three groups of network models are obtained, as shown in Figure 8 below. Considering the training time, the small sample data group is uniformly selected for training throughout the study, and the trained models are tested respectively. Finally, the optimal network matching is selected for the next experiment.
Through experiments, the detection-accuracy-loss curve and the accuracy-recall curve of the three groups of network models are obtained, as shown in Figure 8 below. The accuracy-loss curve, whether for the lightweight network, robust network, or complex network, can converge within 250 iterations, which also shows that the above networks can objectively complete the model's training. However, we can see from Figure 8c that the complex network cannot complete the convergence of training loss in the first 50 iterations, indicating that there are problems such as over-fitting and inability to fit in the recognition process. Therefore, it is preliminarily determined that the complex network is not considered when dealing with vehicle recognition. At the same time, it can also be found from Figure 8a,b that the lightweight AlexNet and SquuzeNet networks, and the robust ResNet50 and ShuffleNet networks have an extensive convergence range for training loss in the first 50 iterations, which also confirms that these four networks are more suitable for dealing with vehicle recognition problems. In order to The accuracy-loss curve, whether for the lightweight network, robust network, or complex network, can converge within 250 iterations, which also shows that the above networks can objectively complete the model's training. However, we can see from Figure 8c that the complex network cannot complete the convergence of training loss in the first 50 iterations, indicating that there are problems such as over-fitting and inability to fit in the recognition process. Therefore, it is preliminarily determined that the complex network is not considered when dealing with vehicle recognition. At the same time, it can also be found from Figure 8a,b that the lightweight AlexNet and SquuzeNet networks, and the robust ResNet50 and ShuffleNet networks have an extensive convergence range for training loss in the first 50 iterations, which also confirms that these four networks are more suitable for dealing with vehicle recognition problems. In order to further determine the optimal network, the accuracy-recall curve under the same test conditions is compared, as shown in Figure 9 below. further determine the optimal network, the accuracy-recall curve under the same test conditions is compared, as shown in Figure 9 below.
(a) (b) (c) From Figure 9, except for the robust ResNet50 and DarkNet53 networks, which maintain a detection accuracy above 90% at a recall rate greater than 0.8, the detection accuracy of all other networks decreased significantly before the recall rate reaches 0.8, which obviously fails the task goal of the high-precision vehicle target detection. As shown in Figure 9a, when the recall rate of the lightweight networks reaches 0.6, the detection accuracy cannot exceed 85%, and some instances of detection accuracy cannot even exceed 70%. From Figure 9b, virtually the same thing happens to the complex networks. Therefore, in this paper, lightweight and complex networks are not suitable for vehicle detection. Considering that the ResNet50 network shows good performance in From Figure 9, except for the robust ResNet50 and DarkNet53 networks, which maintain a detection accuracy above 90% at a recall rate greater than 0.8, the detection accuracy of all other networks decreased significantly before the recall rate reaches 0.8, which obviously fails the task goal of the high-precision vehicle target detection. As shown in Figure 9a, when the recall rate of the lightweight networks reaches 0.6, the detection accuracy cannot exceed 85%, and some instances of detection accuracy cannot even exceed 70%. From Figure 9b, virtually the same thing happens to the complex networks. Therefore, in this paper, lightweight and complex networks are not suitable for vehicle detection.
Considering that the ResNet50 network shows good performance in the precision-loss curve and the precision-recall curve, we determine that the optimal network is a robust ResNet50 network. The specific test conditions of the above networks are shown in Table 3. ResNet50 is a typical residual network structure, in which the data output of a particular layer of the first several layers is directly introduced across multiple layers into the input part of the later data layer, which means that a previous layer will linearly contribute part of the content of the later feature layer. This is performed to solve the problem that the efficiency and detection accuracy decrease with the deepening of the learning complexity. After testing, as shown in Figure 10, the ResNet50 network, as a robust network, has moderate structural complexity and the highest accuracy among the other networks. By integrating such a network as YOLO v2 for modification, we can further optimize the detection effect for particular scenes, such as vehicle target detection, while ensuring efficiency. Finally, the ResNet50 network, with reasonable training time and the highest accuracy, is selected as the feature extraction network to be optimized by the YOLO algorithm. On this basis, an improved new network RES-YOLO is proposed for a comprehensive vehicle detection experiment.

Comprehensive Experimental Process
The workflow of the optimized RES-YOLO automatic recognition method is show in Figure 11. This method can detect vehicle targets for various image data under a var ety of different environmental conditions.

Comprehensive Experimental Process
The workflow of the optimized RES-YOLO automatic recognition method is shown in Figure 11. This method can detect vehicle targets for various image data under a variety of different environmental conditions. Part (1) mainly deals with data sets. After the data are read, they are grouped into a training data set, a test data set, and a verification data set, at a ratio of 6:3:1. Then, the anchor frame of the training data set is calibrated. After calibration, the training data set and its calibration are enhanced from three angles to prepare for subsequent model training.
Part (2) mainly creates the network structure for the RES-YOLO deep learning model. To interface the data sets effectively, the data size input by the designated YOLO framework [224 224 3] is directly used, the anchor box mark of the network is initialized, and the original data set is resized before it is inputted into the model. The ResNet50 network is introduced to extract the data features. After extracting the feature matrix parameters, the data set is trained to obtain the vehicle target automatic recognition model based on deep learning and prepare for subsequent testing and task execution.
Part (3) focuses on vehicle target recognition and model testing. First, the model detected by the ResNet50 network and the target recognition model obtained in the second Part (1) mainly deals with data sets. After the data are read, they are grouped into a training data set, a test data set, and a verification data set, at a ratio of 6:3:1. Then, the anchor frame of the training data set is calibrated. After calibration, the training data set and its calibration are enhanced from three angles to prepare for subsequent model training.
Part (2) mainly creates the network structure for the RES-YOLO deep learning model. To interface the data sets effectively, the data size input by the designated YOLO framework [224 224 3] is directly used, the anchor box mark of the network is initialized, and the original data set is resized before it is inputted into the model. The ResNet50 network is introduced to extract the data features. After extracting the feature matrix parameters, the data set is trained to obtain the vehicle target automatic recognition model based on deep learning and prepare for subsequent testing and task execution.
Part (3) focuses on vehicle target recognition and model testing. First, the model detected by the ResNet50 network and the target recognition model obtained in the second part are preloaded to detect the test set data. Finally, the results of the labeled samples are compared to obtain the accuracy-loss, test curves, and other data.

Comprehensive Comparative Experiment
In order to evaluate the performance of the improved RES-YOLO algorithm, the latest dual-detection framework Faster-RCNN and the single detection framework SSD are selected. The algorithms in the new references [20][21][22] are also referenced for comparison. For the convenience of description, the methods of references [20][21][22] are introduced into comparison in Figures 12-17 of this paper. The method in reference [20] is named reference model 1, the method in reference [21] is called reference model 2, and the method in reference [22] is named reference model 3. This paper will mainly evaluate the improved YOLO algorithm from four aspects: accuracy and recall (PR) curve, accuracy-loss curve, average accuracy of test set detection, and actual detection result. When dividing the model data set, 60% of the data are selected for training, 10% are selected for verification, and 30% are used to test the trained model.

Comprehensive Comparative Experiment
In order to evaluate the performance of the improved RES-YOLO algorithm, the latest dual-detection framework Faster-RCNN and the single detection framework SSD are selected. The algorithms in the new references [20][21][22] are also referenced for comparison. For the convenience of description, the methods of references [20][21][22] are introduced into comparison in Figures 12-17 of this paper. The method in reference [20] is named reference model 1, the method in reference [21] is called reference model 2, and the method in reference [22] is named reference model 3. This paper will mainly evaluate the improved YOLO algorithm from four aspects: accuracy and recall (PR) curve, accuracy-loss curve, average accuracy of test set detection, and actual detection result. When dividing the model data set, 60% of the data are selected for training, 10% are selected for verification, and 30% are used to test the trained model. The effect curves of the final test are shown in Figures 12-15.  Through the experimental test of a small amount of data, from the accuracy-recall curves of the models, the detection effect of the models in the literature [20,22] on the vehicle targets is just average. They cannot provide a detection accuracy greater than 50% at a recall rate smaller than 0.5. The model in the literature [21] and the Faster-RCNN model can maintain a detection accuracy of 60~80% at a recall rate smaller than 0.8. The detection effects of the YOLO and RES-YOLO models are the best. Both maintain a detection accuracy greater than 90% at a recall rate smaller than 0.8. The detection accuracy of the RES-YOLO model is also 2% higher than that of the traditional models, Figure 13. Accuracy-loss curve in small data set (small amount of data) [20][21][22].
Through the experimental test of a small amount of data, from the accuracy-recall curves of the models, the detection effect of the models in the literature [20,22] on the vehicle targets is just average. They cannot provide a detection accuracy greater than 50% at a recall rate smaller than 0.5. The model in the literature [21] and the Faster-RCNN model can maintain a detection accuracy of 60~80% at a recall rate smaller than 0.8. The detection effects of the YOLO and RES-YOLO models are the best. Both maintain a detection accuracy greater than 90% at a recall rate smaller than 0.8. The detection accuracy of the RES-YOLO model is also 2% higher than that of the traditional models, as shown in Figure 12. From the accuracy-loss curves of the models, the overall accuracy-loss shows a downward trend, indicating that all models can complete the vehicle detection task to a certain extent. However, generally speaking, the loss curves of the models in the literature [20,22], and the Faster-RCNN models fluctuate considerably, indicating that the detection performance stability is just average, while the loss curves of the model in the literature [21] and the RES-YOLO models fluctuate less strongly, indicating that they are more suitable for vehicle detection, as shown in Figure 13.
To summarize, as the models in the literature [20,22] perform poorly under the pre-test of a small amount of data, they are far less effective than our algorithm. Therefore, they are not used for subsequent large data tests. The effects of the other models need to be further tested with the large data set group before they can be further compared with our algorithm, as shown in Figures 14 and 15 below.  Through the experimental test of the large data group, from the accuracy-recall curves of models, the Faster-RCNN model has the poorest performance, followed by the model in the literature [21]. Overall, the PR curve of the YOLO algorithm is more prominent, and provides a detection accuracy within a broader range. The improved RES-YOLO algorithm also provides a better accuracy than before modification. At the same recall rate, the RES-YOLO algorithm is about 5% more accurate than the traditional YO-  Through the experimental test of the large data group, from the accuracy-recall curves of models, the Faster-RCNN model has the poorest performance, followed by the model in the literature [21]. Overall, the PR curve of the YOLO algorithm is more prominent, and provides a detection accuracy within a broader range. The improved RES-YOLO algorithm also provides a better accuracy than before modification. At the same recall rate, the RES-YOLO algorithm is about 5% more accurate than the traditional YO- Figure 15. Accuracy-loss curve in small data set (large amount of data) [21].
Through the experimental test of the large data group, from the accuracy-recall curves of models, the Faster-RCNN model has the poorest performance, followed by the model in the literature [21]. Overall, the PR curve of the YOLO algorithm is more prominent, and provides a detection accuracy within a broader range. The improved RES-YOLO algorithm also provides a better accuracy than before modification. At the same recall rate, the RES-YOLO algorithm is about 5% more accurate than the traditional YOLO algorithm, as shown in Figure 14. From the accuracy-loss curves, the accuracy-loss of the Faster-RCNN models is too heavy to complete vehicle detection. The model algorithm in the document [21] also suffers from increased accuracy-loss when performing vehicle detection tasks in multiple scenes.
In contrast, the YOLO model algorithm is obviously better than these two models. The improved RES-YOLO algorithm still performs well in accuracy-loss even after modification, as shown in Figure 15. The above results also show that by optimizing the network structure and loss function in this paper, the improved RES-YOLO algorithm can theoretically outperform the traditional YOLO algorithm and the other types of optimized algorithms. Next, we are going to perform a real-scene test.
In order to show how the RES-YOLO algorithm more intuitively outperforms the other network in vehicle detection, the measured data in different scenes are randomly collected for the actual test of the algorithm. Here, four models obtained through big data training are used, and the results are shown in Figure 16. work structure and loss function in this paper, the improved RES-YOLO algorithm can theoretically outperform the traditional YOLO algorithm and the other types of optimized algorithms. Next, we are going to perform a real-scene test.
In order to show how the RES-YOLO algorithm more intuitively outperforms the other network in vehicle detection, the measured data in different scenes are randomly collected for the actual test of the algorithm. Here, four models obtained through big data training are used, and the results are shown in Figure 16. The results show that: (1) The RES-YOLO algorithm can better recognize multi-vehicle targets. The improved RES-YOLO loss function effectively suppresses the influence of non-target errors on target recognition, as shown in sample data 1. (2) The RES-YOLO algorithm can accurately recognize vehicles even in a dark (harsh) environment with high robustness. Its recognition accuracy is much higher than the algorithm in the literature [21] and the Faster-RCNN algorithm, as shown in sample data 2. Optimizing the network structure of the YOLO algorithm causes the improved RES-YOLO algorithm to be better able to deal with complex environmental information.
(3) The RES-YOLO algorithm can identify vehicle position in short-range vehicle recognition tasks and intuitively see the number of vehicles. Although the accuracy of the frame mark position still needs to be improved, the recognition effect is better Figure 16. Actual test results. (a) Sample data 1 [21]; (b) Sample data 2 [21]; (c) Sample data 3 [21].
The results show that: (1) The RES-YOLO algorithm can better recognize multi-vehicle targets. The improved RES-YOLO loss function effectively suppresses the influence of non-target errors on target recognition, as shown in sample data 1. (2) The RES-YOLO algorithm can accurately recognize vehicles even in a dark (harsh) environment with high robustness. Its recognition accuracy is much higher than the algorithm in the literature [21] and the Faster-RCNN algorithm, as shown in sample data 2. Optimizing the network structure of the YOLO algorithm causes the improved RES-YOLO algorithm to be better able to deal with complex environmental information. (3) The RES-YOLO algorithm can identify vehicle position in short-range vehicle recognition tasks and intuitively see the number of vehicles. Although the accuracy of the frame mark position still needs to be improved, the recognition effect is better than the other methods, as shown in sample data 3.
The average accuracy, training time, and optimal accuracy of the models are obtained using 10% of the validation data set, as shown in Table 4. On this basis, five types of locally measured vehicle data sets in the network are tested. The improve algorithm can provide better recognition accuracy under different interference backgrounds, suggesting that our method is more accurate and robust for vehicle target detection than the other methods.
As shown in Figure 17, the RES-YOLO algorithm is superior to the other algorithms in operation efficiency and accuracy. As a large data volume contains more complex vehicle condition data, it is normal that the average accuracy is slightly lower than in the case of a small data volume. Nevertheless, the average accuracy of the RES-YOLO algorithm for the small data volume and the large data volume is 1.0% and 1.7% higher than the original algorithm. The training time is also 7.3% shorter. The optimal accuracy is improved by 1% to 95%. Compared with the algorithms in the literature [20,22], the accuracy of our method is improved by 47.8% and 46.8% in the case of small data volume. The accuracy is improved by 4.1% and 8.5%, respectively, in the case of large data volume. Compared with the SSD algorithm in the literature [21] and the Faster-RCNN algorithm, the accuracy is improved by 4.1% and 8.5%. Overall, the improved RES-YOLO algorithm is more suitable for vehicle target detection, suggesting that our algorithm is advantageous in vehicle target detection.   [20] 39.0% 27.0% 1579s 52.0% SSD algorithm in reference [21] 84.2% 81.9% 4179s 88.0% Algorithm in reference [22] 40 On this basis, five types of locally measured vehicle data sets in the network are tested. The improve algorithm can provide better recognition accuracy under different interference backgrounds, suggesting that our method is more accurate and robust for vehicle target detection than the other methods.
As shown in Figure 17, the RES-YOLO algorithm is superior to the other algorithms in operation efficiency and accuracy. As a large data volume contains more complex vehicle condition data, it is normal that the average accuracy is slightly lower than in the case of a small data volume. Nevertheless, the average accuracy of the RES-YOLO algorithm for the small data volume and the large data volume is 1.0% and 1.7% higher than the original algorithm. The training time is also 7.3% shorter. The optimal accuracy is improved by 1% to 95%. Compared with the algorithms in the literature [20,22], the accuracy of our method is improved by 47.8% and 46.8% in the case of small data volume. The accuracy is improved by 4.1% and 8.5%, respectively, in the case of large data volume. Compared with the SSD algorithm in the literature [21] and the Faster-RCNN algorithm, the accuracy is improved by 4.1% and 8.5%. Overall, the improved RES-YOLO algorithm is more suitable for vehicle target detection, suggesting that our algorithm is advantageous in vehicle target detection.

Experimental Test of Local Measured Data
After the above tests, it is found that the RES-YOLO model proposed in this paper is superior to other target detection networks in the field of vehicle target detection. Nevertheless, to determine the overall performance of the YOLO v2 architecture used in Figure 17. Actual test renderings of six types of networks [20][21][22].

Experimental Test of Local Measured Data
After the above tests, it is found that the RES-YOLO model proposed in this paper is superior to other target detection networks in the field of vehicle target detection. Nevertheless, to determine the overall performance of the YOLO v2 architecture used in this paper compared with other types of YOLO networks in vehicle detection tasks under complex backgrounds, testing of the image data measured by local researchers was performed. The vehicle conditions under different backgrounds are independently detected, and the training data set is consistent with the test in the previous section. On this basis, the vehicle detection effects of the YOLO v2 network, YOLO v3 network, YOLO v4 network, YOLO v5 network, and RES-YOLO network for the local data sets with different backgrounds are compared horizontally. The experimental results of the local test data are shown in Figure 18, and the final effect of the RES-YOLO network under the measured data is shown in Figure 19.  Figure 18, and the final effect of the RES-YOLO network under the measured data is shown in Figure 19. As shown in Figure 19, the model is further tested for six complex scenes, including: (a) urban road environment, (b) tunnel road environment, (c) skyway environment, (d) dark night environment, (e) trees, pedestrians, and other disturbing environments. Through comparative experiments, it is found that the RES-YOLO network shows the best accuracy effect in the actual tests except for the dark environment at night. As there is less background interference in a dark environment, the effect of this network is equivalent to the YOLO v5 network. In other noisy environments, the RES-YOLO network has apparent advantages. Its detection accuracy can stay above 85% in all environments, suggesting high reliability.
(a) Figure 18. Actual test renderings of five types of networks.
As shown in Figure 19, the model is further tested for six complex scenes, including: (a) urban road environment, (b) tunnel road environment, (c) skyway environment, (d) dark night environment, (e) trees, pedestrians, and other disturbing environments. Through comparative experiments, it is found that the RES-YOLO network shows the best accuracy effect in the actual tests except for the dark environment at night. As there is less background interference in a dark environment, the effect of this network is equivalent to the YOLO v5 network. In other noisy environments, the RES-YOLO network has apparent advantages. Its detection accuracy can stay above 85% in all environments, suggesting high reliability.
As shown in Figure 19, through the actual test experiment of the local data, it is found that the RES-YOLO network can perform vehicle target detection tasks in a variety of interference environments, including the urban road environment, tunnel road environment, skyway environment, night dark environment, and trees, pedestrians, and other disturbing environments. It has a satisfactory extraction effect and is well robust for vehicle target recognition in a complex background environment.
ing: (a) urban road environment, (b) tunnel road environment, (c) skyway environment, (d) dark night environment, (e) trees, pedestrians, and other disturbing environments. Through comparative experiments, it is found that the RES-YOLO network shows the best accuracy effect in the actual tests except for the dark environment at night. As there is less background interference in a dark environment, the effect of this network is equivalent to the YOLO v5 network. In other noisy environments, the RES-YOLO network has apparent advantages. Its detection accuracy can stay above 85% in all environments, suggesting high reliability. As shown in Figure 19, through the actual test experiment of the local data, it is found that the RES-YOLO network can perform vehicle target detection tasks in a variety of interference environments, including the urban road environment, tunnel road environment, skyway environment, night dark environment, and trees, pedestrians, and other disturbing environments. It has a satisfactory extraction effect and is well robust for vehicle target recognition in a complex background environment.

Conclusions and Discussion
Aiming at the problems encountered in the process of surveying and mapping data processing, such as difficulty in vehicle target detection, statistical errors in large data sets, and the complex automatic judgment of vehicle target self-position, the YOLO deep learning framework is studied. The traditional YOLO detection algorithm is optimized by introducing an adaptive proportional coefficient to reconstruct the loss function, and a vehicle detection model, the RES-YOLO algorithm, is proposed before it is trained and tested with the BDDK100 data set. By comparing with the latest detection methods, it is proved that the new algorithm has unique advantages in vehicle target detection.

Conclusions and Discussion
Aiming at the problems encountered in the process of surveying and mapping data processing, such as difficulty in vehicle target detection, statistical errors in large data sets, and the complex automatic judgment of vehicle target self-position, the YOLO deep learning framework is studied. The traditional YOLO detection algorithm is optimized by introducing an adaptive proportional coefficient to reconstruct the loss function, and a vehicle detection model, the RES-YOLO algorithm, is proposed before it is trained and tested with the BDDK100 data set. By comparing with the latest detection methods, it is proved that the new algorithm has unique advantages in vehicle target detection. Finally, by comparing with other types of YOLO networks, it is demonstrated that the RES-YOLO network is advantageous in performing vehicle target detection tasks. The main work and conclusions of this study are as follows: (1) The YOLO algorithm is superior to the algorithms in the literature in terms of time and accuracy. The RES-YOLO and YOLO algorithms have the highest training efficiency, followed by Faster-RCNN. The performance of the algorithms in the literature is just average. (2) The RES-YOLO algorithm can effectively overcome background errors caused by the imbalance of positive and negative samples. It can effectively identify vehicle targets in complex backgrounds and greatly increases the usability of the YOLO algorithm.
(3) The RES-YOLO network out performs even the current mainstream YOLO-type network, especially in vehicle target detection. It can accurately identify vehicle targets for a variety of environments with good robustness.
The research contribution of this paper is that we improved the traditional YOLO network structure, which makes it possible to recognize vehicle targets accurately and which suppresses multi-type environmental background noises effectively; moreover, we provided the optimal RES-YOLO network structure as a useful reference for the subsequent modification of the YOLO type networks for a particular target recognition. We also optimized the loss function in the YOLO network to improve its ability to suppress background noises, unitized key operations to allow for high efficiency operation, integrated our algorithm into the software application, designed an appropriate humanized interface, and incorporated a variety of picture and video detection interfaces. The main interface of the vehicle detection system, as shown in Figure 20, can be easily installed in MATLAB. The research contribution of this paper is that we improved the traditional YOLO network structure, which makes it possible to recognize vehicle targets accurately and which suppresses multi-type environmental background noises effectively; moreover, we provided the optimal RES-YOLO network structure as a useful reference for the subsequent modification of the YOLO type networks for a particular target recognition. We also optimized the loss function in the YOLO network to improve its ability to suppress background noises, unitized key operations to allow for high efficiency operation, integrated our algorithm into the software application, designed an appropriate humanized interface, and incorporated a variety of picture and video detection interfaces. The main interface of the vehicle detection system, as shown in Figure 20, can be easily installed in MATLAB. Finally, due to the diversity of conditions and data acquisition equipment, many problems with the vehicle target testing for more complex environments still need to be studied.   Finally, due to the diversity of conditions and data acquisition equipment, many problems with the vehicle target testing for more complex environments still need to be studied.