UAV Imagery for Automatic Multi-Element Recognition and Detection of Road Trafﬁc Elements

: Road trafﬁc elements comprise an important part of roads and represent the main content involved in the construction of a basic trafﬁc geographic information database, which is particularly important for the development of basic trafﬁc geographic information. However, the following problems still exist for the extraction of trafﬁc elements: insufﬁcient data, complex scenarios, small targets, and incomplete element information. Therefore, a set of road trafﬁc multielement remote sensing image datasets obtained by unmanned aerial vehicles (UAVs) is produced, and an improved YOLOv4 network algorithm combined with an attention mechanism is proposed to automatically recognize and detect multiple elements of road trafﬁc in UAV imagery. First, the scale range of different objects in the datasets is counted, and then the size of the candidate box is obtained by the k-means clustering method. Second, mosaic data augmentation technology is used to increase the number of trained road trafﬁc multielement datasets. Then, by integrating the efﬁcient channel attention (ECA) mechanism into the two effective feature layers extracted from the YOLOv4 backbone network and the upsampling results, the network focuses on the feature information and then trains the datasets. At the same time, the complete intersection over union ( CIoU ) loss function is used to consider the geometric relationship between the object and the test object, to solve the overlapping problem of the juxtaposed dense test element anchor boxes, and to reduce the rate of missed detection. Finally, the mean average precision (mAP) is calculated to evaluate the experimental effect. The experimental results show that the mAP value of the proposed method is 90.45%, which is 15.80% better than the average accuracy of the original YOLOv4 network. The average detection accuracy of zebra crossings, bus stations, and roadside parking spaces is improved by 12.52%, 22.82%, and 12.09%, respectively. The comparison experiments and ablation experiments proved that the proposed method can realize the automatic recognition and detection of multiple elements of road trafﬁc, and provide a new solution for constructing a basic trafﬁc geographic information database.


Introduction
Information on road traffic elements, including road centerlines, road intersections, zebra crossings, bus stations, roadside parking spaces, etc. are an important part of roads. The accurate recognition and detection of road traffic elements provide an essential decision-making basis for automatic driving, improving intelligent transportation systems, promoting smart cities, and updating basic traffic geographic information databases [1]. For the automatic recognition and detection of road traffic elements, the recent research of most scholars has been based on the detection and recognition of roadside traffic signage of a single element [2][3][4]. Inevitably, this approach has many shortcomings, such as the small amount of information acquired, the single element, and the large interval distance. This approach cannot provide a good solution for updating the basic traffic geographic 2 of 19 information database. Due to the limited shooting range, traditional vehicle-mounted cameras can obtain only a small portion of the road traffic element information. This is not conducive to the acquisition of large-area traffic element information; alternatively, unmanned aerial vehicles (UAV) images have the advantages of convenient acquisition and high resolution, providing favorable conditions for the acquisition of large-area traffic element information. Therefore, the automatic recognition and detection of multiple road traffic elements are studied through UAV remote sensing images in this paper to improve the efficiency and reduce labor costs for updating the basic traffic geographic information database.
Many studies have been carried out on target detection and recognition. With the development of deep learning, target detection methods have started changing from classical machine learning methods to deep learning methods, representing a new paradigm of machine learning. Target detection has been widely used in face detection [5], automatic driving [6], text detection [7,8], and other fields. Traditional target detection methods are based on color or shape features for target extraction. For example, Li et al. [3] proposed the method of detecting traffic signs through color and shape features; however, this method had a poor recognition effect and insufficient overall detection accuracy. Zhao et al. [4] used the Hough transform and shape analysis to detect and recognize road traffic signs; however, this method had an insufficient recognition rate and poor recognition effect. Berkaya et al. [9] used a shape algorithm and color threshold technology to detect and recognize circular traffic signs; however, this method only realized the recognition and detection of circular traffic signs, and its application scope was limited. Shi et al. [10] used a split-space Hough transformation method to achieve road boundary detection, and this method was suitable for boundary detection algorithms for straight and curved roads in general scenes. It is difficult to detect road boundaries in complex environments. He et al. [11] used shape information to detect triangular traffic signs; this method was only suitable for the detection of clear objects and did not detect the presence of fragments or occlusions in natural scenes. Creusen et al. [12] proposed an extended algorithm for traffic sign detection using information from multiple color channels. Most of these traditional methods use the special color and shape of traffic signs for feature extraction and rely on classifiers for classification. These methods generally suffer from slow detection speeds and insufficient detection accuracies, making it difficult to achieve the desired goal.
With the development of deep learning [13], increasing numbers of scholars are using deep learning for target detection. Target detection based on deep learning can be divided into two types: one-stage detection represented by a single-shot detector (SSD) [14], with a "You Only Look Once" (YOLO) algorithm [15][16][17][18][19], and two-stage detection represented by a region-based convolutional neural network (R-CNN) [20], Fast R-CNN [21], Faster R-CNN [22], etc. For example, a small traffic sign detection algorithm based on an improved SSD was proposed by Shan et al. [23], which achieved high accuracy in the test set but was not very applicable to the detection of other road traffic elements. Chen et al. [24] proposed an improved Mask R-CNN method to achieve road traffic sign recognition; however, this the method had a single recognition element and little information. Lodhi et al. [25] proposed a convolutional neural network (CNN)-based traffic sign recognition system. The authors integrated multilayer convolutional features and multilayer contextual information through a CNN framework for feature extraction. Guo et al. [26] used Faster R-CNN to implement a systematic approach for end-to-end traffic sign recognition. The method had good performance in small target detection and classification. Jin et al. [27] proposed an improved solution to the problem of insufficient average detection accuracy and missed detections during target detection in real road scenes. The authors improved the detection accuracy of road targets with the YOLOv3 improvement algorithm.
The problems and solutions proposed by the above scholars are useful for updating the basic traffic geographic information database on transportation. However, most scholars perform target detection based on a single element, which cannot satisfy the practical application needs for the detection of multielement road traffic. Therefore, a YOLOv4 [15] Aerospace 2022, 9, 198 3 of 19 network improvement algorithm combining the attention mechanism of efficient channel attention (ECA) [28] is proposed in this paper to achieve the automatic recognition and detection of multielement road traffic in UAV images. First, this paper manually labels a set of road traffic multielement datasets on UAV images captured by roLabelImg [29] (downloadable from https://github.com/cgvict/roLabelImg, accessed 23 June 2020). Second, the optimal candidate box size of the target object is obtained by k-means clustering analysis. Then, by integrating the ECA mechanism into the YOLOv4 backbone network, dataset training is conducted to detect the accuracy of multiple road traffic elements. At the same time, the complete intersection over union (CIoU) loss function [30] is introduced to reduce the error detection rate of juxtaposed dense elements side by side and greatly improve the detection accuracy.
In order to recognize automatic multiple elements and detect road traffic, it is possible to provide a service for updating the basic traffic geographic information database. The main contributions of this paper are as follows: (1) In response to the problems related to few road traffic multielement datasets, single elements, and lack of road information, a set of UAV image road traffic multielement datasets are produced in this paper.
(2) Aiming to solve the problem of the insufficient detection accuracy of road elements and the difficult identification of juxtaposed dense elements, the YOLOv4 algorithm integrating the ECA mechanism is proposed.
(3) The comparative experiment and ablation experiment prove the superiority of this method in detecting multiple elements of road traffic and provide a new solution for updating the basic traffic geographic information database.
The rest of this paper is organized as follows. Section 2 describes the related work. Section 3 details the proposed method, followed by the experiments and results in Section 4. The discussion is presented in Section 5. Finally, our conclusion is outlined in Section 6.

Related Work
In recent years, UAVs have shown a wide range of advantages in the field of transportation. In particular, they play an important role in road traffic monitoring, navigation, road damage detection, vehicle tracking for identification, road maintenance, and other traffic components [31][32][33][34][35][36][37][38][39][40][41][42]. With the advantages of fast data collection, high image quality, minimal cost, light weight, and great adaptability, UAVs can be used in road traffic inspection to greatly improve efficiency and reduce maintenance and manpower costs.
Research in the field of transport drones has focused on the problem of cruise route planning for UAVs, road vehicle detection, and the extraction of road information. The following scholars have addressed the problem of UAV cruise-route planning. Liu et al. [31] proposed a multi-objective optimization model for UAV cruise path planning. Additionally, an improved algorithm was designed to solve the UAV cruise path planning problem. Cheng et al. [32] proposed an algorithm for optimizing and modifying the optimal paths for UAVs. The authors also developed a multibase, i.e., a rechargeable and refillable UAV road patrol task allocation model to solve the problem of poor endurance associated with UAVs. Other academics have contributed to the traffic control field by completing real-time monitoring of road traffic information through UAVs. Elloumi et al. [33] proposed a road traffic detection system based on several UAVs. The authors monitored the traffic on urban roads with several UAVs in real time and sent the information to a traffic processing center for traffic control. Yang et al. [34] proposed an artificial intelligence-based solution to implement multi-object detection for intelligent road monitoring. The method provides a good solution for future road monitoring and control of intelligent transportation by combining UAVs, wireless communication, and Internet of Things technologies. Wang et al. [35] proposed a method for handling the loss of contact between a UAV and its operator based on probabilistic model detection. The method enables the UAV to perform surveillance tasks in dangerous environments. Huang et al. [36] proposed a distributed navigation scheme. This scheme achieved road traffic condition detection in different modes by means of real-time UAV detection. Liu et al. [37] proposed a real-time UAV rerouting model and a decomposition-based multi-objective optimization algorithm. The model took the dynamic requirements of traffic monitoring into account to achieve dynamic route planning for UAV cruising, making it more suitable for real-life traffic monitoring. UAV technology has the advantages of low cost, high flexibility, and good quality of collected image data. The growing number of UAV applications in the field of transport is reflected based on the increasing amount of road image information collected with UAVs. Pan et al. [38] detected asphalt pavement deterioration through drone imagery to provide decision support for road maintenance practices. The paper proposes that a combination of machine learning algorithms, such as support vector machines, artificial neural networks, and random forests, can be used to differentiate between normal and damaged pavements for pavement damage identification. Saad et al. [39] used UAV images to identify ruts and potholes in road surfaces. The authors identified the ruts and potholes extracted from UAV images through site survey and planning, data acquisition, data processing and results, and data analysis to achieve road condition detection. Roberts et al. [40] proposed a method for generating 3D pavement modeling using UAV images. These models were used to monitor and analyze the pavement condition and to automate the detection of pavement deterioration. Wang et al. [41] proposed a UAV-based target tracking and recognition system. This system implemented the functions of target tracking, target recognition and detection, and image processing. Liu et al. [42] processed UAV images through a target detection network with multiscale feature fusion, improving the ability to detect small targets while reducing resource consumption makes the network lighter.
The above paragraph describes the main research on UAVs in the field of transportation. Similar to the abovementioned scholars, this paper also collects information through UAVs. Most scholars currently collect road image information through UAVs primarily to research road damage. However, this paper mainly collects information on road traffic elements, including road centerlines, road intersections, zebra crossings, bus stations, roadside parking spaces, and other similar information. A review of a large amount of literature shows that there is little research on the automatic identification and detection of road traffic elements. However, the extraction of road traffic element information is of great significance for updating the basic traffic geographic information database. Determining how to extract road traffic elements in a low-cost and high-efficiency way is particularly important. Therefore, this paper proposes a deep learning method by fusing the multielement images of road traffic obtained by UAVs. It can achieve the effect of automatic identification and detection with high efficiency, low cost, and high accuracy. The proposed method can provide technical support for updating the basic traffic geographic information database.

Research Method
YOLOv4 is an algorithm that combines a number of previous research techniques, combined with innovation. YOLOv4 enables efficient target detection tasks while using only a single GPU. In the YOLOv4 network, the training process can be optimized to improve accuracy. Better performance can also be achieved by sacrificing a little amount of inference time. The YOLOv4 network achieves the perfect balance of speed and accuracy in target detection tasks.
YOLOv4 has the advantages of fast detection and high speed. Therefore, a YOLOv4 network incorporating the ECA mechanism [28] is proposed. First, the k-means [43] clustering method is used to calculate the matching candidate box size in the datasets. Second, the mosaic data augmentation method is used to increase the number of training samples for multiple elements of road traffic on the training datasets. Then, the ECA module is fused into the YOLOv4 network for data training. Finally, the detection results are obtained, and the accuracy is evaluated. Figure 1 displays a flow chart of the proposed method.
Second, the mosaic data augmentation method is used to increase the number of training samples for multiple elements of road traffic on the training datasets. Then, the ECA module is fused into the YOLOv4 network for data training. Finally, the detection results are obtained, and the accuracy is evaluated. Figure 1 displays a flow chart of the proposed method.

Overall Framework
In this paper, the YOLOv4 network model is selected as the base algorithm model. Bochkovskiy et al. [15] proposed that the YOLOv4 network is a one-stage target detection network. The YOLOv4 network is primarily composed of components related to the CSPDarknet53 [15], spatial pyramid pooling (SPP) [44], feature pyramid networks (FPNs) [45], and path aggregation network (PAN) [46]. Among these components, the CSPDarknet53 structure consists of 5 content security policy (CSP) [47] modules, which are made to act as downsampling modules using a convolutional kernel with a step size of 2 and a size of 3 × 3 in front of each CSP module. Thus, when the input feature image is 416 pixels × 416 pixels in size, the image is downsampled after 5 CSP modules to obtain a feature map with a size of 13 × 13. CSPDarknet53 reduces the computational consumption and memory costs while also enhancing the learning capability of the CNNs and ensuring computational accuracy. The SPP structure is mainly used to solve the problem of the nonuniform size of the input image. The SPP structure directly pools the feature maps of any size to obtain a fixed number of features. FPN+PAN draws on the approach of PANet [46] by adding a feature pyramid to the tail of the FPN structure. This includes the two PAN structures to enable bottom-up communication of strong localization features, enabling easier reception of bottom-level information at the top of the hierarchy, and topdown communication of enhanced semantic features in combination with the FPN structure layer. The combination of these two components enables feature aggregation from different backbone layers and between detection layers, thus improving the feature extraction capability in the backbone network.
The YOLOv4 method of the fused ECA mechanism is proposed in this paper, which adds the ECA mechanism to the two effective feature layers extracted from the backbone network and to the result after upsampling, as shown in the YOLOv4 model with ECA in Figure 2.

Datasets
Mosaic data enhancement Integrate into ECA feature extraction network

SPP+PANet YOLO head Detection results Precis ion evaluation
YOLOv4 k-means clustering to obtain candidate frame sizes

Overall Framework
In this paper, the YOLOv4 network model is selected as the base algorithm model. Bochkovskiy et al. [15] proposed that the YOLOv4 network is a one-stage target detection network. The YOLOv4 network is primarily composed of components related to the CSP-Darknet53 [15], spatial pyramid pooling (SPP) [44], feature pyramid networks (FPNs) [45], and path aggregation network (PAN) [46]. Among these components, the CSPDarknet53 structure consists of 5 content security policy (CSP) [47] modules, which are made to act as downsampling modules using a convolutional kernel with a step size of 2 and a size of 3 × 3 in front of each CSP module. Thus, when the input feature image is 416 pixels × 416 pixels in size, the image is downsampled after 5 CSP modules to obtain a feature map with a size of 13 × 13. CSPDarknet53 reduces the computational consumption and memory costs while also enhancing the learning capability of the CNNs and ensuring computational accuracy. The SPP structure is mainly used to solve the problem of the nonuniform size of the input image. The SPP structure directly pools the feature maps of any size to obtain a fixed number of features. FPN+PAN draws on the approach of PANet [46] by adding a feature pyramid to the tail of the FPN structure. This includes the two PAN structures to enable bottom-up communication of strong localization features, enabling easier reception of bottom-level information at the top of the hierarchy, and top-down communication of enhanced semantic features in combination with the FPN structure layer. The combination of these two components enables feature aggregation from different backbone layers and between detection layers, thus improving the feature extraction capability in the backbone network.
The YOLOv4 method of the fused ECA mechanism is proposed in this paper, which adds the ECA mechanism to the two effective feature layers extracted from the backbone network and to the result after upsampling, as shown in the YOLOv4 model with ECA in Figure 2.

Introduction to the Datasets
At present, there is a lack of sufficient datasets for multiple road traffic elements, and most of the existing publicly available datasets are roadside traffic signage datasets or road traffic datasets. To meet the demand for updating the basic traffic geographic information database and to solve the problem of an insufficient number of datasets for road traffic elements, a set of datasets for multiple road traffic elements is produced in this paper, including zebra crossings, roadside parking spaces, and bus stations, as shown in the sample datasets in Figure 3. The UAV images were captured by the Hava MEGA-V8 and DJI FC6310. Harwar MEGA-V8 is equipped with a five-tilt camera, supporting the BeiDou, global positioning system (GPS), GLONASS, and seven real-time kinetic (RTK) Samsung frequencies. The horizontal positioning accuracy reaches ±2 cm, and the vertical positioning error reaches ±5 cm. This equipment is characterized by high efficiency, long endurance, and high-precision map formation. The DJI FC6310 UAV has 6 vision sensors, a Aerospace 2022, 9,198 6 of 19 main camera, 2 sets of infrared sensors, 1 set of ultrasonic sensors, a GPS/GLONASS dualmode satellite positioning system, an inertial measurement unit (IMU), and compass dual redundant sensors. This equipment can help the drone acquire real-time images and depth and positioning information while flying, as well as build a 3D map around the vehicle and determine its position. The image size of remote sensing is 7146 pixels × 5364 pixels and 5472 pixels × 3648 pixels, respectively. The spatial resolutions are 0.05 m and 0.1 m, respectively. A total of 16,872 images were taken, duplicate areas and areas without road traffic elements were removed from the images, and 1128 of these images were finally selected manually as the original dataset. The road traffic elements were manually marked with the image labeling software roLabelImg, which is used to mark rotated rectangular boxes or square rectangular boxes. The function used in this article involves the marking of positive rectangular boxes. There are many elements that constitute road traffic information, including road centerlines, road intersections, zebra crossings, bus stations, and roadside parking spaces. The main purpose of this paper is to achieve the automated construction of a basic traffic geographic information database. The research results of the automatic identification and detection of road traffic multi-elements are relatively few. Therefore, the representative traffic road elements are selected as research objects. Likewise, in this paper, zebra crossings, roadside parking spaces, and bus stations are selected as research objects. More types of automated detection and recognition will also be added in subsequent research. Among the research objects in this work, zebra crossings, roadside parking spaces, and bus stations are named zebra_crossings, parking_spaces, and bus_stations, respectively. The training data account for 90% of all data, and the rest are test data.

Introduction to the Datasets
At present, there is a lack of sufficient datasets for multiple road traffic elements, and most of the existing publicly available datasets are roadside traffic signage datasets or road traffic datasets. To meet the demand for updating the basic traffic geographic information database and to solve the problem of an insufficient number of datasets for road traffic elements, a set of datasets for multiple road traffic elements is produced in this paper, including zebra crossings, roadside parking spaces, and bus stations, as shown  relatively few. Therefore, the representative traffic road elements are selected as research objects. Likewise, in this paper, zebra crossings, roadside parking spaces, and bus stations are selected as research objects. More types of automated detection and recognition will also be added in subsequent research. Among the research objects in this work, zebra crossings, roadside parking spaces, and bus stations are named zebra_crossings, parking_spaces, and bus_stations, respectively. The training data account for 90% of all data, and the rest are test data.

Clustering of the Anchor Box
The anchor box sizes of the original YOLOv4 network were obtained from the visual object class (VOC) datasets [48]. The detection was performed for scales of 19 × 19, 38 × 38, and 76 × 76. The preset candidate boxes were 12,16,19,36,40,28,36,75,76,55,72,146,142,110,192,243,459, and 401, whose scale sizes are not applicable to the multielement datasets of road traffic captured by UAV images in this paper. Therefore, to apply the target scale range of road traffic multielement datasets, the k-means clustering method was used to conduct scale statistics on 1128 UAV road traffic multielement remote sensing images. First, the scale of road traffic elements was defined as 9 clusters, and the cluster centers of each cluster were randomly selected in each cluster. Then, each data point was associated to the nearest cluster center, and the center point of each of these 9 clusters was found as the new cluster center. Thus, the cluster centers were iterated until the points owned by these 9 clusters no longer change. Finally, the size of the target candidate box was set based on the clustering result. The k-means results clustering are shown in Table  1. These results show that the effect after clustering is in line with the target scale of the datasets proposed in this paper.

Data Augmentation
Mosaic data augmentation is used to enhance the training datasets in the YOLOv4 network. The mosaic data augmentation approach starts by randomly extracting four images containing the anchor frames of the detectors from the road traffic multielement datasets; stitching the images into a new image by randomly scaling, cropping, and

Clustering of the Anchor Box
The anchor box sizes of the original YOLOv4 network were obtained from the visual object class (VOC) datasets [48]. The detection was performed for scales of 19 × 19, 38 × 38, and 76 × 76. The preset candidate boxes were 12,16,19,36,40,28,36,75,76,55,72,146,142,110,192,243,459, and 401, whose scale sizes are not applicable to the multielement datasets of road traffic captured by UAV images in this paper. Therefore, to apply the target scale range of road traffic multielement datasets, the k-means clustering method was used to conduct scale statistics on 1128 UAV road traffic multielement remote sensing images. First, the scale of road traffic elements was defined as 9 clusters, and the cluster centers of each cluster were randomly selected in each cluster. Then, each data point was associated to the nearest cluster center, and the center point of each of these 9 clusters was found as the new cluster center. Thus, the cluster centers were iterated until the points owned by these 9 clusters no longer change. Finally, the size of the target candidate box was set based on the clustering result. The k-means results clustering are shown in Table 1. These results show that the effect after clustering is in line with the target scale of the datasets proposed in this paper.

Data Augmentation
Mosaic data augmentation is used to enhance the training datasets in the YOLOv4 network. The mosaic data augmentation approach starts by randomly extracting four images containing the anchor frames of the detectors from the road traffic multielement datasets; stitching the images into a new image by randomly scaling, cropping, and arranging them; obtaining the anchor boxes corresponding to this resulting image; and then passing this processed image into the YOLOv4 network for learning. The data from the four images can be calculated as one image for the batch normalization calculation [30]. As shown in the workflow of mosaic data augmentation in Figure 4, such mosaic data augmentation enriches the datasets with background and small sample information of the detection object. Moreover, mosaic data augmentation training does not require high computational performance, even when using only the central processing unit (CPU). the four images can be calculated as one image for the batch normalization calculation [30]. As shown in the workflow of mosaic data augmentation in Figure 4, such mosaic data augmentation enriches the datasets with background and small sample information of the detection object. Moreover, mosaic data augmentation training does not require high computational performance, even when using only the central processing unit (CPU).

Efficient Channel Attention
In deep learning, the attention mechanism is a commonly used method and skill. There are many ways to realize the attention mechanism, but its core is to make the network focus on feature information. Attention mechanisms can be divided into channel attention mechanisms, spatial attention mechanisms, and a combination of the two. The mechanism used in this paper is the ECA mechanism. A local cross-channel interaction strategy without dimensionality reduction was implemented by one-dimensional convolution as well as an adaptive selection of the one-dimensional convolutional kernel size. With this method, the coverage of local cross-channel interactions can be guaranteed, which allows the network to gain performance improvements while reducing the complexity of the model. ECANet [28] is an implementation of the channel attention mechanism. ECANet can be considered an improved version of SENet [49]. The squeeze-and-excitation (SE) [49] attention mechanism first carries out channel compression on the input feature map; but this dimension reduction method is not conducive to learning the dependency between channels. Therefore, the ECA avoids dimensional reduction, uses one-dimensional convolution to efficiently realize local cross-channel interaction, extracts the dependency between channels, and improves the performance of the YOLOv4 network. This likewise improves the identification accuracy of the road traffic elements in UAV images. The specific steps of ECA's attention mechanism are as follows: (1) Create a feature map for the global averaging pooling operation.
(2) Carry out a one-dimensional convolution operation with a convolution kernel size equal to k and obtain the weight ω of each channel through the sigmoid activation function. The calculation formula of ω is: where C1D stands for one-dimensional convolution and k stands for the related parameter information between the corresponding y and k fields.

Efficient Channel Attention
In deep learning, the attention mechanism is a commonly used method and skill. There are many ways to realize the attention mechanism, but its core is to make the network focus on feature information. Attention mechanisms can be divided into channel attention mechanisms, spatial attention mechanisms, and a combination of the two. The mechanism used in this paper is the ECA mechanism. A local cross-channel interaction strategy without dimensionality reduction was implemented by one-dimensional convolution as well as an adaptive selection of the one-dimensional convolutional kernel size. With this method, the coverage of local cross-channel interactions can be guaranteed, which allows the network to gain performance improvements while reducing the complexity of the model. ECANet [28] is an implementation of the channel attention mechanism. ECANet can be considered an improved version of SENet [49]. The squeeze-and-excitation (SE) [49] attention mechanism first carries out channel compression on the input feature map; but this dimension reduction method is not conducive to learning the dependency between channels. Therefore, the ECA avoids dimensional reduction, uses one-dimensional convolution to efficiently realize local cross-channel interaction, extracts the dependency between channels, and improves the performance of the YOLOv4 network. This likewise improves the identification accuracy of the road traffic elements in UAV images. The specific steps of ECA's attention mechanism are as follows: (1) Create a feature map for the global averaging pooling operation.
(2) Carry out a one-dimensional convolution operation with a convolution kernel size equal to k and obtain the weight ω of each channel through the sigmoid activation function. The calculation formula of ω is: where C1D stands for one-dimensional convolution and k stands for the related parameter information between the corresponding y and k fields.
(3) The weights are multiplied by the corresponding elements of the original input feature map to obtain the final output feature image.

CIoU_Loss
Road traffic elements in UAV images, such as roadside parking spaces, have juxtaposed dense elements. The intersection over union (IoU) loss function is not a good solution to this problem; therefore, the CIoU loss function is used to solve this problem. CIoU [30] improves the function regression accuracy and convergence speed by considering the distance between the detection frame and target box, overlapping area, aspect ratio, and other aspects, as shown in Figure 5.

CIoU_Loss
Road traffic elements in UAV images, such as roadside parking spaces, have juxtaposed dense elements. The intersection over union (IoU) loss function is not a good solution to this problem; therefore, the CIoU loss function is used to solve this problem. CIoU [30] improves the function regression accuracy and convergence speed by considering the distance between the detection frame and target box, overlapping area, aspect ratio, and other aspects, as shown in Figure 5. CIoU, whose penalty items are publicly announced as: where v is the similarity of the metric aspect ratio and α is the weighting function, respectively, defined as: Thus, the CIoU loss function can be expressed as: where c denotes the diagonal distance between the prediction box b and the smallest outer rectangle of the real box , and d denotes the distance between the centroid of the real box and the prediction box.
is the area intersection ratio of the prediction box and the real box. 2 ( , ) denotes the Euclidean distance between the prediction box and the centroid of the real box.

Experimental Environment
The computer configuration used was an i7-9700k CPU running Windows 10 with a GTX1070Ti GPU and 8 GB of video memory. The experimental training platform was Pycharm. The training weight decay coefficient was set to 0.0005, the initial learning rate was set to 0.001, the confidence level was set to 0.5, and the IoU threshold was set to 0.5. A total of 100 epochs were trained, with 4000 iterations. The datasets were divided into a  CIoU, whose penalty items are publicly announced as: where v is the similarity of the metric aspect ratio and α is the weighting function, respectively, defined as: Thus, the CIoU loss function can be expressed as: where c denotes the diagonal distance between the prediction box b and the smallest outer rectangle of the real box b gt , and d denotes the distance between the centroid of the real box and the prediction box. IoU is the area intersection ratio of the prediction box and the real box. ρ 2 b, b gt denotes the Euclidean distance between the prediction box and the centroid of the real box.

Experimental Environment
The computer configuration used was an i7-9700k CPU running Windows 10 with a GTX1070Ti GPU and 8 GB of video memory. The experimental training platform was Pycharm. The training weight decay coefficient was set to 0.0005, the initial learning rate was set to 0.001, the confidence level was set to 0.5, and the IoU threshold was set to 0.5. A total of 100 epochs were trained, with 4000 iterations. The datasets were divided into a training set and a validation set in a 9:1 ratio, and a typical road traffic element was randomly selected as the test set.

Evaluation Indicators
In the experiment, the mean average precision (mAP) was calculated as the quantitative evaluation index of the model to measure the accuracy of the model detection. The mAP is defined as: where N represents the number of all categories in the test set, i is the ith category, and AP i is the average precision (AP) of the ith category, which is defined as: where p is the precision; r is the recall; and p is a function with r as an argument, which is equal to taking the area under the curve. The recall and precision are defined as: where TP represents the positive samples detected correctly, characterizing the number of road traffic elements detected correctly; FP represents the negative samples detected incorrectly, characterizing the number of targets the were incorrectly detected as classes other than road traffic element classes; and FN represents the positive samples detected incorrectly, characterizing the number of other classes detected incorrectly as road traffic element classes.

Comparison Experiments
In this paper, the effectiveness of the proposed method is verified for both classical and state-of-the-art algorithmic networks for target detection. The SSD, RetinaNet, Faster R-CNN, YOLOv3, YOLOv4, and YOLOv5 networks were used in comparison experiments to train the road traffic multielement datasets, and their AP, precision, recall, and mAP values were calculated and compared. As shown in Table 2, the recognition accuracy of road traffic elements under different network models was counted separately. The rise points in Table 2 are the mAP calculated by comparing each network with the proposed methods in this paper. To verify the effectiveness of the proposed method, ablation experiments were conducted on the road traffic multielement datasets. Such experiments compared the combination of k-means, mosaic data augmentation, and other attention mechanisms (such as the SE [49] attention mechanism, the convolutional block attention module (CBAM) [50] attention mechanism, and fusing the attention mechanisms into the same layer network structure as the ECA mechanism), by calculating their AP, precision, recall, and mAP values, as shown in Table 3. To verify the practicality and effectiveness of the presented method in this paper, the UAV image map of small scenarios and the image map of large complex scenarios were selected for prediction experiments. The prediction results of the ablation experiment for small scenarios are shown in Figure 6. The prediction data are selected from the road traffic multielement datasets with several representative types of element scenarios, namely the single-element scenario, the multielement scenario, and the juxtaposed dense-element scenario. The single-element scenario contains only one type of traffic element, and the bus stations were selected as the detection object in the single-element scenario. The multielement scenario includes zebra crossings, bus stations, and roadside parking spaces. The juxtaposed dense-element scenario involves the detection and recognition of roadside parking spaces. According to the corresponding statistics, there are 2 bus stations in the single-element scenario; 4 zebra crossings, 3 roadside parking spaces, and 1 bus station in the multielement scenario; and 17 roadside parking spaces in the juxtaposed dense-element scenario. The predicted results of the ablation experiments in small scenarios are shown in Table 4.
Combined with the predicted results in Figure 6 and Table 4, it is clear that the use of kmeans clustering or mosaic data augmentation alone for the detection of multiple elements of road traffic suffers from leakage, proving that improving the algorithm from one side alone does not lead to a large improvement in the experimental results. Combined with the analysis of the ablation experiment detection results, the mosaic data augmentation method has the worst detection accuracy of only 74.08% for roadside parking spaces, followed by the k-means clustering method, as confirmed in the prediction results in Figure 6 and Table 4 which show missed detections in the prediction results. In terms of the overall prediction results, the detection results improve with the addition of the attention mechanism, and the proposed method has the highest number of optimal detections at four. In particular, the detection of zebra crossings and roadside parking spaces reaches 98.25% and 99.88% for the detection of multiple elements and dense side-by-side scenarios, respectively. Although the detection of bus stations in complex scenarios with the addition of the SE attention mechanism achieves the best detection, the detection accuracy of the proposed method reaches 98%, which is only 2% different from the detection method with the addition of the SE attention mechanism.
Aerospace 2022, 9, 198 12 of 19 dense-element scenario. The predicted results of the ablation experiments in small scenarios are shown in Table 4. (a)YOLOv4+k-means (e) Proposed method   The predicted results of the ablation experiment in a large complex scene are shown in Figure 7. The large scene map is an ortho mosaic image generated from the remote sensing image captured by the UAV and processed by Pix4D software, which covers a total area of 302,813 m 2 . Upon counting, it is determined that the large complex scene includes 18 zebra crossings, 5 bus stations, and 58 roadside parking spaces. The specific prediction results are shown in Table 5. Combined with the prediction results in Figure 7 and Table 5, it is clear that several of the above algorithms miss detections in large complex scenarios, especially when detecting roadside parking spaces. The reason for this is that for large images, roadside parking comprises a small target detection, and most roadside parking spaces are covered by greenery; thus, the feature information is not obvious, resulting in missed detection. For the detection of other objects, the algorithm in this paper can still exhibit good results. The average detection accuracy of zebra crossings and bus stations in large complex scenarios can reach 92.17% and 93.40%, respectively, corresponding to the best result in the ablation experiment.   Combined with the prediction results in Figure 7 and Table 5, it is clear that several of the above algorithms miss detections in large complex scenarios, especially when detecting roadside parking spaces. The reason for this is that for large images, roadside

Discussion
From the above experimental results and analysis, we find that the present method exhibits a large improvement in the mAP compared to several other methods, with the increase points ranging from 3.53% to 36.51%, verifying that the YOLOv4 model incorporating the ECA mechanism presented in this paper can effectively improve the road traffic multielement detection accuracy. Consistent with the results of previous studies, the experimental results of combining other dominant attention mechanism modules in the same network location are improved compared to the original YOLOv4 network; however, the improvement is not as good as the present method, indicating that the fused attention mechanism has a positive effect on the network training model. The proposed YOLOv4 algorithm with the fused ECA mechanism is the best. This demonstrates the practicality and superiority of the proposed method, which can be directly applied to image maps in large scenarios and provides a more intelligent and convenient method for updating the basic traffic geographic information database. Moreover, the proposed method still achieves better results than several other methods in complex large scenes, which proves its practicality and superiority. The proposed method can be directly applied to image maps in large scenarios, thus providing a more intelligent and convenient method for updating geographic information database. However, in large complex scenarios, there is still a missing detection phenomenon for roadside parking spaces. This is because in this scenario, roadside parking spaces are easily obscured, and they are small targets for detection. This problem requires subsequent research on how to improve the detection of small targets in large complex scenarios and better extract feature information from small targets.

Conclusions
To address the problems of low data extraction, poor automation, and high demand for traffic element information, an automatic recognition and detection method based on YOLOv4 multiple road traffic elements combined with an attention mechanism based on UAV remote sensing images is proposed in this paper. The method achieves 90.45% mAP in the detection of multiple road traffic elements, which is 18.80% better than the original YOLOv4 network. The experimental results verify that the method in this paper provides a new idea for updating and improving the basic traffic geographic information database.
However, the method in this paper also has shortcomings. The experiment focuses only on zebra crossings, bus stations, and roadside parking spaces, and the subsequent work will expand the datasets to complete automatic identification and detection of more elements.