Automatic Recognition of Road Damage Based on Lightweight Attentional Convolutional Neural Network

An efficient road damage detection system can reduce the risk of road defects to motorists and road maintenance costs to traffic management authorities, for which a lightweight end-to-end road damage detection network is proposed in this paper, aiming at fast and automatic accurate identification and classification of multiple types of road damage. The proposed technique consists of a backbone network based on a combination of lightweight feature detection modules constituted with a multi-scale feature fusion network, which is more beneficial for target identification and classification at different distances and angles than other studies. An embedded lightweight attention module was also developed that can enhance feature information by assigning weights to multi-scale convolutional kernels to improve detection accuracy with fewer parameters. The proposed model generally has higher performance and fewer parameters than other representative models. According to our practice tests, it can identify many types of road damage based on the images captured by vehicle cameras and meet the real-time detection required when piggybacking on mobile systems.


Introduction
Pavement damage due to road aging, traffic volume, construction materials, and weather [1,2] is an important cause of driving safety [3][4][5]. Therefore, pavement damage detection is beneficial for drivers' lives' safety. In addition, road infrastructure is a vital national asset, and understanding its damage level is crucial for its subsequent maintenance [6]. Moreover, road damage detection technology plays a crucial role in the construction of intelligent transportation systems (ITS) and automated assisted driving systems (ADAS) [7,8].
Early road damage inspection relied on manual progress along the road by walking or slow-moving vehicles and visually inspecting the road surface. Inspection results were highly subjective and time-consuming. In addition, this inspection task needed to be performed at slow speeds in the lane, and there was also the potential for traffic hazards for the staff [9].
Subsequently, some organizations use sensor-equipped inspection vehicles to collect pavement condition data, where expensive equipment such as laser scanning cameras, road profilers, and 3D capture cameras are required, which undoubtedly increases the cost of such systems significantly [10][11][12][13]. The collected pavement data need to be subsequently processed in the workstation, which is still very time-consuming [14,15].
With the development of deep learning techniques, many researchers have started using neural network-based models for road damage detection. Most of these works use convolutional neural networks (CNNs) for pixel-level segmentation of road images. For example, Fan et al. [16] first used a CNN-based classification network to filter images containing cracks, after which the damages were extracted by traditional image processing methods of filtering with adaptive thresholding. On the other hand, Feng et al. [17] preprocessed the images to filter image noise, input them into two different crack segmentation models, and finally used the predicted results to synthesize the geometric parameters of the cracks calculated using the prediction results. Subsequently, Nguyen et al. [18] proposed a two-stage CNN network for low-resolution image detection and segmentation, which shortens the processing steps while increasing the efficiency of automated detection. Cheng et al. [19] proposed a computerized road crack detection method based on the structure of U-Net and introduced a function of distance transformation to assign pixel weights according to the actual segmentation minimum distance to assign pixel weights. Rill-García et al. [20], on the other hand, used VGG19 to replace the original backbone feature extraction network (VGG16) based on U-Net for improving the accuracy of road crack segmentation in the presence of incorrect annotations. However, the above methods have certain limitations, which exist in three main areas.
• Most pavement damage detection efforts obtain crack results by semantic segmentation of pixel-level images, which requires input images that must be high-quality images that closely match the pavement, undoubtedly increasing the cost and reducing the efficiency during initial image acquisition and making it difficult to meet the real-time warning required by ADAS.

•
Although the state-of-the-artwork allows pixel-level segmentation of pavement cracks or potholes, no other pavement damage classification was considered. We believe that identifying specific pavement damage types, such as longitudinal or transverse cracks, alligator cracks, and potholes, is essential when performing road damage detection.

•
Most related work cannot be automated end-to-end or lightweight model network construction due to the need for multi-stage operations, such as image pre-processing or post-processing.
Therefore, applying these research works to practical scenarios is very difficult considering these limitations. To address these issues, in this paper, we propose a lightweight end-to-end road damage detection network with the following main contributions:

1.
We designed a backbone feature extraction network using a combination of lightweight feature detection modules to ensure efficient automatic feature extraction while making the model parameters smaller.

2.
Our proposed multi-scale fusion network enriches the diversity of road damage features, improves the detection robustness of the algorithm at different scales, and facilitates detection efficiency when the distance and viewpoint change. 3.
We propose a lightweight multi-branch channel attention network (LMCA-Net) for the road damage detection task. This embedded attention module can enhance feature information by assigning weights to multi-scale convolutional kernels depending on the object size, aiming to improve detection accuracy with smaller parameters.
Compared to other representative models, our proposed model generally performs better and has fewer parameters. Based on the images captured by vehicle cameras, it can also identify many types of road damage and meet the real-time detection requirements of mobile systems.

Road Damage Detection Methods Based on Traditional Image Processing
Koch and Brilakis [21] proposed a method to automatically detect pavement potholes using histogram thresholding to segment the pavement damage region and elliptical regression of geometric features to determine the ROI region. Schiopu et al. [22] proposed a pothole detection that can eliminate false detection due to shadows of roadside objects, specifically in the ROI region, by setting a threshold value for the geometric features of potholes and presuming the potholes through a decision tree labelling.
Besides geometric morphology, some studies perform road damage detection from the perspective of image colour. For example, Jakštys et al. [23] outlined the edge contours of road potholes by analysing the B-component in the RGB colour space of road potholes in the ROI region. Akagic et al. [24], on the other hand, analysed the B-component in the RGB colour space by component to perform image segmentation of the asphalt pavement and then detected the pothole areas by processing methods such as cropping, Otsu thresholding, and boundary elimination.
Classical image processing to detect objects tends to segment the object from the background using thresholding, and most prior studies on road damage detection do the same. For example, Akagic et al. [25] proposed a pavement crack detection method based on a combination of the grayscale histogram and Otsu thresholding to search for pavement cracks by dividing the input image into sub-images after the ratio of the maximum histogram to the threshold value obtained. Sari et al. [26] brought results with reasonable accuracy by using the Otsu thresholding algorithm and Gray Level Co-occurrence Matrices (GLCM) for road crack feature detection and extraction, followed by the support vector machine (SVM) algorithm for experimental classification statistics.
Quan et al. [27] proposed an improved Otsu thresholding-based crack detection method that avoids the problem of peak prominence by modifying the weight factor and improves the accuracy compared to the original Otsu thresholding. Chung et al. [28] proposed a method to find the optimal threshold of the image using inverse binary and Otsu thresholding algorithm to meet the real-time pavement pothole detection. They applied the distance transformation of the image using the Watershed algorithm for calculating marker potholes.
In addition, many studies used the boundary decision capability of SVM to classify road damage. For example, Hoang [29] used the least squares version of SVM (LS-SVM) for supervised learning to establish an automatic classification method for pavement potholes compared to single pavement pothole detection. Gao et al. [30] used a machine learning model based on the library of support vector machines (LIBSVM) to propose a fast detection method that distinguishes potholes, longitudinal cracks, transverse cracks, and complex cracks.
These classical image processing methods have performed very well in the past, as shown in Table 1, with the advantage of not requiring large datasets for manual annotation. However, there are some unavoidable problems, as most of the above methods employ techniques such as colour segmentation, threshold feature detection, SVM, etc., which are limited by illumination variations, occlusions, colour variations, and complex backgrounds. Moreover, the need to design feature algorithms leads to a single type of detected road damage.

Deep Learning-Based Road Damage Detection Methods
With the rapid development of deep learning and artificial intelligence, the CNN has become the mainstream technology for road damage detection. The detection methods are mainly divided into image classification, semantic segmentation, and object detection.
Image classification: The most typical CNN approaches to perform road damage detection and classification tasks are usually trained by designing a neural network consisting of convolutional and fully connected (FC) layers. For example, An et al. [31] classified im- ages into two types with or without potholes by replacing the backbone feature extraction network in CNN and comparing the accuracy of different backbone networks in colour and colour grayscale frames in a cross-sectional manner. Bhatia et al. [32] developed a method to predict whether an input thermal image is a pothole or a non-pothole, demonstrating that using the residual network as the backbone network can improve the model detection rate applied in night-time and foggy weather environments. Fan et al. [33] experimentally evaluated 30 CNNs for road crack image classification, where Progressive neural architecture search (PNASNet) achieved the best balance between speed and accuracy. However, the image classification only presents the object image and does not detect the details of road damage in the image.
Semantic segmentation: To address the shortcomings of the image classification that only classifies images with or without road damage and to be able to detect road damage at the pixel level more intuitively through the network, Pereira et al. [34] used U-Net for a semantic segmentation method of road and pothole images. Their network structure is divided into two parts, encoder and decoder, for feature extraction, feature fusion, and result in prediction. Based on this, to design a more advanced semantic segmentation model to improve the detection rate, Fan et al. [35] proposed a novel semantic segmentation pothole detection method that used a spatial pyramid pooling module composed of tandem hollow convolutions to integrate spatial contextual information after enhancing the feature extraction process using a channel attention mechanism, which helped to detect multi-scale road potholes. To address the problem of difficult road crack detection, Zhang et al. [36] improved AD-Net's cracked road detection performance by adding atrous convolution between the encoder and decoder and introducing depth supervision in the decoder stage. Fang et al. [37], on the other hand, improved the performance of AD-Net by configuring Transformer Block at the encoder layer, an external attention mechanism in the coding layer to enhance the feature representation capability and mitigate the impact of interference factors such as shadows, noise, etc., on the detection of road cracks.
Object detection: Object recognition mainly includes localization and classification of road damage, and the main problem is to improve the accuracy of object localization and classification. At the same time, the processing speed of the whole process needs to be improved for the real-time detection conditions required by ITS. Many researchers have previously tried to contribute to these aspects based on classical networks such as SSD [38], Faster-RCNN [39], YOLO Series [40][41][42], and EfficientDet [43]. Wang et al. [44] used Faster-RCNN as a detection framework and ResNet-152 as a feature extraction network, a proposed method to detect and classify road damage. However, the significant overall parameters of the network lead to slow processing speed and do not have multi-scale detection capability. Yebes et al. [45], also based on Faster-RCNN, utilized Resnet101 with a faster processing speed as a feature extraction network. However, even after relaxing the IOU index to 0.4 for evaluation, the accuracy reached 75% while only running at 5-6 fps. Ukhwah et al. [46] and Dharneeshkar et al. [47] trained based on the YOLOv3 detection framework for the dataset and tested on different pothole images achieving good accuracy. Gupta et al. [48] was based on SSD and RetinaNet as the detection framework, using ResNet34 and ResNet50 as feature extraction networks to propose a method for pothole localization from thermal images to solve the detection difficulties caused by unfavourable weather with low visibility.
However, most of these methods in Table 2 cannot localize and classify multiple classes of road damage objects. Most of the road images used for detection need to be taken vertically and close to the ground. Although they have good processing speed, their application to autonomous driving early warning systems is limited. This paper presents a lightweight end-to-end road damage detection network that is designed to automatically, quickly, and efficiently detect and classify road damage. Figure 1 shows the flowchart of the proposed road damage detection algorithm, which firstly obtains a weight model by training a neural network, and, secondly, feeds the road images captured by vehicle cameras into the weight model to get prediction results. This paper presents a lightweight end-to-end road damage detection n designed to automatically, quickly, and efficiently detect and classify road   In step 1, the Ghost Module was used as the basic module of the backbone feature extraction network to meet the overall lightweight of the network. In step 2, the diversity of features was enriched using the proposed multi-scale feature fusion network to improve the algorithm's robustness in detecting different scales of damaged objects. In step 3, the finally obtained feature layers were fed into the lightweight multibranch channel attention network proposed in this paper. This embedded self-attentive module synthesizes feature information of different sizes with only a small number of operations.

Selection and Design of Backbone Network (Step 1)
For algorithmic systems designed for road damage detection, due to the limited memory and the computational resources of the application device, the efficiency and lightweight of the network itself are crucial, and how to make the network computationally less while ensuring accuracy is one of the main focuses of the research. Since the overall computation of a neural network depends mainly on the number of parameters of the backbone network, it is essential to choose a lightweight backbone network. For example, MobileNet [49] and ShuffleNet [50], with their deep convolution or channel-mixing operations, worked only on convolution and achieved lightness by small convolution kernels. As a backbone network, its primary role was to extract feature maps, and images could get many feature maps after passing through each convolution block. Still, many of these feature maps often had exceptionally high similarity and almost no variation. Such similar feature maps not only did not improve the network's performance but also drove many convolutional layer calculations, consuming a lot of computational resources.
In contrast, GhostNet [51] conducted a different approach and obtained one of the similar feature maps by cheaply operating the transformation of another feature map so that one of the identical feature maps could be considered a phantom of the other. The phantom feature map could be generated by the cheap operations based on Ghost Module so that the same number of feature maps could be generated with fewer parameters than the ordinary convolutional layer, which required fewer arithmetic resources than the standard convolutional layer. In this paper, the Ghost module was chosen as the basis of the backbone network because it can improve the execution speed of the model in the neural network structure while ensuring efficient feature extraction.
Each feature block of the backbone network consists of two bottleneck structures, A and B, connected in series, as shown in Figure 3. Bottleneck A does not compress the height and width of the input feature layers and uses two Ghost modules for feature extraction and the residual to optimize the network. As for bottleneck B, depthwise convolution was added in the middle of the Ghost module and residual to compress the height In step 1, the Ghost Module was used as the basic module of the backbone feature extraction network to meet the overall lightweight of the network. In step 2, the diversity of features was enriched using the proposed multi-scale feature fusion network to improve the algorithm's robustness in detecting different scales of damaged objects. In step 3, the finally obtained feature layers were fed into the lightweight multibranch channel attention network proposed in this paper. This embedded self-attentive module synthesizes feature information of different sizes with only a small number of operations.

Selection and Design of Backbone Network (Step 1)
For algorithmic systems designed for road damage detection, due to the limited memory and the computational resources of the application device, the efficiency and lightweight of the network itself are crucial, and how to make the network computationally less while ensuring accuracy is one of the main focuses of the research. Since the overall computation of a neural network depends mainly on the number of parameters of the backbone network, it is essential to choose a lightweight backbone network. For example, MobileNet [49] and ShuffleNet [50], with their deep convolution or channel-mixing operations, worked only on convolution and achieved lightness by small convolution kernels. As a backbone network, its primary role was to extract feature maps, and images could get many feature maps after passing through each convolution block. Still, many of these feature maps often had exceptionally high similarity and almost no variation. Such similar feature maps not only did not improve the network's performance but also drove many convolutional layer calculations, consuming a lot of computational resources.
In contrast, GhostNet [51] conducted a different approach and obtained one of the similar feature maps by cheaply operating the transformation of another feature map so that one of the identical feature maps could be considered a phantom of the other. The phantom feature map could be generated by the cheap operations based on Ghost Module so that the same number of feature maps could be generated with fewer parameters than the ordinary convolutional layer, which required fewer arithmetic resources than the standard convolutional layer. In this paper, the Ghost module was chosen as the basis of the backbone network because it can improve the execution speed of the model in the neural network structure while ensuring efficient feature extraction.
Each feature block of the backbone network consists of two bottleneck structures, A and B, connected in series, as shown in Figure 3. Bottleneck A does not compress the height and width of the input feature layers and uses two Ghost modules for feature extraction and the residual to optimize the network. As for bottleneck B, depthwise convolution was added in the middle of the Ghost module and residual to compress the height and width of the input feature layer, respectively. The specific details are shown in Table 3. and width of the input feature layer, respectively. The specific details are shown in Table  3.  The input image was passed into the backbone network based on the feature map obtained by one standard convolution. The feature map can be compressed and deepened by one Bottleneck B and multiple Bottleneck A according to the characteristics of the two bottleneck structures mentioned. We took the last three convolutional layers, namely Conv.4, Conv.5, and Conv.6, with shapes (52,52,40), (26,26,112), and (13, 13, 160), respectively. These three feature layers have multi-scale sensing feature information and contain three sizes that can be applied to objects near and far.

Multi-Scale Feature Fusion Network (Step 2)
Fusing features at different scales is an important way of improving detection performance. Shallow parts have higher resolution and contain more location and detail  The input image was passed into the backbone network based on the feature map obtained by one standard convolution. The feature map can be compressed and deepened by one Bottleneck B and multiple Bottleneck A according to the characteristics of the two bottleneck structures mentioned. We took the last three convolutional layers, namely Conv.4, Conv.5, and Conv.6, with shapes (52,52,40), (26,26,112), and (13, 13, 160), respectively. These three feature layers have multi-scale sensing feature information and contain three sizes that can be applied to objects near and far.

Multi-Scale Feature Fusion Network (Step 2)
Fusing features at different scales is an important way of improving detection performance. Shallow parts have higher resolution and contain more location and detail information, but they are less semantic and noisier. Deeper features have more robust semantic information but have low resolution and poor perception of details. The sizes of road damage detection targets vary, so the proposed multi-scale feature fusion network in The multi-scale feature fusion process is shown in Table 4, and the new feature layers (13, 13, 512) of Conv.7, (26,26,256) of Conv.8, and (52, 52, 128) of Conv.9 were finally obtained. Among them, Conv7 was directly obtained by expanding the number of channels of Conv6. Conv9 was generated by stacking Conv4 and upsampling Conv5. Conv8 was generated by stacking Conv5, downsampling Conv4, and upsampling Conv6. Such a feature fusion design can deepen the feature network and further enrich the diversity of features.

Lightweight Multibranch Channel Attention Network (Step 3)
Different sizes of the perceptual field of view, i.e., convolutional kernels, will have other effects on objects of different scales. Usually, attention mechanisms are often added to add weights to convolutional kernels to improve their ability to distinguish information during CNN design. Multi-scale convolutional kernels are critical to obtaining more feature information because of the different object sizes when performing road damage detection. In this paper, we propose a LMCA-Net, which aims to embed fewer attention modules in the network to improve detection efficiency. The overall structure is shown in Figure 4.
Firstly, the input feature map F was convolved with convolution kernels of size 3 × 3, 5 × 5, and 7 × 7 to obtain three feature maps, F 1 , F 2 , and F 3 , and then summed to obtain F' of shape C × H × W as in Equation (1).
where D Conv is the dilated convolution and n is the convolution kernel size. Next, the average pooling was performed along the H and W dimensions. Finally, a 1D vector of information about the feature dimension was obtained, with the shape of C × 1 × 1 as in Equation (2).  Firstly, the input feature map F was convolved with convolution kern 3, 5 × 5, and 7 × 7 to obtain three feature maps, F1, F2, and F3, and then sum F' of shape C × H × W as in Equation (1 where DConv is the dilated convolution and n is the convolution kernel size. Next, the average pooling was performed along the H and W dimensi 1D vector of information about the feature dimension was obtained, with × 1 × 1 as in Equation (2). Such a vector can express the importance of the information of each channel. Next, a 1D convolution was used to map the original C dimension into Z dimension information. Following that, three 1D convolutions were used to change from the Z dimension to the original C. This completes the information extraction of the channel dimension. Compared with the fully connected layer in linear transform, 1D convolution can effectively capture the information of cross-channel interactions while significantly reducing the number of parameters [52]. Softmax was used for normalization. At this time, each channel corresponds to a score, representing the importance of its channel, which is equivalent to a mask, as shown in Figure 5. corresponds to a score, representing the importance of its channel, wh a mask, as shown in Figure 5. Finally, the three separately obtained masks are multiplied by th F2, and F3 to obtain F'1, F'2, and F'3. These three feature modules are bined with the residuals of the original feature F for information fusio feature module F'' as in Equation (3), which has been refined compare F and fused with information from multiple sensory fields.

Dataset and Experimental Environment
The road damage detection network proposed in this paper was Global Road Damage Detection Challenge (GRDDC'2020) dataset [53 sists of 21040 annotated images containing damage information collect tries, Japan, India, and the Czech Republic, with road damage inform the coordinates of bounding boxes and labels describing the type of with the boxes. We randomly divided the training and validation sets Finally, the three separately obtained masks are multiplied by the corresponding F 1 , F 2 , and F 3 to obtain F 1 , F 2 , and F 3 . These three feature modules are summed and combined with the residuals of the original feature F for information fusion to obtain the final feature module F as in Equation (3), which has been refined compared with the original F and fused with information from multiple sensory fields.
where a c , b c , and c c are the weights obtained after the Softmax function's normalization, whose sum is 1.

Dataset and Experimental Environment
The road damage detection network proposed in this paper was evaluated using the Global Road Damage Detection Challenge (GRDDC'2020) dataset [53]. The dataset consists of 21,040 annotated images containing damage information collected from three countries, Japan, India, and the Czech Republic, with road damage information composed of the coordinates of bounding boxes and labels describing the type of damage associated with the boxes. We randomly divided the training and validation sets into 18,936 and 2104 images in a 9:1 ratio. In our experiments, a total of eight types of damage were selected as the detection objects to explore the detection efficiency of the proposed method for multiple types of road damage. Table 5 shows the specific road damage types and their definitions. Figure 6 shows the percentage of ground truth for each object in the dataset.  This paper's experimental setup is summarized in Table 6. The experiment was built using the TensorFlow2 framework, and the results were computed using CUDA kernels. The hardware mainly consisted of a high-performance workstation host. The workstation was equipped with an Intel(R) Core (TM) i5-11400F processor and an RTX 3050 graphics card.  This paper's experimental setup is summarized in Table 6. The experiment was built using the TensorFlow2 framework, and the results were computed using CUDA kernels. The hardware mainly consisted of a high-performance workstation host. The workstation was equipped with an Intel(R) Core (TM) i5-11400F processor and an RTX 3050 graphics card.

Evaluation Metrics and Experimental Details
To test the model's performance, we used the following metrics to evaluate the model. The model was evaluated by introducing the average precision (AP) as in Equation (7), the mean average precision (mAP) as in Equation (8), and the F1 score as in Equation (6). The larger the value of these metrics, the higher the agreement of the prediction results with the ground truth.
The AP is calculated using the difference-average precision metric, the area under the precision-recall curve. The equations for precision and recall are shown in Equations (4) and (5).
where T/F denotes true/false, which indicates whether the prediction is correct, and P/N denotes positive/negative, which indicates a positive or negative prediction result.
where n denotes the number of detection points, and P interop (r) represents the value of the accuracy at a recall of r. The hyperparameters set in the training process are shown in Table 7, where the input image size is 416 × 416, the image batch size is 16, the overall training is 500 epochs, the maximum learning rate of the model is 0.01, and the minimum learning rate is 0.0001. Cosine annealing [54] was used as the learning rate descent method, CIoU [55] was used as the loss function, and mosaic and mix-up methods were used for data augmentation. Using anchor-based for prediction, a total of nine prior boxes were set for the three output feature layers, i.e., each feature layer had three different sizes of prior boxes for adjustment in prediction. The size of the anchor box was calculated by analysing the results of the dataset by the K-means clustering as in Figure 7, where the anchor mask is shown in Table 8.

Ablation Experiments
To validate the rationality and effectiveness of our proposed network, the effect of different backbone networks and module combinations on the results was further discussed in the ablation experiments. As a fair comparison, the required dataset for training, input image size, relevant hyperparameters, training strategy, and experimental environment were the same in the ablation experiments, except for the added module parameters.

Input Settings
Loss Calculation Data E   First, we compared the backbone network in our model with several other classical backbone networks after replacement to evaluate the proposed approach. We selected six types of widely used representative deep neural networks and defined them as baseline models. Among them, Mobilenetv1 [49], Mobilenetv2 [56], and Mobilenetv3 [57] were lightweight backbone networks with faster processing speeds. VGG16 [58] featured a simple structure and was widely used as a feature extraction network for various CNN classical models. Resnet50 [59] had a deeper network that could achieve higher accuracy Densenet121 [60], on the other hand, achieved feature reuse and improved efficiency through the connection on the channel. However, VGG16, Resnet50, and Densenet121 all had many parameters, which were very computationally expensive. These baseline models were developed based on specific usage purposes, and all had good performance in prior studies, so we compared them with the approach proposed in this paper. Figure 8 shows the trend of the loss function and mAP for each model when trained after 500 epochs. In Figure 8a, it can be seen that the proposed method's loss function decreases to reach convergence when configuring different backbone networks, which proves the reasonableness of the network. It was also found that the proposed method converges faster than other backbone networks. On the other hand, in Figure 8b, except for VGG16, the mAP of all networks steadily increases with the iterative training of the network. It is worth noting that the mAP in Figure 8b is obtained from the validation set and the parameters set to speed up the evaluation are conservative, only to visualize the change in mAP during training, and the actual mAP for each model is shown in Table 9.
A discussion of the comparative results of the models in Table 9 shows that VGG16 has the lowest detection accuracy with a mAP of only 0.34 and the largest number of parameters. The remaining model with the highest combined accuracy is Resnet50, but it also has several parameters second only to VGG16. Our proposed method is similar to the MobileNet series in terms of combined accuracy, but our proposed model is less complex, with FLOPs of only 6.633 G, and for crosswalk blur (D43) and rutting, bump, pothole (D40), etc., the best detection results are achieved. Figure 9 shows a more intuitive and comprehensive comparison of the model performance. The horizontal coordinates represent the model's complexity for evaluating the algorithm's speed, the vertical coordinates represent the model's comprehensive accuracy, and the sphere's size represents the model's number of parameters. Our proposed method is closer to the upper left than the other methods, and the performance is better than the other networks. els were developed based on specific usage purposes, and all had good performance in prior studies, so we compared them with the approach proposed in this paper. Figure 8 shows the trend of the loss function and mAP for each model when trained after 500 epochs. In Figure 8a, it can be seen that the proposed method's loss function decreases to reach convergence when configuring different backbone networks, which proves the reasonableness of the network. It was also found that the proposed method converges faster than other backbone networks. On the other hand, in Figure 8b, except for VGG16, the mAP of all networks steadily increases with the iterative training of the network. It is worth noting that the mAP in Figure 8b is obtained from the validation set and the parameters set to speed up the evaluation are conservative, only to visualize the change in mAP during training, and the actual mAP for each model is shown in Table 9.    model's number of parameters. Our proposed method is closer to the upper left than the other methods, and the performance is better than the other networks.  Table 9.
The effect of different module combinations on the results is additionally discussed. As shown in Table 10, when the original base algorithm (baseline) extracted features using only the backbone network and output results without adding any modules, the mAP was only 31.3%. After adding the multi-scale feature fusion network, the results improved to 47.8%. In addition, to discuss the performance of our proposed attention mechanism LMCA-Net, we selected four types of representative attention mechanisms and configured them each in step 3 of the network.   Table 9.
The effect of different module combinations on the results is additionally discussed. As shown in Table 10, when the original base algorithm (baseline) extracted features using only the backbone network and output results without adding any modules, the mAP was only 31.3%. After adding the multi-scale feature fusion network, the results improved to 47.8%. In addition, to discuss the performance of our proposed attention mechanism LMCA-Net, we selected four types of representative attention mechanisms and configured them each in step 3 of the network.
As a fair comparison, LMCA-Net was evaluated on the same dataset, input image size, relevant hyperparameters, training strategy, and experimental setting by comparing four widely used methods. The experimental results are shown in Figure 10, where it can be seen that almost all methods improve detection accuracy. Still, due to the multi-scale perceptual field of our approach, the detection accuracy is higher for road damage objects of varying sizes compared to the channel attention SENet [61] and ECA-Net [62], which utilize a multilayer perceptron around learning correlations between channels. While the same level of accuracy is achieved with CBAM [63], which combines channel and spatial attention, and SK-Net [64] and performs engagement on convolutional kernels, our method has a smaller number of parameters and a stronger tendency for performance improvement. method has a smaller number of parameters and a stronger tendency for performa improvement. As can be seen from the heat map shown in Figure 11, all methods achieve g attentional results due to the large size of the Crosswalk and white line blur objects. longitudinal linear crack detection, our proposed LMCA-Net is slightly inferior to CB For small object detection like bumps and potholes, LMCA-Net can ultimately achieve same attention effect as CBAM and SK-Net. For the case of multiple objects combined, method accurately generates more highlighted regions for multi-scale objects. It achieve the same attention effect as the larger model with a smaller number of parame As can be seen from the heat map shown in Figure 11, all methods achieve good attentional results due to the large size of the Crosswalk and white line blur objects. For longitudinal linear crack detection, our proposed LMCA-Net is slightly inferior to CBAM. For small object detection like bumps and potholes, LMCA-Net can ultimately achieve the same attention effect as CBAM and SK-Net. For the case of multiple objects combined, our method accurately generates more highlighted regions for multi-scale objects. It can achieve the same attention effect as the larger model with a smaller number of parameters.
Finally, we compared the road damage detection algorithm proposed in this paper with the widely used object detection algorithms SSD [39], Faster-RCNN [40], YOLO series [41][42][43], and EfficientDet [44]. The quantitative experimental results using the same dataset and model training methods are shown in Table 11. longitudinal linear crack detection, our proposed LMCA-Net is slightly inferior to For small object detection like bumps and potholes, LMCA-Net can ultimately achi same attention effect as CBAM and SK-Net. For the case of multiple objects combin method accurately generates more highlighted regions for multi-scale objects achieve the same attention effect as the larger model with a smaller number of para  The models in the experiments were compared by replacing the backbone network, as shown in Figure 12. The accuracy of almost all the lightweight models was low compared to our proposed method because the overall network expression capability is insufficient to cover the detailed features of each detection target after replacing the smaller network. With the network structure's complexity, each model's accuracy increases, especially the accuracy of both EfficientDet-D4 and YOLOX-L, which is very high. Still, the number of parameters reaches 20.56 M and 54.15 M. While the approach suggested in this research is just 11.04 M in size, its model complexity and numerous parameters are substantially lower than those of previous methods with comparable accuracy, thanks to network design rationality and a light-weight attention mechanism. The lesser the model complexity, the less processing power required and the faster the prediction speed. The suggested model's real-time processing speed is 31 frames per second, which is not the quickest when compared to other models but is sufficient to meet the demand for real-time detection.
To evaluate the visualization results of the models, five representative models of YOLOv4-Mobilenetv2, EfficientDet-D0, YOLOX-L, EfficientDet-D4, and Our Approach of lightweight or accurate models are provided, as shown in Figure 13. These examples were taken from images of the test set covering the significant road damage, including transverse and longitudinal linear cracks, alligator cracks, bumps, potholes and crosswalks, and lane line blur. It can be seen that our proposed method outperforms the other models in terms of both classification and confidence scores. Among them, for the lightweight models YOLOv4-Mobilenetv2 and EfficientDet-D0, which have similar parameters, there are more deficiencies in pavement damage detection, such as cracks, potholes, etc. In comparison to the representative models YOLO-X and EfficientDet-D4, which have higher accuracy, our proposed method not only has absolute advantages in terms of the number of parameters, but it also performs better for the classification of small-sized targets like potholes and transverse linear cracks.   Table 11. Note that our approach obtains higher accuracy while having less model complexity.
To evaluate the visualization results of the models, five representative models of YOLOv4-Mobilenetv2, EfficientDet-D0, YOLOX-L, EfficientDet-D4, and Our Approach of lightweight or accurate models are provided, as shown in Figure 13. These examples were taken from images of the test set covering the significant road damage, including trans-  Table 11. Note that our approach obtains higher accuracy while having less model complexity. Finally, to evaluate the performance of the proposed model on real roads outside the training dataset, it was again used to test the pixel size of 1920 × 1080 images obtained from Korean urban and suburban car recorders. Using the same non-maximum suppression method and setting a threshold fraction of 0.5 or higher to remove the excess boundary box, the video processing speed can reach about 30fps. Due to the high-speed movement of the vehicle and bumps, the images obtained by the camera will appear to be inaccurately focused, thus making the target detection missing. However, the continuous frame detection results, as in Figure 14, are not difficult to find, even if there is a missed detection. However, a more comprehensive detection result can be achieved based on synthesizing multiple frames. In addition, detection is not only limited to the lane in which the video vehicle is traveling; adjacent lanes can also trigger prediction. These visualization results prove that our proposed model has comprehensive detection capability. Finally, to evaluate the performance of the proposed model on real roads outside the training dataset, it was again used to test the pixel size of 1920 × 1080 images obtained from Korean urban and suburban car recorders. Using the same non-maximum suppression method and setting a threshold fraction of 0.5 or higher to remove the excess boundary box, the video processing speed can reach about 30 fps. Due to the high-speed movement of the vehicle and bumps, the images obtained by the camera will appear to be inaccurately focused, thus making the target detection missing. However, the continuous frame detection results, as in Figure 14, are not difficult to find, even if there is a missed detection. However, a more comprehensive detection result can be achieved based on synthesizing multiple frames. In addition, detection is not only limited to the lane in which the video vehicle is traveling; adjacent lanes can also trigger prediction. These visualization results prove that our proposed model has comprehensive detection capability.

Conclusions
In this paper, we designed a lightweight end-to-end road damage detection network designed to quickly and automatically identify and classify specific types of road damage accurately. Such an efficient road damage detection method can reduce the risk of road damage to drivers and reduce the budget for road maintenance work. This study's primary contributions are as follows. (1) The designed feature extraction and multi-scale fusion network, which is more advantageous for target recognition and classification at diverse distances and angles. (2) The proposed embedded lightweight attention module can improve detection accuracy with fewer parameters than previous studies by assigning weights to the multi-scale convolution kernel. The results of various ablation experiments evaluated for backbone networks, attention mechanisms, and other widely used target detection models show that our approach achieves significant performance improvements with few computations. The detection frame rate can be maintained at 30 fps when applied to real-world tests of high-definition road images. In contrast, continuous frames are capable of real-time detection and classification. However, the algorithm suffers from some limitations, such as false detection in shadow coverage, missed detection, and lack of exploration of detection at night or in low light. Overall, this work provides new ideas for existing road damage detection and models lightweight efforts. In the future, it is necessary to enrich the diversity of detection environments further and to explore the integration of road damage detection with other monitoring, warning, and tracking techniques.

Conclusions
In this paper, we designed a lightweight end-to-end road damage detection network designed to quickly and automatically identify and classify specific types of road damage accurately. Such an efficient road damage detection method can reduce the risk of road damage to drivers and reduce the budget for road maintenance work. This study's primary contributions are as follows. (1) The designed feature extraction and multi-scale fusion network, which is more advantageous for target recognition and classification at diverse distances and angles. (2) The proposed embedded lightweight attention module can improve detection accuracy with fewer parameters than previous studies by assigning weights to the multi-scale convolution kernel. The results of various ablation experiments evaluated for backbone networks, attention mechanisms, and other widely used target detection models show that our approach achieves significant performance improvements with few computations. The detection frame rate can be maintained at 30 fps when applied to real-world tests of high-definition road images. In contrast, continuous frames are capable of real-time detection and classification. However, the algorithm suffers from some limitations, such as false detection in shadow coverage, missed detection, and lack of exploration of detection at night or in low light. Overall, this work provides new ideas for existing road damage detection and models lightweight efforts. In the future, it is necessary to enrich the diversity of detection environments further and to explore the integration of road damage detection with other monitoring, warning, and tracking techniques.