An Enhanced Lightweight Network for Road Damage Detection Based on Deep Learning

: Achieving accurate and efﬁcient detection of road damage in complex scenes has always been a challenging task. In this paper, an enhanced lightweight network, E-EfﬁcientDet, is proposed. Firstly, a feature extraction enhancement module (FEEM) is designed to increase the receptive ﬁeld and improve the feature expression capability of the network, which can extract richer multi-scale feature information. Secondly, to promote the reuse of feature information between different layers in the network and take full advantage of multi-scale context information, four pyramid modules with different structures are designed based on the idea of semi-dense connection, among which the bidirectional feature pyramid network with longitudinal connection (LC-BiFPN) is more suitable for road damage detection. Finally, to meet the road damage detection tasks under different hardware resource constraints, the E-EfﬁcientDet-D0~D2 networks are proposed in this paper based on the compound scaling strategy. Experimental results show that the detection accuracy of E-EfﬁcientDet-D0 improves by 2.41% compared with the original EfﬁcientDet-D0 on the publicly available road damage dataset and outperforms other networks such as YOLOv5s, YOLOv7-tiny, YOLOv4-tiny, Faster R-CNN, and SSD. Meanwhile, the detection speed of EfﬁcientDet-D0 can reach 27.0 FPS, which meets the demand for real-time detection, and the model size is only 32.31 MB, which is suitable for deployment in mobile devices such as unmanned inspection carts, UAVs, and smartphones. In addition, the detection accuracy of E-EfﬁcientDet-D2 can reach 57.51%, which is 4.39% higher than E-EfﬁcientDet-D0, and the model size is 61.78 MB, which is suitable for practical application scenarios that require higher detection accuracy and better hardware performance.


Introduction
As public infrastructure, highways are closely related to the development of the national economy and people's livelihoods.As roads are affected by environmental temperature changes, traffic load, rainwater erosion, and other factors during operation, the road surface may produce cracks, potholes, and other different types of damages.If these damages are not found and maintained in time at the early stage of formation, the small damages will continue to expand to form bigger damages, which not only affect the beauty of the pavement and the comfort of driving but also bring great safety hazards to the traffic when they are serious.Therefore, timely detection and maintenance of the damage is particularly important, which can not only reduce maintenance costs later but also reduce the occurrence of traffic accidents.
In the early days, road damage detection was carried out by manual inspection, which was not only costly and inefficient but also depended on the subjective judgment of people.In addition, at intersections or highways with high traffic flow, manual inspection could not guarantee the personal safety of inspectors.To solve this problem, researchers have conducted a series of explorations to automate the detection of pavement diseases using edge and morphology detection [1][2][3], wavelet transform [4,5], and grayscale threshold 2 of 20 segmentation [6,7].However, they are easily affected by factors such as light, shadow, shading, and water damage in the actual detection scenario, and most of the methods analyze the disease for a certain feature, resulting in poor robustness and generalization performance, which cannot meet the needs of today's practical applications.
In recent years, with the improvement of computer hardware processing power and the rapid development of deep learning technology, Convolutional Neural Network (CNN) [8] has been gradually applied to the field of road disease detection.By feeding a large number of damage images into the convolutional neural network, fully extracting the shallow texture features and deep high semantic features of the damage, suppressing irrelevant background noise interference, and training a damage detection network with high generalization performance and high detection accuracy, the efficiency of road damage detection can be significantly improved.
Common object detection algorithms are classified into single-stage algorithms and two-stage algorithms.The two-stage algorithms, represented by Faster R-CNN [9], R-FCN [10], and Mask R-CNN [11], first generate candidate regions using the Region Proposal Network (RPN), and then classify and regress the candidate regions using convolutional neural networks.Xu et al. [12] achieved good detection performance by using a combination of Faster R-CNN and Mask R-CNN training strategies with only a small number of crack images and achieved better detection results.However, this method has a slow detection speed, and the generalization performance needs to be improved.He et al. [13] proposed a pavement disease detection method based on Mask R-CNN and migration learning, which has high detection accuracy for diseases.However, the scale of the disease is too small to measure the performance of the model comprehensively, and the detection speed is too low to meet the real-time detection requirements.The above two-stage algorithm has an advantage in detection accuracy, but the detection speed is slow and cannot meet the detection requirements of real-time.
The single-stage algorithm represented by SSD [14] and YOLO [15][16][17] treats target detection as a regression task, eliminates the tedious region selection, and obtains the category probability and position coordinate values of the target directly from the picture, which greatly improves the detection speed and better fits the task scenario of real-time detection.In addition, the proposed loss function of Focal Loss [18] solves the problem of positive and negative sample imbalance, which in turn makes up for the lack of detection accuracy of the single-stage detection algorithm.Lu et al. [19] used an improved SSD network for crack detection.The detection accuracy is improved while meeting the demand for real-time detection.However, the detection of multi-scale cracks in complex scenes needs to be improved.Suong et al. [20] proposed a YOLOv2-based pavement detection method, which has a fast detection speed but leads to less satisfactory detection accuracy.Cao et al. [21] conducted an extensive evaluation of several deep learning-based pavement disease detection methods.Each detection model was trained on 9493 images from multiple databases, and the experimental results showed that the SSD network-based disease detection method achieved the best balance of detection accuracy and speed.However, it is less satisfactory for the detection of small-scale diseases.
In response to the problems of multiple categories and large-scale differences in highway pavement distress, researchers began to explore how to design fusion networks applicable to pavement distress features.Li et al. [22] designed a novel multi-scale convolutional feature fusion module that was able to introduce high-level features from different convolutional layers directly into low-level features, fully exploiting the advantages of different scale feature information and improving detection accuracy.Nevertheless, the multi-scale convolutional feature fusion module has only a top-down information flow, resulting in the loss of spatial detail information for features at higher levels.Qu et al. [23] proposed a disease detection method with hierarchical feature fusion and connected attention structures: First, a modified DCA-SE-ResNet-50 was used to construct a backbone network; second, a feature fusion module was proposed, which combined depth-separable convolution and extended convolution for recovering spatial detail information of the disease and fused low and high convolution layers at multiple layers of convolutional layers of feature maps to improve the detection accuracy.The high complexity of the network leads to a slow inference speed.Therefore, it is important to design a feature pyramid module suitable for pavement disease detection.
The regression-based damage detection techniques mentioned above have achieved more desirable achievements in detection accuracy and detection speed after rapid development in recent years, but the number of network parameters is too large for deployment in mobile devices such as smartphones, UAVs, and unmanned inspection vehicles.Therefore, some research scholars have focused on lightweight networks to explore how to achieve the best balance between detection accuracy, detection speed, and model size.
Wang et al. [24] proposed a lightweight crack detection method based on bilateral networks that achieved a good balance between inference speed and detection performance.Although the detection speed has improved, some detection accuracy has been sacrificed.In addition, the network complexity is still more complex and not conducive to deployment on mobile devices.Guo et al. [25] proposed an improved YOLOv5 pavement disease detection method using lightweight MobileNetV3 instead of YOLOv5's backbone network, which reduced the number of model parameters.Meanwhile, a coordinate attention mechanism was introduced to help the network locate the disease target more accurately and improve detection accuracy.Yu et al. [26] proposed a real-time disease detection algorithm based on improved YOLOv4, which used focal loss to optimize the overall loss and improve the detection accuracy; meanwhile, a pruning algorithm was introduced to simplify the complexity of the model and improve the detection speed.Although it meets the demand for real-time detection, the model is still too complex and takes up more resources, which is not suitable for deployment in embedded devices.
Based on the above analysis, most of the existing road damage detection methods, although they have high detection accuracy and meet the demand for real-time detection, are too large and complex to be deployed on mobile devices with limited computing resources.Therefore, in order to overcome the above-mentioned problems in the process of road damage detection, an enhanced lightweight network, E-EfficientDet, is proposed based on EfficientDet, which achieves a better balance between detection accuracy, detection speed, and model size.
The main contributions of this paper are as follows: • A novel and enhanced lightweight network, E-EfficientDet, is proposed, and its performance can be evaluated by the road damage dataset published by the Global Road Damage Detection Challenge 2020.The experimental results can verify the effectiveness of the method proposed in this paper.

•
An asymmetric convolution (ABC) is introduced, and a FEEM is designed to increase the receptive field and improve the feature representation capability of the network, which can consequently extract richer multi-scale feature information.

•
Based on the idea of semi-dense connectivity, a feature pyramid module is proposed that is more suitable for road surface damage detection and more effectively incorporates multi-scale contextual semantic information.
The arrangement of this paper is as follows: Section 2 describes the framework of our proposed detector.Section 3 verifies the effectiveness of E-EfficientDet, where the experimental results and related analysis are provided.Section 4 gives the conclusion of this paper.

Materials and Methods
In order to balance the relationship between detection accuracy, detection speed, and model size and to solve the problem of difficult detection due to many categories and large-scale differences in road damages in complex environments, an enhanced lightweight network, E-EfficientDet, is proposed based on EfficientDet [27], and the overall framework of the network is shown in Figure 1.Firstly, the FEEM is designed to improve the feature expression capability and increase the receptive field of the network without reducing the image resolution, and then fully extract the multi-scale feature information of road surface damages.At the same time, considering that low-level detail information is particularly important for disease detection, a LC-BiFPN module is proposed based on the idea of semi-dense connection, which more effectively integrates the feature information of different scales.
prove the feature expression capability and increase the receptive field of the network without reducing the image resolution, and then fully extract the multi-scale feature information of road surface damages.At the same time, considering that low-level detail information is particularly important for disease detection, a LC-BiFPN module is proposed based on the idea of semi-dense connection, which more effectively integrates the feature information of different scales.
As can be seen from Figure 1, firstly, the backbone network extracts the feature information of the damage; subsequently, the extracted feature information will be used as the input of the LC-BiFPN; finally, the damage is accurately localized and classified by the classification network and the regression network.

Asymmetric Convolutional Block (ACB)
In the feature extraction process, the large amplitude of the weights of the middle intersection of the conventional convolutional kernels (i.e., the kernel skeleton) and the small amount of feature extraction information provided by the edges result in uneven refinement of the features.Therefore, in this paper, ACB [28] is introduced in the backbone feature extraction network to enhance the feature representation of the network.The ACB module consists of three branches, as shown in Figure 2. Firstly, the 3 × 3 convolution captures feature information with a relatively large receptive field.Secondly, the 1 × 3 convolution and the 3 × 1 convolution achieve a cross- As can be seen from Figure 1, firstly, the backbone network extracts the feature information of the damage; subsequently, the extracted feature information will be used as the input of the LC-BiFPN; finally, the damage is accurately localized and classified by the classification network and the regression network.

Asymmetric Convolutional Block (ACB)
In the feature extraction process, the large amplitude of the weights of the middle intersection of the conventional convolutional kernels (i.e., the kernel skeleton) and the small amount of feature extraction information provided by the edges result in uneven refinement of the features.Therefore, in this paper, ACB [28] is introduced in the backbone feature extraction network to enhance the feature representation of the network.The ACB module consists of three branches, as shown in Figure 2.
all framework of the network is shown in Figure 1.Firstly, the FEEM is designed prove the feature expression capability and increase the receptive field of the ne without reducing the image resolution, and then fully extract the multi-scale featu formation of road surface damages.At the same time, considering that low-level information is particularly important for disease detection, a LC-BiFPN module i posed based on the idea of semi-dense connection, which more effectively integra feature information of different scales.
As can be seen from Figure 1, firstly, the backbone network extracts the feature mation of the damage; subsequently, the extracted feature information will be used input of the LC-BiFPN; finally, the damage is accurately localized and classified classification network and the regression network.

Asymmetric Convolutional Block (ACB)
In the feature extraction process, the large amplitude of the weights of the mid tersection of the conventional convolutional kernels (i.e., the kernel skeleton) and the amount of feature extraction information provided by the edges result in uneven refin of the features.Therefore, in this paper, ACB [28] is introduced in the backbone featu traction network to enhance the feature representation of the network.The ACB m consists of three branches, as shown in Figure 2. Firstly, the 3 × 3 convolution captures feature information with a relatively lar ceptive field.Secondly, the 1 × 3 convolution and the 3 × 1 convolution achieve a Firstly, the 3 × 3 convolution captures feature information with a relatively large receptive field.Secondly, the 1 × 3 convolution and the 3 × 1 convolution achieve a cross-receptive field to expand the depth of the network while ensuring the relevance of the skeletal features.Then, the feature information extracted from three different convolutional layers is fused.Finally, a BN layer and a ReLU activation function are used to improve the stability of the values and the nonlinear activation output to obtain an enhanced feature map.The ACB module is computed as follows:

BN+ReLU
where x i and x o are the inputs and outputs of the ACB module, respectively; γ bn and bias are the learnable tuning factors and bias terms, respectively; µ bn and σ bn represent the normalized mean and standard deviation of the BN layer, respectively.

Feature Extraction Enhancement Module (FEEM)
In the process of road damage detection, shallow features have a small receptive field and are suitable for detecting small-scale damages, while deep features have a large receptive field and are suitable for detecting large-scale damages.Meanwhile, in the feature extraction process, the area containing the damage target in the image plays a dominant role as the effective area.However, due to the obvious differences in the scale of disease targets in the images, small-scale diseases are easily confused with background noise, resulting in ineffective extraction of disease target features.In addition, although the receptive field of the network can be increased by pooling operations, it will reduce the resolution of the image, which leads to the loss of spatial detail information.To solve this problem, inspired by ACB and ASPP [29], the FEEM module is proposed in this paper, as shown in Figure 3. receptive field to expand the depth of the network while ensuring the relevance of the skeletal features.Then, the feature information extracted from three different convolutional layers is fused.Finally, a BN layer and a ReLU activation function are used to improve the stability of the values and the nonlinear activation output to obtain an enhanced feature map.The ACB module is computed as follows: ReLU bias where i x and o x are the inputs and outputs of the ACB module, respectively; bn γ and bias are the learnable tuning factors and bias terms, respectively; bn µ and bn σ represent the normalized mean and standard deviation of the BN layer, respectively.

Feature Extraction Enhancement Module (FEEM)
In the process of road damage detection, shallow features have a small receptive field and are suitable for detecting small-scale damages, while deep features have a large receptive field and are suitable for detecting large-scale damages.Meanwhile, in the feature extraction process, the area containing the damage target in the image plays a dominant role as the effective area.However, due to the obvious differences in the scale of disease targets in the images, small-scale diseases are easily confused with background noise, resulting in ineffective extraction of disease target features.In addition, although the receptive field of the network can be increased by pooling operations, it will reduce the resolution of the image, which leads to the loss of spatial detail information.To solve this problem, inspired by ACB and ASPP [29], the FEEM module is proposed in this paper, as shown in Figure 3.In this module, firstly, a 1 × 1 convolution kernel is used for feature extraction to maximize the global features of the input feature map a f .Secondly, three parallel ACB modules are used to enhance the feature extraction capability of the network.In order to solve the problem of grid effect caused by the use of atrous convolution, which leads to partial feature information loss, the atrous convolutions of different atrous rates are concatenated and added to cross-scale jump connections to further alleviate information loss while extracting features at different scales.Subsequently, to enable the pixel points of the In this module, firstly, a 1 × 1 convolution kernel is used for feature extraction to maximize the global features of the input feature map f a .Secondly, three parallel ACB modules are used to enhance the feature extraction capability of the network.In order to solve the problem of grid effect caused by the use of atrous convolution, which leads to partial feature information loss, the atrous convolutions of different atrous rates are concatenated and added to cross-scale jump connections to further alleviate information loss while extracting features at different scales.Subsequently, to enable the pixel points of the feature map after the null convolution to cover the entire input feature map with a resolution of 1/32 of the original image, the atrous rate is set to 1, 5, and 9, and the output of each of its layers is shown in Equation (3): where D f ,d denotes the convolution of the atrous with a convolution kernel of f and a null rate of d.ACB 3,1 denotes the asymmetric convolution with a convolution kernel of 3 and atrous rate of 1. Finally, the output results of each layer are fused, and the 1 × 1 convolution is used to reduce the number of channels, reduce the computational effort, and obtain the output feature map f c .

A Feature Pyramid Module More Suitable for Road Surface Damage Detection
With the increasing number of layers in the CNN, the spatial detail information of the high-level features is gradually lost, which is not conducive to the localization of the damage target.In addition, the shallow spatial detail features are particularly important in the process of road damage detection.
Therefore, in order to make fuller use of the shallow spatial detail information and improve the detection accuracy of road damages, we rethink the feature pyramid module used for multi-scale feature fusion based on BiFPN and the idea of semi-dense connectivity [30] to explore a feature pyramid module that is more suitable for pavement disease detection.Is it more obvious to add a single same-scale transverse jump connection, or is it more obvious to add both transverse jump connections of the same scale and different scales of longitudinal connections to improve the detection accuracy of the network?Does adding only cross-scale longitudinal jump connections or adding both cross-scale transverse jump connections and cross-scale longitudinal jump connections bring better performance improvements to the network?In this paper, four different structures of feature pyramid modules are designed, as shown in Figure 4. Their performance will be demonstrated in Section 3. The LC-BiFPN module can show better performance, which is more suitable for road damage detection.
Electronics 2023, 12, x FOR PEER REVIEW 6 of 20 feature map after the null convolution to cover the entire input feature map with a resolution of 1/32 of the original image, the atrous rate is set to 1, 5, and 9, and the output of each of its layers is shown in Equation ( 3): ), 4  ACB denotes the asymmetric convolution with a convolution kernel of 3 and atrous rate of 1.
Finally, the output results of each layer are fused, and the 1 × 1 convolution is used to reduce the number of channels, reduce the computational effort, and obtain the output feature map c f .

A Feature Pyramid Module More Suitable for Road Surface Damage Detection
With the increasing number of layers in the CNN, the spatial detail information of the high-level features is gradually lost, which is not conducive to the localization of the damage target.In addition, the shallow spatial detail features are particularly important in the process of road damage detection.
Therefore, in order to make fuller use of the shallow spatial detail information and improve the detection accuracy of road damages, we rethink the feature pyramid module used for multi-scale feature fusion based on BiFPN and the idea of semi-dense connectivity [30] to explore a feature pyramid module that is more suitable for pavement disease detection.Is it more obvious to add a single same-scale transverse jump connection, or is it more obvious to add both transverse jump connections of the same scale and different scales of longitudinal connections to improve the detection accuracy of the network?Does adding only cross-scale longitudinal jump connections or adding both cross-scale transverse jump connections and cross-scale longitudinal jump connections bring better performance improvements to the network?In this paper, four different structures of feature pyramid modules are designed, as shown in Figure 4. Their performance will be demonstrated in Section 3. The LC-BiFPN module can show better performance, which is more suitable for road damage detection.In the LC-BiFPN module, firstly, top-down and bottom-up bidirectional paths are used to fuse features of different scales.Secondly, nodes that contribute less to the feature fusion network are eliminated to reduce the computational effort.Moreover, considering that input feature maps with different resolutions usually do not contribute equally to the output feature maps, adaptive weights are introduced to weight the input features in the fusion process.Finally, to make fuller use of the spatial detail information at the shallow level, a longitudinal jump connection is added to the original transverse jump connection based on the idea of semi-dense connection.Assuming that the multi-scale features extracted by the backbone extraction network are {P 3-in , P 4-in , P 5-in , P 6-in , P 7-in }, the LC-BiFPN module outputs the features as follows: 4−in ) where w ij is the weight of the j-th input node in the i-th layer; P i−in and P i−in is the first and second fusion node in the i-th layer, respectively; P i−out is the output feature of the first layer; Resize(•) is the upsampling or downsampling operation; Conv(•) is the depth-separable convolution operation.

Compound Scaling
To achieve damage detection in practical application scenarios under different resource constraints, the E-EfficientDet network uses a compound scaling strategy that is similar to EfficientDet, with uniform scaling of the feature extraction network, LC-BiFPN, classification, and regression networks.
In the composite scaling strategy, the width and depth of LC-BiFPN are calculated as follows: The depth of the classification and regression networks is calculated as follows: The input image resolution is calculated as follows: Due to the hardware resource limitation, only E-EfficientDet-D0~D2 is trained.The configuration details of E-EfficientDet-D0~D2 are shown in Table 1.The performance of the E-EfficientDet-D0~D2 will be demonstrated in Section 3.

Loss Function
Considering that the regression-based single-stage target detection network does not use candidate regions to generate the network, when anchor frames are used for target prediction, only a very few anchor frames are positive samples containing the target, and the rest are negative samples, resulting in a serious imbalance between positive and negative samples.At the same time, the analysis reveals that there is a serious category imbalance in road pavement damages.To solve this issue, we employ a multi-task loss function that consists of classification loss and regression loss.It is defined as follows: where p i and p * i mean the predicted probability and real label of the ith anchor containing the object, respectively; f i and f * i mean the prediction and real position of the ith anchor, respectively; N cls means total number of anchors; N reg means total number of true anchors; β means pondage factor.
Classification loss function is defined as follows: where t means the category of object, p t means the probability of the object, α t and γ mean pondage factor.Regression loss function is defined as follows: where x means the difference between the actual position of the candidate box and its predicted position.

Results
We comprehensively evaluated E-EfficientDet using the public dataset of the Global Road Damage Detection Challenge 2020.To demonstrate the validity of the E-EfficientDet, we conducted an ablation study and compared the experimental data of different models.

Datasets and Evaluation Metrics
The dataset used to train the road pavement disease detection model in this paper is derived from the Global Road Damage Detection Challenge 2020.This dataset was used in [31][32][33], which contains road damage images from three different countries: Japan, India, and the Czech Republic, and the types of damage are transverse cracks, longitudinal cracks, mesh cracks, and potholes.Due to the limitation of hardware resources, only the pavement disease images from Japan are selected as the dataset for the proposed detection method in this paper, which contains 10,500 images with an image size of 600 × 600 pixels.Considering that transverse and longitudinal cracks can be converted to each other by geometric transformation methods, transverse and longitudinal cracks are combined in this paper and collectively called strip cracks, marked as D00; mesh cracks and potholes are marked as D20 and D40, respectively.The principle of dataset division is shown in Table 2.The distribution of the number of road pavement damage samples is shown in Figure 5.As can be seen from Figure 5, there is a serious category imbalance in the road pavement damage samples, in which the number of strip cracks is about 3.5 times the number of potholes and the number of mesh cracks is about 2.7 times the number of potholes.As can be seen from Figure 5, there is a seri pavement damage samples, in which the number of number of potholes and the number of mesh cracks is holes.Precision (P), recall (R), average precision (AP), mean average precision (mAP), F1score, frames per second (FPS), and model size are taken as evaluation metrics in this paper.The specific calculation formulas for them are shown in Equations ( 15)-( 20): where TP represents the number of samples for which the target was originally a disease and the model was identified as a disease.FP indicates the number of samples that the target was originally non-disease and the model identified as disease; FN indicates the number of samples that were originally identified as disease by the network as non-disease.
FrameNum represents the number of images processed and ElaspedTime represents the time spent processing the image.

Implementation Details
The training process was built on a PyTorch framework with CUDA 10.2 and trained using one NVIDIV GeForce RTX 2080 GPU.In details, during the process of training, the initial learning rate is set to 0.0003, and the batch size is set to 8. If not specified, the input image size is 512 × 512.To improve the convergence speed of the model, pre-trained weights on the PASCAL VOC dataset were used based on the idea of migration learning.The Adam [34] optimizer is used, and the momentum parameter is set to 0.9.The change curves of loss value during model training are shown in Figure 6.The loss value floats in a certain range when the model is trained for 120 epochs, which indicates that the model starts to converge.

ElaspedTime
where TP represents the number of samples for which the target was originally and the model was identified as a disease.FP indicates the number of sample target was originally non-disease and the model identified as disease; FN ind number of samples that were originally identified as disease by the network a ease.FrameNum represents the number of images processed and ElaspedTim sents the time spent processing the image.

Implementation Details
The training process was built on a PyTorch framework with CUDA 10.2 a using one NVIDIV GeForce RTX 2080 GPU.In details, during the process of tra initial learning rate is set to 0.0003, and the batch size is set to 8. If not specified image size is 512 × 512.To improve the convergence speed of the model, p weights on the PASCAL VOC dataset were used based on the idea of migration The Adam [34] optimizer is used, and the momentum parameter is set to 0.9.T curves of loss value during model training are shown in Figure 6.The loss valu a certain range when the model is trained for 120 epochs, which indicates that starts to converge.

Ablation Studies
In order to comprehensively evaluate the performance of the model and effectiveness of the proposed modules in this paper, we conducted ablation stud on the EfficientDet-D0, and the experimental results are shown in Table 3.

Ablation Studies
In order to comprehensively evaluate the performance of the model and verify the effectiveness of the proposed modules in this paper, we conducted ablation studies based on the EfficientDet-D0, and the experimental results are shown in Table 3.
From the second and sixth rows of Table 3, it can be seen that the mAP and F1-score values of the original EfficientDet-D0 network are 50.71%and 46.00%, respectively, while the mAP and F1-score values of EfficientNet-B0 combined with the LC-BiFPN module are 52.33% and 46.67%, respectively.This indicates that the addition of the LC-BiFPN module improved the detection accuracy of the network for diseases by 1.62%, and the accuracy improvement for mesh cracks (D20) is the most significant.This fully verifies that the proposed LC-BiFPN module has significant performance in multi-scale feature fusion and is superior to the original BiFPN.From the fourth and sixth rows of Table 3, the performance improvement brought by the LC-BiFPN module to the network is better than that of the FPN-B module.This indicates that compared with cross-scale vertical skip connections, pyramid modules with horizontal skip connections of the same scale have a better improvement in network detection accuracy.From the second, third, and fifth rows of Table 3, adding appropriate cross-scale vertical connections can improve the detection performance of the network, but overly complex connections can lead to the opposite effect.From the sixth and seventh rows of Table 3, when the baseline network EfficientNet-B0 is combined with the LC BiFPN module and FEEM module, the detection accuracy reaches 53.12%.Compared with only adding BiFPN to the baseline network, the FEEM module further improved the detection accuracy of the network by 0.79%, with the network having the most advantage in improving the detection accuracy of potholes (D40).It has been fully verified that the FEEM module can enhance the network's ability to extract features of road damage.The performance of the network model can be accurately and intuitively measured by Precision-Recall (P-R) curves, which are shown in Figures 7 and 8 for EfficientDet-D0 and E-EfficientDet-D0 disease detection, respectively.From Figures 7 and 8, the proposed E-EfficientDet-D0 network is better than the original EfficientDet-D0 for road damage detection.From the second and sixth rows of Table 3, it can be seen that the mAP and F1-score values of the original EfficientDet-D0 network are 50.71%and 46.00%, respectively, while the mAP and F1-score values of EfficientNet-B0 combined with the LC-BiFPN module are 52.33% and 46.67%, respectively.This indicates that the addition of the LC-BiFPN module improved the detection accuracy of the network for diseases by 1.62%, and the accuracy improvement for mesh cracks (D20) is the most significant.This fully verifies that the proposed LC-BiFPN module has significant performance in multi-scale feature fusion and is superior to the original BiFPN.From the fourth and sixth rows of Table 3, the performance improvement brought by the LC-BiFPN module to the network is better than that of the FPN-B module.This indicates that compared with cross-scale vertical skip connections, pyramid modules with horizontal skip connections of the same scale have a better improvement in network detection accuracy.From the second, third, and fifth rows of Table 3, adding appropriate cross-scale vertical connections can improve the detection performance of the network, but overly complex connections can lead to the opposite effect.From the sixth and seventh rows of Table 3, when the baseline network EfficientNet-B0 is combined with the LC BiFPN module and FEEM module, the detection accuracy reaches 53.12%.Compared with only adding BiFPN to the baseline network, the FEEM module further improved the detection accuracy of the network by 0.79%, with the network having the most advantage in improving the detection accuracy of potholes (D40).It has been fully verified that the FEEM module can enhance the network's ability to extract features of road damage.
The performance of the network model can be accurately and intuitively measured by Precision-Recall (P-R) curves, which are shown in Figures 7 and 8 for EfficientDet-D0 and E-EfficientDet-D0 disease detection, respectively.From Figures 7 and 8, the proposed E-Ef-ficientDet-D0 network is better than the original EfficientDet-D0 for road damage detection.
Since the size of the input image resolution affects the detection accuracy of the models, to ensure the fairness of the experiments, the resolution of the input images is 512 × 512 if not otherwise specified.The performance comparison of different models is shown in Table 4.As can be seen from Table 4, the two-stage detector Faster R-CNN has certain advantages for the detection of mesh cracks (D20), but its detection effect for potholes (D40) is less satisfactory, resulting in lower mAP values and a slower detection speed that cannot meet the detection requirements of real-time.In addition, although YOLOv3, YOLOv4, and RetinaNet have high detection accuracy, the model is too large, which is 7.3 times, 7.6 times, and 4.3 times that of the E-EfficientDet-D0 network, respectively, and is not suitable for deployment on embedded devices.
From the eighth, eleventh, twelfth, and thirteenth rows of Table 4, with little difference in model size, comparing the lightweight network YOLOv5s with YOLOv4-tiny and YOLOv7-tiny, our proposed E-EfficientDet-D0 detection accuracy and F1-score are the highest, reaching 53.12% and 47.33%, respectively.Although the detection speed is not the best, it basically meets the detection requirement of real-time.From rows ninth, tenth, and thirteenth of Table 4, it can be seen that the mAP and F1-score of the E-EfficientDet-D0 network proposed are better than those of MobileNetv2-YOLOv4 and MobileNetv3-YOLOv4.The performance comparison results of different object detection models can be visually represented in Figure 9.
From an overall perspective, the mAP value of the E-EfficientDet-D0 network is slightly lower than YOLOv4 and RetinaNet but higher than other detection models, with the highest detection accuracy of 70.93% for the mesh crack (D20).In addition, it has significant advantages in model size, which is more conducive to mobile deployments such as UAVs and unmanned carts.Compared with the lightweight detection model, it also has higher detection accuracy while meeting real-time requirements, which is more conducive to the accurate identification of road pavement diseases.Therefore, considering the detection accuracy, detection speed, and model size, our proposed E-EfficientDet-D0 network is more suitable for road pavement disease detection.
Since the size of the input image resolution affects the detection accuracy of the models, to ensure the fairness of the experiments, the resolution of the input images is 512 × 512 if not otherwise specified.The performance comparison of different models is shown in Table 4.As can be seen from Table 4, the two-stage detector Faster R-CNN has certain advantages for the detection of mesh cracks (D20), but its detection effect for potholes (D40) is less satisfactory, resulting in lower mAP values and a slower detection speed that cannot meet the detection requirements of real-time.In addition, although YOLOv3, YOLOv4, and RetinaNet have high detection accuracy, the model is too large, which is 7.3 times, 7.6 times, and 4.3 times that of the E-EfficientDet-D0 network, respectively, and is not suitable for deployment on embedded devices.From the eighth, eleventh, twelfth, and thirteenth rows of Table 4, with little difference in model size, comparing the lightweight network YOLOv5s with YOLOv4-tiny and YOLOv7-tiny, our proposed E-EfficientDet-D0 detection accuracy and F1-score are the highest, reaching 53.12% and 47.33%, respectively.Although the detection speed is not the best, it basically meets the detection requirement of real-time.From rows ninth, tenth, and thirteenth of Table 4, it can be seen that the mAP and F1-score of the E-EfficientDet-D0 network proposed are better than those of MobileNetv2-YOLOv4 and MobileNetv3-YOLOv4.The performance comparison results of different object detection models can be visually represented in Figure 9.
From an overall perspective, the mAP value of the E-EfficientDet-D0 network is slightly lower than YOLOv4 and RetinaNet but higher than other detection models, with the highest detection accuracy of 70.93% for the mesh crack (D20).In addition, it has significant advantages in model size, which is more conducive to mobile deployments such as UAVs and unmanned carts.Compared with the lightweight detection model, it also has higher detection accuracy while meeting real-time requirements, which is more conducive to the accurate identification of road pavement diseases.Therefore, considering the detection accuracy, detection speed, and model size, our proposed E-EfficientDet-D0 network is more suitable for road pavement disease detection.Meanwhile, to achieve damage detection in practical application scenarios under different resource constraints, we also propose other detection networks of the E-EfficientDet family based on the composite scaling method.A comparison of the performance of different networks in the E-EfficientDet series is shown in Figure 10.
As shown in Table 1 and Figure 10, the detection accuracy of the network gradually increases with increasing input resolution, depth, and width, but the computational effort also increases and the detection speed gradually decreases.The E-EfficientDet-D0 network is suitable for practical applications with relatively high requirements for real-time and accuracy and small memory resources, while the E-EfficientDet-D2 network is suitable for practical applications with more demanding requirements for accuracy, low requirements for real-time, and large memory resources.Meanwhile, to achieve damage detection in practical application scenarios under different resource constraints, we also propose other detection networks of the E-EfficientDet family based on the composite scaling method.A comparison of the performance of different networks in the E-EfficientDet series is shown in Figure 10.
As shown in Table 1 and Figure 10, the detection accuracy of the network gradually increases with increasing input resolution, depth, and width, but the computational effort also increases and the detection speed gradually decreases.The E-EfficientDet-D0 network is suitable for practical applications with relatively high requirements for real-time and accuracy and small memory resources, while the E-EfficientDet-D2 network is suitable for practical applications with more demanding requirements for accuracy, low requirements for real-time, and large memory resources.

Visual Analysis Results of Different Models
The inference heat maps of EfficientDet-D0 and E-EfficientDet-D0 in different scenes are shown in Figure 11.The results in Figure 11 show that E-EfficientDet-D0 can provide more attention to shadow occlusion and small-scale diseases compared with EfficientDet-D0.Meanwhile, the visualization of the detection results of different models for strip cracks (D00), mesh cracks (D20), potholes (D40), and multi-class and multi-scale diseases is shown in Figures 12-15, respectively.
From Figure 12, it can be seen that SSD, RetinaNet, MobileNetv2-YOLOv4, Mo-bileNetv3-YOLOv4, and YOLOv7-tiny all have a certain degree of missed detection for strip

Visual Analysis Results of Different Models
The inference heat maps of EfficientDet-D0 and E-EfficientDet-D0 in different scenes are shown in Figure 11.The results in Figure 11 show that E-EfficientDet-D0 can provide more attention to shadow occlusion and small-scale diseases compared with EfficientDet-D0.

Visual Analysis Results of Different Models
The inference heat maps of EfficientDet-D0 and E-EfficientDet-D0 in different scenes are shown in Figure 11.The results in Figure 11 show that E-EfficientDet-D0 can provide more attention to shadow occlusion and small-scale diseases compared with EfficientDet-D0.Meanwhile, the visualization of the detection results of different models for strip cracks (D00), mesh cracks (D20), potholes (D40), and multi-class and multi-scale diseases is shown in Figures 12-15, respectively.
From Figure 12, it can be seen that SSD, RetinaNet, MobileNetv2-YOLOv4, Mo-bileNetv3-YOLOv4, and YOLOv7-tiny all have a certain degree of missed detection for strip   As shown in Figure 13, for the detection of mesh cracks (D20) in the shaded environment, YOLOv3, YOLOv4, RetinaNet, MobileNetv2-YOLOv4, MobileNetv3-YOLOv4, YOLOv5s, YOLOv4-tiny, and YOLOv7-tiny all have a certain degree of leakage, and the detection of E-EEfficientDet-D0 is relatively ideal.This is attributed to the enhanced feature extraction capability of the FEEM module for damages.As can be seen from Figure 14, for the detection of potholes (D00) under adequate light conditions, YOLOv3, SSD, MobileNetv2-YOLOv4, MobileNetv3-YOLOv4, and YOLOv7tiny all have a certain degree of missed detection, among which YOLOv3 and YOLOv7-tiny have more serious missed detection, and the detection effect of E-EfficientDet-D0 is better than models such as EfficientDet-D0, YOLOv5s, and YOLOv7-tiny.
As can be seen from Figure 15, for the detection of multi-class and multi-scale diseases in complex scenes, the models YOLOv3, SSD, MobileNetv2-YOLOv4, MobileNetv3-YOLOv4, and YOLOv5s all have a certain degree of missed detection.The detection effect of E-EEfficientDet-D0 is better than that of YOLOv7-tiny, YOLOv4-tiny, YOLOv5s, and other models.This is due to the fact that the LC-BiFPN module reuses the multi-scale features extracted from the backbone network for circulation in different layers of the network, so that the multi-scale context semantic information of disease is fully fused.As can be seen from Figure 14, for the detection of potholes (D00) under adequate light conditions, YOLOv3, SSD, MobileNetv2-YOLOv4, MobileNetv3-YOLOv4, and YOLOv7-tiny all have a certain degree of missed detection, among which YOLOv3 and YOLOv7-tiny have more serious missed detection, and the detection effect of E-Effi-cientDet-D0 is better than models such as EfficientDet-D0, YOLOv5s, and YOLOv7-tiny.As can be seen from Figure 15, for the detection of multi-class and multi-scale diseases in complex scenes, the models YOLOv3, SSD, MobileNetv2-YOLOv4, MobileNetv3-YOLOv4, and YOLOv5s all have a certain degree of missed detection.The detection effect of E-EEfficientDet-D0 is better than that of YOLOv7-tiny, YOLOv4-tiny, YOLOv5s, and other models.This is due to the fact that the LC-BiFPN module reuses the multi-scale features extracted from the backbone network for circulation in different layers of the network, so that the multi-scale context semantic information of disease is fully fused.

Conclusions
Due to the characteristics of road pavement damages, such as many categories and significant scale differences, they are susceptible to environmental factors such as uneven illumination and shadow obscuration, which leads to less-than-ideal detection results.In addition, most of the existing road damage detection methods cannot achieve a good balance between detection accuracy, detection speed, and model size.Therefore, an enhanced lightweight network, E-EfficientDet, is proposed in this paper.Firstly, the FEEM is designed to increase the receptive field of the network and improve the feature expression capability of the network without reducing the image resolution, which in turn can extract richer multiscale feature information.Secondly, a feature pyramid module is proposed based on the idea of semi-dense connectivity, which is more suitable for pavement disease detection, so that the contextual semantic information at different scales can be fused more effectively.

Conclusions
Due to the characteristics of road pavement damages, such as many categories and significant scale differences, they are susceptible to environmental factors such as uneven illumination and shadow obscuration, which leads to less-than-ideal detection results.In addition, most of the existing road damage detection methods cannot achieve a good balance between detection accuracy, detection speed, and model size.Therefore, an enhanced lightweight network, E-EfficientDet, is proposed in this paper.Firstly, the FEEM is designed to increase the receptive field of the network and improve the feature expression capability of the network without reducing the image resolution, which in turn can extract richer multi-scale feature information.Secondly, a feature pyramid module is proposed based on the idea of semi-dense connectivity, which is more suitable for pavement disease detection, so that the contextual semantic information at different scales can be fused more effectively.
The experimental results show that the E-EfficientDet-D0 proposed in this paper outperforms EfficientDet-D0, YOLOv5s, YOLOv7-tiny, YOLOv4-tiny, Faster R-CNN, SSD, and other models in terms of detection accuracy.The detection speed can reach 27.08 FPS,

Figure 1 .
Figure 1.The framework of overall network.

Figure 2 .
Figure 2. The structure of ACB.

Figure 1 .
Figure 1.The framework of overall network.

Figure 1 .
Figure 1.The framework of overall network.

Figure 2 .
Figure 2. The structure of ACB.

Figure 2 .
Figure 2. The structure of ACB.
of the atrous with a convolution kernel of f and a null rate of d .

Figure 5 .
Figure 5.As can be seen from Figure5, there is a seri pavement damage samples, in which the number of number of potholes and the number of mesh cracks is holes.

Figure 5 .Figure 5 .
Figure 5. Quantity distribution of road pavement damage s

Figure 9 .
Figure 9. Performance comparison of different object detection models.

Figure 9 .
Figure 9. Performance comparison of different object detection models.

Figure 10 .
Figure 10.Performance comparison of different networks in the E-EfficientDet series.

Figure 10 .
Figure 10.Performance comparison of different networks in the E-EfficientDet series.

Electronics 2023 , 20 Figure 10 .
Figure 10.Performance comparison of different networks in the E-EfficientDet series.

Table 2 .
Division of road pavement damage dataset.

Table 2 .
Division of road pavement damage dataset.

Table 3 .
Results of ablation studies.

Table 3 .
Results of ablation studies.

Table 4 .
Performance comparison of different object detection models.

Table 4 .
Performance comparison of different object detection models.