Research on Road Damage Detection Algorithms for Intelligent Inspection Robots

Tian, Hongsai; Zhao, Feng; Yang, Dongqing; Cheng, Haitao; Zhang, Jiahao; Song, Shuangshuang

doi:10.3390/electronics14142762

Open AccessArticle

Research on Road Damage Detection Algorithms for Intelligent Inspection Robots

by

Hongsai Tian

^1,2,

Feng Zhao

^1,2,*

,

Dongqing Yang

³,

Haitao Cheng

^1,2,

Jiahao Zhang

^1,2 and

Shuangshuang Song

^1,2

¹

Shandong Key Laboratory of Technologies and Systems for Intelligent Construction Equipment, Shandong Jiaotong University, Jinan 250357, China

²

School of Information Science and Electrical Engineering, Shandong Jiaotong University, Jinan 250357, China

³

Advanced Technology Research Institute, Beijing Institute of Technology, Jinan 250300, China

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(14), 2762; https://doi.org/10.3390/electronics14142762

Submission received: 4 June 2025 / Revised: 5 July 2025 / Accepted: 8 July 2025 / Published: 9 July 2025

Download

Browse Figures

Versions Notes

Abstract

Intelligent inspection robots are crucial tools that can be used to ensure road safety. However, current intelligent inspection robots for road damage detection are confronted by challenges, including insufficient detection accuracy and poor adaptability in complex environments. These issues exert a direct influence on the reliability of road damage detection and the effectiveness of its practical application. To address the aforementioned issues, this study proposes a road damage detection model based on deep learning. At first, the backbone network is augmented with a multi-scale convolutional attention, which promotes more effective feature extraction and strengthens the model’s perception and representation of features at multiple scales. Secondly, the traditional SPPF module is replaced with the proposed SPPELAN module, which maintains consistent channel dimensions and facilitates improved multi-scale contextual feature extraction, thereby enhancing detection accuracy and inference efficiency under identical experimental conditions. Finally, the introduction of the WIOU loss function enhances the overall performance of the model. The experimental results on the test dataset demonstrate that the road damage detection model designed in this paper is significantly better than the original model in multiple indicators, with mAP@0.5 increased by 0.6%, accuracy increased by 1.7%, Recall increased by 1.4%, and frames per second (FPS) increased by 25.063 frames. Compared with YOLOv7 and YOLOv9, the mAP@0.5 of this model increased by 6.3% and 2.2%, the accuracy increased by 9.3% and 2.4%, and the Recall increased by 10.8% and 5.4%. In addition, the experimental results indicate that the road damage detection model designed in this study exhibits significant performance improvement in real-time road damage detection and holds promising application prospects.

Keywords:

road damage; deep learning; real-time detection; multi-scale convolutional attention

1. Introduction

Intelligent inspection robots are automated devices used in the field of road transportation, primarily for the detection of road surface conditions, structural damages, and other potential safety hazards. It is capable of adapting to complex road environments, providing the advantages of high stability and high efficiency [1]. By integrating deep learning models and SLAM algorithms, intelligent inspection robots can not only achieve precise damage detection but also autonomously perform environment perception and navigation [2]. Moreover, intelligent inspection robots can dynamically adjust their working routes based on real-time environmental data, which avoids redundant inspections or missed areas, thereby improving operational efficiency. Road damage detection exerts a critical function, which contributes to promptly identifying and assessing road issues to ensure the normal usage of roads. By detecting common road damages, including cracks, potholes, and alligator cracking, intelligent inspection robots can provide accurate detection data and maintenance recommendations to road maintenance departments, therefore preventing more significant damage. In this work, the term intelligent inspection robot encompasses advanced autonomous wheeled platforms equipped with sensing and navigation capabilities, as demonstrated by the autonomous vehicle platform used for experimental validation.

As deep learning technologies continue to progress, their application in road damage detection is increasingly recognized as an essential approach for modern road maintenance. For instance, Zhang et al. [3] developed a method for identifying and categorizing road damage utilizing convolutional neural networks. The proposed method utilizes a low-cost video data collection strategy to extract features and classify road surface images using a convolutional neural network. In addition, the method demonstrated strong detection accuracy when evaluated on benchmark datasets. Li et al. [4] presented a neural network model called CrackYOLO, specifically designed for extracting cracks from road images. The proposed method employs an improved YOLOv5 architecture enhanced with CrackConv and ADSample modules. These additions strengthen the model’s feature extraction capabilities, enabling more effective crack detection in scenes with shadows and complex backgrounds. Additionally, the introduction of a hybrid attention mechanism (CAS) further enhances the recognition accuracy for fine cracks. Ning et al. [5] proposed a multi-class road damage detection approach that leverages forward-facing video data and applies an enhanced YOLOv7 architecture for road surface identification. The proposed strategy achieves higher detection precision and reduces model complexity by integrating distributed displacement convolution (EDC), an optimized spatial feature pyramid module (SPPCSPD), and the SimAM similarity attention mechanism. Liu et al. [6] introduced an upgraded network called YOLO-SST, which is built upon the YOLOv5 framework and is designed for the identification of diverse road damage types. This method enhances feature extraction capabilities by incorporating the Shuffle Attention mechanism and Swin-Transformer encoding blocks. Wu et al. presented a new road damage detection model based on YOLOv5, which integrates a lightweight feature fusion network (CFPN) and an improved loss function to enhance detection efficiency in complex environments [7]. Experimental results suggest that the model performs excellently under conditions such as shadows and multi-object overlap. Zhang et al. [8] introduced a deep neural network, ECSNet, specifically designed for rapid crack identification and segmentation on road surfaces. The architecture utilizes compact convolution kernels and parallel max-pooling to extract crack-related features efficiently. This approach substantially reduces the model’s parameter count and delivers high detection accuracy. Furthermore, experimental findings indicate that ECSNet simultaneously ensures high detection accuracy and computational efficiency, and demonstrates outstanding results in the task of road crack identification. Ibragimov et al. [9] developed a road damage recognition approach utilizing the Faster R-CNN (fast region-based convolutional neural network) framework. The experimental evaluation demonstrates that the method is highly effective in identifying cracks as well as localized repair markings. Xu et al. [10] introduced an approach called End RCNN, which focuses on detecting elongated road damage. This method is designed to address the challenges posed by the inefficiency of manual inspections and the limited accuracy of traditional detection techniques. This method aims to enhance the feature extraction capability for elongated road damages by introducing a backbone network that reuses low-level features and integrates features from different stages. Experimental results show that Epd RCNN demonstrates good detection accuracy and robustness under varying lighting conditions. Hou et al. put forward a road damage detection algorithm named FS-Net, aimed at addressing the shortcomings of existing methods in terms of detection accuracy and efficiency [11]. The proposed method enhances the ability to capture two-dimensional spatial features by introducing the FReLU structure to replace traditional activation functions and employs a strip pooling strategy to improve detection capabilities for elongated damages. Based on experimental results, FS-Net improves the average detection accuracy by 4.96% relative to Faster R-CNN and by 3.67% relative to YOLOv3. Wang et al. [12] developed a YOLOv8-attention model for detecting various types of road damages, with the objective of boosting both detection accuracy and operational efficiency. In order to strengthen feature extraction for road damage, this approach integrates a Multi-Head Self-Attention (MHSA) module alongside the Selective Kernel Attention (SKA) module. Balci et al. [13] introduced an automated system for identifying road damages, constructed upon the Faster R-CNN architecture, to improve the efficiency of road surface monitoring and maintenance tasks. Experimental results indicate that the method performs well in the classification and detection of road damage. Luo et al. introduced an enhanced lightweight network, E-EfficientDet, designed to enhance detection precision and efficiency for road damages, particularly under complex conditions [14]. Their experiments demonstrate that the proposed framework achieves superior performance compared to existing neural network models on widely used road damage datasets. Ren et al. [15] introduced an enhanced YOLOv5-based approach for detecting road damage utilizing street-level imagery. The architecture employs a Generalized Feature Pyramid Network (Generalized-FPN) to facilitate feature fusion across multiple layers and scales, while adopting the Diagonal Intersection over Union Loss for improved bounding box regression. Furthermore, it incorporates a decoupled head structure, thereby improving the detection accuracy of road damages in complex multi-scale street view image environments. Experimental results indicate that the proposed method achieves an average detection accuracy of 79.8% on the test dataset. Lv et al. [16] introduced a Mask R-CNN-based approach for detecting road crack damage. Through testing and evaluation on different datasets, the model demonstrated good performance in crack damage detection. Riid et al. [17] employed deep convolutional neural networks within computer vision frameworks to automatically detect road surface damage. The dataset used for training comprised orthophotographic images collected through a mobile mapping platform. Compared to earlier research, the image data are of superior quality, with an additional manual preprocessing procedure incorporated into the workflow. Experimental findings demonstrate that the proposed approach yields strong detection performance. Min Che Ho et al. [18] proposed an automated image analysis technique utilizing the Simple Linear Iterative Clustering (SLIC) superpixel algorithm, aiming to improve both the efficiency and accuracy of road damage identification. Combined with in-vehicle camera equipment and Wi-Fi transmission functionality, this method utilizes superpixel clustering technology to achieve automated identification and assessment of road damage types like patches, potholes, and cracks. Dadashova et al. [19] introduced a deep learning approach that utilizes crowdsourced dashcam imagery for the detection of road damage. This experiment employed dashcam images from different users to detect road damage, addressing the high data collection costs associated with traditional road detection methods. Four models are introduced in this experiment, including Single Shot MultiBox Detector (SSD) and Faster R-CNN, combined with MobileNet and Inception technologies, to address the issues of high costs and limited coverage associated with traditional detection methods. Dong et al. [20] introduced a three-phase framework for automated road damage detection and assessment, utilizing an enhanced convolutional neural network to improve detection efficiency and minimize operational expenses. This approach incorporates multi-level contextual cues extracted from a CNN-based classifier to generate discriminative super features, facilitating the rapid identification of both the existence and category of road damages. In recent years, researchers have also focused on the challenges of object detection against near-color backgrounds. For example, Gao et al. investigated the use of deep learning methods to recognize green fruits in backgrounds of similarly colored green foliage, providing valuable insights for improving detection accuracy in complex scenarios [21].

Although the aforementioned deep learning-based road damage detection algorithms have achieved promising results in several domains, they still face numerous challenges in real-world scenarios, such as variations in lighting conditions, strong reflections, wet road reflections, and shadows, as well as a lack of validation for the effectiveness of the models in practical applications. It is crucial to address these challenges. Furthermore, enhancing detection accuracy and reducing processing time have become key research directions. To address these issues and further enhance both model performance and efficiency, this study focuses on optimizing the model architecture to meet the requirements of efficient real-time detection in road inspection applications. The specific research objectives are as follows:

(a): To overcome the challenges of limited detection accuracy and suboptimal inference speed observed in current models operating in complex environments, this study incorporates a multi-scale convolutional attention mechanism, the SPPELAN module, and the WIOU loss function. These enhancements collectively improve the model’s feature representation, minimize information loss, and refine loss function optimization, thereby ensuring both accurate and real-time road surface detection.
(b): Existing models exhibit reduced detection performance in challenging scenarios involving reflections and shadows, adversely impacting accuracy. In response, a multi-scale convolutional attention is integrated to strengthen the model’s capacity for key feature extraction, thereby improving its ability to handle complex real-world conditions and enhancing both detection robustness and accuracy.

2. Materials and Methods

2.1. Design of Road Damage Detection Model

The YOLO network model is characterized by high real-time efficiency, high detection accuracy, multi-category target recognition capability, and excellent small object detection capability. At the same time, YOLO’s end-to-end training architecture and low computing resource requirements make it particularly suitable for deployment in embedded devices and real-time monitoring systems. Therefore, in this study, the YOLO network architecture is selected as the basic model for road damage detection. YOLO primarily consists of three core components: Backbone, Neck, and Head. The backbone serves as the central module for feature extraction and adopts the CSPDarknet53 architecture, which is composed of multiple convolutional layers, batch normalization (BatchNorm), and activation functions such as SiLU. This structure enables the network to effectively capture low-level features, such as edges, textures, and shapes, from the input image, while progressively extracting higher-level semantic information, thus achieving efficient transformation from images to multi-scale feature maps. The neck integrates feature maps of different scales produced by the backbone through FPN (Feature Pyramid Network) and PAN (Path Aggregation Network), thereby enhancing the model’s capability to detect objects of various sizes. Finally, the detection head outputs the category and location of each target based on the fused feature maps, achieving end-to-end object detection. To improve the adaptability and reliability of road damage detection under complex conditions, multiple enhancements have been made to the model in this study. Primarily, a multi-scale convolutional attention (MSCA) module is incorporated into the backbone of the feature extraction, aiming to strengthen the model’s ability to capture targets and features at varying scales and thereby improve detection accuracy [22]. Multi-scale convolution applies convolutional filters of varying sizes to the input image, enabling the extraction of features across different scales. Then, these multi-scale features are fused and weighted through the attention mechanism, and thus the model pays more attention to the key feature areas in the final output. This approach significantly enhances the model’s detection capability for small objects and in complex environments, while also improving its robustness to targets of different scales. In addition, to ensure that the image does not lose too much information during the feature extraction stage, the SPPELAN module is used to replace the traditional SPPF module [23]. The SPPELAN module avoids excessive compression of information by maintaining the number of channels in the input, output, and processing unchanged, which can effectively retain the detailed information in the image and improve the information transfer in the feature extraction process, thereby improving the detection accuracy and robustness of the model. To further improve the overall performance of the model, balance the Recall rate and Precision rate, and enhance the performance of the model in specific scenarios, the WIOU loss function is introduced to meet the requirements of road damage detection [24]. The network structure of the road damage detection model designed in this study is displayed in Figure 1.

2.2. Multi-Scale Convolutional Attention Mechanism

Accurate road damage detection necessitates the extraction of highly representative information from feature maps. To this end, a multi-scale convolutional attention (MSCA) mechanism is incorporated into the feature extraction network in this study [25]. The core idea of the multi-scale convolutional attention (MSCA) mechanism is to capture multi-scale information of local features in an image through convolution kernels of varying sizes while utilizing an attention mechanism to automatically adjust the importance of features at different scales, contributing to enhancing the representation of key features. Different from traditional multi-head self-attention modules, MSCA does not employ a self-attention mechanism. Instead, a multi-scale convolutional attention module is involved, as shown in Figure 2. This module comprises three components: depthwise separable convolution, multi-branch depthwise separable strip convolution, and 1 × 1 convolution. These components are responsible for aggregating local information, capturing multi-scale contextual features, and modeling inter-channel relationships, respectively. The output from the 1 × 1 convolution acts as attention weights, which are subsequently used to reweight the input features within the MSCA module [26]. The output of the MSCA is given by the following equation:

A t t = {C o n v}_{1 \times 1} (\sum_{i = 0}^{3} {S c a l e}_{i} (D W - C o n v (F)))

(1)

o u t = A t t ⊙ F

(2)

where F represents the input features, DW-Conv denotes depthwise separable convolution, Att refers to the attention map, out is the output, and ⊙ denotes element-wise matrix multiplication.

{S c a l e}_{i}

i \in 0, 1, 2, 3

represents the i-th branch, where branch 0 is a direct input, while the other branches each use two depthwise separable convolutions, with convolution kernel sizes set to 7, 11, and 21, respectively. Specifically, a k × k convolution is decomposed into a 1 × k convolution and a k × 1 convolution, which can simulate the effect of a large k × k convolution kernel but at a lower computational cost.

2.3. Pyramid Pooling Structure

To facilitate more effective extraction of multi-scale contextual information, the SPPELAN module is employed in place of the SPPF module. As an enhanced variant of the traditional Spatial Pyramid Pooling (SPP) structure [27], SPPELAN incorporates several components, including a CBS module and three sequentially connected MaxPool2d layers. The outputs from the pooling layers are fused and forwarded to a Concat layer, followed by a CBS module, thereby constituting the complete SPPELAN module. The architecture of this network component is depicted in Figure 3. Relative to SPPF [28], the main difference in the SPPELAN module lies in the adjustment of the number of channels. The SPPELAN module maintains the same number of channels for input, output, and intermediate processing, without increasing or lowering the channel count. Moreover, this ensures that important feature information is neither lost nor compressed during the module’s processing. This is of particular importance when handling high-dimensional features, as reducing the number of intermediate channels may lead to some vital detailed information being overlooked or lost. The multiple pooling operations and convolutional layers within the SPPELAN module enable multi-scale feature extraction while maintaining the same number of channels. This ensures that excessive information is not lost during the feature extraction phase and contributes to maintaining the overall stability of the network. Maintaining a consistent number of channels helps the model retain more details across different scales, which is particularly beneficial for road damage detection tasks requiring multi-scale information. Furthermore, by maintaining a consistent number of channels, the SPPELAN module is capable of sustaining a high feature extraction capability under limited computational resources. In some cases, reducing the number of channels may result in excessive feature compression, particularly when coping with complex features, which may prevent the model from fully representing the information in the input data. With the channel count unchanged, the module ensures that the information in the feature maps is completely retained and effectively processed.

2.4. WIOU Loss Function

To further boost model performance and improve its sensitivity to minority classes, the WIOU loss function is adopted in place of the conventional loss function. Unlike the standard Intersection over Union metric, which only evaluates the overlap between predicted and ground truth boxes, WIOU also takes regional discrepancies into consideration. This can probably cause biased evaluation results. WIOU introduces a dynamic adjustment mechanism that adaptively calculates weights based on the differences between the predicted and ground truth boxes, enabling the model to show more robust performance under different error distributions. The basic principle of WIOU is to build upon the traditional IOU by introducing a weighting function to adjust the impact of the error between the predicted and ground truth boxes. This contributes to overcoming the shortcomings of traditional IOU when dealing with different target sizes, positional deviations, and shape differences, making the impact of errors more reasonable. The definition of WIOU is expressed in Equation (3):

L_{W I o U v 3} = L_{W I o U v 1} \times r

(3)

L_{W I o U v 1} = R_{W I o U} \times L_{I o U}

(4)

L_{I o U} = 1 - I o U

(5)

R_{W I o U} = e x p (\frac{{(x - x_{g t})}^{2} + {(y - y_{g t})}^{2}}{{(W_{g}^{2} + H_{g}^{2})}^{*}})

(6)

r = \frac{β}{δ α^{β - δ}}

(7)

β = \frac{L_{I o U}^{*}}{\bar{L_{I o U}}}

(8)

where r denotes the gradient gain, which is defined in Equation (7),

α a n d δ

are two hyperparameters used for controlling the coefficients

r

.

β

is an outlier used to describe the quality of the anchor box.

L_{I o U}

represents the

W I o U v 1

loss function, as defined in Equation (4).

L_{I o U}

is the standard Intersection over Union loss function, as defined in Equation (5), applied to measure the overlap between the predicted and actual boxes. For anchor boxes of moderate quality,

R_{W I o U}

will significantly magnify the effect, and it is defined in Equation (6).

(x, y)

represents the centroid coordinates of the predicted box, while

W_{g} a n d H_{g}

are the width and height of the minimum enclosing box. To prevent gradients hindering convergence during training,

W_{g} a n d H_{g}

are detached from the computation graph (denoted with a superscript *).

Firstly, WIOU introduces a dynamic non-monotonic focusing mechanism, which mainly targets medium-quality samples. This makes the model more sensitive to these samples without excessively concentrating on extremely high-quality or low-quality samples [29]. This mechanism is beneficial for preventing the model from overfitting or ignoring vital samples during training, thereby enhancing its generalization capability. Secondly, a gradient gain allocation strategy is involved in the loss function to adjust the gradient gain differently for anchor boxes of varying quality. For low-quality anchor boxes, WIOU assigns a smaller gradient gain to avoid harmful gradient impacts from these low-quality samples on model training, consequently preventing negative effects on model performance. In terms of medium-quality anchor boxes, WIOU increases the gradient gain, helping the model better optimize the predicted bounding boxes. Based on the aforementioned improvements, WIOU not only enhances the overall performance and stability of the model but also indirectly balances the relationship between Precision and Recall.

3. Experimental Method

3.1. Experimental Setup and Implementation

Using the Ubuntu 18.04 operating system, this experiment was conducted on a high-performance computing platform. Python 3.8 and the PyTorch 1.8.1 deep learning framework were used for model development. The hardware configuration includes an NVIDIA RTX 2080 Ti GPU (11GB VRAM;NVIDIA Corporation, Santa Clara, CA, USA), a 12-core Intel Xeon Platinum 8255C CPU(Intel Corporation, Santa Clara, CA, USA), and 40 GB of RAM, running CUDA version 11.1. This computing platform provides robust support for the efficient training of deep learning models and processing large-scale datasets. This experiment is based on the publicly available dataset RDD2020 (Road Damage Detection 2020) [30]. To strengthen the model’s adaptability to various types of road damage and variations in complex environments, data augmentation techniques were applied to the original images, including mirror transformations and brightness adjustments. After data augmentation, the dataset comprised a total of 34,214 images, encompassing four prevalent categories of road damage: Longitudinal Cracks, Transverse Cracks, Alligator Cracks, and Potholes, which were assigned class labels 0, 1, 2, and 3, respectively. Detailed distributions for each damage type are summarized in Table 1. The data was partitioned into training, validation, and test sets in an 8:1:1 ratio. The initial learning rate was configured as 0.01 and decayed progressively with a factor of 0.01 to facilitate better model convergence. Model training was performed over 200 epochs using the SGD optimizer, with momentum and weight decay employed to regulate the speed and stability of convergence. In addition, the batch size was set to 16, indicating that 16 images were processed per iteration to ensure steady model training with a large dataset. Additionally, three warmup epochs were introduced, during which the learning rate was gradually increased to avoid instability caused by a high initial learning rate. To prevent overfitting, an early stopping strategy was employed, ensuring that training was halted when model performance ceased to improve, therefore enhancing generalization capability.

3.2. Evaluation Metrics

Precision, Recall, F1-Score, mean Average Precision (mAP@0.5), and frames per second (FPS) were used as metrics in this study to evaluate the model’s performance. Precision is one of the key metrics used to evaluate the model’s performance, describing the proportion of correctly predicted positive samples among all samples predicted as positive. Then, it is defined as shown in Equation (9):

P r e c i s i o n = \frac{T P}{T P + F P}

(9)

where

T P

represents the number of samples correctly predicted as positive by the model, and FP indicates the number of samples incorrectly predicted as positive. Recall is employed to measure the proportion of actual positive samples that are correctly predicted as positive by the model. It is expressed in Equation (10):

R e c a l l = \frac{T P}{T P + F N}

(10)

where FN denotes the number of samples that were incorrectly predicted as negative by the model. The F1-Score is the harmonic mean of Precision and Recall, applied to comprehensively evaluate the overall performance of the model in detection tasks. The F1-Score effectively balances the trade-off between Precision and Recall, with values ranging from 0 to 1, where values closer to 1 indicate better model performance. It is expressed in Equation (11):

F 1 = 2 \times \frac{P r e c i s i o n \times R e c a l l}{P r e c i s i o n + R e c a l l}

(11)

The mean Average Precision (mAP@0.5) represents the mean value of Average Precision (AP) for all classes, calculated under the condition that the Intersection over Union (IoU) threshold is set to 0.5. Average Precision (AP) is defined in Equation (12). The (mAP@0.5) is a vital metric for evaluating the overall performance of a model across different object detection tasks, integrating the relationship between Precision and Recall. A higher (mAP@0.5) indicates more balanced detection performance across different classes. Then, it is defined as shown in Equation (13):

A P = \int_{0}^{1} P (R) d R

(12)

m A P @ 0.5 = \frac{1}{n} \sum_{i = 1}^{n} A P_{i}

(13)

where P(R) represents the Precision at a given Recall R, n is the number of classes, and

A P_{i}

refers to the Average Precision for the i-th class.

Frames per second (FPS) is utilized to assess the prediction efficiency of the model. This metric quantifies how many images can be processed by the model per second. A higher FPS value signifies enhanced computational efficiency and improved suitability for real-time applications. Therefore, FPS serves as a key indicator of the model’s processing speed and real-time performance in practical scenarios.

4. Results and Analysis

4.1. Experimental Results

The publicly accessible RDD2020 dataset was utilized for both training and testing to assess the effectiveness of the proposed model in road damage detection. Performance was evaluated using metrics such as Precision, Recall, and mean Average Precision (mAP@0.5). The model’s robustness and generalizability were further examined via five-fold cross-validation. Experimental results are presented in Table 2. (All intermediate results involved in the calculation process are rounded to three decimal places.) The results suggest that the model’s Precision and Recall in each fold exceeded 97%, with an Average Precision of 0.977 and Recall of 0.970. Furthermore, the model’s performance on mAP@0.5 is also outstanding, with an average value of 0.990, demonstrating the model’s excellent accuracy and consistency in detecting different types of road damage.

The model’s processing speed (FPS) is particularly noteworthy. The average processing speed across five validations was 268.847 frames per second, with a peak of 270.270 frames per second. Moreover, this result demonstrates that the optimized algorithm not only shows a significant improvement in accuracy but also possesses high real-time processing capability, satisfying the requirements of intelligent inspection robots for real-time road damage detection. As a result, the model holds high potential for practical application and widespread adoption, particularly in guaranteeing both improved detection accuracy and efficiency while maintaining a low false positive and false negative rate.

In this experiment, the introduction of the MSCA mechanism, SPPELAN module, and WIOU loss function significantly enhanced the model’s detection accuracy and processing speed. The representative P-R curve is displayed in Figure 4. As displayed in the figure, the model demonstrated excellent performance in detecting various types of road damage, especially in the detection of Longitudinal Cracks and Alligator Cracks. The mAP@0.5 for both categories reached 0.994, implying that the model exhibits extremely high accuracy in detecting these two types of damage. Regarding pothole detection, the model’s mAP@0.5 also reached 0.990, revealing good performance. Nevertheless, for transverse crack detection, the model’s mAP@0.5 is relatively lower. Although this result is slightly inferior when compared with other types, it still reflects a high level of detection effectiveness, indicating that there is room for further optimization in the detection of Transverse Cracks. In general, the model achieved an mAP@0.5 of 0.990, indicating that it possesses extremely high accuracy and robustness in detecting various types of road damage. Moreover, this makes the model suitable for real-time road damage detection tasks in intelligent inspection robots.

Figure 5 and Figure 6 exhibit the model’s actual performance in detecting road damage under different lighting conditions. Experimental findings indicate that the model maintains robust performance and strong generalization across diverse lighting conditions. Whether under extremely bright sunlight or in low-light environments, the model is able to consistently and accurately identify different types of road damage. This characteristic is of particular importance because, in practical applications, the lighting conditions on road surfaces often vary significantly, and drastic changes in lighting can pose challenges to detection performance. Results indicate that the model performs reliably under diverse illumination conditions, which substantiates its capability to adapt to complex real-world environments. Furthermore, this ensures that the model can provide continuous and reliable support for intelligent inspection robots, therefore safeguarding efficient real-time detection of road damage.

In addition, strong light reflection is one of the common visual challenges confronted by intelligent inspection robots in real-world environments, particularly on wet and slippery road surfaces after rain. These complex conditions can have a significant impact on the robot’s visual system, thus increasing the difficulty of road damage detection. To bolster the model’s performance under varied environmental conditions, the model was trained to handle conditions including strong light reflection and slippery road surfaces. This allows the intelligent inspection robot to maintain a high detection accuracy even under diverse and challenging conditions. Figure 7 and Figure 8 display the model’s detection performance under strong light reflection and slippery road surface conditions. According to the experimental findings, the enhanced model demonstrates notable improvements in robustness to varying lighting conditions. The model can still effectively perform inspection tasks even under extreme lighting conditions. These results underscore the model’s robustness in real-world applications, offering substantial support for intelligent inspection robots and ensuring effective road damage detection in complex scenarios.

Shadow occlusion is another significant visual challenge faced by intelligent inspection robots in real-world applications, especially in environments with uneven lighting, including early mornings and evenings. Shadows cast by buildings, trees, streetlights, and other vehicles on the road surface can make it challenging for the inspection robot to clearly capture road surface details within the shaded areas, influencing overall detection accuracy. To address the existing issue, numerous scene images containing shadow occlusion were introduced into the training dataset to enhance the model’s adaptability to the effects of shadows. With this approach, the model effectively acquires the ability to differentiate shadows from true road damage, resulting in a notable reduction in false alarms. The detection results shown in Figure 9 and Figure 10 attest to the model’s reliable performance under shadowed conditions. These outcomes indicate that the model preserves a high level of accuracy, even when shadows are present in the scene.

The improved model’s excellent performance in complex environments further demonstrates its adaptability, providing stable and reliable support for intelligent inspection robots. Additionally, as shown in Table 2, the model’s stability and consistency were thoroughly validated through five-fold cross-validation. The improved road damage detection model showed excellent performance across multiple metrics, including Precision, Recall, and mean Average Precision (mAP@0.5), with all average values remaining at high levels. Furthermore, this not only proves the model’s efficiency and robustness under different experimental conditions but also showcases its broad applicability and strong generalization ability in various road damage detection tasks.

In conclusion, the experimental results show that the improved model demonstrates outstanding detection performance in road damage detection tasks, especially its high adaptability and robustness in complex environments. This makes it an ideal choice for real-time detection by intelligent inspection robots.

4.2. Ablation Experiment

Ablation studies were performed on the enhanced model to assess the contribution of individual improvement modules to overall performance. These experiments involved incorporating the MSCA, substituting the SPPF module with the SPPELAN module, and adopting the WIOU loss function. Table 3 shows the experimental results. (All intermediate results involved in the calculation process are rounded to three decimal places.) After adding the MSCA to the backbone network, the model showed significant improvements across several key performance metrics, including Precision, Recall, and mAP@0.5. Based on the previous improvements, the original SPPF module was further replaced with the SPPELAN module. The results revealed an increase in Precision from 96.9% to 97.7%, Recall from 95.9% to 96.8%, and the F1-Score from 96% to 97%. Additionally, the model’s processing speed was significantly enhanced, being able to process 263.158 frames per second. Finally, by introducing the WIOU loss function into the improved model, both Precision and Recall exceeded 97%, and mAP@0.5 was further elevated to 99%. Obviously, this significantly enhanced the model’s stability and robustness. The comprehensive results demonstrate that the implemented enhancements substantially elevated the model’s performance, yielding notable gains in both processing speed and detection accuracy.

4.3. Comparative Experiment

To further assess the effectiveness of the improved model, this study compares it with common single-stage object detection methods, including YOLOv7, YOLOv9, and YOLOv5. The results of the experiments are summarized in Table 4. The comparison results indicate that, compared with YOLOv5, the improved network shows significant improvements in Precision, Recall, and mean Average Precision (mAP@0.5). In addition, it also outperforms YOLOv5 in terms of frame rate (FPS) and F1-Score.

Additionally, when compared with YOLOv7, the improved network demonstrates a clear advantage in all key performance metrics. Although YOLOv9 shows some improvements in Precision, the improved network still exhibits a stronger overall performance. Moreover, the improved model significantly enhances inference speed relative to the original model, with a more noticeable improvement in inference efficiency when compared with other network models. It should be noted that the ‘Ours’ model integrates both the proposed architectural enhancements and the WIOU loss function, whereas the baseline YOLO models are used in their standard form with default loss functions. Therefore, the comparison reflects the overall effectiveness of our fully optimized system relative to the default implementations of baseline models.

Figure 11 shows the Precision–Recall (P-R) curves for YOLOv7, YOLOv9, YOLOv5, and the improved network model. By comparing the performance of these network models, the P-R curves indicate that the improved model achieves the best performance. Specifically, the improved model demonstrates a significant improvement in both Precision and Recall across various types of road damage (including Longitudinal Cracks, Transverse Cracks, Alligator Cracks, and potholes). Clearly, in the more challenging detection tasks, such as Transverse Cracks and potholes, the mAP@0.5 for the improved model can reach 0.982 and 0.990, respectively, far outperforming the other models.

In contrast, YOLOv7 shows relatively weaker overall performance, especially in the detection of Transverse Cracks and potholes, where it struggles due to insufficient feature extraction. Although YOLOv9 exhibits some improvement in detection accuracy compared with YOLOv7, it still fails to surpass YOLOv5 in more challenging categories. In addition, the original YOLOv5 model demonstrates relatively stable performance across different types of defects, which can maintain a balance between accuracy and reliability in detection tasks.

In summary, the improved network model demonstrates significant advantages across multiple key performance metrics, particularly in core indicators of object detection quality, including Precision, Recall, and mAP@0.5. When compared with other mainstream network models, it shows clear improvements. Additionally, the optimization in inference speed makes this model more competitive in real-time application scenarios, effectively satisfying the real-time detection needs of intelligent inspection robots.

4.4. Model Validation

To assess the practical applicability of the model, evaluations were performed on roads exhibiting four distinct types of damage. Initially, a range of advanced sensors was employed to guarantee the accuracy and completeness of the collected data. Specifically, the Daheng industrial camera was responsible for high-precision image capture, which could clearly document the details of the road damage. Through image processing techniques, it accurately identified and classified different types of road damage. Meanwhile, the integrated navigation system utilized the npos220 sensor to provide real-time and accurate location information for the damages, offering efficient spatial positioning support for road damage detection.

Additionally, to achieve real-time detection of road damage, tests were performed on the autonomous driving platform developed by the Advanced Technology Research Institute of Beijing Institute of Technology. All components are integrated into a wheeled vehicle platform. The onboard computing unit enables real-time image inference and defect detection, allowing the system to efficiently perform deep learning-based road surface defect detection during motion. Figure 12 shows that this platform provided robust technical support for real-time data collection.

In addition, this study designed a comprehensive road damage detection system, which mainly includes two key functions: real-time detection and data query. The real-time detection function is capable of promptly capturing and accurately classifying various types of damage that appear on the road surface. The data query function allows users to review and analyze historical damage data, providing reliable support for subsequent repair and maintenance decisions. The data query interface is displayed in Figure 13. Moreover, this system will offer a solid technological foundation for future road surface monitoring and damage early warning.

Comprehensive testing of the model in a real-world road environment was conducted. The experimental results indicate that the model efficiently recognizes multiple types of road damage, with both detection accuracy and response speed meeting the expected requirements. By taking a vehicle speed of 60 km per hour as an example, Figure 14 illustrates the model’s real-time detection capabilities for Transverse Cracks across various lighting conditions. The test results demonstrate the feasibility and efficiency of the system in practical applications, providing technical support and valuable experience for the large-scale deployment of road damage detection technology.

Figure 15 shows the detection results of Longitudinal Cracks under different lighting conditions. The experimental results demonstrate that regardless of changes in lighting intensity, the system can accurately identify Longitudinal Cracks and precisely mark their locations, further proving the robustness and environmental adaptability of the system.

Figure 16 shows the detection outcomes for Alligator Cracks across a range of lighting conditions. According to experimental results, regardless of changes in lighting conditions, the system is able to accurately identify the Alligator Cracks and clearly mark their distribution area and location.

Figure 17 shows the detection results of potholes under different lighting conditions. The experimental results meet the expected outcomes.

In summary, by performing comprehensive testing under various road conditions and lighting scenarios, the road damage detection model has exhibited exceptional performance and adaptability. The experimental results validate the model’s feasibility and effectiveness in real-world applications, providing vital technical support and practical experience for its future large-scale deployment. This achievement plays a significant role in advancing the development of automated road damage detection technology.

5. Discussion

Road damage detection is vital for intelligent inspection robots as it ensures the stable operation of road systems. However, several challenges still exist in performing road damage detection. Firstly, images captured from the road may be impacted by strong light reflections, shadows, or low-light environments, making it difficult to discern road features and thereby impacting the accuracy of damage detection. Secondly, the road environment is complex, with vehicles, signs, and other objects on the road potentially interfering with damage detection. In addition, different types of road damage exhibit varying characteristics, which can further increase the difficulty of detection.

In response to these challenges, improvements are made to the model’s backbone through the integration of an MSCA, the SPPELAN module, and the WIOU loss function, thereby boosting overall model performance. Although the improvement in mAP@0.5 from 0.984 to 0.990 appears relatively small in absolute terms, it is significant within the high-performance range. In safety-critical and accuracy-demanding applications such as road damage detection, even marginal enhancements can lead to meaningful reductions in false positives or missed detections of cracks, potholes, and other surface defects. These incremental improvements contribute to the robustness of the system and enhance its reliability in real-world deployments.

While the model demonstrates strong performance in both detection accuracy and processing efficiency, it still has certain limitations and needs to be further optimized to strengthen its practicality and robustness. First, there is still room for optimization of the model’s computational requirements, especially when deployed in embedded or edge computing environments, where computational complexity and energy consumption need to be lowered. A lightweight network can be employed to increase the inference speed and reduce storage usage, making it more suitable for practical application scenarios, including vehicle-mounted equipment or robot inspections. Second, different categories of road damage (such as cracks and potholes) in the dataset may have uneven sample distribution problems, affecting the balanced detection ability of the model. Therefore, data enhancement, undersampling/oversampling, and generative adversarial network (GAN) synthetic data can be applied to make the model more stable in detecting various types of disasters [31]. In the future, combined with real-time detection needs, methods based on multimodal fusion (such as combining lidar and image data) can be explored to improve detection performance in severe weather or low-light conditions, enhancing the practicality of intelligent inspection robots and providing more efficient and accurate technical support for road safety maintenance. Recent advances in micro-target recognition provide valuable insights for addressing the challenges of road damage detection. For example, Gao et al. [32] investigated the detection of tiny apple leaf diseases and proposed a lightweight multi-branch convolutional module and attention mechanisms to improve the recognition of small and inconspicuous targets. These approaches effectively enhance detection performance for objects with low contrast or small size against complex backgrounds. Drawing on these findings, further exploration of lightweight architectures and attention-based modules in road damage detection can help improve the system’s robustness, especially for subtle or small-scale road defects that are easily overlooked under real-world conditions.

6. Conclusions

To conclude, this study proposes a deep learning-based road damage detection model based on the YOLO framework, applied to intelligent inspection robots. By incorporating the MSCA module into the backbone network, the model achieves enhanced feature extraction capabilities without a substantial increase in computational overhead. Additionally, the SPPELAN module is employed to replace the traditional SPPF module, allowing for better extraction of multi-scale contextual information. This enhances the model’s detection capability when addressing complex road damages, significantly improving both detection accuracy and processing speed. For additional performance gains, the study replaces standard loss functions with the WIOU loss function. In comparison with the baseline model, the improved model achieves a 1.7% increase in accuracy, a 1.4% increase in Recall, and a 0.6% improvement in mAP@0.5, while the FPS improved from 238.095 to 263.158, indicating a gain of 25.063 frames, which corresponds to an approximate 10.53% improvement over the baseline. The model’s parameter count only increased slightly. Experimental results suggest that the model proposed in this study outperforms other existing YOLO series object detection algorithms in road damage detection tasks, demonstrating higher detection accuracy and faster processing speed. It exhibits significant potential for application in intelligent inspection robots.

Author Contributions

Conceptualization, H.T. and F.Z.; methodology, H.T.; software, H.T., H.C., J.Z., S.S. and D.Y.; validation, H.T.; formal analysis, H.T., H.C., J.Z., S.S. and D.Y.; investigation, H.T.; resources, F.Z.; data curation, H.T.; writing—original draft preparation, H.T.; writing—review and editing, F.Z.; visualization, H.T.; supervision, H.T.; project administration, F.Z.; funding acquisition, F.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the Shandong Provincial Department of Transportation Science and Technology Plan Project (2023B78-06).

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Li, Y. Constructing the intelligent expressway traffic monitoring system using the internet of things and inspection robot. J. Supercomput. 2024, 80, 8742–8766. [Google Scholar] [CrossRef]
Wang, X.; Ma, X.; Li, Z. Research on SLAM and path planning method of inspection robot in complex scenarios. Electronics 2023, 12, 2178. [Google Scholar] [CrossRef]
Zhang, C.; Nateghinia, E.; Miranda-Moreno, L.F.; Sun, L. Pavement distress detection using convolutional neural network (CNN): A case study in Montreal, Canada. Int. J. Transp. Sci. Technol. 2022, 11, 298–309. [Google Scholar] [CrossRef]
Li, L.; Sun, S.; Song, W.; Zhang, J.; Teng, Q. CrackYOLO: Rural pavement distress detection model with complex scenarios. Electronics 2024, 13, 312. [Google Scholar] [CrossRef]
Ning, Z.; Wang, H.; Li, S.; Xu, Z. YOLOv7-RDD: A Lightweight Efficient Pavement Distress Detection Model. IEEE Trans. Intell. Transp. Syst. 2024, 25, 6994–7003. [Google Scholar] [CrossRef]
Liu, Y.; Liu, F.; Liu, W.; Huang, Y. Pavement distress detection using street view images captured via action camera. IEEE Trans. Intell. Transp. Syst. 2023, 25, 738–747. [Google Scholar] [CrossRef]
Wu, P.; Wu, J.; Xie, L. Pavement distress detection based on improved feature fusion network. Measurement 2024, 236, 115119. [Google Scholar] [CrossRef]
Zhang, T.; Wang, D.; Lu, Y. ECSNet: An accelerated real-time image segmentation CNN architecture for pavement crack detection. IEEE Trans. Intell. Transp. Syst. 2023, 24, 15105–15112. [Google Scholar] [CrossRef]
Ibragimov, E.; Lee, H.J.; Lee, J.J.; Kim, N. Automated pavement distress detection using region based convolutional neural networks. Int. J. Pavement Eng. 2022, 23, 1981–1992. [Google Scholar] [CrossRef]
Xu, H.; Chen, B.; Wang, J.; Chen, Z.; Qin, J. Elongated pavement distress detection method based on convolutional neural network. J. Comput. Appl. 2022, 42, 265–272. [Google Scholar]
Hou, Y.; Dong, Y.; Zhang, Y.; Zhou, Z.; Tong, X.; Wu, Q.; Qian, Z.; Li, R. The application of a pavement distress detection method based on FS-Net. Sustainability 2022, 14, 2715. [Google Scholar] [CrossRef]
Wang, Z.; Abbas, M.; Wang, L. An attention-based improved YOLOv8 method for pavement distress detection. In Proceedings of the Transportation Research Board Annual Meeting, Washington, DC, USA, 7–11 January 2024. [Google Scholar]
Balci, F.; Yilmaz, S. Faster R-CNN structure for computer vision-based road pavement distress detection. J. Polytech. 2022, 26, 701–710. [Google Scholar] [CrossRef]
Luo, H.; Li, C.; Wu, M.; Cai, L. An enhanced lightweight network for road damage detection based on deep learning. Electronics 2023, 12, 2583. [Google Scholar] [CrossRef]
Ren, M.; Zhang, X.; Chen, X.; Zhou, B.; Feng, Z. YOLOv5s-M: A deep learning network model for road pavement damage detection from urban street-view imagery. Int. J. Appl. Earth Obs. Geoinf. 2023, 120, 103335. [Google Scholar] [CrossRef]
Lv, Z.; Cheng, C.; Lv, H. Automatic identification of pavement cracks in public roads using an optimized deep convolutional neural network model. Philos. Trans. R. Soc. A 2023, 381, 20220169. [Google Scholar] [CrossRef]
Riid, A.; Lõuk, R.; Pihlak, R.; Tepljakov, A.; Vassiljeva, K. Pavement distress detection with deep learning using the orthoframes acquired by a mobile mapping system. Appl. Sci. 2019, 9, 4829. [Google Scholar] [CrossRef]
Ho, M.C.; Lin, J.D.; Huang, C.F. Automatic image recognition of pavement distress for improving pavement inspection. GEOMATE J. 2020, 19, 242–249. [Google Scholar] [CrossRef]
Dadashova, B.; Dobrovolny, C.S.; Tabesh, M. Detecting Pavement Distresses Using Crowdsourced Dashcam Camera Images; Technical Report; Safety through Disruption (Safe-D) University Transportation Center (UTC): College Station, TX, USA, 2021. [Google Scholar]
Dong, H.; Song, K.; Wang, Y.; Yan, Y.; Jiang, P. Automatic inspection and evaluation system for pavement distress. IEEE Trans. Intell. Transp. Syst. 2021, 23, 12377–12387. [Google Scholar] [CrossRef]
Ang, G.; Zhiwei, T.; Wei, M.; Yuepeng, S.; Longlong, R.; Yuliang, F.; Jianping, Q.; Lijia, X. Fruits hidden by green: An improved YOLOV8n for detection of young citrus in lush citrus trees. Front. Plant Sci. 2024, 15, 1375118. [Google Scholar] [CrossRef]
Guo, M.H.; Lu, C.Z.; Hou, Q.; Liu, Z.; Cheng, M.M.; Hu, S.M. Segnext: Rethinking convolutional attention design for semantic segmentation. Adv. Neural Inf. Process. Syst. 2022, 35, 1140–1156. [Google Scholar]
Qiu, Z.; Huang, Z.; Mo, D.; Tian, X. GSE-YOLO: A Lightweight and High-Precision Model for Identifying the Ripeness of Pitaya (Dragon Fruit) Based on the YOLOv8n Improvement. Horticulturae 2024, 10, 852. [Google Scholar] [CrossRef]
Tong, Z.; Chen, Y.; Xu, Z.; Yu, R. Wise-IoU: Bounding box regression loss with dynamic focusing mechanism. arXiv 2023, arXiv:2301.10051. [Google Scholar]
Qian, L.; Qian, W.; Tian, D.; Zhu, Y.; Zhao, H.; Yao, Y. MSCA-UNet: Multi-scale convolutional attention UNet for automatic cell counting using density regression. IEEE Access 2023, 11, 85990–86001. [Google Scholar] [CrossRef]
Yu, C.C.; Chen, Y.D.; Cheng, H.Y.; Jiang, C.L. Semantic Segmentation of Satellite Images for Landslide Detection Using Foreground-Aware and Multi-Scale Convolutional Attention Mechanism. Sensors 2024, 24, 6539. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 37, 1904–1916. [Google Scholar] [CrossRef] [PubMed]
Xue, Z.; Lin, H.; Wang, F. A small target forest fire detection model based on YOLOv5 improvement. Forests 2022, 13, 1332. [Google Scholar] [CrossRef]
Saydirasulovich, S.N.; Mukhiddinov, M.; Djuraev, O.; Abdusalomov, A.; Cho, Y.I. An improved wildfire smoke detection based on YOLOv8 and UAV images. Sensors 2023, 23, 8374. [Google Scholar] [CrossRef]
Arya, D.; Maeda, H.; Ghosh, S.K.; Toshniwal, D.; Sekimoto, Y. RDD2020: An annotated image dataset for automatic road damage detection using deep learning. Data Brief 2021, 36, 107133. [Google Scholar] [CrossRef]
Maeda, H.; Kashiyama, T.; Sekimoto, Y.; Seto, T.; Omata, H. Generative adversarial network for road damage detection. Comput.-Aided Civ. Infrastruct. Eng. 2021, 36, 47–60. [Google Scholar] [CrossRef]
Ang, G.; Han, R.; Yuepeng, S.; Longlong, R.; Yue, Z.; Xiang, H. Construction and verification of machine vision algorithm model based on apple leaf disease images. Front. Plant Sci. 2023, 14, 1246065. [Google Scholar] [CrossRef]

Figure 1. Model overall structure diagram.

Figure 2. Structure of the multi-scale convolutional attention mechanism.

Figure 3. Structure of the SPPELAN module replacing SPPF for enhanced feature retention.

Figure 4. P-R curve of the improved model.

Figure 5. Detection results under bright lighting conditions.

Figure 6. Detection results under dim lighting conditions.

Figure 7. Detection results under strong light reflections.

Figure 8. Detection results under slippery road surface conditions.

Figure 9. Detection results under shadow occlusion.

Figure 10. Detection results under shadow occlusion.

Figure 11. P-R curves of different detection models.

Figure 12. Road damage detection platform integrating GPS(NovAtel Inc., Calgary, Canada), LiDAR(Hesai Technology Co., Ltd., Shanghai, China), and Camera (1920 × 1080 resolution; Daheng Imaging Co., Ltd., Beijing, China) sensors.

Figure 13. Data query interface.

Figure 14. Detection results of Transverse Cracks under low-light/high-light conditions.

Figure 15. Detection results of Longitudinal Cracks under low-light/high-light conditions.

Figure 16. Detection results of Alligator Cracks under low-light/high-light conditions.

Figure 17. Detection results of potholes under low-light/high-light conditions.

Table 1. The number of instances for each type of damage.

Type of Damage	Longitudinal Cracks	Transverse Cracks	Alligator Cracks	Potholes
number	17,356	13,746	20,596	18,152
percentage (%)	24.85%	19.68%	29.49%	25.98%

Table 2. Results of a five-fold cross-validation experiment.

Fold	Precision	Recall	mAP@0.5	F1	FPS
1	0.976	0.966	0.989	0.971	263.157
2	0.978	0.970	0.989	0.974	270.270
3	0.977	0.970	0.990	0.973	270.270
4	0.977	0.970	0.991	0.973	270.270
5	0.978	0.970	0.989	0.974	270.270
Average	0.977	0.970	0.990	0.973	268.847

Table 3. Ablation experiment results.

Methods	Precision	Recall	mAP@0.5	FPS	F1
YOLOv5s	0.958	0.956	0.984	238.095	0.957
YOLOv5s-MSCA	0.969	0.959	0.986	227.273	0.964
Δ	0.011	0.003	0.002	−10.822	0.007
YOLOv5s-MSCA-SPPELAN	0.977	0.968	0.990	263.158	0.972
Δ	0.008	0.009	0.004	35.885	0.008
YOLOv5s-MSCA-SPPELAN-WIOU	0.975	0.970	0.990	263.158	0.973
Δ	−0.002	0.002	0	0	0.001

Table 4. Comparison of different models.

Methods	Precision	Recall	mAP@0.5	FPS	F1
YOLOv5	0.958	0.956	0.984	238.095	0.960
YOLOv7	0.882	0.862	0.927	147.060	0.870
YOLOv9	0.951	0.916	0.968	41.840	0.930
Ours	0.975	0.970	0.990	263.158	0.973

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Tian, H.; Zhao, F.; Yang, D.; Cheng, H.; Zhang, J.; Song, S. Research on Road Damage Detection Algorithms for Intelligent Inspection Robots. Electronics 2025, 14, 2762. https://doi.org/10.3390/electronics14142762

AMA Style

Tian H, Zhao F, Yang D, Cheng H, Zhang J, Song S. Research on Road Damage Detection Algorithms for Intelligent Inspection Robots. Electronics. 2025; 14(14):2762. https://doi.org/10.3390/electronics14142762

Chicago/Turabian Style

Tian, Hongsai, Feng Zhao, Dongqing Yang, Haitao Cheng, Jiahao Zhang, and Shuangshuang Song. 2025. "Research on Road Damage Detection Algorithms for Intelligent Inspection Robots" Electronics 14, no. 14: 2762. https://doi.org/10.3390/electronics14142762

APA Style

Tian, H., Zhao, F., Yang, D., Cheng, H., Zhang, J., & Song, S. (2025). Research on Road Damage Detection Algorithms for Intelligent Inspection Robots. Electronics, 14(14), 2762. https://doi.org/10.3390/electronics14142762

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Research on Road Damage Detection Algorithms for Intelligent Inspection Robots

Abstract

1. Introduction

2. Materials and Methods

2.1. Design of Road Damage Detection Model

2.2. Multi-Scale Convolutional Attention Mechanism

2.3. Pyramid Pooling Structure

2.4. WIOU Loss Function

3. Experimental Method

3.1. Experimental Setup and Implementation

3.2. Evaluation Metrics

4. Results and Analysis

4.1. Experimental Results

4.2. Ablation Experiment

4.3. Comparative Experiment

4.4. Model Validation

5. Discussion

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI