You are currently viewing a new version of our website. To view the old version click .
Applied Sciences
  • Article
  • Open Access

16 April 2024

RDD-YOLO: Road Damage Detection Algorithm Based on Improved You Only Look Once Version 8

,
,
,
and
School of Computer Science and Technology, Dong Hua University, Shanghai 201620, China
*
Author to whom correspondence should be addressed.
This article belongs to the Special Issue Deep Learning for Object Detection

Abstract

The detection of road damage is highly important for traffic safety and road maintenance. Conventional detection approaches frequently require significant time and expenditure, the accuracy of detection cannot be guaranteed, and they are prone to misdetection or omission problems. Therefore, this paper introduces an enhanced version of the You Only Look Once version 8 (YOLOv8) road damage detection algorithm called RDD-YOLO. First, the simple attention mechanism (SimAM) is integrated into the backbone, which successfully improves the model’s focus on crucial details within the input image, enabling the model to capture features of road damage more accurately, thus enhancing the model’s precision. Second, the neck structure is optimized by replacing traditional convolution modules with GhostConv. This reduces redundant information, lowers the number of parameters, and decreases computational complexity while maintaining the model’s excellent performance in damage recognition. Last, the upsampling algorithm in the neck is improved by replacing the nearest interpolation with more accurate bilinear interpolation. This enhances the model’s capacity to maintain visual details, providing clearer and more accurate outputs for road damage detection tasks. Experimental findings on the RDD2022 dataset show that the proposed RDD-YOLO model achieves an mAP50 and mAP50-95 of 62.5% and 36.4% on the validation set, respectively. Compared to baseline, this represents an improvement of 2.5% and 5.2%. The F1 score on the test set reaches 69.6%, a 2.8% improvement over the baseline. The proposed method can accurately locate and detect road damage, save labor and material resources, and offer guidance for the assessment and upkeep of road damage.

1. Introduction

Roads constitute the backbone of transportation and are a crucial component of urban and rural infrastructure. Well-maintained road infrastructure is a prerequisite for urbanization and economic development. Given the ongoing development in road construction in China, the scale and complexity of the road network have experienced explosive growth. As of the end of 2022, the total road mileage in China has reached 5.35 million kilometers, growing by 1.12 million km in the past decade. Among them, the mileage of highways has reached 177,000 km, ranking first globally [1]. However, as the scale of roads expands, road damage issues become increasingly prominent. Due to differences in vehicle loads and construction lifespans, these roads face varying degrees of damage. Road damages such as potholes and cracks not only affect the efficiency of road traffic but also increase the instability of vehicle travel, raising the probability of traffic accidents and posing a threat to the safety of vehicles and passengers. Therefore, to maintain road infrastructure and improve road safety, it is crucial to detect and repair these road damages in a timely manner, which places strict requirements on the task of road damage detection.
Although the currently proposed deep learning-based object detection techniques have demonstrated remarkable success, there still remain challenges and room for enhancement in the field of road damage detection. Firstly, road damage exhibits diverse types with irregular shapes and sizes, making the detection task inherently challenging. Secondly, road background images encompass various elements such as different road surface textures, vehicles, pedestrians, and buildings, which may interfere with the accurate detection of road damage and impact the model’s precision. Furthermore, the appearance of road damage varies under different lighting and weather conditions, adding complexity to the road damage detection task. To tackle these obstacles and improve the accuracy and robustness of detection, this paper improves the state-of-the-art (SOTA) YOLOv8 model and applies it to road damage detection, resulting in improved precision and performance. The main contributions of this paper are as follows:
  • The introduction of the SimAM [2] into the backbone filters out noise and amplifies the model’s focus on important information within the input image. This enhancement significantly improves the model’s performance and generalization ability, enhancing its ability to effectively handle road damage detection tasks.
  • The neck structure is enhanced by replacing traditional convolution modules with GhostConv [3]. This not only successfully reduces redundant information in the model, lowering the number of parameters and computational complexity, but also maintains the model’s outstanding performance in recognizing damage.
  • To ensure the provision of more accurate images, the nearest interpolation in the neck is replaced with a more accurate bilinear interpolation. This modification enhances the model’s capability to retain complex details in the image, producing a clearer, more accurate output that helps minimize the impact of environmental factors on detection results.

3. The Proposed Method

This paper has made improvements to YOLOv8, as depicted in Figure 4. The SimAM is added to the backbone network so that the model can focus more effectively on the important information in the input image and improve the performance and output effects of the model; the convolutional module of the neck is replaced by GhostConv, which reduces the redundant information, the number of model parameters, and computational complexity while retaining the model’s superior recognition performance; and the upsampling algorithm of the neck is replaced from the nearest interpolation to the bilinear interpolation, which provides a higher precision image.
Figure 4. Improved YOLOv8 network architecture.

3.1. SimAM

In deep learning, an attention mechanism is an approach that mimics the human visual and cognitive system. The attention mechanism assists the recommendation model in assigning varying weights to each part of the input, thereby extracting more crucial and significant information and enabling the recommendation model to make more precise assessments. It does not bring more overhead to the calculation and storage of the model [24], thereby enhancing the model’s performance and generalization ability.
Most existing attention modules typically generate one-dimensional or two-dimensional weights based on the feature X. As shown in Figure 5a,b, these generated weights are expanded into the channel attention mechanism and spatial attention mechanism to refine features in the channel or spatial dimensions. This approach limits their flexibility to learn different attention weights in both channel and spatial dimensions. Moreover, these mechanisms are often constructed based on a series of complex factors, such as the choice of pooling. This paper introduces the SimAM into the backbone network of YOLOv8. The SimAM is a straightforward, parameter-free plug-and-play module [25]. As shown in Figure 5c, it simultaneously considers channel and spatial dimensions, calculates three-dimensional weights through its own defined neuron energy function, and guides the model to focus more on important information with high energy and ignore noise information. This enhances the model’s performance and generalization ability without adding any parameters. The calculation steps of the SimAM are as follows:
Figure 5. Comparisons of different attention in different dimensions [2]. The identical color signifies the use of a sole scalar applied to every channel, spatial position, or individual point across those features.
SimAM directly infers three-dimensional weights from neurons and then refines these neurons in turn. SimAM defines the following energy function for each neuron:
e t ( w t , b t , y , x i ) = ( y t t ^ ) 2 + 1 M 1 i = 1 M 1 ( y 0 x ^ i ) 2 ,
where t ^ = w t t + b t and x ^ i = w t x i + b t are linear transformations of t and x i , t and x i are the target neuron and other neurons in a single channel of the input feature X R C × H × W , i is the index over the spatial dimension, M = H × W is the number of neurons on the channel, and w t and b t are the weight and bias transformations. All values in Equation (1) are scalars. Equation (1) reaches the minimum when t ^ is equal to y t , all other values of x ^ i are equal to y 0 , and y t is not equal to y 0 . Minimizing this equation allows us to find the linear separability between the target neuron t and other neurons in the same channel. For simplicity, the final definition of the energy function is obtained by using binary labels (i.e., y t = 1 and y 0 = 1 ) and adding regularization factors to Equation (1):
e t ( w t , b t , y , x i ) = 1 M 1 i = 1 M 1 ( 1 ( w t x i + b t ) ) 2 + ( 1 ( w t t + b t ) ) 2 + λ w t 2 .
The analytical solution obtained after simplifying the above formula is
w t = 2 ( t μ t ) ( t μ t ) 2 + 2 σ t 2 + 2 λ ,
b t = 1 2 t + μ t w t ,
where μ t = 1 M 1 i = 1 M 1 x i and σ t 2 = 1 M 1 i M 1 ( x i μ t ) 2 are the mean and variance computed for all neurons in the channel except for t . The minimum energy can be calculated employing the subsequent formula:
e t = 4 ( σ ^ 2 + λ ) ( t μ ^ ) 2 + 2 σ ^ 2 + 2 λ ,
where μ ^ = 1 M i = 1 M x i and σ ^ 2 = 1 M i = 1 M ( x i μ ^ ) 2 . Equation (5) indicates that the lower the energy e t , the more distinct and linearly separable the neuron t is from the peripheral neurons and the more important the neuron t is for visual processing. The importance of each neuron can be expressed by 1 / e t . Finally, the features are refined according to the definition of attentional mechanisms:
X ˜ = s i g m o i d 1 E X ,
where E is the grouping of all e t values across channel and spatial dimensions. Adding the sigmoid function prevents the value E from being too large and does not affect the relative importance of each neuron.

3.2. GhostConv Convolution Module

Popular convolution modules often generate rich feature maps to guarantee a comprehensive comprehension of the input data. However, this inevitably leads to feature redundancy, resulting in unnecessary computational overhead. For the given input data X R c × h × w , where c is the number of input channels and h and w are the height and width of the input data, the operation of producing n feature maps by the conventional convolutional layer can be formulated as
Y = X × f + b ,
where is the convolution operation, b is the bias term, Y R h × w × n is the output feature map with n channels, and f R c × k × k × n are the convolution filters of the layer. Additionally, h and w are the height and width of the output data, and k × k is the kernel size of the convolution filters f . The required floating-point operations per second (FLOPs) during the convolution process are n h w c k k , and since the number of filters n and the number of channels c are usually large, the FLOPs are also large.
To address this issue, in this paper, we replace the Conv module in the YOLOv8 neck network with GhostConv. GhostConv is a convolutional module from GhostNet, a lightweight network designed by Noah’s Ark Laboratory. Compared to regular convolutional neural networks, GhostConv decreases the number of parameters and computational complexity while maintaining the size of the output feature map. Additionally, it maintains superior recognition performance.
GhostConv first uses regular convolution to generate m feature maps Y R h × w × m :
Y = X × f ,
where f R c × k × k × m are the filters used, and m n and the bias term are omitted for simplicity. To maintain the spatial size of the output feature map (i.e., h and w ), hyperparameters such as filter size, stride, padding, etc., are the same as those in the regular convolution (Equation (7)). To obtain the desired n feature maps, a series of inexpensive linear operations are applied to each intrinsic feature in Y . The s ghosts are generated according to the following function:
y i j = Φ i , j ( y i ) , i = 1 , , m , j = 1 , , s ,
where y i is the i -th intrinsic feature map in Y , and Φ i , j is the j -th (excluding the last one) linear operation for generating the j -th ghost feature map y i j . In other words, y i can have one or more ghost feature maps { y i j } j = 1 s . The final Φ i , s is used to preserve the identity mapping of the intrinsic feature map. By using Equation (9), we can obtain n = m s feature maps with Y = y 11 , y 12 , , y m s as the output of the ghost module, as illustrated in Figure 6.
Figure 6. Comparisons of Conv and GhostConv [3]. (a) The convolutional layer. (b) The Ghost module.

3.3. Bilinear Interpolation

In YOLOv8, nearest interpolation is used for upsampling and enlarging images. As shown in Figure 7, this method first reduces the target image to the size of the original image. Then, at that scale, for each pixel to be interpolated, it calculates its corresponding position in the original image, finds the closest pixel coordinates in the original image to the target position, and sets the pixel value at the target position to the value of the nearest neighbor pixel. Nearest interpolation is the simplest and requires the least processing time of all the interpolation algorithms [26]. However, nearest interpolation simply selects the nearest pixel value without considering the information of surrounding pixels, and it does not generate new pixel values. This results in the inability to achieve smooth transitions between pixels when performing upsampling operations, leading to jagged edges or blurred details in the image. This influence is particularly noticeable when dealing with detail-rich images and may lead to image distortion.
Figure 7. Nearest interpolation diagram.
Therefore, this paper uses bilinear interpolation for the upsampling operation. Bilinear interpolation takes a weighted average of the four neighborhood pixels to calculate its final interpolated value [27], which better preserves details, resulting in smoother images and generating more realistic results. There are two alignment methods for bilinear interpolation, and the scaling ratios differ depending on the alignment method. As shown in Figure 8, bilinear interpolation scales the target image. In the corner alignment method, the center points of the pixels at the four corners of the target image align with the center points of the pixels at the four corners of the original image. In the edge alignment method, the sides of the target image align with the original image after scaling. The bilinear interpolation used in this paper employs the corner alignment method for computation. The calculation steps of bilinear interpolation are as follows.
Figure 8. Comparison diagram of corner alignment and edge alignment.
As shown in Figure 9, assuming that we require the pixel value of point P located at ( x , y ) in the target image after scaling, and the pixel values of the four nearest original image points Q 11 , Q 12 , Q 21 , and Q 22 to point P are ( x 1 , y 1 ) , ( x 1 , y 2 ) , ( x 2 , y 1 ) , and ( x 2 , y 2 ) , respectively, the bilinear interpolation diagram could be formed as follows.
Figure 9. Bilinear interpolation diagram.
To begin with, linear interpolation is performed in the x-axis direction for Q 11 , Q 21 and Q 12 , Q 22 , resulting in intermediate results R 1 ( x , y 1 ) and R 2 ( x , y 2 ) :
f ( x , y 1 ) = x 2 x x 2 x 1 f ( Q 11 ) + x x 1 x 2 x 1 f ( Q 21 ) ,
f ( x , y 2 ) = x 2 x x 2 x 1 f ( Q 12 ) + x x 1 x 2 x 1 f ( Q 22 ) .
After that, linear interpolation is performed on R 1 , R 2 in the y-axis direction to obtain the pixel value of point P :
f ( x , y ) = y 2 y y 2 y 1 f ( x , y 1 ) + y y 1 y 2 y 1 f ( x , y 2 ) = y 2 y y 2 y 1 x 2 x x 2 x 1 f ( Q 11 ) + x x 1 x 2 x 1 f ( Q 21 ) + y y 1 y 2 y 1 x 2 x x 2 x 1 f ( Q 12 ) + x x 1 x 2 x 1 f ( Q 22 ) = 1 ( x 2 x 1 ) ( y 2 y 1 ) f ( Q 11 ) ( x 2 x ) ( y 2 y ) + f ( Q 21 ) ( x x 1 ) ( y 2 y ) + f ( Q 12 ) ( x 2 x ) ( y y 1 ) + f ( Q 22 ) ( x x 1 ) ( y y 1 ) = 1 ( x 2 x 1 ) ( y 2 y 1 ) x 2 x x x 1 f ( Q 11 ) f ( Q 12 ) f ( Q 21 ) f ( Q 22 ) y 2 y y y 1 .

4. Experimental Results and Analysis

4.1. Dataset

This paper uses the RDD2022 dataset [28] from the open-source Crowdsensing-based Road Damage Detection Challenge (CRDDC’2022) [29] for model training. The RDD2022 dataset consists of 47,420 road images collected from six countries: Japan, India, the Czech Republic, Norway, the United States, and China. The dataset is prepared following the format of the PASCAL Visual Object Classes (VOC) datasets [30], with XML files providing annotations for over 55,000 instances of road damage. As illustrated in Figure 10, the dataset encompasses four damage types: longitudinal cracks (D00), transverse cracks (D10), alligator cracks (D20), and potholes (D40), with respective counts of 26,016, 11,830, 10,617, and 6544 instances. The distribution of these damage types reveals that longitudinal cracks are the most prevalent, while potholes are relatively less frequent.
Figure 10. Example images for road damage categories.
The RDD2022 dataset is segmented into two sets: the training set and the test set. The distribution of images and labels across different countries in the dataset is illustrated in Figure 11. Images in the training set are accompanied by annotation files, while images in the test set do not include annotations. Therefore, the detection results need to be converted into files in a specific format and submitted to the CRDDC’2022 official website (https://crddc2022.sekilab.global/ (accessed on 1 March 2024)) for evaluation. Consequently, we divided the data from the RDD2022 training set into our experimental training set and validation set in a 7:3 ratio, using the RDD2022 test set as our experimental test set. We then tested the trained model on the test set and submitted the results to the official website to view the outcomes.
Figure 11. RDD2022 data statistics: distribution of images and labels based on countries.

4.2. Experimental Environment

The experimental setup for our paper includes the following hardware and software configurations: The server operating system used is Ubuntu 18.04, equipped with an Intel(R) Xeon(R) E5-2667 v4 Central Processing Unit (CPU), and an NVIDIA RTX 4090 GPU with 24GB of VRAM. The training environment is configured with PyTorch 2.0.1, Python 3.8, and CUDA 11.7.
The experimental parameters in this paper are configured as follows: The input image resolution is set to 640 × 640, the initial learning rate is 0.01, the learning rate momentum is 0.937, and the Stochastic Gradient Descent (SGD) optimizer is used for training.

4.3. Evaluation Metrics

This paper refers to the evaluation metrics for road damage detection in CRDDC’2022, using the F1 score and mean average precision (mAP) to assess the performance of the trained model. The F1 score is the harmonic mean of precision and recall values; maximizing the F1 score ensures reasonably high precision and recall [31]. Precision is the ratio of true positives to all predicted positives, while recall is the ratio of true positives to all actual positives. Both precision and recall are based on the Intersection over Union (IoU) evaluation, described as the overlap area between the predicted bounding box and the true bounding box divided by their union area. It measures the degree of overlap between the predicted and true target boxes, as shown in Figure 12. Since precision and recall are contradictory, increasing one usually decreases the other. Therefore, the F1 score is employed as the final evaluation metric to balance both. mAP is the average of the average precision (AP) for different classes. This paper uses mAP50 and mAP50-95 metrics for evaluation. mAP50 represents the average precision at the IoU threshold of 0.5, and mAP50-95 is an extension of mAP50, indicating the average precision at different thresholds (from 0.5 to 0.95 with a step size of 0.05). It offers a comprehensive assessment of the model’s accuracy in predicting target boxes under different IoU thresholds. The detailed parameters are as follows:
  • True Positive (TP): An instance of road damage exists in the real situation and the labeling of the instance is correctly predicted, the overlap area between the predicted bounding box and the ground truth bounding box is more than 50%, and the IoU > 0.5.
  • False Positive (FP): The model predicts an instance of road damage at the specific position in the image that does not exist in the ground reality of the image. FP includes cases where the predicted labels do not match the actual labels too.
  • False Negative (FN): When there are instances of road damage in the real situation, but the model cannot predict the correct label or bounding box for the instance.
  • Recall:
    R e c a l l = T P T P + F N .
  • Precision:
    P r e c i s i o n = T P T P + F P .
  • F1 Score:
    F 1 S c o r e = 2 P r e c i s i o n R e c a l l P r e c i s i o n + R e c a l l .
  • AP: The precision–recall (PR) curve p ( r ) can be plotted based on precision and recall values, and the AP is the average value of the function p ( r ) over the domain of definition r [ 0 , 1 ] .
    A P = 0 1 p ( r ) d r
  • mAP: Calculate and average the APs for each category.
    m A P = i = 1 N A P i N
  • Params: The total quantity of learnable parameters in the neural network, including model weights and biases, is used to measure the spatial complexity and size of the model.
  • Giga-FLOPs (GFLOPs): The quantity of billion floating-point operations performed by the model per second. GFLOPs is used to evaluate the computational complexity and efficiency of the model.
Figure 12. IoU calculation diagram.

4.4. Experimental Implementation and Experimental Results

In pursuit of higher accuracy, this paper adopts YOLOv8x as the baseline and conducts comparative experiments using RDD-YOLO and other classic YOLO algorithms, with 180 rounds of training. Figure 13 displays the mAP50 and mAP50-95 curves on the validation set. We can observe that the performance of YOLOv5x and YOLOv7x are comparable, and the performance of YOLOv8x is relatively better. Compared with other YOLO algorithms, the proposed RDD-YOLO model has a faster convergence speed and higher values of mAP50 and mAP50-95.
Figure 13. mAP50 and mAP50-95 curves of different YOLO algorithms in validation set.
The final mAP50 and mAP50-95 of RDD-YOLO and YOLOv8x on the validation set for different road damage categories can be seen in Table 1. According to the data in the table, both RDD-YOLO and YOLOv8x exhibit the best detection performance for the D20 category, attributed to the distinct features of grid cracks. However, due to the irregular shape, small volume, and relatively fewer samples of potholes, the model’s detection performance for D40 is relatively lower. RDD-YOLO achieved final mAP50 and mAP50-95 scores of 62.8% and 36.7%, respectively, showing improvements of 2.6% and 5.2% over YOLOv8x. Specifically, there is a significant improvement in D40 detection, with an increase of 4.5% in mAP50 and 6.2% in mAP50-95. The model also exhibited notable improvements in detecting other types of road damage. RDD-YOLO enhances its ability to retain image details and focus on key information, leading to improved performance in road damage detection tasks.
Table 1. Comparison of road damage detection results for various types of roads in validation set.
For validating the performance of the algorithm presented in this paper, we compare RDD-YOLO with high-accuracy models such as YOLOv5x, YOLOv7x, and YOLOv8x proposed in recent years. Additionally, we introduce p-YOLOv5x for comparison, a model that uses pre-trained weights from the winning solution in the Global Road Damage Detection Challenge (GRDDC’2020) [32] proposed by Hegde V et al. [33]. As illustrated in Table 2, the comparison reveals that higher versions of YOLO models have a smaller parameter count compared to lower versions, and the YOLOv8x model outperforms lower versions in terms of performance. The p-YOLOv5x, utilizing pre-trained weights from the winning solution in GRDDC’2020, performs well on the Indian and Japanese datasets due to the use of RDD2020 [34], which includes images only from Japan, India, and the Czech Republic. However, its overall F1 score is still lower compared to YOLOv5x. Notably, the proposed RDD-YOLO performs excellently on both overall and individual country datasets, exhibiting superior performance and having a smaller parameter count. Although RDD-YOLO has a smaller number of GFLOPS compared to YOLOv8x, it is still larger than other models, possibly requiring more computational resources and time. In summary, RDD-YOLO excels in both performance and model parameters, surpassing other models. This research result provides strong support for future studies and practical applications.
Table 2. Comparison results of different algorithms.
To further validate the effectiveness of the introduced improvements on the RDD-YOLO model, a series of ablation experiments were conducted, as shown in Table 3. By comparing the experimental results, it can be found that the introduced improvements have a positive impact on model performance while almost not affecting the model size and inference speed. First of all, based on the experimental results, it can be discovered that the SimAM and the replaced bilinear interpolation in this paper improved the F1 score of the original model by 1.5 and 0.6 percentage points, respectively. This indicates that by introducing these improvements into the model structure, we can significantly enhance the model’s detection capability for road damage, and these improvements do not increase the model’s complexity and computational overhead. Secondly, the GhostConv structure introduced in this paper not only improved the F1 score by 0.3% but also reduced a certain amount of model parameters and computations. In the end, considering all improvements, the proposed method in this paper achieved a 2.8% increase in F1 score in contrast with the baseline model YOLOv8x. While ensuring detection accuracy, there was also a certain degree of reduction in model parameters and computations.
Table 3. Model improvement ablation experiment.
Figure 14 shows the detection performance of our model on the test set for road damage. Different subplots correspond to four common types of road damage mentioned above. Through the observation of these subplots, we can deduce some important features of the detection results. Firstly, for alligator cracks and road potholes, due to their relatively distinct features, we can observe high confidence scores for the predicted boxes. This indicates that our model performs well in detecting these more obvious damage types and accurately marks their positions. However, longitudinal cracks and transverse cracks are usually characterized as very small and narrow. Due to the more detailed morphology of these cracks, the model tends to exhibit relatively lower confidence during detection. Nevertheless, the model in this paper still shows satisfactory performance, being able to locate and fit these cracks with more elongated morphology more accurately. Overall, our model performs well with different shapes of cracks and shows good generalization. Our model is not only able to adapt to the more obvious and common types of damage but is also able to deal with the more detailed cracks, providing a reliable solution for the comprehensive detection of road damage. This is of positive significance for ensuring road safety.
Figure 14. Sample images of road damage detection for each category.

5. Conclusions

To enhance the precision of road damage detection and minimize the labor and material expenses associated with road damage assessment and maintenance, this paper proposes a road damage detection algorithm based on improved YOLOv8, named RDD-YOLO. This algorithm introduces several key optimization measures to improve model performance. To start with, the SimAM is introduced into the backbone network to enhance the focus on key information in the input image. This allows the model to selectively concentrate on critical areas in the input image, enabling more effective capture of relevant features related to road damage and significantly improving detection accuracy. Furthermore, an optimization is applied to the neck structure by replacing the traditional convolution module with GhostConv. This change achieves a more lightweight and efficient performance in the damage detection task while maintaining excellent performance in recognizing damage. Lastly, the upsampling algorithm of the neck is improved by replacing the nearest interpolation with bilinear interpolation, which better restores the subtle features of the image and improves the accuracy of the detection results through a finer interpolation method. Experimental results demonstrate that RDD-YOLO achieves an F1 score of 69.6% on the RDD2022 dataset, surpassing other algorithms not only in detection performance but also with more concise parameters. Although RDD-YOLO performs well, there is still room for improvement: the variety of road damage types, in reality, is extensive, while RDD2022 covers only four common types of road damage. Therefore, our model can only identify these four types of road damage. Additionally, in pursuit of higher F1 scores, we chose the large-scale YOLOv8x as the baseline model, which, despite improvements to reduce parameter count to some extent, remains quite large and may be challenging to deploy on resource-constrained, high real-time demand mobile endpoint devices. Moving forward, we will seek more comprehensive datasets to identify a greater variety of road damage types, further optimize network models, reduce model size and complexity, increase detection speed, and strive to further improve detection accuracy, enabling the model to be better applied to practical tasks such as road surface crack detection.

Author Contributions

Conceptualization, Y.L. (Yue Li) and Y.L. (Yutian Lei); methodology, Y.L. (Yue Li) and Y.Y.; software, C.Y. and J.Z.; investigation, Y.L. (Yutian Lei) and Y.Y.; data curation, J.Z.; writing—original draft preparation, C.Y.; writing—review and editing, C.Y.; visualization, Y.L. (Yue Li) and J.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Shanghai Science and Technology Innovation Action Plan Project: 22511100700.

Data Availability Statement

The open-source dataset RDD2022 can be found at https://github.com/sekilab/RoadDamageDetector (accessed on 1 March 2024).

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Available online: https://baike.baidu.hk/item/%E5%85%AC%E8%B7%AF/7265058 (accessed on 3 April 2024).
  2. Yang, L.; Zhang, R.-Y.; Li, L.; Xie, X. SimAM: A Simple, Parameter-Free Attention Module for Convolutional Neural Networks. In Proceedings of the 38th International Conference on Machine Learning; PMLR, Virtual Event, 18–24 July 2021; pp. 11863–11874. [Google Scholar]
  3. Han, K.; Wang, Y.; Tian, Q.; Guo, J.; Xu, C.; Xu, C. Ghostnet: More Features from Cheap Operations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 1580–1589. [Google Scholar]
  4. Lim, R.S.; La, H.M.; Shan, Z.; Sheng, W. Developing a Crack Inspection Robot for Bridge Maintenance. In Proceedings of the 2011 IEEE International Conference on Robotics and Automation, Shanghai, China, 9–13 May 2011; IEEE: Piscataway, NJ, USA, 2011; pp. 6288–6293. [Google Scholar]
  5. Kapela, R.; Śniatała, P.; Turkot, A.; Rybarczyk, A.; Pożarycki, A.; Rydzewski, P.; Wyczałek, M.; Błoch, A. Asphalt Surfaced Pavement Cracks Detection Based on Histograms of Oriented Gradients. In Proceedings of the 2015 22nd International Conference Mixed Design of Integrated Circuits & Systems (MIXDES), Torun, Poland, 25–27 June 2015; pp. 579–584. [Google Scholar]
  6. Wang, G.-L.; Hu, J.; Qian, J.-G.; Wang, Y.-Q. Simulation in Time Domain for Nonstationary Road Disturbances and Its Wavelet Analysis. Zhendong Yu Chongji (J. Vib. Shock) 2010, 29, 28–32. [Google Scholar]
  7. Fan, Z.; Wu, Y.; Lu, J.; Li, W. Automatic Pavement Crack Detection Based on Structured Prediction with the Convolutional Neural Network. arXiv 2018, arXiv:1802.02208. [Google Scholar]
  8. Cao, M.-T.; Tran, Q.-V.; Nguyen, N.-M.; Chang, K.-T. Survey on Performance of Deep Learning Models for Detecting Road Damages Using Multiple Dashcam Image Resources. Adv. Eng. Inform. 2020, 46, 101182. [Google Scholar] [CrossRef]
  9. Kang, D.; Benipal, S.S.; Gopal, D.L.; Cha, Y.-J. Hybrid Pixel-Level Concrete Crack Segmentation and Quantification across Complex Backgrounds Using Deep Learning. Autom. Constr. 2020, 118, 103291. [Google Scholar] [CrossRef]
  10. Mandal, V.; Mussah, A.R.; Adu-Gyamfi, Y. Deep Learning Frameworks for Pavement Distress Classification: A Comparative Analysis. In Proceedings of the 2020 IEEE International Conference on Big Data (Big Data), Atlanta, GA, USA, 10–13 December 2020; pp. 5577–5583. [Google Scholar]
  11. Chen, Q.; Gan, X.; Huang, W.; Feng, J.; Shim, H. Road Damage Detection and Classification Using Mask R-CNN with DenseNet Backbone. Comput. Mater. Contin. 2020, 65, 2201–2215. [Google Scholar] [CrossRef]
  12. Yuan, Y.; Yuan, Y.; Baker, T.; Kolbe, L.M.; Hogrefe, D. FedRD: Privacy-Preserving Adaptive Federated Learning Framework for Intelligent Hazardous Road Damage Detection and Warning. Future Gener. Comput. Syst. 2021, 125, 385–398. [Google Scholar] [CrossRef]
  13. Zhang, Y.; Zuo, Z.; Xu, X.; Wu, J.; Zhu, J.; Zhang, H.; Wang, J.; Tian, Y. Road Damage Detection Using UAV Images Based on Multi-Level Attention Mechanism. Autom. Constr. 2022, 144, 104613. [Google Scholar] [CrossRef]
  14. Wang, J.; Gao, X.; Liu, Z.; Wan, Y. GSC-YOLOv5: An Algorithm Based on Improved Attention Mechanism for Road Creak Detection. In Proceedings of the 2023 IEEE 12th Data Driven Control and Learning Systems Conference (DDCLS), Xiangtan, China, 12–14 May 2023; IEEE: Washington, DC, USA, 2023; pp. 1664–1671. [Google Scholar]
  15. Ni, Y.; Mao, J.; Fu, Y.; Wang, H.; Zong, H.; Luo, K. Damage Detection and Localization of Bridge Deck Pavement Based on Deep Learning. Sensors 2023, 23, 5138. [Google Scholar] [CrossRef] [PubMed]
  16. Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 21–26 July 2016; pp. 779–788. [Google Scholar]
  17. Xiao, B.; Nguyen, M.; Yan, W.Q. Fruit Ripeness Identification Using YOLOv8 Model. Multimed. Tools Appl. 2024, 83, 28039–28056. [Google Scholar] [CrossRef]
  18. Redmon, J.; Farhadi, A. YOLO9000: Better, Faster, Stronger. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 7263–7271. [Google Scholar]
  19. Redmon, J.; Farhadi, A. YOLOv3: An Incremental Improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
  20. Bochkovskiy, A.; Wang, C.-Y.; Liao, H.-Y.M. YOLOv4: Optimal Speed and Accuracy of Object Detection. arXiv 2020, arXiv:2004.10934. [Google Scholar]
  21. Hussain, M. YOLO-v1 to YOLO-v8, the Rise of YOLO and Its Complementary Nature toward Digital Manufacturing and Industrial Defect Detection. Machines 2023, 11, 677. [Google Scholar] [CrossRef]
  22. Li, C.; Li, L.; Jiang, H.; Weng, K.; Geng, Y.; Li, L.; Ke, Z.; Li, Q.; Cheng, M.; Nie, W.; et al. YOLOv6: A Single-Stage Object Detection Framework for Industrial Applications. arXiv 2022, arXiv:2209.02976. [Google Scholar]
  23. Wang, C.-Y.; Bochkovskiy, A.; Liao, H.-Y.M. YOLOv7: Trainable Bag-of-Freebies Sets New State-of-the-Art for Real-Time Object Detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 7464–7475. [Google Scholar]
  24. Gao, G.-S. Survey on Attention Mechanisms in Deep Learning Recommendation Models. Comput. Eng. Appl. 2022, 58, 9–18. [Google Scholar]
  25. Zhang, Y.; Shen, S.; Xu, S. Strip Steel Surface Defect Detection Based on Lightweight YOLOv5. Front. Neurorobot. 2023, 17, 1263739. [Google Scholar] [CrossRef] [PubMed]
  26. Parsania, P.-S.; Virparia, P.-V. A Comparative Analysis of Image Interpolation Algorithms. Int. J. Adv. Res. Comput. Commun. Eng. 2016, 5, 29–34. [Google Scholar] [CrossRef]
  27. Patel, V.; Mistree, K. A Review on Different Image Interpolation Techniques for Image Enhancement. Int. J. Emerg. Technol. Adv. Eng. 2013, 3, 129–133. [Google Scholar]
  28. Arya, D.; Maeda, H.; Ghosh, S.K.; Toshniwal, D.; Sekimoto, Y. RDD2022: A Multi-National Image Dataset for Automatic Road Damage Detection. arXiv 2022, arXiv:2209.08538. [Google Scholar]
  29. Arya, D.; Maeda, H.; Ghosh, S.K.; Toshniwal, D.; Omata, H.; Kashiyama, T.; Sekimoto, Y. Crowdsensing-Based Road Damage Detection Challenge (CRDDC’2022). In Proceedings of the 2022 IEEE International Conference on Big Data (Big Data), Osaka, Japan, 17–20 December 2022; IEEE: Washington, DC, USA, 2022; pp. 6378–6386. [Google Scholar]
  30. Everingham, M.; Eslami, S.M.A.; Van Gool, L.; Williams, C.K.I.; Winn, J.; Zisserman, A. The Pascal Visual Object Classes Challenge: A Retrospective. Int. J. Comput. Vis. 2015, 111, 98–136. [Google Scholar] [CrossRef]
  31. Arya, D.; Maeda, H.; Ghosh, S.K.; Toshniwal, D.; Mraz, A.; Kashiyama, T.; Sekimoto, Y. Transfer Learning-Based Road Damage Detection for Multiple Countries. arXiv 2020, arXiv:2008.13101. [Google Scholar]
  32. Arya, D.; Maeda, H.; Kumar Ghosh, S.; Toshniwal, D.; Omata, H.; Kashiyama, T.; Sekimoto, Y. Global Road Damage Detection: State-of-the-Art Solutions. In Proceedings of the 2020 IEEE International Conference on Big Data (Big Data), Atlanta, GA, USA, 10–13 December 2020; pp. 5533–5539. [Google Scholar]
  33. Hegde, V.; Trivedi, D.; Alfarrarjeh, A.; Deepak, A.; Ho Kim, S.; Shahabi, C. Yet Another Deep Learning Approach for Road Damage Detection Using Ensemble Learning. In Proceedings of the 2020 IEEE International Conference on Big Data (Big Data), Atlanta, GA, USA, 10–13 December 2020; pp. 5553–5558. [Google Scholar]
  34. Arya, D.; Maeda, H.; Ghosh, S.K.; Toshniwal, D.; Sekimoto, Y. RDD2020: An Annotated Image Dataset for Automatic Road Damage Detection Using Deep Learning. Data Briefs 2021, 36, 107133. [Google Scholar] [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.