1. Introduction
Pavement distresses have become increasingly frequent and severe, imposing substantial burdens on maintenance operations and traffic safety. This situation is imposing significant operational and maintenance pressures, as well as heightened traffic risks. As the most prevalent type of road surface distress, cracks significantly influence load-bearing capacity, durability, driving speed, fuel consumption, driving safety, and ride comfort. Furthermore, crack propagation accelerates the pavement deterioration and leads to increased maintenance and repair costs. Therefore, the prompt and accurate acquisition of information regarding road surface diseases has emerged as a critical issue in the realm of road maintenance.
With its outstanding advantages of high efficiency and reliability, the pavement crack detection method based on machine vision [
1] has received extensive attention from both the academic and engineering communities. Its technical routes can be mainly divided into two categories: One is the detection method based on traditional digital image processing [
2]. Firstly, the collected pavement images are preprocessed, including gray-scale processing, denoising, binarization, image enhancement, etc., and then the representative features of the disease target are extracted by combining computer vision and image processing technology. Chou et al. [
3] used fuzzy technology to eliminate the noise caused by illumination variations and achieved reliable detection accuracy. In order to improve the accuracy of the pavement crack recognition algorithm, Jia et al. [
4] used the improved CV model to preprocess the segmented images, so as to overcome the interference of various factors in the environment and make the images acquired by the camera clear. Li et al. [
5] improved the traditional edge detection operator, filtering algorithm and image processing algorithm to obtain continuous and clear crack edge feature images and applied the improved algorithm to the detection of various cracks, achieving good results. Due to the diversity and complexity of road surface diseases, the feature extraction process is not universally applicable. Furthermore, the entire procedure requires manual processing of road surface image data, which leads to significant time consumption and increases the likelihood of reduced accuracy.
The other is the pavement crack detection method based on deep learning [
6]. Compared with traditional methods, these approaches can reduce the manual intervention and improve the accuracy and robustness of crack detection. Given the special morphological characteristics of cracks, scholars mainly adopt pixel-level semantic segmentation frameworks [
7] to achieve fine-grained delineation. Cha et al. [
8] partitioned road image into 256 × 256-pixel local samples to form a training set and used the convolutional neural network model to train the model on the local sample training set. The trained neural network was used to test the images other than the training set and finally achieved 98% accuracy on the test set. Tong et al. [
9] randomly divided the gray-scale pavement images into training set and test set, designed DCNN architecture, trained and tested pavement images, and successfully applied deep convolutional neural network to automatic pavement crack detection. He et al. [
10] proposed the ResNet structure with cross-layer connections and alleviated the gradient dispersion problem in deep networks. Sha et al. [
11] combined two CNN networks to extract the characteristics of pavement diseases, achieving the identification and measurement of pavement cracks. With the development of deep learning technology, object detection algorithms [
12] have also been increasingly introduced into road engineering. Object detection uses deep neural networks to detect images and automatically generate bounding boxes to determine the position of objects and identify the types of objects. Sun et al. [
13] proposed a pavement potting crack detection method based on the improved Faster R-CNN [
14]. This algorithm fuses the feature extraction layers in multiple networks with Faster R-CNN to improve detection accuracy. Moreover, to further enhance the model performance, the method of candidate box aspect ratio is introduced into the model. The effect of model detection and positioning has been improved. Zhang et al. [
15] introduced the Multi-level Attention Blocks (MLAB) in YOLOv3 and trained the model with road surface images from the perspective of unmanned aerial vehicles, achieving satisfactory results. To improve detection according to the characteristics of road surface defects, He et al. [
16] proposed an improved YOLOv5 detection model, Pavement Damage–YOLO (PD-YOLO), enhancing the feature extraction ability and multi-scale feature fusion ability. Based on the YOLOv8 model, Hou et al. [
17] introduced the CPCA attention mechanism and replaced the neck network with weighted BiFPN. This not only achieved model lightweighting but also improved the accuracy and Recall of the model for pavement crack detection.
Although the aforementioned research has achieved significant advancements, challenges remain in achieving robust and precise detection of highly diverse and irregular pavement cracks in real-world scenarios. First, while individual techniques like attention mechanisms, advanced pooling modules, and improved loss functions have been explored separately in pavement crack detection, there is a lack of systematic investigation into their synergistic integration and optimization within a unified framework like YOLOv8s. The interaction effects and potential trade-offs among these modules for this specific task are not well understood. Second, many studies focus on algorithmic improvements but pay less attention to practical engineering constraints, such as the balance between accuracy and computational efficiency for potential deployment on mobile platforms like UAVs. Third, the performance evaluation often lacks comprehensive efficiency metrics and failure analysis under challenging conditions. Therefore, this study focuses on pavement cracks as the primary target and utilizes pavement images as the research subject. It proposes an intelligent detection method for asphalt pavement cracks based on an enhanced version of YOLOv8s. The main contributions are: (1) A systematically improved YOLOv8s model that integrates SPPFCSPC for multi-scale feature enhancement, CBAM for targeted feature refinement, and EIoU loss for precise bounding box regression, specifically optimized for pavement crack characteristics. (2) A thorough experimental analysis including ablation studies, comparisons with state-of-the-art models, computational complexity evaluation, and visual failure case analysis. (3) The construction of a balanced Crack_Dataset and the development of a user-friendly detection system based on PyQt5, bridging the gap between algorithm research and practical application.
2. The Improved Structural Design of the YOLOv8s Algorithm
As is shown in
Figure 1, The YOLOv8 algorithm represents an advanced iteration within the YOLO series of object detection methodologies. It is composed of four integral components: the input, the backbone, the neck, and the head. The YOLOv8 model has introduced a range of new features and enhancements based on its predecessors, resulting in state-of-the-art accuracy and detection speed. Nevertheless, the detection accuracy and Recall rate for road crack detection applications still require further improvement. In this study, we adopt the small-scale (S-scale) YOLOv8 model and integrate it with specific crack characteristics to enhance the performance of the YOLOv8s model.
2.1. SPPFCSPC Module
Given the complexity of feature extraction and the substantial background noise present in the dataset, the original SPPF module in the YOLOv8s backbone was replaced with the SPPFCSPC module. As is shown in
Figure 2, The SPPFCSPC combines the structural advantages of SPPF and SPPCSPC by splitting the input into two branches, each performing independent convolution operations. Additionally, the lower branch is capable of preserving features associated with fine cracks and small targets. Ultimately, the original features are fused with the processed features across various stages to facilitate multi-scale feature integration, thereby improving representational capacity.
The SPPFCSPC module sequentially superimposes three Max pooling operations with 5 × 5 kernels, facilitating feature pyramid fusion through these three pooling processes. This approach enhances the model’s capacity to achieve a larger receptive field, improves its ability to extract multi-scale crack features, and further bolsters the model’s detection capabilities for targets of varying scales.
2.2. Attention Mechanism
The human visual system demonstrates selective perceptual capabilities during the information processing phase. By optimizing the allocation of attentional resources, it prioritizes the processing of essential information [
18]. This cognitive mechanism enhances the efficiency and accuracy of information processing. Drawing inspiration from this biological principle, the attention mechanism introduced in the realm of deep learning dynamically adjusts feature weights to achieve improved representation of key features while simultaneously suppressing non-key features. CBAM attention module [
19] was incorporated into the neck network of YOLOv8s to enhance the model’s ability to represent features of pavement cracks.
The CBAM attention mechanism module has been integrated into the neck network of YOLOv8s. This CBAM consists of a channel attention module (CAM) and a spatial attention module (SAM), which are arranged in series. It was selected over other attention mechanisms (e.g., SE, ECA) for two main reasons. First, its dual-path design (channel + spatial) is particularly suited for capturing the long, thin, and irregular shapes of pavement cracks, which require both channel-wise feature recalibration and spatial region emphasis. Second, CBAM introduces minimal computational overhead compared to more complex attention modules, making it a practical choice for maintaining a favorable speed-accuracy trade-off, which is crucial for real-time applications. The specific structure of this module is shown in
Figure 3. For the input feature map
F ∈ C×H×W, the calculation process of this module is represented by Equation (1):
where
Mc represents the channel attention weight;
Ms represents spatial attention weights; and ⊗ represents element-by-element multiplication.
The input feature map concurrently executes global average pooling and global max pooling operations on each feature channel. Subsequently, the results from both operations are summed after being processed through the fully connected layer. Ultimately, the channel attention weights are derived using the Sigmoid activation function. The input feature map concurrently executes global average pooling and global max pooling along the channel dimension. Subsequently, through convolution operations followed by a Sigmoid activation function, it derives the spatial attention weights.
Within the CBAM framework, the channel attention module evaluates the importance of feature channels, enhancing those corresponding to crack texture and edges while suppressing irrelevant information. The spatial attention module emphasizes the spatial features, including the shape and position of cracks. It mitigates the influence of the road surface background and enhances focus on the target of road surface cracks. Both the fully connected layers in the channel attention module and the convolutional layers in the spatial attention module introduce a limited number of additional parameters, resulting in only a modest increase in computational cost. In this study, the CBAM is integrated after the C2f module of the neck network and prior to feature fusion. This arrangement ensures that high-resolution features obtained from upsampling are initially processed by the attention mechanism, followed by feature fusion and other operations before being forwarded to the model’s detection head. By optimizing dual-dimensional features in both channel and spatial dimensions, we enhance the detection accuracy of pavement cracks.
2.3. The Improved Loss Function
In object detection, the Intersection over Union (IoU) threshold is used to evaluate the localization accuracy of predicted bounding boxes. A threshold of 0.5 is typically applied for both evaluation and non-maximum suppression (NMS) during post-processing. The purpose of this IoU threshold is to eliminate redundant prediction boxes. Notably, a smaller IoU threshold in this context enhances the effectiveness of removing duplicate boxes. The schematic diagram of IoU is shown in
Figure 4:
The calculation formula is as follows:
In YOLOv8, the matching of both positive and negative samples, as well as the calculation of loss values, employs Complete Intersection over Union (CIoU) to quantify the degree of overlap between the predicted bounding box and the ground-truth box. CIoU introduces a penalty term for aspect ratio in addition to Distance Intersection over Union (DIoU), while simultaneously considering both the deviation in center point positions and the length of the diagonal. Consequently, it provides a more accurate measurement of similarity between two bounding boxes. The calculation formula is as follows:
where
b and
bgt respectively represent the center point coordinates of the prediction box and the real box;
ρ2(b,b
gt) represents the Euclidean distance between two center points.
C represents the diagonal length of the minimum bounding rectangle of the two boxes.
α is the weight function,
ν is used to measure the similarity of aspect ratio.
W and
wgt respectively represent the widths of the prediction box and the real box.
H and
hgt respectively represent the heights of the predicted box and the true box. The final CIoU loss is defined as:
However, the
v in the CIoU formula reflects the difference in aspect ratio between the predicted box of the model and the actual label box, rather than the difference in width and height between the two boxes. This enables the model to indirectly learn the size characteristics of diseases by utilizing metrics such as Intersection over Union (IoU) and aspect ratio. Consequently, this approach reduces the efficiency of the model’s learning process. As a result, CIoU somewhat impedes effective optimization of the model. To address this issue, the EIoU method was introduced to establish an absolute size matching mechanism between the predicted box and the ground-truth box. In our improved model, only the bounding box regression loss is replaced by EIoU loss. The classification loss and the objectness loss remain unchanged from the original YOLOv8s implementation. This allows us to directly evaluate the impact of EIoU on localization accuracy. EIoU replaces the penalty factor associated with the aspect ratio in CIoU by incorporating the loss of width and height between the predicted box and the ground-truth box. This modification allows for a more direct alignment of the size of the predicted box, as inferred by the model, with that of the actual box. The gradient direction is more clearly defined, thereby enabling the model to learn the shape features of pavement cracks with greater efficacy. The enhanced positioning loss function incorporates three constraints: overlap degree loss, deviation in center point position, and losses related to width and height. The first two terms of EIoU inherit the operational logic of CIoU, while the third term incorporates width and height loss to improve the model’s learning of crack size features and accelerate convergence. Compared with CIoU, EIoU does not require trigonometric computations, resulting in faster calculation and more efficient model training. The EIoU loss formula is as follows:
where
wc and
hc are the widths and heights of the minimum bounding rectangles of the real box and the predicted box, respectively.
2.4. The Improved Structure of YOLOv8s
Considering the specific characteristics of pavement crack detection targets [
20], three modifications were applied to the YOLOv8s architecture: the SPPF module in the backbone was replaced with SPPFCSPC, the CBAM was added to the neck, and the CIoU loss in the head was substituted with EIoU loss. The resulting improved network structure is illustrated in
Figure 5.
3. Data Training and Parameter Setting
3.1. Data Acquisition and Preprocessing
Choosing an appropriate dataset is crucial for the development of a pavement disease detection system. A representative dataset containing sufficient samples of pavement distress is essential for reliable model training and evaluation. This enables improved adaptability and accuracy under varying road conditions. To enhance the robustness of the model, this study utilizes training data sourced from the public Road Damage Dataset 2022 (RDD2022). This dataset primarily encompasses three types of pavement diseases: longitudinal cracks, transverse cracks, and grid cracks. Additionally, the dataset also contains information on blurred white lines and indistinct pedestrian crossings; however, this investigation focuses exclusively on the aforementioned three types of pavement diseases.
The dataset was collaboratively developed using data from China_MotorBike and United_States in RDD2022, along with self-collected images obtained via unmanned aerial vehicles. Notably, the self-collected images of Kunlun Mountain Road were exclusively utilized as the validation set. Subsequently, through processes of data annotation, data cleaning, and data augmentation, the dataset was ultimately designated as Crack_Dataset. This dataset comprises a total of 4385 images. It has been divided into a training set and a validation set at an approximate ratio of 8:2.
3.1.1. A Method for Collecting Pavement Disease Data Based on Unmanned Aerial Vehicles
As shown in
Figure 6, The DJI Mavic 3E (Produced in Shenzhen, China) unmanned aerial vehicle was employed to capture road surface images of specific sections along Kunlun Mountain Road in the Huangdao District of Qingdao City. The technical parameters of this unmanned aerial vehicle are presented in
Table 1. A total of 57 images were collected and incorporated into the validation set. The example image of the road surface collected is shown in
Figure 7.
3.1.2. Data Annotation
This investigation primarily focuses on three types of pavement distress: longitudinal cracks, transverse cracks, and network cracks. The study involves labeling the location, size, category, and other relevant information of pavement images collected from the Kunlun Mountain Road using unmanned aerial vehicles. For this purpose, LabelImg(Version 1.8.6) is employed to annotate the pavement crack targets. LabelImg is an open-source data annotation tool that supports three labeling formats: PascalVOC, YOLO, and CreateML. The interface of Labelimg is shown in
Figure 8.
3.1.3. Data Cleaning
Data cleaning operations represent a critical phase in the creation of datasets. In this study, such processes are primarily applied to image data derived from the RDD2022 dataset. The objective is to rectify or eliminate any errors, as well as address missing or duplicate annotations that may be present within the dataset, thereby ensuring the integrity and quality of the data utilized for model training.
The RDD2022 public dataset was released with VOC format labeled data already included. This study involves writing a Python program to convert the VOC labeled data format into the YOLO labeled data format, specifically in txt text format. During the initial model training process, certain disease labels were omitted from the RDD2022 dataset. The lack of labeled data can reduce model training efficacy and compromise prediction accuracy. To address this issue, this investigation employs the Labelimg tool for secondary annotation to supplement the missing annotations and enhance the Crack_Dataset dataset.
3.1.4. Data Augmentation
The training set of the Crack_Dataset contains three types of pavement cracks. The data distribution of the training set is illustrated in
Figure 9a. Longitudinal crack samples constitute 60.15% of the training set, while transverse crack samples account for 28.11%, and mesh crack samples represent only 9.03%. The proportion of longitudinal cracks exceeds 50%, indicating that this category has a significantly higher sample count compared to other types, particularly when contrasted with mesh cracks. This imbalanced distribution may lead to an overemphasis on longitudinal cracks during model training, which could adversely affect the identification capabilities for transverse cracks. Furthermore, it increases the likelihood of overlooking important yet rare defect types such as mesh cracks, ultimately resulting in suboptimal detection performance by the model.
Based on this, the present investigation employs various data augmentation techniques primarily on images containing transverse and mesh cracks to increase their sample size and proportion, while preserving all original longitudinal crack samples to maintain the real-world distribution. As depicted in
Figure 10, The augmentation methods include horizontal flipping, vertical flipping, random brightness adjustment, and the addition of Gaussian and pepper noise. As depicted in
Figure 9b, after data enhancement, longitudinal crack samples constitute 37.22%, transverse crack samples account for 27.66%, and mesh crack samples represent 26.19% of the training set. Notably, the proportion and sample size of mesh cracks have increased by approximately threefold.
3.2. Experimental Environment and Parameter Setting
3.2.1. Experimental Environment Setup
To ensure the reliability of the experimental results, this study was conducted within specific software and hardware environments. Regarding hardware, high-performance computer equipment was utilized to meet the requirements for training and inference of the YOLOv8s model. In terms of software, operating systems, programming languages, and deep learning frameworks were selected based on their suitability for deep learning tasks. The following sections outline the software and hardware configurations employed in this experiment. The detailed configuration is shown in
Table 2.
3.2.2. Parameter Setting
To ensure the rigor of the experiments and the comparability of model training results, the hyperparameter settings for all models remain consistent, with the exception of the number of training rounds, as detailed in
Table 3. The batch_size was set to 8 due to GPU memory constraints. To mitigate potential training instability associated with small batch sizes, we employed the Cosine annealing learning rate scheduler. All training hyperparameters are determined based on considerations related to both model performance and the training environment. To guarantee that model training achieves convergence, an approach involving an increase in training rounds is employed to prevent issues related to non-convergence during model training.
3.3. Design of Road Crack Detection and Identification System
An interactive interface for the automatic pavement crack detection system was developed using PyQt5 to enable user-friendly operation of the Python-based YOLOv5 model. This interface provides a convenient connection between the YOLO model and the user, supporting efficient defect detection.
The PyQt5 framework has been utilized to develop a visual user interface that facilitates the input and detection of images, videos, and real-time camera feeds. This approach offers an effective solution for the automated identification of pavement cracks. It enables real-time tracking and detection of images, videos, and cameras, while also allowing for the storage of recognition results for subsequent data analysis.
5. Conclusions
This investigation aims to enhance the efficiency of intelligent detection of pavement defects and proposes an advanced method for detecting asphalt pavement cracks based on an improved YOLOv8s framework. The main conclusions are as follows:
(1) To further enhance the detection accuracy of the model, this study replaces the SPPF module in the YOLOv8s backbone network with the SPPFCPSC module. This modification improves both complex feature extraction capabilities and multi-scale feature fusion abilities. Additionally, a CBAM attention mechanism is integrated into the neck network of YOLOv8s to strengthen the model’s focus on critical features. Furthermore, CIoU loss in the head network is substituted with EIoU loss to improve positioning accuracy. Ultimately, an innovative improved YOLOv8s pavement crack detection model is proposed.
(2) The dataset utilized for experiments in this investigation comprises self-collected road images along with selected images from the public RDD2022 dataset, providing high-quality training data that enhances model generalization ability while proposing reasonable evaluation metrics.
(3) Based on experimental results, we verified and quantified the detection performance of our proposed improved YOLOv8s model through ablation studies. The findings indicate that our modified YOLOv8s outperforms conventional models within the YOLO series regarding Precision, Recall, and mAP@0.5 when compared to other mainstream object detection algorithms.