An Improved YOLOv8 with TransXNet Backbone for Pavement Crack Detection

Du, Zitao; Yin, Yuna; Yang, Wenbo

doi:10.3390/app16083982

Open AccessArticle

An Improved YOLOv8 with TransXNet Backbone for Pavement Crack Detection

by

Zitao Du

^*

,

Yuna Yin

and

Wenbo Yang

School of Civil and Transportation Engineering, Hebei University of Technology, Tianjin 300401, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2026, 16(8), 3982; https://doi.org/10.3390/app16083982

Submission received: 20 March 2026 / Revised: 15 April 2026 / Accepted: 17 April 2026 / Published: 20 April 2026

(This article belongs to the Special Issue Defect Evaluation and Nondestructive Testing)

Download

Browse Figures

Versions Notes

Abstract

To improve the accuracy and efficiency of road crack detection, this paper proposes an enhanced model based on You Only Look Once version 8 (YOLOv8). The Cross Stage Partial DarkNet (CSPDarkNet) backbone network is replaced with TransXNet, which integrates a Transformer-based self-attention mechanism with convolutional operations to better capture local features. By embedding the attention mechanism into its core module, namely the dual dynamic token mixer (D-Mixer), the TransXNet architecture is further optimized, thereby enabling coordinated attention-based feature selection and enhancement across both global and local dimensions. In addition, the Gaussian Error Linear Unit (GELU) is employed to enhance nonlinear representation capability and training stability. Experiments were conducted on the public Road Damage Dataset 2022 (RDD2022), which contains approximately 47,000 road images. The results demonstrate that, compared with the baseline model, the precision, recall, and mean Average Precision (mAP) of the proposed method are improved by 9.6, 11.7, and 8.2 percentage points, respectively; moreover, the detection accuracy and computational efficiency also outperform those of other methods.

Keywords:

pavement cracks; object detection; YOLOv8; backbone network; attention mechanism

1. Introduction

Road inspection and maintenance management are of paramount importance to transportation infrastructure, particularly with the rapid development of highway projects [1,2]. After a road is put into service, the pavement is affected by traffic loads, temperature, humidity, groundwater seepage, solar radiation, and other factors, leading to gradual deterioration of the pavement and a progressive reduction in the road’s load-bearing capacity. Pavement cracks are considered an important indicator of pavement condition and are often the earliest manifestation of most pavement distresses. The presence of cracks not only shortens the service life of roadways but also poses a threat to traffic safety. As an important form of pavement deterioration, road cracks affect the structural integrity and performance of roads and may be accompanied by severe structural damage if not repaired in a timely manner. Therefore, crack detection and repair are crucial for ensuring road safety and maintaining road performance [3]. Recent studies have demonstrated that the type, length, and spatial distribution of cracks are closely associated with road maintenance costs and traffic safety, and that early detection and timely repair can significantly reduce maintenance expenses and accident risk [4,5,6].

Early digital image processing approaches [7] employ computer vision and image processing techniques to analyze pavement images and extract useful crack features for detection and localization. These methods typically include image filtering [8], edge detection [9], morphological processing [10], and feature extraction [11]. Although these methods can achieve superior detection performance compared with manual inspection, they are susceptible to interference from noise, shadows, humidity, and illumination variations in complex road conditions. This often results in a large number of redundant candidate regions, increased computational cost, reduced detection speed, and limited generalization and robustness [12,13].

With the rapid development of computer vision and deep learning technologies, pavement crack detection has achieved significant progress in recent years [14,15]. Among object detection frameworks, the You Only Look Once (YOLO) series has attracted considerable research attention due to its high detection accuracy and real-time performance. Through successive improvements and multiple iterations, YOLO models have been widely applied in pavement crack detection [16]. Lightweight networks, such as MobileNet and EfficientDet, as well as Transformer architectures, have also demonstrated potential in crack detection by reducing computational cost while maintaining accuracy and improving the detection of small-scale cracks [17,18,19]. Zhen et al. [20] improved YOLOv3 by incorporating the ResNet50vd backbone network for detecting hidden cracks in asphalt pavement. By incorporating a Bayesian search-based hyperparameter optimization method, the enhanced architecture improves the model’s detection capability. Liu et al. [21] proposed a YOLOv3 model incorporating four-scale detection layers, which effectively reduces the missed detection rate of small-scale cracks. Guo et al. [22] replaced the backbone of YOLOv5 with MobileNetV3, thereby reducing both the number of model parameters and computational complexity, and applied a K-means algorithm to optimize anchor boxes, improving the model’s adaptability to the dataset and enhancing detection accuracy. Ye et al. [23] proposed an improved YOLOv7-based network for detecting and classifying concrete cracks. By incorporating instance segmentation, the proposed method mitigates the limitations of traditional detection methods and achieves more accurate crack detection. The network integrates three customized modules to address issues such as missing feature information, small objects, and gradient propagation, thereby improving detection accuracy. Li [24] further optimizes the YOLOv8n model by employing Wise Intersection over Union as the loss function, along with a bounding box loss function with dynamic focusing and an EMA-based multi-scale attention module. Moreover, the G-Ghost bottleneck block was utilized to achieve model lightweighting. The enhanced model was deployed on personal computers and embedded systems, achieving high detection accuracy, fast detection speed, and real-time detection, thereby demonstrating strong practical application potential.

Despite these advancements, the original YOLOv8 model still exhibits limitations in fine-grained pavement crack detection, including insufficient global semantic modeling, loss of small-object features, and suboptimal suppression of background interference [25]. Recently, CNN–Transformer hybrid architectures have shown significant potential in crack detection [26], as they integrate the local feature extraction capability of CNNs with the long-range dependency modeling capability of Transformers, enabling more effective handling of complex crack patterns. Furthermore, approaches that combine multi-scale feature fusion and attention mechanisms have been demonstrated in several road detection studies to effectively enhance the detection of small and low-contrast cracks [27,28], thereby providing a theoretical foundation for the improvement strategies proposed in this study.

To address these issues, this paper proposes an improved YOLOv8-based pavement crack detection model. Compared with the CSPDarkNet backbone network in YOLOv8, the proposed model replaces it with the TransXNet visual backbone to combine the advantages of the Transformer self-attention mechanism and convolutional neural networks. The proposed model retains the ability of convolutional operations to capture local texture features while simultaneously modeling the long-range semantic dependencies enabled by the self-attention mechanism, thereby achieving more stable detection of discontinuous and small-scale cracks. The Gaussian Error Linear Unit (GELU) is adopted as the activation function. Previous studies have demonstrated that GELU exhibits superior capability in detecting small targets and capturing smooth regions in segmentation tasks, while also showing enhanced robustness to noise and visual artifacts [29]. Moreover, within the Vision Transformer (ViT) architecture, GELU effectively aligns with the feature distribution characteristics of Transformer-based models [30], thereby further improving the model’s nonlinear representation capacity and training stability, while also alleviating the vanishing gradient problem in deep networks. Therefore, the improved model can more accurately extract the feature information from low-contrast cracks and small-scale pavement defects, while suppressing background interference from complex road surfaces.

The main contributions of this study are summarized as follows:

To address the challenges of irregular crack morphology, scattered distribution, and complex backgrounds in pavement crack detection, the CSPDarkNet backbone in YOLOv8 is replaced with the TransXNet architecture, which integrates Transformer-based self-attention mechanisms with convolutional operations. This modification enhances the model’s global representation capability and alleviates the limitations of convolutional networks in modeling long-range dependencies, as well as the issue of small-object feature dilution during deep downsampling. To the best of our knowledge, this is the first study to apply TransXNet to pavement crack detection.

Based on the backbone modification, the original Sigmoid Linear Unit (SiLU) activation function is replaced with GELU to ensure architectural consistency. Experimental results demonstrate that this replacement further improves model performance, with a more pronounced improvement in recall than in precision, indicating that GELU is more effective in detecting challenging crack samples.

Ablation studies and five-fold cross-validation are conducted to evaluate the contributions of each improvement module. The results demonstrate that the proposed model significantly outperforms the YOLOv8n baseline in terms of precision, recall, and mean average precision (mAP).

2. YOLOv8 Object Detection Algorithm

Ultralytics released YOLOv8 in 2023, which formulates object detection as an end-to-end deep learning task and achieves fast and accurate detection performance [31]. YOLOv8 incorporates numerous architectural improvements over the previous YOLO versions to enhance feature extraction capacity while maintaining computational complexity. The CSP Bottleneck with three convolutions (C3) module used in YOLOv5 is replaced in the backbone network by the Cross Stage Partial Bottleneck with two convolutions (C2f) module, which is a more lightweight structure. The original C3 module is developed by the combination of the Cross Stage Partial Network with residual connections, where feature information is propagated through a branching structure [32].

The C2f module integrates the characteristics of the C3 module and the Efficient Layer Aggregation Network architecture by enhancing gradient flow, enabling better feature reuse without compromising the lightweight nature of the model. Furthermore, the decoupled head structure in YOLOv8 separates classification and localization branches, ensuring independent processing of classification and regression tasks without parameter sharing. This structure enhances task-specific learning efficiency and improves detection performance. Additionally, the model adopts the Task Alignment Learning dynamic label assignment scheme to enhance the alignment between classification and regression tasks [16]. For bounding box regression, YOLOv8 employs a combination of Distribution Focal Loss and Complete Intersection over Union, which enhances localization accuracy and overall detection performance. The overall architecture of YOLOv8 is illustrated in Figure 1.

The overall architecture of the YOLOv8 algorithm consists of three components: a backbone network, a neck network, and a detection head. The YOLOv8 series provides five model versions (YOLOv8n, YOLOv8s, YOLOv8m, YOLOv8l, and YOLOv8x), which differ in terms of model parameters and computational complexity. YOLOv8n is selected as the baseline model for this study based on its balance between detection accuracy and computational efficiency in pavement crack detection.

3. Improvements to the YOLOv8 Object Detection Algorithm

3.1. Backbone Network

The backbone network of the YOLOv8 is the primary component for feature extraction. Its CSPDarkNet architecture transforms raw input images into multi-scale feature representations through a series of convolution and down-sampling operations, achieving a favorable trade-off between computational efficiency and gradient optimization. Shallow feature layers retain abundant spatial and texture information, which facilitates the localization and recognition of small targets. Deeper feature layers capture more abstract and global semantic information, which facilitates the classification and detection of larger targets. Moreover, cross-stage connections and efficient activation functions enhance feature representation while maintaining a lightweight network structure. These characteristics of the backbone network provide strong support for multi-scale feature fusion in the neck network and accurate detection in the head network, enabling the model to effectively handle complex detection scenarios involving multi-scale and overlapping targets.

However, because the CSPDarkNet architecture relies on the local receptive fields of convolutional operations, it is unable to model long-range semantic dependencies, particularly when detecting irregularly shaped, sparse objects or objects in complex backgrounds. During deep-layer downsampling, features of small objects may be weakened or lost, and the network cannot effectively suppress interference from complex background information. These deficiencies are particularly evident in fine-grained and low-contrast targets, such as pavement cracks.

In this paper, the TransXNet visual backbone is adopted to replace the original YOLOv8 backbone, as it integrates the self-attention mechanism of the Transformers with the strengths of convolutional neural networks to provide more accurate feature representations. The Cross-Scale Attention Module at the core of the network overcomes the limitations of local receptive fields in convolutional operations and dynamically captures long-range dependencies between features, enabling the network to model global semantic relationships for discontinuous targets, such as intermittent cracks.

Furthermore, the lightweight decoupled attention mechanism in TransXNet enables adaptive allocation of feature weights to suppress interference from complex road backgrounds while preserving fine-grained details of small targets, without introducing additional computational overhead. The modified model contains 3.0 million parameters, representing a reduction of 0.2 million compared with YOLOv8n. This enhancement effectively addresses the limitation of the original backbone network in small-object feature preservation and global semantic modeling. Therefore, the proposed architecture improves model robustness in noisy environments and is more suitable for pavement crack detection tasks requiring both local and global feature representation. As shown in Figure 2, TransXNet adopts a four-stage hierarchical architecture, where each stage consists of an image patch embedding layer and several sequentially stacked blocks. A 7 × 7 convolutional layer is employed to construct the initial image patch embedding layer, with a stride of 4, padding of 3, and an output channel dimension of 64, followed by batch normalization. In the subsequent stages, the image patch embedding layers are constructed using 3 × 3 convolutional layers with a stride of 2 and padding of 1, each followed by batch normalization.

The TransXNet module contains consists of three main components: the dynamic position encoding layer, the dual dynamic token mixer, and the multi-scale feedforward network. The number of modules and output channels varies across different stages. Specifically, Stage 1 consists of two modules with an output channel dimension of 64; Stage 2 includes two modules with 128 output channels; Stage 3 comprises six modules with 256 output channels; and Stage 4 contains two modules with 512 output channels.

3.2. Attention Mechanism

Attention mechanisms have emerged as a key technique for improving the performance of deep learning models. By dynamically weighting different features, attention mechanisms enable models to focus on important information while suppressing background interference, thereby significantly enhancing feature representation and improving task performance, particularly in complex scenarios. However, the original YOLOv8 primarily focuses on lightweight design and high inference efficiency and lacks a dedicated attention module. In contrast, TransXNet incorporates attention into its core building block, the Dual Dynamic Token Mixer (D-Mixer), enabling the model to selectively enhance features through coordinated global and local feature interactions. As shown in Figure 3, the lightweight token-mixing feature extraction module consists of three components: overlapping spatial dimension-reduction attention, dynamic depthwise convolutions, and compressed token enhancers. Through the joint interaction between global context modeling and local feature perception, D-Mixer significantly enhances the model’s ability to extract complex and discriminative features.

The Overlapping Spatial Reduction Attention (OSRA) module enhances spatial attention performance by introducing an overlapping spatial reduction strategy. The feature subgraph is processed using large overlapping patch units, which capture global semantic relationships, enhance spatial boundary structures, suppress background texture interference, and improve crack detection performance. In this module, the spatial reduction ratio is set to 16, and the number of attention heads is set to 8. In addition, a Dynamic Depthwise Convolution (DDC) module is introduced to aggregate contextual information by using adaptive average pooling to dynamically generate convolution kernel weights, enabling adaptive local feature extraction based on input features and improving the model’s ability to recognize low-contrast cracks. The convolution kernel size is set to 5 × 5, and the number of groups is equal to the number of input channels.

During the computing process, the input feature map X is initially split into two feature groups along the channel dimension,

X_{1}

and

X_{2}

. Subsequently, the two feature maps are fed into the OSRA branch and the DDC branch for global feature extraction and local feature extraction, respectively. The resulting features are then concatenated along the channel dimension. The fused feature is then fed into the Compress Token Enhancer module to aggregate local tokens and enhance feature representation, yielding a refined feature map X′. The refined feature map X′ provides enhanced feature representation for subsequent crack classification and localization. Meanwhile, the computing process of D-Mixer can be formulated as follows:

X_{1} X_{2} = S p l i t (X) D = C o n c a t [O S R A (X_{1}), I D C o n c (X_{2})] S^{'} = S T E (D)

(1)

3.3. Activation Functions

During the replacement of the original YOLOv8 backbone with TransXNet, the activation function, as a key component for nonlinear representation, needs to be appropriately adjusted to be compatible with the nonlinear feature mapping requirements of the TransXNet architecture, while also enhancing the model’s representation capability for pavement crack detection tasks. In the original YOLOv8 model, the SiLU is adopted as the default activation function in the convolution modules. Although SiLU enables smooth gradient propagation and performs well in lightweight network design, it has limitations in modeling the non-linear characteristics of complex fine-grained features such as cracks. In contrast, the TransXNet backbone integrates Transformer self-attention layers with convolutional operations, and adopts GELU as the activation function in its core modules, which enables smoother probabilistic activation behavior and better modeling of complex feature distributions. GELU is formally defined as Equation (2), where x denotes the input and Φ(x) represents the cumulative distribution function of the standard normal distribution.

G E L U (x) = x \cdot Φ (x) Φ (x) = \frac{1}{\sqrt{2 π}} \int_{- \infty}^{x} e^{\frac{- t^{2}}{2}} d t

(2)

The GELU activation function provides a smooth nonlinear representation and better aligns with the feature distribution characteristics of the global attention module in TransXNet compared with the SiLU function, thereby effectively alleviating the vanishing gradient problem in deep networks and improving the network’s sensitivity to the complex texture features of pavement cracks. As shown in Figure 4, the GELU function exhibits a continuously differentiable and non-monotonic curve. From the perspective of functional form, GELU retains partially nonzero activation values in the negative region, which helps preserve sparse activations and maintain gradient flow. In contrast, SiLU approaches zero in the negative region, which may lead to information loss. This property further enhances the model’s nonlinear representation capability and improves feature learning efficiency for pavement crack detection tasks.

4. Experimental Design and Results Analysis

4.1. Experimental Dataset

This study employs the RDD2022 road damage dataset, which contains 47,420 images collected from six countries, with over 55,000 annotated damage instances. The dataset encompasses multiple categories of road defects, including longitudinal cracks, transverse cracks, alligator cracks, and potholes.

Since the annotation files of the RDD2022 dataset are provided in PASCAL VOC XML format, whereas the YOLO-based model adopted in this study requires labels in a specific TXT format, a format conversion process is necessary. First, the LabelImg tool is employed to re-annotate samples with inconsistent labels and remove images without annotations. Subsequently, a custom script is used to perform batch conversion of annotation formats.

The LabelImg tool supports rectangular bounding box annotation, enabling efficient localization of crack regions in images and generating annotation files that contain both category labels and spatial coordinates. Upon completion of annotation, LabelImg automatically generates corresponding TXT files, which record the category information and positional coordinates of crack targets in each image, as summarized in Table 1. The total number of annotated crack instances after processing is reported in Table 2.

During the format conversion process, it is necessary to follow the coordinate normalization rules of anchor boxes in object detection tasks, with the transformation relationships defined as shown in Equation (3). Specifically, the top-left and bottom-right coordinates of the bounding box are denoted as (

x_{l e f t}

,

y_{l e f t}

) and (

x_{r i g h t}

,

y_{r i g h t}

), respectively, while the original width and height are represented by w and h. After normalization, the center coordinates of the target are expressed as (

X

,

Y

), and the corresponding width and height of the bounding box are denoted as

W

and

H

.

\{\begin{matrix} X = \frac{x_{r i g h t} + x_{l e f t}}{2 w} \\ Y = \frac{y_{r i g h t} + y_{l e f t}}{2 h} \\ W = \frac{x_{r i g h t} {- x}_{l e f t}}{w} \\ H = \frac{y_{r i g h t} {- y}_{l e f t}}{h} \end{matrix}

(3)

The preprocessed dataset is further divided into training and validation sets with a ratio of 7:3, comprising 16,634 images in the training set and 7133 images in the validation set. The preprocessing step significantly improves the overall quality of the dataset and ensures reliable data for subsequent model training. To enhance the robustness and generalization capability of the algorithm, the dataset is augmented using transformations such as translation, rotation, and color variations. In addition, Gaussian noise is introduced to simulate uncertainties present in real-world scenarios. Representative examples of the data augmentation process are illustrated in Figure 5.

4.2. Experimental Design

The experiments were conducted on a workstation running Windows 10 Pro, equipped with an Intel Core i7-6900K CPU and 48 GB RAM. Model training was performed on a GPU with CUDA 12.4 support. The training environment is configured with Python 3.8.2 and the PyTorch 2.0.1 deep learning framework.

To ensure optimal model performance and reproducibility, a systematic grid search was conducted to optimize key hyperparameters. Specifically, several candidate combinations of batch size and learning rate were evaluated, and mAP@0.5 was calculated on a subset of the validation set for each combination. The validation results for all combinations are presented in Table 1. Based on these comparisons, the optimal configuration was determined to be a batch size of 32 and a learning rate of 0.01, achieving an mAP@0.5 of 75.2 on the validation set while balancing model accuracy and training stability. The momentum parameter was set to 0.937 following typical YOLO configurations, while the number of training iterations was determined by monitoring the convergence of the validation loss. After comparing 200, 300, 400, and 500 training iterations, 400 iterations were selected as the optimal value to ensure sufficient convergence without noticeable overfitting. The grid search results presented in Table 3 were all obtained under the condition of a fixed 400 training iterations.

The number of parameters, Floating Point Operations (FLOPs), and inference speed of the improved model were evaluated and compared with those of the official YOLOv8n model, as presented in Table 4. This comparison not only quantifies the advantages of the improved model in terms of computational resources and inference efficiency but also facilitates the assessment of potential computational costs associated with maintaining high detection accuracy, thereby providing a reference for practical deployment.

4.3. Experimental Evaluation Index

Precision

Precision is defined as the proportion of samples predicted as positive that are actually positive. Its calculation is given in Equation (4), where true positives (TP) denote the number of instances correctly predicted as positive, and false positives (FP) denote the number of instances incorrectly predicted as positive while actually belonging to the negative class.

P r e c i s i o n = \frac{T P}{T P + F P}

(4)

Precision reflects a model’s ability to minimize false positives. A lower number of false positives (FP) corresponds to higher precision. In this case, fewer instances from other classes are incorrectly classified as the target class, indicating that the predicted positive samples have higher precision.

2.: Recall

Recall is defined as the proportion of actual positive samples that are correctly identified as positive by the model. Its calculation is given in Equation (5), where true positives (TP) denote the number of instances correctly predicted as positive, and false negatives (FN) denote the number of instances incorrectly predicted as negative while actually belonging to the positive class.

R e c a l l = \frac{T P}{T P + F N}

(5)

Recall reflects the model’s ability to minimize false negatives. A higher recall corresponds to a smaller number of false negatives (FN). In this case, fewer positive instances are incorrectly classified as negative, resulting in a lower rate of missed detections.

3.: mAP

Average Precision (AP) reflects the prediction accuracy for each category and is calculated as the area under the precision–recall (P–R) curve for each category. Mean Average Precision is computed as the mean of AP values across all categories and is used to evaluate the overall model performance. A higher mAP indicates better overall detection performance, reflecting improved precision and recall.

The mAP@0.5 denotes the mAP value computed at an Intersection over Union (IoU) threshold of 0.5. IoU is used to measure the degree of overlap between the predicted bounding box and the ground truth box. An IoU value closer to 1 indicates a higher degree of overlap between the two boxes, whereas a value closer to 0 indicates less overlap. In practice, IoU is compared with a predefined threshold, typically set to 0.5, to determine whether a prediction is correct. When the IoU between the predicted box and the ground truth box exceeds the threshold, the prediction is considered correct. mAP@0.5 is computed based on this criterion.

4.: F1 score

The F1 score is the harmonic mean of precision and recall and is used to comprehensively evaluate the classification performance of a model. Its calculation is given in Equation (6). By combining precision and recall into a single metric through the harmonic mean, the F1 score provides a balanced assessment of the model’s performance with respect to both precision and recall.

F 1 = 2 \times \frac{P r e c i s i o n \times R e c a l l}{P r e c i s i o n + R e c a l l}

(6)

4.4. Ablation Experiment

This study conducts staged ablation experiments for crack detection to systematically evaluate the contributions of TransXNet, the GELU activation function, and training strategy optimization. To mitigate the randomness caused by a single data split, five-fold cross-validation is performed on both the YOLOv8 baseline and the final improved model, with results reported as mean ± standard deviation. All intermediate configurations are trained once using a fixed random seed of 56. As summarized in Table 5, the mAP@0.5 values of the baseline and the improved model are 86.9% ± 0.16 and 95.3% ± 0.14, respectively. The standard deviations across all folds are below 0.2, indicating stable and reliable performance.

Replacing the CSPDarkNet backbone in YOLOv8 with TransXNet leads to consistent improvements across all evaluation metrics, suggesting that the hybrid architecture enhances feature representation. The integration of Transformer and convolutional branches enables effective modeling of both global context and local details, thereby reducing background interference and improving the detection of crack boundaries and textures. Building on this modification, substituting SiLU with GELU further improves performance, with mAP@0.5 increasing to 91.2% and precision and recall improving by 0.8% and 1.2%, respectively. This gain can be attributed to the smoother nonlinear characteristics of GELU, which facilitate more refined feature discrimination, particularly for low-contrast and blurred cracks.

Further improvements are achieved by incorporating cosine annealing and transfer learning, resulting in a final mAP@0.5 of 95.3%. Transfer learning provides informative initial representations that accelerate convergence, while cosine annealing stabilizes the optimization process and enhances convergence in later training stages. Overall, the proposed model achieves an improvement of 8.4 percentage points over the baseline. The results demonstrate that backbone modification is the primary driver of performance gains, while activation function replacement and training strategy optimization provide complementary improvements.

4.5. Comparative Experiments

To verify and analyze the detection performance advantages of the improved YOLOv8 model for pavement cracks, comparative experiments were performed with the original YOLOv8 model and two widely used object detection models, YOLOv7 and Faster R-CNN, on RDD2022 dataset and a self-constructed pavement crack dataset. The same training–testing data splits, data augmentation strategies, and evaluation protocols were used for all models. The main evaluation metrics include precision, recall, mAP@0.5 and mAP@0.5:0.95. The experiment results are presented in Table 6.

Experiments demonstrate that the detection performance of the proposed improved YOLOv8 model surpasses that of the compared models on both datasets. The proposed model outperforms the baseline Fast R-CNN detector by 14 percentage points in mAP@0.5, demonstrating the superiority of the YOLO-based single-stage detection framework for fine-grained pavement crack detection. Compared with the original YOLOv8, the proposed model achieves improvements of 8.2 percentage points and 16.9 percentage points in mAP@0.5 and mAP@0.5:0.95, respectively. The precision reaches 95.8%, representing an improvement of 9.6 percentage points over the baseline YOLOv8 model. The experimental results demonstrate that the proposed method improves detection accuracy while maintaining stable recall performance, effectively reducing false detection caused by background interference and enhancing the detection of subtle cracks.

4.6. Visual Analysis of Experimental Results

Precision and recall are important metrics for evaluating the reliability and coverage of detection models. Precision indicates the reliability of the predictions generated by the detection model. Recall indicates the extent to which true targets are successfully detected.

As shown in the precision curve in Figure 6, alligator cracking maintains relatively high precision across a wide range of confidence scores. This can be attributed to the characteristic morphology of alligator cracking, in which transverse and longitudinal cracks intersect to form closed or semi-closed block-like patterns with relatively large areas. These cracks are typically wide and deep, exhibiting strong contrast with the surrounding pavement, which facilitates model learning. Transverse and longitudinal cracks achieve moderate precision, whereas potholes exhibit relatively lower precision at higher confidence levels. This is primarily due to the highly variable shapes of potholes and the similarity of their edges to shadows, stains, and other artifacts on the pavement, which increases the difficulty of accurate discrimination by the model.

Meanwhile, the recall curves in Figure 7 indicate that alligator cracking achieves the highest recall among the four categories, demonstrating that the model detects a large proportion of true instances even at relatively low confidence levels. This suggests that the model has effectively learned the morphological and textural features of alligator cracking. In contrast, transverse and longitudinal cracks are predominantly linear, sparse, or isolated, lacking prominent contextual features, which increases detection difficulty. Potholes exhibit relatively low recall overall, likely due to limited sample size and high variability in shape, indicating that feature learning for this category could be further improved.

To systematically evaluate model performance, the model outputs a precision–recall (P–R) curve. As shown in Figure 8, the improved model achieves a large area under the P–R curve; even at high recall levels, it maintains high precision, with an mAP@0.5 of 95.2%, indicating that the model effectively controls the false positive rate while minimizing missed detections. Further analysis shows that the average precision across typical crack categories exceeds 90%, indicating strong detection performance and enhanced adaptability to large-area targets.

In the precision–recall curves, the horizontal axis represents the confidence score of the predicted bounding box, indicating the model’s confidence level for each prediction. The curve is generated by using each confidence value as a threshold to filter predicted bounding boxes and compute the corresponding precision and recall. When the confidence threshold is low, more predicted bounding boxes are retained, resulting in higher recall but potentially lower precision. When the confidence threshold is high, only high-confidence predictions are retained, which may improve precision but reduce recall. By analyzing the variation in precision and recall under different confidence thresholds, the detection performance of the model can be evaluated, and an appropriate threshold can be selected to balance precision and recall. Based on this, precision and recall can be further integrated into a single metric through the F1-score curve, enabling a more intuitive evaluation of the overall detection performance under different confidence thresholds.

The F1 score ranges from 0 to 1. A higher F1 score indicates fewer missed detections (high recall) and fewer false detections (high precision), resulting in better overall detection performance. As shown in Figure 9, the F1 score for crack detection is closest to 1, indicating the best detection performance. The overall F1 score across all categories is 0.93, indicating that the model has effectively learned features for different crack types and achieves balanced detection performance.

Based on the confusion matrix for pavement crack classification, we further present and analyze the confusion matrix of the classification model, as shown in Figure 10. The confusion matrix is computed on the validation set and objectively reflects the misclassification distribution among different crack types on unseen data. The diagonal elements are relatively large, indicating that the classification model has strong discrimination capability for major crack types, such as the longitudinal crack, transverse, and alligator cracks. Therefore, the model achieves high overall classification accuracy. This is partly due to the larger number of longitudinal and transverse crack samples in the dataset. However, the off-diagonal elements still exhibit non-negligible values, indicating a certain degree of misclassification between transverse and alligator cracks. This is due to their similarity in local morphological structures and textures, which increases the difficulty of distinguishing between them.

Further analysis of the confusion matrix, which contains nonzero elements at the intersections between the background column and other crack categories, indicates that only a small number of crack instances are missed. Observation of the background row reveals that nonzero values corresponding to crack categories indicate that some background textures and artifacts, particularly complex ones, are occasionally mistreated as cracks.

In general, the proposed model demonstrates strong capability in distinguishing among multiple crack categories (beyond the conventional binary crack detection), but remains sensitive to interference caused by morphological similarity and background complexity. Future work will focus on enhancing discriminative local feature extraction and improving background suppression.

5. Conclusions

This paper addresses the issues of low detection efficiency and poor accuracy in traditional pavement crack detection methods, as well as the limitations of the existing YOLOv8 model in such tasks, including insufficient global semantic modeling ability, loss of small-object features, and inadequate suppression of background interference. To address these problems, this study conducts a comprehensive investigation into improvements of the YOLOv8 model and its application to pavement crack detection. Through model optimization and validation, the proposed method successfully achieves the intended research objectives. The main conclusions are summarized as follows:

The proposed efficient YOLOv8 model addresses the limitation of traditional convolution networks in modeling long-range dependencies by replacing the original CSPDarkNet backbone with the TransXNet visual backbone. TransXNet integrates Transformer-based self-attention mechanisms with convolutional operations to achieve more effective feature representation. The cross-scale attention module captures long-range semantic dependencies of discontinuous cracks, while the lightweight decoupled attention design emphasizes background interference suppression and preserves fine-grained details of small targets, thereby providing more reliable feature representations for pavement crack detection. In addition, the activation function is replaced from SiLU to GELU to better align with the feature distribution characteristics of TransXNet. This modification improves the nonlinear representation capability and training stability of the network, alleviates the vanishing gradient problem, and enhances feature extraction for subtle and low-contrast cracks. These combined improvements lead to an overall enhancement in network performance.

The experimental results demonstrate that the improved model achieves superior overall performance. On the RDD2022 dataset, the model achieves an mAP@0.5 of 95.2%, representing an improvement of 8.2 percentage points over the baseline YOLOv8 model. Moreover, the model achieves a precision of 95.8% and a recall of 91.2%. Compared with other widely used detection models, such as YOLOv7 and Faster R-CNN, the proposed method achieves the best performance across all core evaluation metrics. This improved model effectively reduces both false positives and false negatives while demonstrating strong adaptability to complex pavement surface conditions. Moreover, the ablation study confirms the effectiveness of each improvement strategy, with backbone replacement contributing the most significant performance gain. These results further validate the effectiveness and soundness of the proposed improvement strategies.

In summary, the proposed model demonstrates strong performance in pavement crack detection. The model achieves high precision and stable recall under typical conditions. The model exhibits particularly strong recognition capability for crack types with distinct directional features, especially longitudinal cracks. Meanwhile, the training process exhibits smooth convergence without noticeable overfitting, indicating good practical applicability and generalization ability. However, several limitations remain. The model still struggles to distinguish between morphologically similar crack types, such as transverse and reticulated cracks. Additionally, the detection of small-scale and low-contrast cracks in complex backgrounds with strong interference requires further improvement.

Therefore, future work should focus on enhancing fine-grained feature discrimination mechanisms, developing more effective data augmentation strategies, and optimizing background-aware methods. In addition, exploring lightweight model deployment and integration the proposed model with embedded devices for real-time detection are expected to further promote its large-scale application in practical road maintenance systems.

Author Contributions

Conceptualization, Z.D. and Y.Y.; methodology, Z.D.; software, Z.D.; validation, Y.Y., W.Y. and Z.D.; formal analysis, W.Y. and Z.D.; investigation, Z.D.; resources, Z.D.; data curation, Y.Y.; writing—original draft preparation, Y.Y.; writing—review and editing, Y.Y., W.Y. and Z.D.; visualization, Y.Y. and W.Y.; supervision, Z.D. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data used in this study were obtained from the public Road Damage Dataset 2022 (https://github.com/sekilab/RoadDamageDetector (accessed on 18 March 2026)). No new raw data were generated in the present study.

Acknowledgments

The authors wish to thank the providers of the RDD2022 for making the dataset publicly available. The authors also acknowledge the anonymous reviewers for their valuable comments and suggestions. All authors are grateful for their contributions to this study.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

YOLO	You Only Look Once
CSPDarkNET	Cross Stage Partial DarkNet
GELU	Gaussian Error Linear Unit
C3	CSP Bottleneck with 3 convolutions
C2f	Cross Stage Partial Bottleneck with two convolutions
D-Mixer	Dual Dynamic Token Mixer
OSRA	Overlapping Spatial Reduction Attention
DDC	Dynamic Depthwise Convolution
SiLU	Sigmoid Linear Unit
RDD2022	Road Damage Dataset 2022

References

Zhou, Y.; Guo, X.; Hou, F.; Wu, J. Review of Intelligent Road Defects Detection Technology. Sustainability 2022, 14, 6306. [Google Scholar] [CrossRef]
Lu, Z.; Hao, Z.; Zhu, Y.; Lu, C. A Review on Automated Detection and Identification Algorithms for Highway Pavement Distress. Appl. Sci. 2025, 15, 6112. [Google Scholar] [CrossRef]
Han, C.; Zhang, W.; Ma, T. Data cleaning framework for highway asphalt pavement inspection data based on artificial neural networks. Int. J. Pavement Eng. 2022, 23, 5198–5210. [Google Scholar] [CrossRef]
Justo-Silva, R.; Ferreira, A. Pavement Maintenance Considering Traffic Accident Costs. Int. J. Pavement Res. Technol. 2019, 12, 562–573. [Google Scholar] [CrossRef]
Yao, L.; Leng, Z.; Jiang, J.; Ni, F. Assessing Pavement Conditions Impact on Traffic Crashes: An Interpretable Machine-Learning Approach to Identify Crash Risk Factors and Performance Thresholds. J. Transp. Eng. Part B Pavements 2025, 151, 04025049. [Google Scholar] [CrossRef]
Amir, A.; Henry, M. Statistical Analysis of Asphalt Pavement Distress Occurrence for Project-Level Maintenance Management of Provincial Highways. In Interdisciplinary Symposium on Smart & Sustainable Infrastructures; Springer Nature: Cham, Switzerland, 2023; pp. 1025–1038. [Google Scholar] [CrossRef]
Safaei, N.; Samdi, O.; Masoud, A.; Safaei, B. An Automatic Image Processing Algorithm Based on Crack Pixel Density for Pavement Crack Detection and Classification. Int. J. Pavement Res. Technol. 2022, 15, 159–172. [Google Scholar] [CrossRef]
Sulistyaningrum, D.R.; Setiyono, B.; Anita, J.N.; Muheimin, M.R. Measurement of Crack Damage Dimensions on Asphalt Road Using Gabor Filter. J. Phys. Conf. Ser. 2021, 1752, 012086. [Google Scholar] [CrossRef]
Tian, R.; Sun, G.; Liu, X.; Zheng, B. Sobel Edge Detection Based on Weighted Nuclear Norm Minimization Image Denoising. Electronics 2021, 10, 655. [Google Scholar] [CrossRef]
Yun, H.B.; Mokhtari, S.; Wu, L. Crack Recognition and Segmentation Using Morphological Image-Processing Techniques for Flexible Pavements. Transp. Res. Rec. J. Transp. Res. Board 2015, 2523, 115–124. [Google Scholar] [CrossRef]
Li, Z.; Yin, C.; Zhang, X. Crack Segmentation Extraction and Parameter Calculation of Asphalt Pavement Based on Image Processing. Sensors 2023, 23, 9161. [Google Scholar] [CrossRef]
Liu, J.; Li, Z.; Zhang, X. A Survey of Sea Surface Target Detection Technology in Visible Light Remote Sensing Images. Comput. Sci. 2020, 47, 116–123. [Google Scholar] [CrossRef]
Mou, G.; Yi, X.; Li, Z.; Qian, J.; Chen, W. Third-Party Construction Target Detection in Pipeline Inspection Aerial Images Based on Improved YOLOv2 and Transfer Learning. J. Comput. Appl. 2020, 40, 1062–1068. [Google Scholar] [CrossRef]
Asadi, Y. Paving the Future of Intelligent Pavement Defect Detection with Machine Learning: A Comprehensive Survey of Techniques and Applications. Int. J. Pavement Res. Technol. 2026. [Google Scholar] [CrossRef]
Abdelwahed, S.H.; Sharobim, B.K.; Wasfey, B.; Said, L.A. Advancements in Real-Time Road Damage Detection: A Comprehensive Survey of Methodologies and Datasets. J. Real-Time Image Process. 2025, 22, 137. [Google Scholar] [CrossRef]
Feng, C.; Zhong, Y.; Gao, Y.; Scott, M.R.; Huang, W. Tood: Task-Aligned One-Stage Object Detection. In 2021 IEEE/CVF International Conference on Computer Vision (ICCV); IEEE Computer Society: New York, NY, USA, 2021; pp. 3490–3499. [Google Scholar] [CrossRef]
Jiang, T.; Li, L.; Samali, B.; Yu, Y.; Huang, K.; Yan, W.; Wang, L. Lightweight Object Detection Network for Multi-Damage Recognition of Concrete Bridges in Complex Environments. Comput.-Aided Civ. Infrastruct. Eng. 2024, 39, 3646–3665. [Google Scholar] [CrossRef]
Chen, L.; Yao, H.; Fu, J.; Ng, C.T. The Classification and Localization of Crack Using Lightweight Convolutional Neural Network with CBAM. Eng. Struct. 2023, 275, 115291. [Google Scholar] [CrossRef]
Lv, Z.; Dong, S.; Xia, Z.; He, J.; Zhang, J. Enhanced Real-Time Detection Transformer (RT-DETR) for Robotic Inspection of Underwater Bridge Pier Cracks. Autom. Constr. 2025, 170, 105921. [Google Scholar] [CrossRef]
Liu, Z.; Gu, X.; Yang, H.; Wang, L.; Chen, Y.; Wang, D. Novel YOLOv3 Model with Structure and Hyperparameter Optimization for Detection of Pavement Concealed Cracks in GPR Images. IEEE Trans. Intell. Transp. Syst. 2022, 23, 22258–22268. [Google Scholar] [CrossRef]
Liu, Z.; Gu, X.; Chen, J.; Wang, D.; Chen, Y.; Wang, L. Automatic recognition of pavement cracks from combined GPR B-scan and C-scan images using multiscale feature fusion deep neural networks. Autom. Constr. 2023, 146, 104698. [Google Scholar] [CrossRef]
Guo, G.; Zhang, Z.Y. Road damage detection algorithm for improved YOLOv5. Sci. Rep. 2022, 12, 15523. [Google Scholar] [CrossRef]
Ye, G.; Li, S.; Zhou, M.; Mao, Y.; Qu, J.; Shi, T.; Jin, Q. Pavement crack instance segmentation using YOLOv7-WMF with connected feature fusion. Autom. Constr. 2024, 160, 105331. [Google Scholar] [CrossRef]
Li, X.W. Research on Road Crack Detection Based on Deep Learning. Doctoral Dissertation, Chang’an University, Xi’an, China, 2025. [Google Scholar] [CrossRef]
Zheng, J.; Tang, C.; Sun, Y. BridgeFusionNet: A Hybrid Convolutional-Transformer Architecture for Road Surface Crack. Eng. Appl. Artif. Intell. 2025, 160, 112085. [Google Scholar] [CrossRef]
Wang, J.; Wang, J.; Li, Y.; Ge, D.; Zhou, S. Crack Segmentation Network Based on Hybrid-Window Transformer and Dual-Branch Fusion. Appl. Intell. 2025, 55, 944. [Google Scholar] [CrossRef]
Xing, Y.; Han, X.; An, D.; Yao, G.; Shi, H.; Liu, W. ConvNextV2X: Road Microcrack Detection Algorithm Based on Improved Multiscale Feature Fusion. J. Comput. Civ. Eng. 2026, 40, 04025139. [Google Scholar] [CrossRef]
Liu, G.; Wu, X.; Dai, F.; Liu, G.; Li, L.; Huang, B. Crack-MsCGA: A Deep Learning Network with Multi-Scale Attention for Pavement Crack Detection. Sensors 2025, 25, 2446. [Google Scholar] [CrossRef] [PubMed]
Purwanti, E.; Purnama, I.K.E.; Yuniarno, E.M. An Empirical Ablation Study of Activation Functions in U-Net for Colorectal Polyp Segmentation. In 2025 International Electronics Symposium (IES); IEEE: New York, NY, USA, 2025; pp. 627–632. [Google Scholar] [CrossRef]
Wu, Z.; Zhang, J.; Chen, J.; Guo, J.; Huang, D.; Wang, Y. Aphq-vit: Post-Training Quantization with Average Perturbation Hessian Based Reconstruction for Vision Transformers. In Proceedings of the Computer Vision and Pattern Recognition Conference, Nashville, TN, USA, 10–17 June 2025; pp. 9686–9695. [Google Scholar] [CrossRef]
Reis, D.; Kupec, J.; Hong, J.; Daoudi, A. Real-Time Flying Object Detection with YOLOv8. arXiv 2024, arXiv:2305.09972. [Google Scholar] [CrossRef]
Hermens, F. Automatic object detection for behavioural research using YOLOv8. Behav. Res. Methods 2024, 56, 7307–7330. [Google Scholar] [CrossRef]

Figure 1. YOLOv8 Network Architecture Diagram.

Figure 2. Overall Architecture Diagram of TransXNet.

Figure 3. D-Mixer Block Diagram.

Figure 4. GELU Activation Function.

Figure 5. Data Augmentation Example.

Figure 6. Precision Curve.

Figure 7. Recall Curve.

Figure 8. P–R Curve.

Figure 9. F1 score Curve.

Figure 10. Confusion Matrix.

Table 1. Annotation Information of LabelImg.

Crack Type	Center x	Center y	Width w	Height h
0	0.665039	0.270508	0.052734	0.537109
1	0.541016	0.534180	0.117188	0.033203
1	0.698242	0.561523	0.158203	0.033203
1	0.895508	0.553711	0.208984	0.044922

Table 2. Statistical Summary of Annotated Instances for Each Crack Category.

Crack Category	Number of Instances	Proportion (%)
Longitudinal Crack	26,196	45.77
Transverse Crack	11,875	20.75
Alligator Cracking	12,617	22.05
Pothole	6544	11.43

Table 3. Results of Hyperparameter Optimization via Grid Search.

Learning Rate	Batch Size	mAP@0.5 (%)
0.001	16	72.3
0.001	32	73.1
0.001	64	72.8
0.01	16	74.5
0.01	32	75.2
0.01	64	74.9
0.1	16	71.6
0.1	32	70.8
0.1	64	69.5

Table 4. Efficiency Comparison between the Improved Model and the YOLOv8n.

Model	Image Size (Pixels)	Parameters (M)	FLOPs (B)	Inference Speed (ms/Image)
Yolov8n	640	3.2	8.7	0.99
Improved Model	640	3.0	7.9	0.87

Table 5. Ablation Experiment Results.

Model	P (%)	R (%)	mAP@0.5 (%)	mAP@0.5:0.95 (%)
YOLOv8	86.2 ± 0.15	79.5 ± 0.17	86.9 ± 0.16	57.8 ± 0.16
Replace with TransXNet	90.1	83.5	90.6	63.2
Replace the activation function with GELU	90.9	84.7	91.2	64.3
TransXNet + GELU + Cosine annealing	92.9	87.7	93.3	68.3
TransXNet + GELU + Transfer learning	94.8	89.1	94.3	70.9
Completely improved model	95.4 ± 0.12	91.0 ± 0.18	95.3 ± 0.14	74.6 ± 0.15

Table 6. Model Comparison Experimental Results.

Model	P (%)	R (%)	mAP@0.5 (%)	mAP@0.5:0.95 (%)
YOLOv7	83.8	75.9	84.8	54
YOLOv8	86.2	79.5	87	57.6
Faster R-CNN	80	73	81.2	50.4
Improved model	95.8	91.2	95.2	74.5

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Du, Z.; Yin, Y.; Yang, W. An Improved YOLOv8 with TransXNet Backbone for Pavement Crack Detection. Appl. Sci. 2026, 16, 3982. https://doi.org/10.3390/app16083982

AMA Style

Du Z, Yin Y, Yang W. An Improved YOLOv8 with TransXNet Backbone for Pavement Crack Detection. Applied Sciences. 2026; 16(8):3982. https://doi.org/10.3390/app16083982

Chicago/Turabian Style

Du, Zitao, Yuna Yin, and Wenbo Yang. 2026. "An Improved YOLOv8 with TransXNet Backbone for Pavement Crack Detection" Applied Sciences 16, no. 8: 3982. https://doi.org/10.3390/app16083982

APA Style

Du, Z., Yin, Y., & Yang, W. (2026). An Improved YOLOv8 with TransXNet Backbone for Pavement Crack Detection. Applied Sciences, 16(8), 3982. https://doi.org/10.3390/app16083982

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

An Improved YOLOv8 with TransXNet Backbone for Pavement Crack Detection

Abstract

1. Introduction

2. YOLOv8 Object Detection Algorithm

3. Improvements to the YOLOv8 Object Detection Algorithm

3.1. Backbone Network

3.2. Attention Mechanism

3.3. Activation Functions

4. Experimental Design and Results Analysis

4.1. Experimental Dataset

4.2. Experimental Design

4.3. Experimental Evaluation Index

4.4. Ablation Experiment

4.5. Comparative Experiments

4.6. Visual Analysis of Experimental Results

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI