1. Introduction
By the end of 2024, China’s total highway mileage had reached 5,490,400 km, with expressways accounting for 190,700 km. This total mileage ranks first globally [
1]. However, the pavement is the first component of the road structure to bear external forces, enduring loads from passing vehicles while also facing impacts such as temperature fluctuations, corrosion, and human damage. The intensification of these factors inevitably leads to road deterioration. In the “14th Five-Year Plan for Highway Maintenance Management Development,” the Ministry of Transport proposed leveraging digital technologies to advance highway maintenance and management techniques, accelerate the R&D of inspection equipment, and enhance the automation of road inspection and maintenance [
2]. Pavement defect detection serves as a critical preliminary step in road maintenance, playing a pivotal role in ensuring safe and efficient daily transportation while fostering stable socioeconomic development [
3].
Common pavement defects include longitudinal cracks, transverse cracks, and crocodile cracks. Early pavement crack detection methods primarily relied on traditional image processing techniques, encompassing both 2D and 3D approaches. Typical 2D methods, such as edge detection based on the Canny operator [
4] and threshold segmentation using Otsu’s method [
5], are highly sensitive to environmental factors like lighting variations and pavement conditions. These methods not only require frequent manual intervention to adapt to different scenarios—significantly reducing the level of automation and detection efficiency—but also tend to introduce biases during crack feature extraction, leading to frequent false positives and false negatives. Although 3D detection models process richer information, they are also susceptible to environmental interference, require complex denoising, and suffer from limited application scenarios. Moreover, these models are computationally intensive, inefficient, and demand high hardware specifications [
6].
In recent years, with the continuous advancement of machine learning and artificial intelligence, computer vision has been increasingly applied in the field of object detection, which is primarily divided into two-stage and single-stage models. Two-stage models, typified by the R-CNN series [
7], first generate candidate boxes and then perform secondary feature extraction and multi-task learning to achieve object detection. They have demonstrated outstanding performance in medical image lesion segmentation, 3D obstacle detection in autonomous driving, and military target detection in satellite imagery. Single-stage object detection models enable direct end-to-end prediction of both object category and location, offering faster detection speeds. Classic single-stage detection models include the SSD and YOLO series [
8]. Among these, the YOLO series [
9] stands as a quintessential single-stage model, achieving a remarkable balance between real-time performance and accuracy. It has become the mainstream tool for road crack detection.
Duo Ma et al. [
10] proposed the YOLO-MF method based on an improved YOLOv3. They employed a PCGAN to generate realistic crack images and optimized the approach using accelerated algorithms and the Median Flow (MF) algorithm. However, although the PCGAN-generated data alleviates data scarcity, it cannot fully capture the diversity and complexity of real-world cracks, thereby limiting the model’s generalization capability. An Xue-Gang et al. [
11] enhanced the detection accuracy for pavement defects by improving YOLOv4 with adaptive spatial feature fusion and modifications to the Focal Loss function. Sanchez et al. [
12] annotated bounding boxes based on a nine-category list of damaged and undamaged objects. Beyond detecting cracks amidst background noise on asphalt surfaces, they designed six augmented scenarios by applying horizontal and vertical flipping to evaluate model performance. The YOLOv5 model demonstrated consistent detection of well-defined defects. Huantong Geng et al. [
13] proposed the Selective Dynamic Feature Compensation YOLO (SDFC-YOLO) algorithm, which introduced a Dynamic Downsampling Module (DDM) to adaptively adjust the sampling positions of convolutional kernels during feature extraction, alongside a novel feature fusion method. However, integrating multiple sophisticated modules increased the complexity of the training process, demanding greater computational resources and time for parameter optimization. Zhang et al. [
14] integrated the Convolutional Block Attention Module (CBAM) into YOLOv7 to enhance accuracy. However, relying exclusively on attention mechanisms can lead to false detections in complex road environments. Li Song et al. [
15] proposed an improved lightweight road damage detection algorithm, YOLOv8-RD, which combines the strengths of CNN and Transformer architectures. By introducing the BOT module and a coordinate attention mechanism, the detection efficiency was improved. Although this model achieved better performance for small objects to some extent, its accuracy remained insufficient for detecting extremely fine cracks. Yuan, Hongshuai et al. [
16] adopted the scale sequence feature fusion module and the triple feature encoder module from the ASF-YOLO architecture to enhance the detection of multi-scale cracks and improve target feature perception. They also incorporated the Coordinate Attention (CA) mechanism, embedding positional information into channel attention to bolster crack feature extraction. However, the increased model complexity resulted in slower detection speeds compared to other models. For lightweight applications, Xu Tiefeng et al. [
17] proposed the DGE-YOLO-P crack detection model based on YOLOv8. They designed the C2f-DCNv3 module to enhance modeling capacity and reduce the dimensionality of input features, effectively decreasing the number of model parameters and computational complexity.
However, pavement cracks—characterized by weak textures, high aspect-ratio, and significant scale variations—are prone to being confounded with environmental noise such as oil stains, repair marks, and tree roots. As highlighted by Dong et al. [
18] in their 2025 study on YOLO11-based bridge crack detection, YOLO series models still face “considerable difficulties” when detecting narrow, elongated cracks with low contrast, including significant background false positives and missed detections. These limitations indicate that practical applications of YOLO models remain constrained in this domain. As YOLO undergoes continuous updates and iterations, road crack detection must integrate crack-specific features while continually refining its network architecture. Currently, the YOLO algorithm undergoes continuous optimization and version updates. In 2024, the Ultralytics team introduced the YOLO11 [
19] version, which achieves high detection accuracy and a relatively lightweight network architecture while maintaining real-time performance, making it suitable for targeted improvements in detecting road crack objects. Zhang, Y. et al. [
20] proposed the GLNET-YOLO framework based on cross-modal deep feature fusion, integrating visible and infrared image features. This framework extends the YOLO11 architecture by introducing the FM module for global feature fusion and enhancement, and the DMR module for local feature separation and interaction. It significantly improves detection accuracy and algorithm robustness under low-light and complex background conditions, with its effectiveness further validated on the KAIST dataset.
Based on this, this paper analyzes the YOLO11 model and optimizes it through three key improvements. A Feature Fusion Backbone Network (MFFBN) is designed to enhance the recognition and extraction of pavement crack features in complex environments. The BiFPN weighted bidirectional feature pyramid network is combined with the MCA multimodal cross-attention mechanism to propose the BiMCNet (Multi-Channel Attention Bifurcate Network) to replace the Concat layer in the original network architecture, thereby optimizing the model’s detection capability for fine cracks. The CGeoCIoU (Crack Geometrically Improved Complete IoU) replaces the original model’s CIoU, and by adjusting model accuracy through three distinct penalty terms, a novel pavement crack detection method—YOLO11-MBC (YOLO11-MFFBN-BiMCNet-CGeoCIoU)—is proposed. This addresses the issues of low recognition accuracy and high false positive/miss rates in complex road conditions.
2. Materials and Methods
2.1. Basic YOLO11 Model Architecture
YOLO11, released by Ultralytics on 30 September 2024, is the latest version in the YOLO series of real-time object detection algorithms. It achieves significant improvements in speed, accuracy, and efficiency compared to its predecessors. Developed through a series of optimizations based on YOLOv8 [
21], the model incorporates several innovative enhancements, including the C3k2 module, the Spatial Pyramid Fast Pooling (SPFF) module, and the Cross-Stage Partial Spatial Attention (C2PSA) mechanism. The C3k2 module employs a dual-branch design: a 3 × 3 convolutional branch captures local features, while a 1 × 1 convolutional branch facilitates channel interaction for feature extraction. The SPFF module fuses spatial features at different granularities through multi-scale pooling and aggregates multi-scale contextual information via multiple max-pooling operations (e.g., three repetitions of 5 × 5 pooling), thereby reducing computational complexity compared to traditional Spatial Pyramid Pooling (SPP) modules. The C2PSA mechanism extracts features in parallel using multi-scale convolutional kernels, generating multi-scale feature maps that enhance the model’s focus on critical regions. The basic YOLO11 model architecture is illustrated in
Figure 1.
However, the basic YOLO11 model faces several challenges in road crack detection: (1) Although the C3k2 module excels in feature extraction, its excessive focus on local features may cause it to overlook the overall continuity of cracks when processing such fine and complex textures, leading to inaccurate detection results. (2) The width and length of cracks can vary significantly, and the SPFF module may struggle to adapt to these variations, resulting in incomplete crack detection. (3) While the C2PSA mechanism enhances the perception of target details, it may be inadequate for handling the substantial variations in crack shape and position, leading to detection inaccuracies. Additionally, the high computational complexity of the C2PSA mechanism may compromise the model’s real-time performance.
2.2. YOLO11—Improved Multi-Bid Frameworks
To address the shortcomings of C3K2 in poor feature extraction for minute pavement cracks, SPPF module’s incomplete crack detection coverage, and C2PSA’s high computational complexity, improvements were made to propose the YOLO-MBC integrated network architecture, as shown in
Figure 2. The main enhancements include:
Design of the MFFBN Backbone: Since cracks typically occupy less than 5% of the image area and are easily obscured by repeated downsampling, we designed a Feature Fusion Backbone Network (MFFBN). This backbone integrates the MFFM [
22] with the principles of the ECA mechanism [
23] to enhance the perception of target boundaries and minute defects. It enables the model to focus on critical regions within the target area while filtering out noise such as road surface reflections and oil stains in the spectral dimension. This improvement ensures accurate identification of road cracks and potholes in complex detection environments.
Design of the BiMCNet Module: Given that cracks on road surfaces often exhibit slender, elongated topologies and are susceptible to occlusion and disruption from uneven lighting, we designed the BiMCNet architecture. This structure is centered on the BiFPN module and incorporates the Multimodal Cross-Attention (MCA) mechanism. MCA allows queries from local breakpoints to align with globally semantic keys, locating pixels with consistent orientation to achieve spatial stitching. Consequently, it reconstructs occluded, elongated cracks in the spatial dimension while suppressing the influence of false targets such as lane markings, repair marks, and tree roots on the detection results.
Introduction of CGeoCIoU function: When the crack aspect ratio exceeds 10:1 and orientation is arbitrary, CIoU still treats the target as an “axis-aligned bounding box.” The CGeoCIoU loss function is introduced to enhance the model’s perception of crack boundaries.
2.3. Design of the MFFBN Backbone
Individual cracks typically occupy less than 5% of an image’s area. These defects are characterized by minute dimensions and low contrast, making them easily overlooked during deep convolution processes. Additionally, complex road surfaces may exhibit crack-like textures, which blur the boundaries of target cracks and compromise the model’s detection accuracy and robustness. Although YOLO11 enhances feature expression through multiple residual connections and local convolutional operations, it lacks sufficient sensitivity to small objects and specialized scale adaptation mechanisms. Furthermore, real-world roads exhibit imbalanced distributions of defect categories due to usage patterns and environmental factors.
Therefore, this paper proposes a Feature Fusion Backbone Network (MFFBN), which integrates the C3k2 module with a custom-designed Feature Fusion Module (MFFM) to achieve refined feature extraction. This architecture focuses on detecting minute cracks while simultaneously mitigating class imbalance, thereby enabling effective detection of road surface crack defects. The structure of the MFFM is illustrated in
Figure 3, where the left part performs multi-scale cross-layer fusion and the right part conducts dynamic channel weight adjustment.
- (1)
Multi-scale cross-layer fusion primarily achieves this by simultaneously preserving high-resolution edge localization information and deep semantic discrimination information, specifically accomplished using multiple 1 × 1 and 3 × 3 convolutions where 1 × 1 convolutions capture fine local features—such as minor road cracks—while eliminating redundant information. Conversely, 3 × 3 convolutions expand the receptive field to extract contextual features. Through successive convolutions, these operations ultimately merge shallow-level semantic information with high-frequency features across varying scales, synthesizing a unified representation. The feature fusion process is described by Equation (1):
In the above equation, represents features derived from the upper layer output after one convolution operation, providing additional local detail information; denotes features processed through one convolution and two convolutions to deliver higher-level semantic information; represents features obtained after two convolutions, supplying surrounding semantic information. Finally, performs batch normalization and the convolution operation of the activation layer. The three feature maps are concatenated along the channel dimension. is the output obtained after applying a convolutional operation to the concatenated feature. Through multi-scale cross-layer fusion of features, the model can better capture the boundary and detailed characteristics of small objects such as minor cracks on complex roads. This significantly enhances the model’s ability to express features across different scales, making it more suitable for feature extraction tasks that might be overlooked in complex scenes.
- (2)
Dynamic channel weight adjustment: In the task of road crack detection, the proposed Feature Fusion Module (MFFM) introduces an innovative dynamic channel weight adjustment mechanism inspired by the Effective Channel Attention (ECA) mechanism. This mechanism replaces cumbersome fully connected layers with efficient convolutional operations, thereby reducing model complexity while maintaining computational efficiency. This design optimizes the feature extraction process, enhancing the model’s ability to recognize pavement crack features.
A key advantage lies in its adaptive enhancement of crack-related feature channels while suppressing interference from non-target features such as background noise and lighting variations. This is crucial for enhancing the model’s sensitivity to subtle cracks, particularly when crack features resemble or are indistinguishable from the surrounding environment. Furthermore, the mechanism excels at addressing class imbalance by automatically boosting feature weights for underrepresented classes. This prevents these important but scarce categories from being overwhelmed by abundant information from common classes during feature learning.
2.4. Design of the BiMCNet Module
To enhance the performance of pavement crack detection, particularly for identifying fine cracks, the Bidirectional Feature Pyramid Network (BiFPN) module was introduced and applied to the YOLO11-MBC pavement crack detection model. The Neck network serves as a critical component within object detection frameworks, efficiently reorganizing multi-level feature maps output by the backbone network. This process fuses high-resolution spatial details with low-resolution semantic abstractions, thereby providing the detection head with multi-scale representations that enhance discriminative power. Traditional Feature Pyramid Networks (FPN) propagate semantic information solely through unidirectional top-down paths. PANet introduces bottom-up branches to supplement spatial details yet still suffers from information loss and insufficient balance. BiFPN (Bidirectional Feature Pyramid Network), proposed in EfficientDet [
21,
24], achieves more comprehensive and flexible multi-scale information exchange within the same computational budget through bidirectional cross-scale connections and a learnable weighting fusion mechanism. Embedding BiFPN into the YOLO11 neck significantly enhances the model’s detection accuracy and robustness for cross-scale pavement defects—ranging from minute cracks to large-area damage—without substantially increasing inference latency. (See
Figure 4).
Although the PAN-FPN architecture in YOLO11 possesses bidirectional multi-scale fusion capabilities, the repeated upsampling and downsampling processes generate feature redundancy and additional computational overhead, leading to increased computational demands. BiFPN partially alleviates this computational pressure through learnable weight fusion; however, its receptive field remains confined to local neighborhoods. This limitation hinders the capture of global context, resulting in inadequate suppression of road surface background noise. Furthermore, BiFPN lacks a fine-grained filtering mechanism, exhibiting weak selectivity for the spatial orientation features of sub-pixel cracks and small potholes. Consequently, it suffers from representation gaps and localization errors in complex textured backgrounds.
To dynamically optimize multi-scale feature processing, enhance target discrimination and contextual awareness, strengthen the capture of spatial orientation features for small cracks and potholes, and maintain lightweight and efficient computation, a novel structure named BiMCNet is embedded into the YOLO11 model. Its specific implementation steps are shown in
Figure 5. When a feature map has only one input path and undergoes no further processing, it typically contributes little to the feature network. For feature maps with two input paths, if they are at the same scale, an additional path is added from the backbone features and fused with the features from the PAN path. This processing method achieves enhanced fusion without introducing extra parameter costs.
In traditional Feature Pyramid Networks (FPN), input features are typically treated equally without considering their varying contributions to the output. In contrast, BiMCNet optimizes the contribution of features at different resolutions by assigning unique weights to each channel, thereby improving the feature fusion effect. Its definition is given by Equation (2).
The variable
represents the output,
denotes the inputs at each node, and
signifies the cumulative weights of these inputs, which are learnable parameters. To ensure computational stability, a small value
is introduced. Additionally, each bidirectional path (including both top-down and bottom-up pathways) is treated as an independent module and can be reused to enhance feature integration. Taking the
i-th level as an example, the two feature fusion formulas in BiMCNet are shown in Equation (3) and Equation (4), respectively.
To further enhance contextual information capture and feature selectivity, the BiMCNet architecture also integrates the MCA multimodal cross-attention mechanism. As shown in
Figure 6, multimodal cross-attention (MCA) serves as a core mechanism for fusing multi-source heterogeneous data. Its objective is to automatically establish cross-modal correlations within a unified semantic space, achieving information filtering and fusion through learnable attention weights. Taking visual-language tasks as an example, MCA establishes bidirectional dependencies between image and text features: on one hand, textual queries guide the model to focus on semantically relevant regions within images; on the other hand, image context reciprocally enhances the interpretation of textual semantics. In implementation, features from each modality are first mapped into a unified Query-Key-Value (Q-K-V) representation. For image-text pairs, visual features serve as the Query, while linguistic features act as both Key and Value. Attention weights are computed by measuring the similarity between Query and Key (typically using dot product or cosine similarity), followed by softmax normalization. These weights directly determine the weighted aggregation of Values, dynamically amplifying key information.
In the figure, the feature tensor F outputs a processed feature tensor F′ of the same size as F. The output process involves two steps: coordinated attention embedding and coordinated attention generation. Below, I will describe the output process. The fusion of the MCA multimodal cross-attention mechanism can be divided into four steps:
Step A: First, perform global pooling on the input vector
F to obtain global feature information. The input image is encoded along both horizontal and vertical spatial directions before undergoing global pooling. This yields one-dimensional vector features in the height and width directions, enabling subsequent global feature extraction while preserving coordinate information. For the convolutional operation
F1, the encoding of the cth channel at the height and weight positions can be described by the following equation:
Step B: This transformation establishes long-range dependencies along the horizontal direction, enabling attention to focus on lateral global features while preserving precise positional coordinates vertically. Subsequently, the two feature maps from the global receptive field are concatenated and fed into a 1 × 1 convolution. Let denote the 1 × 1 convolution operator and R denote the nonlinear activation function, as detailed in the following equation:
Step C: The concatenated feature tensor [
] obtained in the previous step is an intermediate feature map containing horizontal and vertical feature information. Two 1 × 1 convolution operators are used to split Z into two feature tensors and in the horizontal and vertical directions, respectively. These are transformed into two tensors matching the size of the input feature in both horizontal and vertical dimensions. Subsequently, they are fed into the sigmoid activation function, defined by the following equation:
Step D: Finally, we concatenate these feature tensors along two spatial dimensions to obtain the attention weights for the channel features, treating them as the output feature map F′. The entire computation process is summarized in the following equation:
The attention module not only acquires channel information during encoding but also incorporates positional information. Specifically, the coordinate information retained in the vertical direction enables the network to precisely determine the starting and ending positions of cracks or potholes along the longitudinal axis. Meanwhile, the horizontal dimension facilitates the detection of elongated cracks extending laterally or contiguous damaged areas. This significantly enhances the localization accuracy of multi-scale cracks.
2.5. Introduction of the CGeoCIoU Loss Function
Since typical road surface cracks exhibit aspect ratios exceeding 10:1 and arbitrary orientations, the original model’s CIoU still treats targets as “axis-aligned rectangles.” Therefore, the CGeoCIoU (Crack Geometrically Improved Complete IoU) loss function is introduced to enhance the model’s perception of crack boundaries.
YOLO11 employs CIoU as its bounding box loss function, which accounts for factors such as the aspect ratio and center point distance between target boxes. Its calculation is defined by Equations (11)–(13), where represents the Euclidean distance between the centers of two boxes, c denotes the minimum diagonal length of the bounding box, and is a weight parameter used to measure the consistency of width and height between the two boxes.
In crack detection, due to the characteristics of cracks themselves—such as being slender, highly directional, and having irregular boundaries—traditional IoU-based bounding box loss functions often fail when aligning complexly shaped or inconsistent-scale crack targets. While CIoU, adopted in YOLO11, considers factors like center point distance, aspect ratio, and bounding box distance, it still struggles to address the following issues:
Although the centers of the two boxes coincide, there are noticeable differences in corner positions, orientations, or boundary contours;
Angular deviations in elongated crack targets significantly impact detection quality, yet CIoU fails to capture this “rotational misalignment”;
Smaller crack targets are sensitive to geometric variations, yet uniform weighting may impose unfair penalties.
To address this, an improved direction-aware geometric IoU loss function is proposed based on CIoU. The enhanced diagram is shown in
Figure 7, incorporating three optimization terms as follows:
- 1.
Corner Distance Penalty:
Measures the Euclidean distance difference between the top-left and bottom-right corners of the predicted box and the ground truth box, respectively, and incorporates it as a penalty term:
- 2.
Contour Alignment Penalty:
Based on the minimum bounding contours of two frames, a boundary consistency metric is introduced:
Enalizes cases where the predicted box edge shape significantly deviates from the actual box, particularly effective for slender targets such as cracks.
- 3.
Dynamic Weighting Strategy:
To prevent small targets from being overlooked, introduce target-aware weight adjustment:
Here, λ is the regularization coefficient, which increases the weight of the error term as the IoU decreases. In summary, the final loss function is defined as:
This loss function significantly enhances the representation capabilities for directionality, detailed position alignment, and boundary contour morphology in crack detection while preserving the original CIoU stability. Its effectiveness will be further validated in
Section 4.1 below.
5. Conclusions
Based on deep learning, this paper investigates pavement crack detection algorithms, proposes optimization improvements to the YOLO11 model, and designs experiments to validate the detection performance of the enhanced model. The research findings are as follows:
This paper addresses the challenges in detecting crack targets within complex road scenarios, including weak texture, high aspect-ratio, large scale variations, and strong background interference. Building upon YOLO11 as the baseline, we propose an innovative framework named YOLO11-MBC. The proposed approach incorporates three key improvements: a Feature Fusion Backbone Network (MFFBN) is designed to suppress noise such as oil stains and reflections in the spectral dimension; the BiMCNet neck employs BiFPN combined with MCA cross-modal attention to reconstruct occluded slender cracks in the feature space; and the CGeoCIoU loss with triple penalties for direction, corner points, and contours is introduced to enhance crack boundary perception.
Experiments on the public RDD2022 dataset demonstrate that YOLO11-MBC achieves a 22.5% improvement in F1-score and an 8% improvement in mAP50, while retaining 515 FPS on a Tesla A100 with only approximately 9% additional GFLOPs. These results outperform YOLOv8, YOLOv10 and existing crack detectors. Ablation studies and visualizations further confirm that the three proposed modules act synergistically to reduce missed detections and false boxes in complex road scenes.
While the improvements to the pavement crack detection model presented in this study have achieved preliminary results, the following limitations remain:
While this study has addressed the multi-scale nature of pavement cracks, it has not yet incorporated multimodal features. Future work will focus on integrating these multimodal characteristics to further enhance the model.
Existing datasets mainly consist of images captured under clear weather conditions, with a scarcity of samples from challenging climates such as rain, snow, and fog. Future research will expand the dataset by including images from diverse weather conditions and performing generalization tests to improve the model’s adaptability and robustness in real-world environments.
Future work will continue to refine the model architecture. Since complex and variable weather conditions are commonly encountered in real-world scenarios, subsequent research will focus on enhancing the model’s versatility and robustness. Further improvements will be made to the model structure to boost detection performance, with the ultimate goal of developing a more lightweight, efficient, and practical solution.