1. Introduction
Rural roads are long and have a wide distribution, and frequent pavement distress detection plays a key role in prolonging the service life of roads and enhancing rural roads to contribute towards local economic revitalization. Compared with the national and provincial trunk highways, the management and maintenance level of rural roads is relatively low, with overgrown roadside vegetation casting dense shadows, the presence of abundant ground cover such as weeds and soil, and cracks of different sizes. The above factors render the automated detection of pavement distress very difficult [
1,
2]. The pavement condition of typical national and provincial trunk highways and rural roads in the northern region of China is illustrated in 
Figure 1. National and provincial trunk highways have good pavement quality and less interference due to periodic maintenance and inspection operations, unlike rural roads, which are less maintained and have many interference items on the pavement. Furthermore, the complex shape of the cracks is challenging for the detection network.
As illustrated in 
Figure 1, rural pavement is obscured by shadows from objects, such as tree branches along the roadside, resulting in a complex image background and recognition difficulty for the detection model when detecting pavement cracks [
3,
4,
5,
6,
7]. In addition, rural pavements have finer interference terms, similar to weeds, tree branches, and dirt, that are similar to the texture of cracks, leading to misrecognition by the detection network. Furthermore, cracks of different scales may exist in the same image in the rural pavement crack detection task, leading to increased detection difficulty. Therefore, a pavement crack detection algorithm with strong anti-interference and outstanding multi-scale detection capability is very important for rural pavement detection tasks.
With the development of image processing technology in recent years, road crack detection algorithms based on deep learning [
4,
8,
9,
10,
11,
12,
13,
14,
15] have become the focus of research. Algorithms can be divided into two types: those based on semantic segmentation [
16,
17,
18,
19,
20] and those based on target detection. Rural pavement crack detection requires reliable accuracy and good detection efficiency. Semantic segmentation-based crack detection algorithms are more accurate at extracting crack edge information, but their detection efficiency is lower and more applicable to high-grade urban pavements, which cannot meet the detection needs of rural pavements. Therefore, rural pavement crack detection tasks are more suited to the use of object detection algorithms with high detection efficiency and reliable accuracy. Yuan et al. [
21] proposed a FedRD model for the rapid detection of pavement distress, which can not only detect pavement distress but has high accuracy and good reliability. The FedRD model not only has a faster detection speed, but it also demonstrates a fast detection effect in the case of limited edge data. However, it does not make a detailed classification of pavement distress, and the detection accuracy is not high for low pavement distress. Wu et al. [
22] proposed a lightweight detection model, YOLO-LWNet, based on mobile terminal devices, which proposes a lightweight backbone and an efficient feature fusion network based on LWC as a basic block. It compares the model with other models in the RDD public dataset. The model has computational complexity, but its detection accuracy is poor. Li et al. [
23] proposed a model based on U-net to output both the segmentation and detection results of cracks. They combined the multi-head detection method to output the two kinds of results, thereby constructing a more feature-rich detection model. However, this model is better for segmentation results, and it performs poorly in target classification and detection efficiency. Pham et al. [
24] proposed a YOLOv7 road crack detection algorithm based on the YOLOv7 road crack detection algorithm, which combines coordinated attention and label smoothing, integration, and other related accuracy fine-tuning techniques to train a deep learning model. This was proven to have better detection performance through testing the results in the RDD public dataset. However, its training method is complex, and its practical applicability in engineering is poor. Deng et al. [
25] applied the YOLOv2 recognition network to identify concrete surface cracks. Compared to its region-based convolutional neural network, YOLOv2 has better detection efficiency and detection accuracy, but compared to the current stage of the target detection field of the model, it is more complex and the accuracy is worse. Liu et al. [
26] proposed the application of the nondestructive ground-penetrating radar (GPR) to collect road crack images, and, combined with the improved YOLOv3 to process the collected data, it achieved high recognition accuracy and detection effects; however, the sensor used is both expensive to replicate and is poor in practical applicability. Although pavement distress detection algorithms based on deep learning target detection technology have made progress, there are fewer academic studies on rural pavement distress detection, and at the same time, there is a lack of large-scale, multi-scenario, and all-type training datasets in this field. A comparison between the existing pavement distress detection models and the model proposed in this paper is seen in 
Table 1.
For the aforementioned issues, this study selects the YOLOv5 detection model, which is widely used for pavement crack detection [
27,
28], as the base model and uses CrackConv and ADSample in the backbone and ADSample in the feature fusion section to reduce the influence of the network on complex background interference. The CAS module is added to the feature fusion section to improve the network’s anti-interference ability for small cracks. The MSConv and MSHead modules in the feature fusion section are improved to enhance the network’s multi-scale detection ability, and the network is composed of the rural pavement detection system. In order to verify the effectiveness of the algorithm, this study applies a pavement condition monitoring vehicle, collects rural road data images in the field, produces the LNTU_RDD_NC dataset for the study of rural road crack detection technology, and verifies the algorithm proposed in this study.
  2. CrackYOLO Detection Model
The CrackYOLO detection model is primarily composed of the backbone, feature fusion section, and head. The network architecture of the CrackYOLO model is illustrated in 
Figure 2.
Due to the complexity introduced by shadows obscuring rural pavement cracks, the original network exhibits weak target information extraction capabilities in the backbone and loses too much target information during the downsampling feature fusion section. In this study, leveraging the elongated and tubular geometric shape of cracks, we introduce CrackConv in the backbone section of the network to more accurately extract crack features. Simultaneously, we designed ADSample with adaptive weights that can adjust the weight of cracks obscured by shadows to solve the loss of information about crack features covered by shadows experienced during downsampling while improving the receptive field. To address the reduced recognition accuracy caused by fusions of interference, e.g., branches or weeds and crack edge textures, we introduce the CAS adaptive attention mechanism module between the backbone and neck and after up-sampling. This allows the model to focus on crack features, enhancing the model’s resistance to interference. To mitigate the variable scales of cracks on rural roads leading to misidentification, we introduce MSConv to refine the extraction of crack features, therefore enhancing the model’s ability to extract fine cracks. Simultaneously, we replace the MSHead detection head and integrate the output of MSConv. A scale-aware mechanism is introduced, enabling the adaptive allocation of semantic weights to feature maps with larger resolutions. This ensures that the model exhibits better adaptability to targets with significant scale differences.
  2.1. Improvement for Complex Backgrounds
Improvements to the backbone are designed to address the geometric shape of cracks, utilizing CrackConv’s deformable convolution, as proposed in previous work, which aims to help the model focus on key features. The minute tubular structure exhibited by cracks, owing to their elongated and slender local construction, has long been considered a challenging task in the field of object detection. Moreover, cracks occupy a relatively small proportion of the entire pavement image, and they are susceptible to complex background interference, such as shadow coverage, thereby increasing the complexity of feature extraction. In this study, it was thought that feature extraction could be carried out through Snake Convolution [
29] and variability convolution to improve the extraction accuracy of cracks covered by other cracks. However, when we referred to the model, no better accuracy improvements were observed, so we reconstructed the structure and improved the design of CrackConv to guide the model to focus on the key characteristics of the crack itself and to improve the information weight of the crack. Cracked convolution structure as shown in 
Figure 3.
Assuming the central coordinates are represented by 
Ki = (
xi, 
yi), the 3 × 3 kernel K is expressed as follows:
To enhance the convolution’s focus on complex geometric features and prevent the receptive field from deviating from the target, an offset correction Δ is employed. This allows the convolution, based on the morphological knowledge of tubular structures, to adapt and concentrate on the local features of elongated and curved cracks. The specific positions of each grid in K are represented as follows: 
Ki±c = (
xi±c, 
yi±c), where 
c = {0, 1…, 4} indicates the horizontal distance from the central grid. The selection of each grid position 
Ki±c in the convolutional kernel 
K is an accumulative process compared to 
Ki, with 
Ki+1 increasing the offset with reference to 
Ki ∆ = {δ | δ ∈ [−1, 1]}. Therefore, the offset needs to be Σ to ensure that the convolutional kernel conforms to linear morphological structures. The construction of coordinates is illustrated in 
Figure 4.
In 
Figure 4, the variations along the 
x-axis and 
y-axis within the receptive field are given by the following equations:
Due to the variations along the 
x-axis and 
y-axis, 
Figure 4 (right) illustrates the feature extraction process of CrackConv within a receptive field of 9 × 9. In addition to the improvements in the backbone, complex backgrounds can cause the detection model to lose information about cracks themselves during the downsampling process. This issue arises because convolutional kernels use the same parameters for feature extraction within each receptive field without considering the differences in target information at different positions.
The original network’s convolution operation insufficiently recognized the criticality of crack-specific features, which further compromised the effective extraction of crack features. We designed ADSample based on the idea of expanding the receptive field of RFAConv [
30], allowing the model to adjust the receptive field while maximizing the information of the crack itself and ensuring that it is not lost while also improving the downsampling operation of the original network to improve the model’s resistance to complex background interference. Global information is extracted from the input features using AvgPool operations that aggregate features within each receptive field. Subsequently, 1 × 1 convolution operations facilitate the interaction of information within the receptive field. Following this, the SoftMax function is applied to highlight the importance of each feature within the receptive field. The features obtained are then fused with spatial features from the receptive field to adjust the weights of convolutional parameters and ultimately output the features. The implementation process is illustrated in 
Figure 5.
The input feature map can be represented as follows:
In the above expressions, gi*i denotes a group convolution of size i * i, k represents the size of the convolutional kernel, Norm indicates normalization, and X represents the input feature mapping and denotes the multiplication of the attention mapping (Arf) with the transformed spatial features (Frf) from the receptive field.
However, ADSample prioritizes the spatial features within the receptive field using weights associated with different features in the convolution, multiplying each feature weight with the input features, and summing the results. The feature map using ADSample does not overlap with the spatial features of the receptive field after adjusting its shape. Consequently, the learned attention map aggregates the feature information of each receptive field slider, extracting modules containing crack information.
  2.2. Improvement for Fine Interference
To mitigate the impact of fine interference types such as weeds, soil, and branches on the model, in this study, the hybrid attention mechanism was introduced because the attention mechanism can make convolutional neural networks focus on key information [
31,
32,
33]; however, in actual experiments, the network did not focus on the cracks themselves. This study designed the CAS adaptive attention mechanism module according to its related structure, which can adaptively improve the crack information weight and reduce the interference information weight, including branches and weeds, according to the crack characteristics detected by the network. This addresses the issue of decreased accuracy caused by the fusion of rural road crack texture information and surrounding interference, therefore improving the model’s resistance to interference. The structure of the CAS module is depicted in 
Figure 6.
The effective feature map 
F obtained from the backbone extraction network is initially subjected to a channel attention mechanism to calculate a weight 
QC. Subsequently, the feature map 
F is multiplied by the obtained weight 
QC, assigning a corresponding weight to each channel to obtain the channel attention feature 
FC. Following this, a spatial attention mechanism is applied to obtain the weight 
QS, and the feature map 
F is multiplied by this to yield the refined spatial features 
FS. Finally, the output feature map is obtained 
F′. The transformation of the feature map using the CAS hybrid attention mechanism module is expressed by Equation (5).
        
In the above equation, 
F represents the input feature map, 
FC denotes the feature map with assigned channel weights, 
FS represents the feature map with assigned spatial weights, and 
F′ represents the feature map outputted by the CAS hybrid attention mechanism. The calculation process of the channel attention feature 
FC in the CAS module is as follows: the input image feature 
F undergoes global average pooling and global max pooling, followed by processing through a multi-layer perceptron (MLP). The results of this process are stacked, sigmoid is applied, and the final channel attention feature 
FC is obtained. The computation is expressed by Equation (6).
        
In the aforementioned equation, F represents a feature map of dimensions H × W × C, W0, and W1, respectively, and signifies the correlation coefficient;  denotes the feature layer obtained after global average pooling; and  signifies the feature layer obtained after global maximum pooling. The spatial attention feature FS implementation process of the CAS module is outlined as follows:
For each feature point in the obtained channel attention feature 
FC, the maximum values 
Fmax and average values 
Favg along the channel dimension are computed. Subsequently, these values are concatenated, and a convolution operation with a single channel is applied to adjust the channel dimension. Following this, a sigmoid function is employed to obtain the spatial attention feature, as expressed in Formula (7). The computational procedure is delineated as follows:
The equation,  represents a 3 * 3 dilated convolutional layers, and [;] denotes the concatenation function. Favg signifies the feature layer obtained after global average pooling, while Fmax denotes the feature layer obtained after global maximum pooling.
  2.3. Improvement for Multi-Scale Target Detection
The original YOLOv5 detection model incorporates a feature pyramid structure in the neck section to address multi-scale target detection, enabling feature extraction from three different scales. However, when recognizing targets with significant scale variations, such as rural pavement cracks, the multi-scale structure of the feature pyramid may still fall short of extracting all the fine-grained details of the targets. To address the insufficient multi-scale detection capability of the original detection model, this study improves the feature pyramid section by subdividing the channels of the utilized convolutional layers to extract key information for different-sized targets. In addition, the ability to adjust the distribution of feature maps with different resolutions is increased in the Network Head section.
Inspired by the concept proposed in Scale-Aware Modulation [
34], this research introduces a multi-scale convolution, MSConv, to transform the original network’s feature pyramid structure. This modification enables the finer extraction of spatial features across multiple scales. Initially, the convolution channels of the input layer are divided into four parts, and 1 × 1, 3 × 3, 5 × 5, and 7 × 7 convolutional layers are individually applied to extract features from the corresponding parts. The extracted channel feature information is then fine-tuned for multi-scale extraction using pointwise convolution [
35], which exchanges channel information. Finally, the refined features are embedded into the network’s feature fusion convolution. The implementation principle is illustrated in 
Figure 7.
The process involves breaking down the input convolution into four equal parts along the channel dimension through average splitting. Subsequently, channel feature exchange is performed using 1 × 1 convolutions, and the resulting features are then concatenated to reconstruct the convolution.
        
Following the processing by MSConv, the feature maps exhibit a richer scale compared to the feature maps outputted by the original network. However, the original network’s head tends to assign lower weights to feature maps with larger or smaller resolutions, leading to a persistent failure to recognize small cracks. To address this problem, this study introduces MSHead [
36], which stacks the input feature pyramid and incorporates a scale-aware mechanism. This mechanism reassigns semantic information weights to feature map layers with significant resolution disparities, thereby enhancing the multi-scale detection’s capability. The structure of MSHead is illustrated in 
Figure 8.
The input feature map tensor is denoted as 
, and it is reshaped into a three-dimensional tensor through 
. The generalized form of self-attention can be expressed by the following formula:
Π
S represents scale-aware attention, which, based on its semantic significance, facilitates the fusion of features across different scales. The input feature map layer undergoes global pooling to obtain the maximum value of the feature map. A 1 × 1 approximate convolution is responsible for integrating all channels. This is followed by the activation functions RELU and sigmoid. ΠS can be represented by the following formula:
        where 
 is a linear function and 
 is a sigmoid activation function.
  4. Analysis of Model Method Effectiveness
To demonstrate that the CrackYOLO detection model can meet the requirements of rural road crack detection tasks and to validate the effectiveness of the proposed model, seven experiments are conducted by building upon the YOLOv5 base model. The experiments involve the incorporation of various modules: CrackConv, ADSample, CAS module, MSConv convolution, MSHead detection head, and the complete CrackYOLO recognition network. The models are compared based on their detection performance. Additionally, two control experiments are conducted to showcase the model’s optimal ability to withstand complex backgrounds and demonstrate its best multi-scale detection capability. CrackConv and ADSample aim to reduce the impact of complex backgrounds, and they are simultaneously added to the model as the eighth control experiment. MSConv and MSHead detection heads, which both have multi-scale detection capabilities, are simultaneously added to the model as the ninth control experiment. The experiments are conducted on the test set allocated according to 
Table 3, and the recognition results are presented in 
Figure 14.
The improvements to the main feature extraction network using CrackConv, compared to the original YOLOv5 detection model, resulted in enhanced feature extraction capabilities. The average recognition accuracy increased by 2.82%, and the accuracy for reticular cracks, transverse cracks, and longitudinal cracks increased by 2.74%, 2.89%, and 2.83%, respectively.
The improvement in the neck with the ADSample compared to the original YOLOv5 detection model reduced the loss of crack feature information under shadow coverage during downsampling. The average recognition accuracy increased by 0.58%; the accuracy for reticular cracks remains the same; for transverse cracks, it increased by 0.58%; and for longitudinal cracks, it increased by 1.18%.
Combining the CrackConv and ADSample modules in the base network to resist complex background interference results in an average recognition accuracy increase of 4.75%, an accuracy increase of 3.40% for reticular cracks, an increase of 5.51% for transverse cracks, and an increase of 5.36% for longitudinal cracks. This proves that the CrackConv and ADSample modules improved in this study exhibit good detection performance in rural road crack recognition.
Addressing the issue of resisting small disturbances by adding the CAS adaptive attention mechanism resulted in an average recognition accuracy increase of 0.87%, an accuracy increase of 1.73% for reticular cracks, an increase of 0.80% for transverse cracks, and an increase of 0.10% for longitudinal cracks. This improvement in the CAS module provided certain help in rural road crack detection against small interferences.
Improving multi-scale convolution (MSConv) in the neck part enhanced the model’s fine detection capability. Compared to the original YOLOv5 detection model, the average recognition accuracy increased by 1.56%, and the accuracy for reticular cracks, transverse cracks, and longitudinal cracks increased by 3.12%, 0.85%, and 0.72%, respectively.
Improving the multi-scale target detection head (AHead) in the head resulted in a recognition accuracy increase of 2.69%, an accuracy increase of 1.24% for reticular cracks, an increase of 3.87% for transverse cracks, and an increase of 3.02% for longitudinal cracks.
Addressing the model’s ability to recognize multi-scale cracks by adding the MSConv and MSHead modules to the base network results in an average recognition accuracy increase of 5.44%, an accuracy increase of 5.20% for reticular cracks, an increase of 6.02% for transverse cracks, and an increase of 5.11% for longitudinal cracks. This improvement in the MSConv and MSHead modules significantly enhanced rural road crack detection.
Compared to the original YOLOv5 detection model, the CrackYOLO detection model achieved an average recognition accuracy increase of 6.70% and an accuracy increase of 6.16% for reticular cracks, 7.10% for transverse cracks, and 6.85% for longitudinal cracks. This demonstrates that the improved model in this study provides higher accuracy in rural road crack detection.
To address the influence of complex backgrounds, such as shadow coverage, this study designed the CrackConv and ADSample modules to encourage the models to focus more on the features of the cracks themselves. To verify the effectiveness of these modules, the original YOLOv5 detection network was used as a base control, and the detection results of the aforementioned modules were summarized. Representative images were selected, and heat maps of the network were drawn. The network with both CrackConv and ADSample modules is abbreviated as (V5 + CC + AD), and the summarized results are as follows:
The samples shown in 
Figure 15a indicate that in the YOLOv5 network, there were two instances of missed recognition in cracks covered by tree branch shadows. The network with the ADSample module successfully recognized small cracks under shadow coverage, and the CrackConv module showed good resistance to shadow coverage, resulting in higher confidence in recognition.
In 
Figure 15b, the original network failed to recognize cracks covered by shadows, while the network with CrackConv and ADSample modules successfully identified cracks under shadow coverage.
Figure 15c shows that the CrackConv and ADSample modules successfully recognized cracks covered by shadows, but they failed to identify branches around the cracks. When both modules were added to the network, even the small crack branches were successfully identified.
 The CAS module proposed in this study to address small disturbances such as weeds and branches was compared with the original network. The performance was analyzed based on the network heatmap, and the comparative detection results are as follows:
In the sample shown in 
Figure 16a, the original network failed to successfully identify transverse cracks around the cracks with soil and animal feces. The network with the CAS module effectively avoided interference from small branches to the cracks.
In 
Figure 16b, the sample showed enlarged processing of small disturbances. It can be observed that the original network misidentified branches in the image as longitudinal cracks. The network with the CAS module was not disturbed by these branches.
This study addressed the difficulty of recognizing cracks when an image presents cracks on multiple scales by designing the MSConv and MSHead modules to improve the multi-scale detection performance of the detection model. To verify the effectiveness of these modules, the original YOLOv5 detection network was used as a baseline, and the detection results of the above modules were summarized. Representative images were selected, and heat maps of the network were generated. The network using both MSConv and MSHead modules was denoted as (V5 + MSC + MSH). The summarized results are as follows:
In the sample shown in 
Figure 17a, the detection models from the four experiments exhibited some multi-scale detection capability. However, the original network paid less attention to detecting small cracks. With the addition of the MSConv and MSHead modules, there was an improvement in confidence for the identified small cracks.
Figure 17b shows a sample with four cracks of different sizes. The original network could only recognize the larger cracks. After adding the MSConv, it recognized two small cracks, and after adding the MSHead, it recognized three small cracks. Adding both of these modules enabled the detection of cracks at all scales. Therefore, it was evident that embedding the MSConv and MSHead modules simultaneously into the model significantly increased the multi-scale detection capability of the detection model.
 In 
Figure 17c, the sample featured a larger transverse crack surrounded by smaller longitudinal cracks. The original network failed to identify both scales of cracks. After adding MSConv, it struggled to identify the smaller cracks close to the transverse crack. However, after adding MSHead, it successfully identified both scales of cracks.
  5. Conclusions
There are often obstacles, including complex backgrounds, disturbances, and variable scale, in rural road crack detection that make existing road crack detection models ineffective. Aiming to solve this problem, this study designed the CrackYOLO rural road crack detection model, introducing unique modules such as CrackConv, ADSample, CAS, MSConv, and MSHead to improve crack feature extraction and solve problems such as shadows and variable crack scale. Compared with the standard model, the model described here exhibits excellent performance, especially in complex rural environments. To verify the effectiveness of the proposed algorithm, this study created the LNTU_RDD_NC dataset and used it to conduct experiments. The experimental results show that the CrackYOLO model has significant advantages compared with other commonly used road crack detection models for crack detection tasks carried out on rural pavement. At the same time, we also use the public dataset of pavement crack detection in different scenarios, such as urban pavement, to conduct experiments. The results show that the CrackYOLO detection model has good detection performance not only in rural pavement scenarios but also in urban pavement and other scenarios.
However, this study has limitations, particularly in detecting targets in various scenarios. In future work, efforts must be made to expand detection tasks, such as pothole detection, to improve the application scenarios of the study.