1. Introduction
Pavement crack detection is essential for traffic safety and road infrastructure maintenance. With more than 64 million kilometers of roads worldwide [
1], pavement defects such as cracks and potholes are major contributors to vehicle damage and traffic accidents [
2]. According to the World Health Organization, road traffic crashes claim over 1.3 million lives each year, and adverse weather conditions—including rain and snow—can increase accident rates by up to 50% [
3]. These extreme weather events not only accelerate pavement deterioration but also significantly complicate crack detection, as visibility is reduced and road surfaces become obscured by precipitation [
4]. In such environments, rain streaks and snowflakes act as visual noise, masking surface textures and interfering with conventional image-based detection algorithms [
5]. Consequently, existing detection systems often struggle to reliably recognize cracks under adverse weather conditions [
6]. This underscores the urgent need for robust and adaptive crack detection methods that can maintain high accuracy and stability across a wide range of challenging weather scenarios.
Conventional approaches for identifying pavement cracks—such as manual surveys or basic image analysis—have long been the norm, but these methods are inefficient and easily influenced by human subjectivity and environmental changes [
7,
8]. As technology has advanced, deep learning and computer vision have transformed the field with models like YOLO [
9,
10,
11,
12,
13,
14,
15,
16] becoming popular for their speed and accuracy in detecting pavement damage. These models excel at handling large datasets and learning diverse visual patterns without handcrafted features [
17]. Despite their strengths, most current models are developed and validated using datasets captured in good weather [
18], leaving them vulnerable when applied to images from rainy or snowy conditions. Rain, snow, and lighting variations introduce complex visual disturbances, further reducing the reliability of crack detection in real-world scenarios [
19]. This gap underlines the need for robust detection solutions that remain effective regardless of the weather [
20].
Object detection constitutes a cornerstone of computer vision research and continues to evolve rapidly, which is propelled by the growing demand for high-precision and real-time automated inspection systems [
21,
22]. The YOLO (You Only Look Once) family of detectors [
23] has played a pivotal role in this evolution. Meanwhile, methods like EfficientDet [
24] have introduced compound scaling and attention-based modules to optimize detection efficiency, and transformer-based [
25] frameworks such as DETR [
26] have redefined detection paradigms through end-to-end architectures. In the specific context of pavement crack detection, these object detection models enable the rapid, accurate localization of surface defects across varied and challenging environments. Compared to segmentation-based approaches [
27], detection models provide superior inference speed and adaptability, greatly improving the practicality and reliability of large-scale road monitoring systems. Collectively, these methodological advances establish a strong technological foundation for robust crack detection, especially in scenarios requiring resilience to complex weather-induced visual disturbances [
28]. However, even with these advances, there remains a need for further architectural improvements to address the unique challenges posed by extreme weather in pavement crack detection.
The pavement crack detection task involves identifying several common types of pavement distress. In this study, the dataset [
29] is divided into four defect categories: class_0 (longitudinal cracks), class_1 (transverse cracks), class_2 (alligator cracks), and class_3 (potholes), which represent the most prevalent and safety-critical forms of road surface damage encountered in real-world scenarios. In practical road monitoring applications, extreme weather presents several key challenges to accurate crack detection, as illustrated in
Figure 1. These challenges are summarized as follows: Low visibility refers to situations where rain, fog, or other adverse weather conditions blur or darken the scene, reducing the clarity and visibility of cracks. Low visibility often leads to missing or incomplete crack region detection, particularly for fine or shallow cracks [
30,
31,
32]. Background noise is characterized by the presence of extraneous visual elements such as raindrops, snowflakes, shadows, vehicles, or strong road markings. These background elements act as visual occluders, masking or confusing the true crack features, and they can result in false detections or missed cracks. Complex cracks refers to defects that are irregular, fragmented, or highly interconnected—such as alligator cracks. Their irregular shapes and spatial distribution increase the difficulty of both detection and accurate localization, especially under adverse environmental conditions.
Figure 1 provides representative examples of these challenging scenarios. The first group of images demonstrates occlusion and background noise caused by precipitation and other environmental factors; the second group illustrates the reduced contrast and blurred features in low-visibility conditions; and the third group shows the challenges associated with complex and fragmented crack patterns. The green bounding boxes in each image indicate the annotated defect regions. These challenges highlight the urgent need for robust pavement crack detection frameworks capable of addressing environmental interference and preserving detection accuracy in real-world, weather-perturbed environments.
To address the aforementioned challenges, this paper proposes an improved pavement crack detection method based on YOLOv12. The method systematically incorporates three key modules aimed at enhancing feature robustness, spatial detail preservation, and adaptability in adverse weather conditions. Specifically, we replace the standard downsampling convolutions in the backbone with the Haar Wavelet Downsampling Block (HWDB) [
33], which effectively preserves both low-frequency and high-frequency information during spatial reduction. In both the backbone and neck, all Cross-Stage Partial modules with 2 × 2 convolutional kernels (C3K2) are replaced by the Strip Pooling Bottleneck Block (SPBB) [
34], which achieves multi-scale and directional context aggregation and improves the detection of elongated, fragmented, and low-contrast cracks. In the upsampling stage, the Dynamic Sampling Upsampling Block (DSUB) [
35] is introduced in place of conventional upsampling operators, enabling content-adaptive spatial feature reconstruction and enhanced boundary detail recovery.
The main contributions of this paper are summarized as follows:
(1) To address the scarcity of extreme weather samples, we selected 2600 images from the RDD2022 [
29] Japan dataset and applied simulation algorithms to generate realistic rain and snow conditions for each image. This process resulted in a diverse experimental dataset encompassing clear, rainy, and snowy scenarios, which provides a solid data foundation for robust crack detection research in complex and adverse environments.
(2) We propose a novel pavement crack detection method based on YOLOv12, integrating the HWDB, SPBB, and DSUB modules, which significantly enhances detection performance under complex and adverse weather conditions.
(3) To address the challenges of information loss and poor recognition of low-contrast cracks caused by precipitation and occlusion, we design the HWDB module, which effectively preserves detailed and edge information through Haar wavelet decomposition.
(4) To tackle the problems of elongated, fragmented, and directionally ambiguous cracks, the SPBB module is introduced in both the backbone and neck, leveraging strip pooling and multi-branch context aggregation to enhance the representation of directional and global features.
(5) To improve the recovery of fine details and crack boundaries in the upsampling stage, we construct the DSUB module, which enables a content-aware reconstruction of high-resolution feature maps via the adaptive prediction of spatial offsets and scaling.
(6) Extensive experiments on the aforementioned multi-scenario datasets as well as the Norwegian dataset of RDD2022 show that our approach achieves superior accuracy and robustness compared to mainstream baseline methods, especially under severe weather conditions.
3. Methodology
3.1. Architecture of the Proposed Method
As illustrated in
Figure 2a, we propose a substantially enhanced pavement crack detection framework, built upon YOLOv12 [
15], that is specifically tailored for the challenging conditions encountered in adverse weather scenarios. Recognizing that standard detection architectures often fail to maintain accuracy in the presence of reduced visibility, precipitation-induced noise, and irregular crack morphologies, we systematically redesign the backbone and upsampling stages of YOLOv12. By introducing specialized modules at these key locations, our approach significantly improves the network’s robustness and detection performance under real-world rainy and snowy environments.
In the backbone, we replace all standard convolutional downsampling layers with the Haar Wavelet Downsampling Block (HWDB) [
33]. Conventional downsampling operations, such as strided convolution or pooling, frequently result in the loss of crucial edge and texture details—a limitation that becomes even more severe in the presence of low-contrast or partially occluded regions caused by adverse weather. To address this challenge, the HWDB leverages Haar wavelet transformation to decompose feature maps into both low- and high-frequency components, enabling the network to capture subtle crack boundaries and fine structural cues that would otherwise be eliminated by traditional methods. This targeted integration markedly enhances the network’s capacity to preserve and propagate essential features, leading to the more reliable detection of faint, blurred, or partially obscured cracks in complex real-world conditions.
For multi-scale and contextual feature fusion, we replace all C3K2 modules with the Strip Pooling Bottleneck Block (SPBB) [
34]. While the original C3K2 modules are computationally efficient, they exhibit clear limitations in capturing long-range dependencies and directional characteristics—an ability that is critical for accurately detecting fragmented or elongated cracks, particularly when these cracks are partially obscured by precipitation such as raindrops or snowflakes. By adopting a dual-branch design that incorporates both strip pooling and global average pooling, the SPBB module enables the aggregation of local and global contextual features with specific emphasis on enhancing representations along horizontal and vertical orientations. This targeted enhancement substantially boosts the network’s ability to model directionality and maintain spatial continuity, thereby improving the detection and structural understanding of cracks under complex and visually challenging conditions.
In the upsampling stage, we replace all traditional upsampling operations, such as nearest neighbor or bilinear interpolation, with the Dynamic Sampling Upsampling Block (DSUB) [
35]. Conventional upsampling methods typically employ fixed, content-independent sampling strategies, which often result in blurred crack boundaries and the loss of fine structural details—issues that are particularly pronounced for small or low-contrast cracks under adverse weather. The DSUB module addresses these challenges by predicting adaptive offsets and scales for each spatial position, thus enabling content-aware dynamic sampling that accurately reconstructs feature details and preserves essential edge information. This targeted approach significantly enhances the recovery of spatial details during upsampling, yielding sharper crack boundaries and more precise localization, even in visually degraded or complex real-world scenarios.
Our improved network first extracts features using the HWDB-augmented backbone, then fuses multi-scale and directional features with SPBB modules, and finally restores spatial details with the DSUB module before outputting high-quality crack detection and classification results. Each structural innovation we design closely corresponds to a key challenge in crack detection under adverse weather conditions, ensuring both theoretical soundness and practical robustness.
3.2. Strip Pooling Bottleneck Block
In adverse weather conditions such as rain and snow, the visibility of pavement cracks is significantly reduced. Cracks often appear as elongated, fragmented, or low-contrast structures, sometimes partially occluded, which greatly increases the difficulty of automatic detection. The main challenges lie in the fact that conventional convolutional architectures have limited capacity for global context modeling and long-range dependency, making it difficult to accurately detect or recover cracks that are blurred or occluded. Moreover, cracks usually exhibit strong directional patterns (e.g., horizontal or vertical extension), while ordinary networks struggle to effectively capture such directional information. Therefore, designing an efficient feature extraction module that can aggregate multi-scale context, directional information, and global semantics is critical to improve crack detection performance under adverse weather.
To address these issues, we introduce the Strip Pooling Bottleneck Block (SPBB), as illustrated in
Figure 2b. This block utilizes multi-branch aggregation and double residual connections to efficiently integrate global, local, and direction-aware features, significantly enhancing the robustness and representation power of the network for complex pavement crack scenarios. The detailed process is as follows:
Given the input feature
X, a
convolution is first applied to extract spatial structure information, resulting in feature
F. Then, a
convolution is performed for channel adjustment and feature mixing, and the result is used as input to the upper and lower branches, which are denoted as
and
:
where
X is the input feature map,
F is the output after the
convolution,
and
denote
and
convolution operations, and
,
are the inputs to the upper and lower branches, respectively.
In the upper branch,
is passed through three sub-paths. The first and third sub-branches perform global average pooling (GAP) followed by
convolution, while the second sub-branch applies a
convolution directly. The outputs are then concatenated along the channel dimension to produce
:
where
denotes global average pooling,
and
are global context features,
is the local detail feature,
indicates channel concatenation, and
is the output of the upper branch.
In the lower branch,
is processed by horizontal and vertical strip pooling, i.e., average pooling along the width and height, respectively, which are each followed by
convolution. The resulting features are then concatenated to obtain
:
where
and
denote average pooling along the height and width, respectively;
and
are the features aggregated in the horizontal and vertical directions, and
is the output of the lower branch.
The outputs of the upper and lower branches are concatenated along the channel dimension to form
, which is then added to the output
F of the first
convolution to obtain the first residual output
:
where
is the concatenated multi-branch feature,
F is the
convolution output, and
is the first residual output.
Finally, the first residual output
is added to the original input
X to obtain the final output
Y of the SPB module:
where
is the output after the first residual connection,
X is the input feature, and
Y is the final output of the SPB module.
In summary, the SPBB module effectively integrates global context, local spatial information, and direction-aware features, significantly improving the network’s ability to detect elongated, fragmented, low-contrast, and strongly directional cracks. The dual residual connections ensure sufficient information flow and feature diversity throughout the module, providing a robust and expressive high-level semantic feature for subsequent detection heads. Overall, the proposed SPB module lays a solid foundation for crack detection in extreme environments and greatly enhances the model’s generalization capability in complex scenarios.
3.3. Haar Wavelet Downsampling Block (HWDB)
Conventional downsampling techniques, such as max or average pooling, inevitably result in the loss of critical spatial details and high-frequency information—an issue particularly detrimental for detecting small, faint, or low-contrast pavement cracks in adverse weather conditions such as rain and snow. Under these circumstances, cracks often exhibit blurred contours, weak intensity, and are prone to occlusion or distortion by environmental noise, as shown in
Figure 2a. Preserving both global structure and fine-grained edge cues during downsampling is thus essential for robust and reliable crack detection in real-world outdoor scenes.
To address these challenges, we introduce an information-preserving the Haar Wavelet Downsampling Block (HWDB), as illustrated in
Figure 2c. Rather than reducing spatial resolution through direct aggregation, the HWDB module leverages an orthogonal frequency decomposition to retain vital spatial information. Specifically, for each
spatial region in the input feature map
X, the HWDB module computes one low-frequency and three high-frequency components according to
where
are the values of each
patch in
X,
denotes the low-frequency (approximation) coefficient, and
represent horizontal, vertical, and diagonal high-frequency components, respectively.
These four sub-bands, each retaining the spatial dimensions
and
C channels, are concatenated along the channel axis to form a frequency-aware feature representation:
with
. To further promote adaptive feature fusion and dimensionality reduction, the HWDB module employs a
convolution followed by batch normalization and a ReLU activation:
where
projects
channels back to
C,
denotes batch normalization, and
is the activation function. The resulting
serves as the downsampled feature, preserving both global context and high-frequency edge cues vital for subsequent crack detection stages.
This orthogonal transform enables the HWDB module to capture and transmit spatial structure, boundary information, and textural details that would otherwise be lost in traditional downsampling. By systematically integrating the HWDB module into all downsampling layers of our detection backbone, the network gains the capacity to robustly localize and delineate cracks—even under heavy rain, snow, and low-visibility conditions—thereby providing a reliable and information-rich feature foundation tailored for adverse-weather pavement crack detection.
3.4. Dynamic Sampling Upsampling Block (DSUB)
Under complex conditions such as rain and snow, pavement cracks often exhibit elongated, fragmented, and low-contrast characteristics, making the spatial structural recovery of feature upsampling crucial for accurate detection. Traditional upsampling methods, such as bilinear interpolation, cannot adaptively adjust sampling locations based on content, leading to blurred edges and the loss of details. As shown in
Figure 2d, to address this, we introduce the Dynamic Sampling Upsampling Block into our method, which implements content-adaptive dynamic upsampling and effectively enhances the recovery of structural and fine details.
Specifically, given an input feature map
, we first use convolutional branches to predict the offset
O and scale
S for each spatial location as
where
O is the output of the offset branch with shape
, and
S is the output of the scale branch with shape
or
.
Then, the offset and scale are multiplied element-wise to obtain a content-adaptive scaled offset:
where
represents the adaptively scaled offset with the same shape as
O.
Next,
is added to the initial reference grid
of the target space to produce the content-adaptive sampling points:
where
denotes the standard sampling positions for each pixel in the target space (typically generated by meshgrid and normalized to
).
The sampling points
P are then reshaped to match the target high-resolution space. After that, the reshaped
P is added to the standard coordinate grid
of the target space, yielding the final dynamic sampling coordinates:
where
indicates the standard coordinate grid for the upsampled output space with shape
.
Afterwards, the features are passed through a pixel shuffle operation to reorganize spatial and channel information:
Finally, the
function is employed, using
G as the sampling coordinates to perform dynamic content-aware sampling on the input features, thereby producing the final high-resolution output:
where
denotes the pixel shuffle operation,
is the feature map after shuffling, and
is the final output with shape
.
By following this content-adaptive dynamic sampling process, the introduced DySample module allows each upsampled pixel to extract features from the optimal spatial location, substantially enhancing the recovery of spatial details and edge information. This provides a robust feature foundation for high-resolution crack detection in complex environments.
4. Experiment
4.1. Datasets and Evaluation Metrics
To comprehensively evaluate the robustness and generalization ability of our method in complex real-world environments, we construct three datasets based on the RDD2022 [
29] Japan subset. First, the original RDD2022 Japan dataset, which contains a variety of pavement types and multiple crack categories (such as longitudinal cracks, transverse cracks, alligator cracks, and potholes), collected under normal weather conditions, is used as the clear-weather baseline. In addition to the rain/snow-augmented Japan dataset, we introduce the Norway subset of RDD2022 as an independent test set to assess the generalization performance of our model. The Norway dataset was collected by the Norwegian Public Road Administration using high-resolution vehicle-mounted cameras and covers both expressways and county roads. Notably, the Norway images capture a wide range of post-rain, snow-covered, and overcast conditions, closely resembling the challenging real-world scenarios our method is designed to address. As highlighted in the original RDD2022 paper, the inclusion of the Norway subset enhances the dataset’s diversity and provides an excellent testbed for evaluating model robustness and transferability across geographical regions and adverse environmental conditions.
Table 1 summarizes the key statistics of the two datasets used in this study, including the number of training and validation images and the number of annotated bounding boxes for each defect category.
Figure 3 shows representative image samples from each of the four major defect classes, providing a clear visualization of the differences in appearance and annotation style.
To simulate realistic rainy conditions, we design a procedural rain synthesis pipeline that overlays multiple layers of randomly distributed lines with varying lengths and angles onto the original images, simulating rain streaks of different intensity and direction. Each layer of rain streaks is further enhanced by applying motion blur along the main rain direction, resulting in a more natural dynamic effect. The rain layers are then fused with the original image using adjustable transparency to control the severity of the rain effect. By adjusting the relevant parameters, we are able to synthesize a wide range of rainfall scenarios, from light drizzle to heavy downpour, providing challenging samples for crack detection under various precipitation intensities. Finally, for snowy conditions, we employ a two-stage synthesis approach. First, we generate a transparent snowflake mask for each image, consisting of thousands of randomly distributed snow particles with variable shapes, sizes, transparency, and trailing effects, capturing the spatial layering and depth of real snowfall. The snow mask is then blended with the clear-weather image at a specified transparency level, resulting in realistic snow occlusion where cracks and pavement may be partially covered. All datasets are split into training and validation sets at a ratio of 10:3, ensuring a balanced distribution of crack categories and weather conditions and providing a solid foundation for the evaluation of our detection framework.
For evaluation, we adopt several standard metrics to assess the effectiveness of our method in pavement crack detection, including Precision, mAP@0.5, mAP@0.5:0.95, and category-wise average precision for each type of crack. Precision measures the proportion of correctly predicted crack instances among all positive detections, reflecting the ability of model to reduce false positives. In addition, we further used precision–recall (PR) curves to compare the performance of the baseline model and the proposed CrackNet-Weather at different IoU thresholds to better characterize the stability and recall performance in low-contrast scenarios and the fine crack detection of the model. mAP@0.5 is a widely used benchmark in object detection, while mAP@0.5:0.95 provides a comprehensive evaluation by averaging performance across multiple IoU thresholds. In addition to the overall results, we also report the average precision for each individual crack category, including longitudinal cracks, transverse cracks, alligator cracks, and potholes, to evaluate the detection accuracy for different defect types. We further report the number of model parameters to indicate model size and resource requirements, and we use GFLOPs to reflect computational complexity and efficiency. All experimental results are evaluated on the validation set to ensure a fair and objective comparison.
4.2. Implementation Details
All experiments were conducted using the Ultralytics YOLOv12 framework on a single NVIDIA RTX 3090 GPU.(NVIDIA Corporation, Santa Clara, CA, USA) During training, the model was trained for 220 epochs with a batch size of 32, and all input images were resized to 640 × 640 pixels. The optimizer used was stochastic gradient descent (SGD) with the default learning rate as provided by the official implementation. To improve the generalization capability of the model, a range of data augmentation strategies were employed during training, including scale augmentation (with the scale factor set to 0.5), mosaic augmentation (set to 1.0), and copy-paste augmentation (set to 0.1); mixup augmentation was not used. Mixed precision training (AMP) was disabled to ensure stable convergence. The dataset was organized in COCO format, and the ratio of training set to validation set was 10:3. Unless otherwise specified, all other hyperparameters and training procedures followed the official Ultralytics YOLOv12 implementation.
4.3. Ablation Studies
To systematically evaluate the effectiveness of each proposed module, we conducted ablation experiments based on the YOLOv12-s. We analyzed the impact of the HWDB [
33], SPBB [
34], and DSUB [
35] modules, both individually and in combination, on detection accuracy and model complexity. The experiments were performed under two adverse weather conditions (rain and snow) and covered four types of pavement cracks. All of the detection results reported in
Table 2 are mAP@0.5:0.95 in percentage form. The corresponding computational complexity (FLOPs) and number of parameters for each configuration are listed in
Table 3.
Effect of the Strip Pooling Bottleneck Block (SPBB). After incorporating the Strip Pooling Bottleneck Block (SPBB), the overall detection accuracy (mAP) increases significantly: from 16.0 to 17.0 under rain and from 15.1 to 16.3 under snow. Notably, the performance for transverse and alligator cracks under rain is markedly enhanced with the mAP for transverse cracks increasing from 7.5 to 8.8 and for alligator cracks from 27.0 to 28.8. While accuracy improves, the FLOPs rise moderately from 21.5 G to 29.5 G, and the parameter count slightly decreases from 9.25 M to 9.10 M. This indicates that the SPBB module enhances the ability of model to capture directional and long-range contextual features with only a slight reduction in computational efficiency. Considering that real-world road monitoring often involves many elongated, directional, and occluded cracks, the SPBB module greatly improves the robustness of the model in complex and challenging environments.
Effect of the Haar Wavelet Downsampling Block (HWDB). The Haar Wavelet Downsampling Block (HWDB) also yields outstanding results with the mAP rising to 17.3 under rain and 16.2 under snow. The most significant gain appears in the detection of potholes, where the mAP increases from 16.0 to 18.1 under rain. Importantly, the HWDB module drastically reduces the number of model parameters (from 9.25 M to 8.19 M) and lowers FLOPs to 18.9 G, making the model more lightweight. This demonstrates that the HWDB module can effectively retain important structural information during downsampling while considerably reducing model complexity, enhancing the detection of low-contrast and weak-texture cracks. The HWDB module is thus well suited for resource-constrained or large-scale road monitoring applications.
Effect of the Dynamic Sampling Upsampling Block (DSUB). The Dynamic Sampling Upsampling Block (DSUB) primarily improves the recovery of fine-grained structural information, especially for transverse cracks, where the mAP under rain increases substantially from 7.5 to 10.4. The overall mAP reaches 16.9/15.9 (rain/snow). The introduction of the DSUB has a negligible effect on FLOPs (21.3 G) and does not increase the parameter count (9.25 M). This shows that the DSUB, by means of content-adaptive spatial feature reconstruction, greatly enhances the perception of crack boundaries and small objects in complex backgrounds for the model while maintaining practical inference speed. This capability is essential for the timely detection of early-stage pavement crack.
In conclusion, the systematic ablation study demonstrates that the proposed three modules not only lead to significant improvements in detection accuracy across different crack categories and adverse weather conditions but also maintain a well-balanced computational complexity. Specifically, the SPBB, HWDB, and DSUB modules each provide targeted structural enhancements for challenging crack features such as occlusions, low contrast, and fine or fragmented patterns that frequently occur in real-world road environments. Regardless of the type—longitudinal, transverse, alligator, or potholes—these innovative modules show strong adaptability and robustness, meeting the dual demands for complex scene recognition and efficient inference in practical road monitoring. Importantly, the overall model maintains a reasonable parameter size and FLOPs while achieving improved accuracy, which is beneficial for deployment on edge devices or in resource-constrained engineering scenarios. Taken together, the experimental results confirm that the improved architecture proposed in this study offers a more reliable and efficient solution for pavement crack detection under adverse weather and complex scene conditions.
4.4. Comparisons
To further validate the effectiveness and generalization capability of our proposed method in the field of pavement crack detection, we conducted comparisons with several representative mainstream object detection models, including YOLOv8-s, YOLOv10-s, YOLOv11-s, YOLOv12-s, and YOLOv13-s. All models were trained and evaluated on the same rainy and snowy pavement crack detection dataset. The experimental results are presented in
Table 4, where mAP@0.5:0.95, FLOPs, and the number of parameters comprehensively reflect the detection ability and model complexity.
As shown in
Table 4, our method achieves the best overall detection performance under both rainy and snowy conditions with an average mAP of 17.8 (rain) and 16.8 (snow). Compared to common baselines such as YOLOv8-s and YOLOv12-s, our method improves the mAP by 0.9/0.8 and 1.8/1.7 points (rain/snow), respectively. For the challenging alligator cracks and potholes categories, which are difficult due to their complex shapes and low contrast, our method achieves mAP values of 29.7/29.3 and 19.1/17.0, respectively, demonstrating a more pronounced advantage over other models.
In terms of computational complexity, although the FLOPs (26.9 G) of our method are slightly higher than those of YOLOv8-s and YOLOv12-s, the parameter count is only 8.05 M, which is the lowest among all of the compared models. Considering the significant improvement in detection accuracy achieved by our method, this moderate increase in FLOPs is acceptable. Overall, our method can effectively improve the detection accuracy of pavement cracks under adverse weather and complex pavement conditions while maintaining reasonable model complexity. The model is suitable for road inspection scenarios with high accuracy requirements and also takes into account inference speed and resource consumption, which offers clear advantages for addressing challenges such as environmental interference and hard-to-detect targets in real-world road monitoring. These results further demonstrate the practicality and application potential of our method in the field of intelligent pavement crack detection.
To further verify the generalization capability of the proposed method, we conducted additional experiments on the Norway pavement defect dataset. The comparison results with mainstream detection models are summarized in
Table 5. Our method (CrackNet) achieves an AP
50:95 of 9.22% and an AP
50 of 24.24%, which outperforms YOLOv8-s, YOLOv11-s, YOLOv12-s, and YOLOv13-s, and these results are comparable to or better than the two-stage models Faster R-CNN and Cascade. In particular, CrackNet surpasses YOLOv12-s by 1.83 percentage points in AP
50:95 and 5.42 percentage points in AP
50, demonstrating superior generalization performance on an unseen dataset.
Meanwhile, CrackNet maintains a lower computational complexity with only 26.91 GFLOPs and 8.05 M parameters, which is significantly less than Faster R-CNN (151.31 GFLOPs, 28.30 M) and Cascade (231.00 GFLOPs, 69.16 M). These results indicate that our method achieves a good trade-off between detection accuracy and computational cost, and it possesses strong generalization ability across different datasets and scenarios.
To provide a more comprehensive evaluation, we further examined the performance of CrackNet-Weather through precision–recall (PR) curves under different IoU thresholds. Specifically, we compared the baseline model with our proposed model across two thresholds and two datasets (rainy and snowy conditions) in order to highlight their differences in precision–recall trade-offs.
When the IoU threshold is 0.5, as shown in
Figure 4, CrackNet-Weather shows clear improvements in recall across both rainy and snowy datasets. The PR curves consistently shift toward the top-right region compared with YOLOv12, which indicates that the model can capture a larger proportion of positive instances while sustaining comparable precision. Importantly, the gains are most evident for longitudinal and transverse cracks, which are more likely to appear faint and fragmented in rainy or snowy conditions. This suggests that the proposed model is more capable of suppressing missed detections for low-contrast crack types, which is a factor that directly enhances safety relevance.
When the IoU threshold is 0.5–0.95, as shown in
Figure 5, the evaluation emphasizes not only whether the crack is detected but also whether the crack is accurately localized. Under this stricter criterion, CrackNet-Weather still maintains a larger area under the PR curve compared to YOLOv12 in most categories. This advantage is particularly evident in cracks and potholes, which have irregular shapes and blurred edges in inclement weather, making accurate localization difficult. CrackNet-Weather’s superior stability under these conditions highlights its strong feature representation and localization capabilities, ensuring reliable detection even under challenging alignment requirements.
The dual-level PR analysis confirms that CrackNet-Weather provides benefits beyond higher mAP scores. It improves the recall of fine and low-contrast cracks at lower IoU thresholds, reducing the likelihood of critical missed detections. At the same time, it sustains robust localization accuracy at stricter thresholds, reflecting a stronger generalization ability under diverse weather scenarios.
4.5. Model Robustness Verification
To rigorously evaluate the robustness and generalization ability of CrackNet-Weather, we constructed two balanced test subsets: a Normal Weather Dataset (NWD) and an Extreme Weather Dataset (EWD). Specifically, we randomly selected 2000 images from the original Japan dataset to represent the normal weather scenario, and then we selected another 2000 images from our synthesized snow datasets to represent extreme weather conditions. In both subsets, we ensured that the distribution of defect types was balanced, so that each major defect category is equally represented. This design guarantees an objective and fair comparison of model performance across different environmental conditions.
As illustrated in
Figure 6, in addition to the comparison under normal and extreme weather subsets, we further verified the robustness of the models under parameterized visual degradations. Specifically, three levels of localized Gaussian blur (radius = 1, 1.5, 2) were applied to crack regions to simulate the partial blurring of pavement defects under rainy or snowy conditions, while brightness factors of 0.6, 1.0, and 1.4 were used to emulate dark, normal, and bright illumination scenarios. It is worth noting that both the blur and brightness variations were applied on top of the normal rain subset to simulate additional imaging degradations. Furthermore, light and heavy rain/snow conditions were synthesized using our augmentation algorithms (Algorithms 1 and 2) with explicit parameter settings: light rain (
Nlayers = 1,
Nlines = 400,
kmb = 5,
= 0.15), heavy rain (
Nlayers = 3,
Nlines = 1200,
kmb = 9,
= 0.25), light snow (
Nsnow = 3k,
size = [1, 3],
= 0.15, streak = 20%), and heavy snow (
Nsnow = 10k,
size = [3, 6],
= 0.25, streak = 50%). These controlled perturbations enable a systematic and reproducible evaluation of model robustness across diverse degradation scenarios, ensuring that the observed performance differences are not limited to a single weather condition but generalize across a spectrum of visual interferences.
Table 6 presents the performance of both models across different weather conditions. Under normal weather conditions (NWD), the baseline model YOLOv12 achieved an AP
50–95 value of 18.0% and an AP
50 value of 40.4%. In comparison, the proposed CrackNet achieved AP
50–95 and AP
50 values of 19.2% and 43.2%, respectively, indicating improved detection capability under ideal conditions. However, under extreme weather conditions (EWD), CrackNet demonstrated significant robustness advantages. Specifically, CrackNet achieved an AP
50–95 value of 16.8%, representing a 1.7% improvement over the baseline model. For the AP
50 metric, CrackNet reached 38.3%, outperforming the baseline model’s 34.3%. These results indicate that the proposed method can better handle visual interference and feature degradation caused by adverse weather such as rain and snow, and they demonstrate its potential for deployment in challenging real-world scenarios that require reliable pavement defect detection and automated road inspection.
Table 7 summarizes CrackNet’s performance under a spectrum of weather- and imaging-induced degradations. A consistent trend emerges across conditions: as the severity of the degradation increases, detection performance decreases. In particular, heavier rain/snow introduces denser streaks and occlusions that suppress edge contrast along crack boundaries, leading to larger drops in both
and
. For localized Gaussian blur applied on the Normal (Rain) subset, the metrics degrade monotonically with the blur radius, reflecting the model’s sensitivity to a loss of high-frequency cues that are critical for accurate crack localization. The brightness variation further shows that under-exposure (darker scenes) is more detrimental than over-exposure, which is consistent with the reduction of signal-to-noise ratio and diminished crack–background separability under low illumination.
Notably, exhibits larger relative drops than across all adverse settings, indicating that degradations affect precise localization at higher IoU thresholds more than coarse detection at 0.5 IoU. Despite these challenges, CrackNet maintains stable performance trends across diverse perturbations and preserves competitive values, evidencing the robustness of its feature representations and decision head. The consistent ranking of difficulty (normal > light > heavy; small r > large r; normal/bright > dark) and the controlled, parameterized construction of the subsets further suggest that the model’s behavior generalizes systematically rather than overfitting to a particular weather type. Taken together, the results indicate that CrackNet retains reliable detection capacity across parameterized adverse-weather and imaging degradations, supporting its use in non-ideal, real-world pavement inspection.
To quantitatively evaluate robustness under different weather severities and imaging degradations, we conducted graded tests on
light/normal/heavy rain and snow, and we further tested under
partial Gaussian blur and
brightness variation (under-/over-exposure).
Table 6 reports the AP
50:95 and AP
50 values on the RDD2022-Japan subset.
Algorithm 1 Pseudocode for Procedural Snowfall Augmentation |
Input:
Original image Image directory D (for batch processing) Number of snowflakes Mask transparency Snow particle size range Snow mask output path Output: Step 1: Generate Snow Mask
- 1.
Create an empty RGBA mask M with the same size as . - 2.
For to : - 3.
Apply Gaussian blur (e.g., ) to M for realism. - 4.
Save M as a snow mask to . Step 2: Blend Snow Mask with Original Image
- 1.
For each image in directory D: Resize M to match if needed. For each pixel : - –
Compute blended pixel:
Save to .
- 2.
Output the snow-augmented images.
|
Algorithm 2 Pseudocode for Procedural Rainfall Augmentation |
Input:
Original image Image directory D (for batch processing) Number of rain layers Number of rain lines per layer Rain line length range Rain line angle mean and std Rain line thickness range Rain line brightness range Motion blur kernel size Transparency (per-layer alpha) range Output path Output: Step 1: Generate and Overlay Rain Layers
- 1.
Initialize - 2.
For to : Create empty mask with same size as For to : - –
Randomly sample start point - –
Randomly sample length - –
Randomly sample angle - –
Compute end point by and - –
Randomly sample thickness - –
Randomly sample brightness - –
Draw line from to with width , brightness on
Apply motion blur with kernel size along to Randomly sample transparency Blend onto :
Step 2: Batch Processing and Output
- 1.
Repeat the above for each in directory D - 2.
Save to - 3.
Output the rain-augmented images
|
To further characterize stability, we summarize each factor (rain, snow, blur, brightness) with two severity-aware metrics:
where
,
, and
denote the overall AP
50:95 measured on the Light, Normal, and Heavy subsets of the same factor; “pp” denotes percentage points. For brightness, severities are ordered as
over-exposed (Light) →
normal (Normal) →
under-exposed (Heavy).
Table 8 condenses the severity-wise behavior of our detector using AP
50:95. Across all factors, the
Heavy-level AP remains a large fraction of the
Light-level AP—85.5% for rain, 82.8% for snow, 90.7% for blur, and 87.7% for brightness—placing the worst-case loss within a narrow band (Normalized Drop 9.26–17.23%). The
per-level degradation is also modest: the Severity Slope lies between
and
pp/level with
snow being the hardest (17.23%,
) and
blur the mildest (9.26%,
);
rain sits in between (14.51%,
), while
brightness is moderate (12.29%, which is
), consistent with the noted asymmetry that under-exposure harms more than over-exposure. Taken together—
bounded worst-case loss and
small, nearly constant per-level loss—these numbers show that accuracy degrades
gradually and predictably as conditions worsen; i.e., the model is stable under graded adverse weather and imaging degradations.
6. Discussion
In addition to the overall quantitative results, we further analyze representative error cases to better understand the model’s limitations under adverse weather conditions.
Figure 5 presents three typical examples: class confusion, false positives, and missed detections.
As shown in the left column of
Figure 8, the model sometimes misclassifies potholes (class_3) as alligator cracks (class_2). This confusion is particularly common when potholes are partially filled with water, snow, or debris, which can alter their boundary shapes and visual textures. In such cases, the edge of the pothole becomes fragmented or network-like, resembling the patterns of alligator cracks. Adverse weather further exacerbates this issue by blurring boundaries and reducing feature distinctiveness. This phenomenon suggests that incorporating more discriminative features or additional context cues may help mitigate class confusion in future work.
The middle column of
Figure 8 presents a typical false positive case. In the ground truth, only a single longitudinal crack is annotated. However, the model incorrectly detects an additional defect, identifying a region of water accumulation on the road surface as a pothole (class_3). This type of misclassification frequently occurs under rainy conditions, where surface water can form reflective or irregular patterns that differ in appearance from the surrounding pavement. Factors such as light refraction, surface roughness, and varying water depth can create localized dark or distorted regions that visually resemble the typical signatures of potholes. To mitigate this issue, future work could incorporate multi-frame temporal cues, spectral information, or additional sensor modalities (such as polarization imaging or LiDAR) to more reliably differentiate between water accumulation and real structural defects.
The right column of
Figure 8 shows a missed detection, where an obvious longitudinal crack (class_0) is present in the ground truth but not detected by the model. Missed detections often occur in scenes with extremely low contrast or severe occlusion due to snow cover or wet road surfaces. In these conditions, the intensity difference between cracks and the surrounding pavement is diminished, and crack boundaries are partially hidden, making it challenging for the model to extract reliable features. Future efforts could explore multi-modal sensing (e.g., combining RGB with infrared or LiDAR data) or improved attention mechanisms to enhance the detection of subtle or occluded cracks.
Despite these promising results, CrackNet-Weather still exhibits certain limitations and areas for further improvement, as evidenced by the representative misclassification cases analyzed above. We refer to the following limitations specifically:
1. While the model shows robust performance overall, misclassification and missed detection still occur in challenging scenarios, particularly when environmental artifacts such as water accumulation or snow cover visually resemble true pavement defects, or when cracks are faint and difficult to distinguish under low-contrast or occluded conditions. These issues highlight the need for more advanced feature extraction, context modeling, and possibly the integration of multi-modal sensor information (e.g., infrared or LiDAR) to better distinguish between real defects and environmental noise.
2. The current training dataset is limited in scale and diversity. Future efforts should focus on further enriching the dataset to cover a wider range of regions, pavement types, and weather conditions, thereby improving the generalization and adaptability of the model.
3. Although CrackNet-Weather maintains a balance between detection accuracy and computational efficiency, there is still potential for further optimization in model size and inference speed. Approaches such as network pruning, quantization, and lightweight architecture design could facilitate real-time deployment on edge and mobile devices.
4. Another promising future direction is to incorporate semantic segmentation methods after accurate crack detection in order to isolate crack pixels and estimate crack widths. This would enable the identification of fine and narrow cracks that are difficult to capture with detection-based methods alone. Such an extension could greatly benefit the early detection of pavement defects and provide more detailed structural information for preventive road maintenance.
5. The current study relies solely on RGB imagery, which limits the ability to distinguish visually similar artefacts (e.g., water accumulation versus potholes) and to detect faint cracks under low-contrast conditions. Incorporating additional sensing modalities such as thermal infrared or depth could provide complementary cues, which represents an important direction for future work.
6. This study does not include inference speed or latency tests on embedded/edge hardware platforms (e.g., NVIDIA Jetson). While the proposed method achieves favorable complexity–accuracy trade-offs in simulation, its deployment performance on resource-constrained devices remains to be systematically evaluated in future work.
In summary, CrackNet-Weather offers an innovative and practical solution for pavement crack detection under adverse weather and complex pavement conditions. Nevertheless, ongoing optimization in model generalization, adaptability to extreme environments, and engineering deployment will be necessary to fully realize its potential in real-world applications. With continued advances in dataset diversity, model architecture, and multi-modal fusion, the proposed framework holds significant promise for large-scale intelligent road inspection, infrastructure health monitoring, and smart city development.
7. Conclusions
In this paper, we proposed CrackNet-Weather, which is a robust and efficient pavement crack detection framework specifically designed to address the challenges posed by adverse weather conditions such as rain and snow. By systematically integrating the Haar Wavelet Downsampling Block (HWDB), Strip Pooling Bottleneck Block (SPBB), and Dynamic Sampling Upsampling Block (DSUB) into the YOLOv12 backbone, the proposed method effectively enhances the extraction and fusion of frequency, contextual, and structural information, significantly improving the detection accuracy of low-contrast, fine, and complex cracks.
Extensive experiments conducted on a diverse and challenging dataset—including both clear and weather-augmented images as well as a cross-domain test on the Norway dataset—demonstrate that CrackNet-Weather outperforms mainstream baseline models. The proposed framework not only achieves a higher mean Average Precision for all major defect categories but also maintains a favorable balance between detection accuracy, model size, and computational complexity. These results validate the effectiveness and practicality of CrackNet-Weather for real-world road inspection and large-scale deployment.
Despite these achievements, several issues remain for future research. In particular, expanding the dataset to cover more geographic regions and additional weather scenarios, as well as further optimizing the model for lightweight deployment, will be critical to improving generalization and adaptability. Incorporating multi-modal sensor information and exploring advanced feature enhancement strategies are also promising directions to further address challenging cases of misclassification and missed detection.
Overall, CrackNet-Weather provides an effective and scalable solution for pavement crack detection under adverse weather, holding substantial potential for application in intelligent transportation infrastructure monitoring and automated road maintenance systems.