YOLOv11-DCFNet: A Robust Dual-Modal Fusion Method for Infrared and Visible Road Crack Detection in Weak- or No-Light Illumination Environments

Chen, Xinbao; Zhang, Yaohui; Lei, Junqi; Li, Lelin; Liu, Lifang; Zhang, Dongshui

doi:10.3390/rs17203488

Open AccessArticle

YOLOv11-DCFNet: A Robust Dual-Modal Fusion Method for Infrared and Visible Road Crack Detection in Weak- or No-Light Illumination Environments

by

Xinbao Chen

¹

,

Yaohui Zhang

^1,*,

Junqi Lei

¹,

Lelin Li

¹

,

Lifang Liu

² and

Dongshui Zhang

¹

School of Earth Sciences and Spatial Information Engineering, Hunan University of Science and Technology, Xiangtan 411201, China

²

College of Civil Engineering, Hunan University of Science and Technology, Xiangtan 411201, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2025, 17(20), 3488; https://doi.org/10.3390/rs17203488

Submission received: 19 August 2025 / Revised: 15 October 2025 / Accepted: 16 October 2025 / Published: 20 October 2025

(This article belongs to the Special Issue Road Extraction and Distress Assessment by Spaceborne, Airborne and Terrestrial Platforms (Second Edition))

Download

Browse Figures

Versions Notes

Abstract

Highlights

What are the main findings?

A dual-modal road crack detection model based on YOLOv11 with an embedded Cross-Modality Fusion Transformer (CFT) is proposed, constructing infrared–visible dual-branch feature extraction and enabling dynamic cross-modal interaction.
Experimental validation on a newly compiled dataset covering weak- and no-light conditions demonstrates that the proposed method substantially outperforms conventional fusion approaches, particularly in robustness and generalization.

What is the implication of the main finding?

It addresses the critical challenge of visible-light degradation in weak-light scenarios by leveraging the complementarity of infrared and visible modalities, ensuring stable crack perception and accurate localization.
The proposed framework provides a practical solution for real-world road maintenance, offering enhanced adaptability and applicability in nighttime or low-illumination intelligent transportation monitoring.

Abstract

Road cracks represent a significant challenge that impacts the long-term performance and safety of transportation infrastructure. Early identification of these cracks is crucial for effective road maintenance management. However, traditional crack recognition methods that rely on visible light images often experience substantial performance degradation in weak-light environments, such as at night or within tunnels. This degradation is characterized by blurred or deficient image textures, indistinct target edges, and reduced detection accuracy, which hinders the ability to achieve reliable all-weather target detection. To address these challenges, this study introduces a dual-modal crack detection method named YOLOv11-DCFNet. This method is based on an enhanced YOLOv11 architecture and incorporates a Cross-Modality Fusion Transformer (CFT) module. It establishes a dual-branch feature extraction structure that utilizes both infrared and visible light within the original YOLOv11 framework, effectively leveraging the high contrast capabilities of thermal infrared images to detect cracks under weak- or no-light conditions. The experimental results demonstrate that the proposed YOLOv11-DCFNet method significantly outperforms the single-modal model (YOLOv11-RGB) in both weak-light and no-light scenarios. Under weak-light conditions, the fusion model effectively utilizes the weak texture features of RGB images alongside the thermal radiation information from infrared (IR) images. This leads to an improvement in Precision from 83.8% to 95.3%, Recall from 81.5% to 90.5%, mAP@0.5 from 84.9% to 92.9%, and mAP@0.5:0.95 from 41.7% to 56.3%, thereby enhancing both detection accuracy and quality. In no-light conditions, the RGB single modality performs poorly due to the absence of visible light information, with an mAP@0.5 of only 67.5%. However, by incorporating IR thermal radiation features, the fusion model enhances Precision, Recall, and mAP@0.5 to 95.3%, 90.5%, and 92.9%, respectively, maintaining high detection accuracy and stability even in extreme no-light environments. The results of this study indicate that YOLOv11-DCFNet exhibits strong robustness and generalization ability across various low illumination conditions, providing effective technical support for night-time road maintenance and crack monitoring systems.

Keywords:

dual-modal fusion; IR images; YOLOv11-DCFNet; road cracks; weak-light detection; no-light detection; CFT

Graphical Abstract

1. Introduction

With the continuous development of China’s economy, the pace of road infrastructure construction has accelerated significantly [1]. As an essential transportation facility, roads play a vital role in promoting economic development [2]. However, road cracks represent one of the primary issues affecting both the safety and longevity of road usage. Currently, road cracks manifest in various forms, including vertical, transverse, and oblique cracks [3]. These cracks are predominantly caused by a combination of factors, such as construction practices, temperature variations, and changes in vehicle loading. If early cracks are not repaired promptly, they may progress into severe damage, adversely affecting the road’s aesthetic appeal and driving comfort, while also posing a direct threat to traffic safety. In recent years, the prevalence of road cracks has led to vehicle damage and even traffic accidents, particularly in older urban areas and on national highways where cracks are frequent. Consequently, the issue of road cracks has emerged as a significant constraint on the safe operation of pavement infrastructure.

Traditional methods of road crack detection primarily rely on manual inspection techniques, such as visual assessments or handheld equipment. These approaches are characterized by high labor intensity and inefficiency, and they are often influenced by the operator’s experience and subjective judgment [4]. Consequently, the results of these inspections can exhibit significant uncertainty and a risk of oversight. Furthermore, these methods face substantial limitations when applied to large-scale road network inspections, particularly regarding time and labor costs, making it challenging to meet the demand for high-frequency and high-precision monitoring necessary for the maintenance of modern transportation facilities. Although some developed regions have adopted advanced technologies such as infrared radar, three-dimensional laser scanning vehicles, and high-precision inspection vehicles, these systems tend to be expensive, complex to operate, and require a high level of expertise for maintenance. This dependency restricts their adoption in small and medium-sized cities or less developed areas. Therefore, there is an urgent need for an automated road crack detection scheme that is efficient, intelligent, cost-effective, and scalable. Such a system would enable rapid sensing and timely intervention regarding road infrastructure conditions, thereby supporting the sustainable development of intelligent transportation and road health management. With the rapid advancement of deep learning and computer vision technologies, crack detection methods based on visual perception have emerged as a significant research focus. The potential for deployment on flexible platforms, such as unmanned aerial vehicles (UAVs) and mobile terminals, is increasingly prominent, offering new avenues for the intelligent identification and large-scale monitoring of road defects.

In recent years, with the advancement of deep learning, target detection algorithms have increasingly become the dominant methods for crack detection. These algorithms are primarily categorized into two types: two-stage detection methods (such as R-CNN [5], Faster R-CNN [6], and Mask R-CNN [7]) and single-stage detection methods (including YOLO [8], SSD [9], and RetinaNet [10]). Two-stage target detection algorithms first extract candidate frames from the image and then perform a secondary extraction on these frames. Convolutional Neural Networks (CNNs) are widely recognized for their superior feature extraction capabilities. Recent studies have demonstrated various applications of these algorithms: Wang et al. [11] developed a crack recognition model based on Mask R-CNN to achieve pixel-level segmentation; Li et al. [12] introduced the SENet attention mechanism to optimize Faster R-CNN, enhancing its feature expression ability across different levels; Kortmann et al. [13] applied Faster R-CNN for the recognition of multinational, multi-type road damage, verifying its cross-regional adaptability; Xu et al. [14] compared the performance of Faster R-CNN and Mask R-CNN under small sample conditions, proposing a joint training strategy; Balcı et al. [15] achieved high accuracy in road crack detection by combining data augmentation with Faster R-CNN, attaining a mean average precision (mAP) of 93.2%; Gan et al. fused Faster R-CNN with a BIM model to facilitate 3D visual modeling of cracks at the bottom of bridges; and Lv et al. [16] designed an optimized CNN model capable of accurately identifying vertical, transverse, and reticulation cracks, achieving a recognition accuracy of up to 99%. In contrast, single-stage target detection algorithms directly analyze the entire image, offering the advantage of rapid detection speed, although they may be slightly less accurate. Nevertheless, single-stage methods demonstrate impressive performance in crack detection tasks using visible images, particularly the YOLO series of models. These models are extensively utilized in practical engineering due to their end-to-end architecture and high efficiency. The YOLO series of target detection algorithms continues to evolve and is widely applied in road crack detection tasks. Current research primarily focuses on enhancing model accuracy, developing lightweight structures, and improving robustness in complex scenarios.

In crack detection research, the YOLO series have increasingly become the predominant framework due to its efficient end-to-end architecture and real-time performance. For instance, Zhou et al. [17] enhanced accuracy and speed on the RDD2022 dataset by integrating SENet channel attention, K-means anchor optimization, and SimSPPF. Similarly, Zhen et al. [18] improved the detection of complex backgrounds and small cracks through denoising and anchor optimization techniques. Du et al. [4] proposed BV-YOLOv5S, which combines BiFPN and Varifocal Loss to facilitate robust multi-class pavement defect detection. Karimi et al. [19] applied YOLOv5 to identify multi-material cracks in historical buildings, demonstrating its strong adaptability to various materials. For small cracks, Li et al. [20] developed CrackTinyNet, based on YOLOv7, which incorporates BiFormer, NWD loss, and SPD-Conv, achieving high accuracy in both public datasets and real-vehicle tests. Regarding YOLOv8, Wen et al. [21], Zhang et al. [22], and Yang et al. [23] have respectively improved accuracy, inference speed, and deployment through detection-layer optimization, lightweight convolutions, and global attention mechanisms. Manjusha et al. [24] confirmed the overall superiority of YOLOv8 through comparative analysis. Recent YOLO-based variants include Rural-YOLO [25], which enhances disease detection accuracy and enables lightweight deployment using attention and feature enhancement modules, and YOLO11-BD [26], which reduces computation while improving accuracy. Additionally, fusion architectures such as YOLO-MSD [27], PC3D-YOLO [28], ML-YOLO [29], RSG-YOLO [30], and EMG-YOLO [31] have been effectively applied to industrial surface defects, track panel cracks, rail cracks, and multi-type defect detection. Additionally, Zhou et al. [32] proposed an improved YOLOv5 network with integrated convolutional block attention modules (CBAM) and residual structures to enhance insulator and defect detection in UAV images, achieving notable gains in precision and robustness. Chen et al. [33] introduced DEYOLO, a dual-feature-enhancement framework based on YOLOv8, which incorporates semantic-spatial cross-modality modules (DECA and DEPA) and a bi-directional decoupled focus mechanism, significantly improving RGB-IR object detection under poor illumination. Similarly, Xu et al. [34] presented an improved YOLOv5s, which embeds a shallow feature extraction layer and window self-attention modules into the Path Aggregation Network, effectively addressing the challenges of detecting small objects in remote sensing images with high accuracy and real-time performance on the DIOR and RSOD datasets.

In addition to the continuous optimization of the YOLO series, emerging network architectures also provide new solutions for crack detection. The Transformer architecture notably enhances global feature modeling. Chen et al. [35] proposed iSwin-Unet, which integrates hopping attention and residual Swin Transformer blocks with Swin-Unet, achieving significant performance gains across multiple public datasets. Alireza Saberironaghi et al. [36] introduced DepthCrackNet, which incorporates spatial and depth enhancement modules to maintain high accuracy and computational efficiency for detecting fine cracks in complex backgrounds. Wang et al. [37] developed CGTr-Net, which fuses CNNs with Gated Axial-Transformers alongside feature fusion and pseudo-labeling strategies, demonstrating robust performance for thin and discontinuous cracks. Zhang et al. [38] applied generative diffusion models to crack detection through the CrackDiff framework, where a multi-task UNet simultaneously predicts masks and noise, enhancing the integrity and continuity of crack boundaries. In remote sensing road extraction, Hui et al. [39] fused global spatial features with Fourier frequency-domain features using the Swin Transformer, effectively separating and enhancing high and low-frequency information to improve road-background separability and boundary continuity. This approach achieved IoU scores of 72.54% (HF), 55.35% (MS), and 71.87% (DeepGlobe). Guan et al. [40] proposed Swin-FSNet for unpaved roads, employing the Swin Transformer as the encoder backbone along with a discrete wavelet-based frequency-aware decomposition module (WBFD) for direction-sensitive features. A hybrid dynamic serpentine convolution module (HyDS-B) in the decoder adaptively models curved and bifurcated roads, achieving an IoUroad of 81.76% (self-collected UAV dataset) and 71.97% (DeepGlobe) with notable structural preservation.

In dual-modal and integrated approaches, other researchers combine diverse features and sensing data to enhance detection adaptability and utility. Zhang et al. [41] integrated enhanced YOLOv5s with U-Net++, utilizing binocular vision ranging and edge detection to achieve precise crack localization and width quantification, thereby improving measurement accuracy in complex scenes. For pixel-level road crack detection, Wang et al. [42] proposed GGMNet, which incorporates three key components: (1) Global Context Residual Blocks (GC-Resblocks) to suppress background noise, (2) Graph Pyramid Pooling Modules (GPPMs) for multiscale feature aggregation and long-range dependency capture, and (3) Multiscale Feature Fusion (MFF) modules to minimize missed detections. This framework demonstrated high accuracy and robustness on the DeepCrack, CrackTree260, and Aerial Track Detection datasets, providing insights for refining linear feature extraction. Traditional machine learning techniques remain relevant in specialized applications. Shi et al. [43] developed CrackForest, which combines integral channel features with random structured forests to enhance noise immunity and structural information retention. Gavilán et al. [44] significantly reduced false positives through pavement classification and feature screening in their adaptive detection system. Oliveira et al. [45] created CrackIT, which integrates unsupervised clustering with crack characterization for effective thin-crack detection and width grading.

Current research has significantly advanced crack detection through targeted improvements to the YOLO series models, including structural optimization, attention mechanisms, image enhancement, lightweight design, and multimodal fusion. These developments have substantially enhanced both accuracy and practicality. However, model performance degrades markedly in weak- or no-light conditions, as visible light images lose texture information and exhibit blurred boundaries, which limits crack sensing and localization capabilities. Therefore, developing dual-modal detection models with enhanced robustness for such environments remains a critical research direction. To address weak-light degradation, thermal infrared imagery serves as a complementary modality. Unlike visible light, which depends on ambient illumination, thermal imaging relies on the temperature differences of the target surface, enabling stable imaging at night or in darkness. Road cracks manifest as high thermal contrast regions in infrared images, thereby strengthening robust perception. The modalities exhibit significant complementarity: visible images provide texture and edge details, while infrared conveys surface thermal characteristics. Consequently, dual-modal fusion has emerged as a key approach for weak-light crack detection, improving feature completeness and model generalization—particularly valuable for temperature-variant defects such as road cracks.

Meanwhile, current fusion methods (e.g., feature concatenation, weighted fusion) often fail to capture deep inter-modal semantic relationships, suffering from shallow integration and inadequate spatial dependency modeling [31]. Transformers have recently gained prominence in vision tasks due to their superior long-range dependency modeling. Their self-attention mechanism dynamically focuses on globally relevant features, effectively addressing dual-modal alignment challenges [4]. Motivated by these advances, we propose the Cross-Modality Fusion Transformer (CFT) module. This dual-branch structure concurrently models intra- and inter-modal feature relationships via self-attention, enabling deep fusion and semantic alignment to significantly enhance crack perception in complex scenes. Integrated into YOLOv5 for infrared–visible detection, CFT substantially outperforms conventional fusion methods across multiple metrics, particularly in weak-light scenarios [35].

Building on this foundation, we present a dual-modal road crack detection method using the most stable YOLOv11 version with embedded CFT. Our approach constructs infrared–visible dual-branch feature extraction and leverages CFT within the backbone for dynamic cross-modal interaction. To validate performance, we compiled a dual-modal road crack dataset encompassing weak-/no-light conditions and conducted comparative experiments with visualization analysis.

2. Materials and Methods

2.1. Road Crack Datasets

The accuracy and reliability of road crack identification depend heavily on the availability of a rich and diverse sample set. To address the limitations posed by low illumination environments, we constructed a bimodal road crack identification dataset, termed RoadCrack-MM-2025. This dataset utilizes the Hikvision P20MAX (Hikmicro, Hangzhou, China), which supports the simultaneous acquisition of visible light and thermal infrared images. The recording method employed is handheld, enabling stable capture of thermal and visible light images of the road surface under dark conditions at night. The camera was held approximately 1 m above the pavement, ensuring a consistent imaging scale across all samples.

The data collection took place from 7 June to 15 June 2025, during the hot summer period, and encompassed both asphalt and cement concrete pavements. The captured cracks include vertical, transverse, and oblique cracks that are visible to the naked eye, with widths typically greater than 0.5 cm and lengths ranging from about 0.5 m to 1 m or more. These geometric limits define the scope of detection in this study and determine the appropriate image resolution for reliable feature extraction.

The acquisition scenarios covered both weak-light and no-light conditions at night, with the system automatically recording and synchronously generating pairs of visible light and thermal infrared images. After collection, the images were manually screened to remove blurred or obstructed samples caused by equipment vibration or external interference.

In total, 1062 pairs of high-quality image samples (i.e., 2124 images) were retained. The raw images were captured at a resolution of 480 × 640 pixels, and during training, all images were uniformly resized to 640 × 640 pixels to meet the input requirements of the detection network. The dataset includes a variety of realistic conditions, such as differences in illumination, pavement texture, and crack morphology.

The data were annotated using the LabelImg tool (version 1.8.6) and divided into a training set (714 image pairs, 1428 images), a validation set (179 image pairs, 358 images), and a test set (169 image pairs, 338 images), maintaining an approximate 4:1:1 ratio. An overview of the dataset is presented in Figure 1.

The training and validation sets collectively contain 893 pairs of images. Based on illumination conditions, the RGB images were further categorized into 507 weak-light and 386 no-light scenes, with a ratio of approximately 4:3. To ensure the model’s robustness across varying illumination levels, weak-light and no-light samples were distributed into the training and validation sets at a 4:1 ratio. Within the training set, the weak-light and no-light samples were balanced (84 and 85 pairs, respectively), maintaining a 1:1 ratio for consistent performance evaluation.

2.2. The Dual-Modal Fusion Method Based on YOLOv11

2.2.1. YOLOv11-DCFNet Architecture

The overall architecture of the original YOLOv11 retains the Backbone-Neck-Head architecture while incorporating key enhancements in each module. The input processing employs Mosaic data augmentation, adaptive anchor computation, and grayscale padding to stabilize feature extraction. Within the Backbone, the original C2f module is replaced by the novel C3k2 structure—inheriting YOLOv7’s ELAN advantages while eliminating redundant convolutional layers and optimizing bottleneck gradient flow to enhance multi-scale feature extraction. Following the SPPF module, a new C2PSA (Position-Sensitive Attention) mechanism introduces position-aware encoding to improve fine-grained crack representation. The Neck maintains FPN-PAN multi-scale feature fusion, while the Head integrates DWConv depthwise separable convolution to reduce computational load. For regression optimization, Distribution Focal Loss (DFL) and CIoU are adopted to balance detection accuracy, convergence stability, and localization performance. YOLOv11 offers scaled variants (YOLO11n/s/m/l/x), with YOLO11n selected as our base model due to its suitability for computation-limited, real-time road crack detection. Subsequent improvements further enhance its accuracy and robustness, achieving optimal efficiency-precision tradeoffs.

Building upon YOLOv11, our proposed YOLOv11-DCFNet firstly constructs a dual-stream backbone network for separate feature extraction from visible and infrared images. To achieve efficient cross-modal fusion, we introduce a Cross-Modality Fusion Transformer (CFT) module embedded within critical feature extraction stages of the backbone. Leveraging Transformer’s global modeling capability, CFT enables complementary information enhancement while preserving inter-modal feature distinctiveness. Specifically, it strengthens connections between local geometric structures and contextual semantics during fusion, significantly improving perception of small, weakly textured cracks (Seeing Figure 2). Our infrared–visible bimodal fusion framework substantially enhances YOLOv11’s weak-light detection performance while demonstrating notable stability and real-time capability. This approach shows strong application potential for nighttime road inspection, tunnel monitoring, and all-weather infrastructure maintenance.

2.2.2. CFT Module Design

In multispectral target detection, traditional single-modal (e.g., RGB) detectors exhibit limited performance in complex scenes such as weak-light and no-light conditions. Consequently, these fused visible (RGB) and thermal infrared (IR) images serve as an effective strategy to improve target detection robustness and accuracy. The modalities demonstrate significant complementarity: RGB images provide texture details and color information, while IR images retain object thermal radiation characteristics under weak-light conditions. Fully exploiting this complementary information constitutes a core challenge for dual-modal sensing systems.

The CFT module is designed as a feature fusion unit leveraging the self-attention mechanism of the Transformer, enabling it to capture both intra-modality and inter-modality global feature associations. To ensure clarity and reproducibility, the architectural parameters of the CFT are explicitly specified and validated through ablation studies. Specifically, the feature dimensionality is set to 256, 512, and 1024 across successive backbone stages. The number of attention heads is fixed at 8, which achieves a balance between modeling capacity and computational overhead. For the Feed-Forward Network (FFN), we adopt a 2× scaling of the hidden size, as ablation experiments (shown in Table 1) demonstrate that this configuration (Group 4) yields superior accuracy–efficiency trade-offs compared to both smaller (1×) and larger (4×) expansions. Furthermore, the CFT is composed of 8 stacked Transformer blocks.

(1): Feature preprocessing and serialization

First, the input RGB image and IR image are subjected to convolutional feature extraction by YOLOv11 backbone network to obtain preliminary spatial feature representations, respectively:

F^{R G B}, F^{I R} \in R^{C \times H \times W}

(1)

Considering that the complexity of the Transformer operation grows rapidly with increasing spatial dimensions, we perform spatial downsampling of the above features (average pooling), reduce the spatial dimensions to smaller scales (denoted as

h \times w

), and spread the features:

X^{R G B}, X^{I R} \in R^{h w \times C}

(2)

Then the features of the two modalities are spliced along the spatial dimension to form a unified feature sequence with a superimposed learnable positional encoding (Positional Embedding)

E_{p o s}

to form the final Transformer input feature

X

:

X = [X^{R G B}; X^{I R}] + E_{p o s} \in R^{2 h w \times C}

(3)

(2): Transformer Self-Attention Fusion

The self-attention mechanism first performs a linear projection of the feature space on the input sequence, generating three feature mapping matrices

U, M, N

, which are used to capture the relationships between features within the sequence:

U = X W_{U}, M = X W_{M}, N = X W_{N}

(4)

Subsequently, the attention weight matrix

β

is obtained by computing the scaled dot product between the feature mappings:

β = softmax (\frac{U M^{T}}{\sqrt{d}}) \in R^{2 h w \times 2 h w}

(5)

Each element in the matrix

β

represents the correlation weights between the corresponding features. These weights are then used to weight the combination of features

N

to achieve effective fusion of features:

Z = β N \in R^{2 h w \times d}

(6)

To further capture the diverse feature fusion information, a multi-attention mechanism is used and the robustness of feature fusion is enhanced by residual connectivity to prevent the gradient degradation problem during network training.

(3): Feature Reconstruction and Fusion Output

The feature sequence

Z

, which is obtained after the above attention fusion process, is then further enhanced by a position-independent Feed-Forward Network (FFN) to improve the expression of the features and residual connectivity:

Z^{'} = Z + X, O = FFN (Z^{'}) + Z^{'}

(7)

Eventually, the feature sequence

O

is partitioned into two modal features, which are reduced back to the feature map form:

O^{R G B}, O^{I R} \in R^{C \times h \times w}

(8)

The above fused low-resolution feature maps are upsampled by bilinear interpolation and fused with the residuals of the original feature maps respectively to obtain the final enhanced feature maps:

{\hat{F}}^{R G B} = F^{R G B} + Upsample (O^{R G B})

(9)

{\hat{F}}^{I R} = F^{I R} + Upsample (O^{I R})

(10)

The deep fusion of the two modal features is accomplished, followed by target detection via the YOLOv11 detection network.

2.3. Experimental Environment

All deep learning experiments in this study were executed on a local workstation running Windows 11 64-bit. The hardware configuration comprised an NVIDIA GeForce RTX 4060 Ti GPU (Ningbo Zhongjia Kemao Co., Ltd., Ningbo, China) with 16 GB VRAM and an Intel^® Core™) i5-13400F CPU (Intel Corporation, Santa Clara, CA, USA), supported by 831 GB system memory. The software environment utilized Python 3.10.18, PyTorch 2.5.1 (torch-2.5.1+cu121-cp310), and CUDA 12.1 acceleration libraries. The training iterations (Epoch) were set to 300, as depicted in Figure 3. This setup enabled large-scale image data processing, parallel computation, and model acceleration during training and inference phases. The specific configuration and experimental environment are detailed in Table 2. In addition, the training adopted SGD optimization with an initial learning rate of 0.001, momentum of 0.9, weight decay of 0.0005, and a batch size of 64, following standard YOLO training practices to ensure reproducibility and stability.

2.4. Evaluation Indicators

In order to comprehensively evaluate the detection performance of the proposed model, this study adopts a set of mainstream evaluation metrics, including Precision, Recall, F1-score, AP (Average Precision), and mAP (mean Average Precision). Precision measures the proportion of correctly detected positive samples among all predicted positives, while Recall reflects the proportion of correctly detected positives among all actual positives. Since there is often a trade-off between Precision and Recall, the F1-score, defined as their harmonic mean, is commonly employed to provide a balanced evaluation of detection performance. A higher F1 value indicates that the model achieves both high accuracy and strong recall ability, while a lower F1 reflects an imbalance between the two.

Beyond these indicators, AP is defined as the area under the Precision–Recall curve and evaluates model performance across different confidence thresholds. The mean Average Precision (mAP) further extends this by averaging AP values across all categories, thereby reflecting overall detection capability in multi-class scenarios. In this study, two standard forms of mAP are reported: mAP@0.5, which measures average accuracy at an IoU threshold of 0.5, and mAP@0.5:0.95, which averages results across IoU thresholds from 0.5 to 0.95 with a step size of 0.05. The latter serves as a stricter and more comprehensive indicator, highlighting the model’s robustness in detecting targets with varying degrees of overlap.

Precision = \frac{T P}{T P + F P}

(11)

Recall = \frac{T P}{T P + F N}

(12)

F 1 = 2 \times \frac{Precision \times Recall}{Precision + Recall}

(13)

A P = \int_{0}^{1} Precision (Recall) d (Recall)

(14)

mAP = \frac{1}{N} \sum_{i = 1}^{N} A P_{i}

(15)

3. Results

3.1. Uni-Modal Experimental Results with Three Lighting Scenarios

To systematically analyze the effects of lighting conditions on the performance of single-modal visible target detection, this study designs twelve experimental sets that combine four models (YOLOv5-RGB, YOLOv8-RGB, YOLOv11-RGB, and YOLOv13-RGB) with three lighting scenarios (natural light, weak light, and no light). A comprehensive comparison of performance metrics reveals the influence patterns of light variations on crack detection results, thereby validating the necessity of introducing the infrared modality.

Under natural light conditions, characterized by sufficient daytime illumination, all four models demonstrated strong performance in crack detection, achieving optimal results as illustrated in Figure 4. The YOLOv11-RGB model exhibited stable performance, achieving a mean Average Precision (mAP) of 92.2% at IoU 0.5, with precision at 94.8% and recall at 89.2%, alongside a mAP of 56.0% calculated over the range IoU 0.5:0.95. Serving as the baseline for our enhanced model, YOLOv11-RGB ensures reliable performance control for subsequent dual-modal fusion models. In comparison, YOLOv5-RGB and YOLOv8-RGB attained mAP scores of 93.7% and 66.5%, and 94.2% and 63.7% for mAP@0.5 and mAP@0.5:0.95, respectively, slightly surpassing YOLOv11-RGB in these metrics. YOLOv13-RGB also showed competitive results, with mAP@0.5 reaching 93.9% and mAP@0.5:0.95 at 63.6%, precision at 94.0%, and recall at 91.0%. Nevertheless, recall and overall robustness exhibited minimal differences across the models. Collectively, mainstream uni-modal models provide high detection accuracy under ideal lighting conditions, particularly when crack textures are clearly defined.

Upon entering weak-light conditions, although some visible light information remains in the image, allowing for a discernible texture of the cracks, the model’s detection performance exhibits a downward trend. This decline is attributed to the overall reduction in light intensity, which adversely affects image contrast and detail clarity. For instance, in the case of YOLOv5-RGB, the mean Average Precision (mAP) at 0.5 decreased from 93.7% to 90.3%, while mAP at 0.5:0.95 fell from 66.5% to 51.4%. Similarly, YOLOv11-RGB experienced a decrease of 14.3 percentage points in mAP at 0.5:0.95. YOLOv13-RGB also showed performance degradation, with mAP@0.5 dropping to 83.5% and mAP@0.5:0.95 to 42.7%, precision at 88.6%, and recall at 76.6%. Although the performance degradation is less severe than in no-light environments, the findings indicate that diminished light intensity significantly impacts the model’s perceptual capabilities. The quality of image information under weak-light conditions is insufficient to support detection performance comparable to that in normal lighting. This phenomenon illustrates that while traditional visible light models exhibit some adaptability in low-light scenarios, their detection efficacy remains constrained by light levels, making it challenging to achieve robust and stable recognition performance.

Under no-light conditions, models that rely solely on visible light exhibit a severe performance collapse. All four models demonstrate significant declines in metrics: YOLOv5-RGB’s mean Average Precision (mAP) at 0.5 drops to 77.6% (from 90.3%), with a mAP at 0.5:0.95 of 40.4%; YOLOv8-RGB achieves 70.4% and 36.1% for mAP at 0.5 and mAP at 0.5:0.95, respectively; while YOLOv11-RGB declines to 67.5% and 33.3%. YOLOv13-RGB performed similarly poorly, with mAP@0.5 of 60.6% and mAP@0.5:0.95 of 27.8%, precision at 60.1%, and recall at 59.0%. This performance collapse confirms that negligible texture and structural information from visible images in darkness leads to critical deficiencies in features, substantially reducing detection accuracy and recall, while simultaneously increasing the rates of missed and false detections. The twelve experiments illustrate the stratified effects of illumination: stable, high-accuracy detection occurs under natural light; moderate degradation is observed with basic recognition in weak light; and near-complete failure arises in no-light conditions, thereby revealing fundamental unimodal limitations.

3.2. Comparative Experiments of Our Dual-Modal with Uni-Modal Model

Building upon a comprehensive performance analysis of twelve uni-modal models under diverse lighting conditions, this study further extends the comparative scope to rigorously evaluate dual-modal fusion methodologies in weak-light and no-light environments. Beyond unimodal baselines, our benchmark incorporates three representative state-of-the-art approaches: (1) the YOLOv11-RGBT framework (Wan et al. [46]), implemented across three lightweight variants (YOLOv8n, YOLOv10n, and YOLOv11n) with consistent P3 mid-level feature fusion for RGB/IR inputs; (2) the YOLOv5-CFT methodology (Qingyun et al. [47]), deployed at multiple capacity scales (s, m, l) with embedded Cross-Modality Fusion Transformer modules to enhance global feature interaction; and (3) the recently proposed C²Former model (Yuan and Wei [48]), which introduces calibrated and complementary cross-attention mechanisms to mitigate modality misalignment and fusion imprecision. This multi-path evaluation framework enables a fair and systematic comparison across heterogeneous strategies, providing insights into the impact of fusion design, architectural configuration, and computational efficiency on dual-modal detection.

Experimental results confirm that the proposed YOLOv11-DCFNet (ours-n) achieves the most favorable balance between accuracy and efficiency while exhibiting strong robustness in degraded-light conditions. As shown in Table 3, YOLOv11-DCFNet attains a Precision of 95.3%, Recall of 90.5%, mAP@0.5 of 92.9%, and mAP@0.5:0.95 of 56.3%, consistently outperforming all comparative baselines. It substantially surpasses YOLOv5-CFT variants (maximum 74.4% mAP@0.5 and 32.0% mAP@0.5:0.95) and outperforms the YOLOv11-RGBT family, where the best-performing YOLOv11(n) only reaches 89.4% mAP@0.5 with lower Recall (82.9%). Moreover, while C²Former achieves competitive Recall (95.6%) and mAP@0.5:0.95 (54.6%), its training cost is prohibitively high (18.6 h, nearly 10× slower than DCFNet). In contrast, YOLOv11-DCFNet provides a superior trade-off with balanced accuracy and efficiency, delivering greater stability and real-world deployability in extreme low-illumination scenarios.

3.3. Ablation Experiments

To systematically evaluate dual-modal fusion strategies for road crack detection in weak- and low-light environments, this study constructs five progressively structured ablation models based on the YOLOv11 backbone with five strategies as illustrated in Table 4. The baseline model, YOLOv11-RGB, utilizes only RGB inputs. YOLOv11-EF implements early fusion through element-wise summation at P3, P4, and P5, while YOLOv11-ADD applies weighted fusion at identical positions. YOLOv11-T integrates three CFT modules, which combine multi-head attention and feature updates, at P2, P3, and P4 scales with minimal intervention to the backbone. In contrast, YOLOv11-DCFNet reconfigures the backbone with deeply integrated CFT modules at P3, P4, and P5, establishing cross-stage propagation. This progression evolves from feature arithmetic fusion to architecturally embedded synergy.

The experimental results presented in Table 4 indicate that YOLOv11-RGB exhibits a significant performance bottleneck in low-light conditions, achieving Precision and Recall rates of 76.2% and 72.3%, respectively, and a mean Average Precision (mAP) of only 38.0% at the range of 0.5 to 0.95. This limitation underscores the challenges associated with unimodal inputs in effectively detecting cracks under reduced illumination. In contrast, YOLOv11-EF employs early fusion by integrating RGB and infrared (IR) images at the input stage, resulting in substantial improvements in Precision (85.7%) and Recall (85.1%). Nevertheless, its mAP@0.5:0.95 experiences a slight decline to 36.3%, suggesting that while simple channel splicing enhances target detection capabilities, it remains inadequate for achieving precise localization and complex semantic interpretation.

Further, the YOLOv11-ADD model achieves an improved balance between computational overhead and semantic interaction through element-wise additive fusion of mid-level RGB and infrared (IR) features, which effectively enhances the quality of inter-modal representation. Precision increases to 90.1%, while the mean Average Precision (mAP) at 0.5 remains comparable to that of YOLOv11-EF at 84.4%, and the mAP at 0.5:0.95 shows a slight increase to 36.9%, indicating superior fusion capabilities. Additionally, the YOLOv11-T model employs a two-branch input structure that separately extracts RGB and IR features, incorporating three CFT modules at the P2, P3, and P4 scales. This design utilizes multi-attention mechanisms to facilitate deeper semantic interactions between modalities. Although its mAP at 0.5:0.95 is 35.2%, which is slightly lower than that of the ADD model, it still surpasses early fusion techniques, thereby demonstrating the advantages of Transformer-based fusion in enhancing semantic perception.

In this study, YOLOv11-DCFNet serves as the primary improved model, distinct from previous architectures that limit fusion to input-level (YOLOv11-EF), mid-level weighting (YOLOv11-ADD), or shallow attention guidance (YOLOv11-T). Instead, it adopts a dual-branch structure incorporating deeply embedded CFT modules at P3, P4, and P5, while also introducing cross-stage propagation. This design enhances inter-modal interaction and feature complementarity, resulting in superior performance across all models. Specifically, Precision reaches 95.3%, Recall is 90.5%, mAP@0.5 is 92.9%, and mAP@0.5:0.95 is 56.3%, which corresponds to respective improvements of 19.1%, 18.2%, 16.3%, and 20.0% compared to the baseline. These findings underscore that infrared–visible fusion under weak- or no-light conditions significantly enhances detection accuracy and robustness, thereby validating the effectiveness and advancement of the proposed method.

To further comprehensively assess the impact of module combinations on complexity and efficiency, we evaluate parameters (Params), computational load (GFLOPs), and inference speed (FPS) across five models, as shown in Table 5. The single-branch YOLOv11-RGB baseline achieves optimal efficiency metrics, including minimal parameters (2.62 M), the lowest computation (3.31 GFLOPs), and a peak frame rate (114.79 FPS), demonstrating an exceptional lightweight design. However, due to the absence of dual-modal fusion, it exhibits limited feature representation in complex scenes.

YOLOv11-EF implements early fusion through element-wise summation at P3, P4, and P5, resulting in an increase in parameters to 3.99 M and computation to 4.87 GFLOPs (Table 5), while maintaining a frame rate of 82.61 FPS, thereby confirming efficient information enrichment. YOLOv11-ADD employs weighted fusion at the same positions (P3, P4, P5) within a dual-branch structure, preserving nearly identical complexity (3.96 M parameters and 4.78 GFLOPs) while improving the speed to 88.22 FPS, thus demonstrating effective feature integration. Conversely, YOLOv11-T incorporates three CFT modules at P2, P3, and P4 scales, which significantly enhances inter-modal semantic interaction but increases parameters to 13.95 M and computation to 6.13 GFLOPs, resulting in a reduction in FPS to 37.34, which is just above the 30 FPS real-time threshold. This outcome reveals scalability constraints for resource-limited deployment.

Finally, YOLOv11-DCFNet reconfigures the backbone by incorporating deeply integrated CFT modules at stages P3, P4, and P5, thereby establishing cross-stage propagation. This architecture reduces parameters to 10.36 M and computation to 5.49 GFLOPs compared with YOLOv11-T, while enhancing inference speed by 21%, achieving 45.30 FPS. This optimization effectively demonstrates fusion-depth control and computational refinement. As a result, YOLOv11-DCFNet strikes an optimal balance between dual-modal feature enhancement, detection precision, and deployable inference performance, thereby validating its strong practical viability for real-world applications.

3.4. Visual Interpretability Experiment

Deep learning models are often regarded as black boxes, which complicates their interpretability despite their strong performance on various tasks. Understanding model interpretability is particularly crucial in critical domains such as autonomous driving. In our study, we evaluated the interpretability of YOLOv11-RGB (unimodal) and YOLOv11-DCFNet (bimodal, equipped with a CFT module) using confusion matrices (Seeing Figure 5). The unimodal model (Seeing Figure 5a) exhibits low diagonal values (e.g., 0.56 confusion between ‘Vertical cracks’ and ‘Background’), indicating significant misclassification, particularly between different types of cracks and the background. In contrast, the bimodal model (Seeing Figure 5b) demonstrates stronger diagonal dominance (0.89 for Vertical, 0.97 for Transverse, and 0.94 for Oblique cracks), with substantially reduced cross-category confusion and fewer background false alarms. This suggests that infrared thermography effectively compensates for the loss of RGB detail in low-light or complex environments, while the CFT module enhances cross-modal context modeling, thereby improving the clarity of target representation and overall detection robustness.

Furthermore, the analysis of Precision–Recall (PR) curves (Seeing Figure 6) also reveals significant performance differences among the models. The baseline YOLOv11-RGB exhibits a marked degradation in precision at Recall values exceeding 0.6, particularly for vertical cracks. In contrast, YOLOv11-EF and YOLOv11-ADD demonstrate upward-shifted curves, indicating improved precision at medium to high recall levels (0.6–0.9), which supports the effectiveness of early fusion and feature weighting techniques. Specifically, YOLOv11-ADD performs exceptionally well for horizontal and oblique cracks, while YOLOv11-EF enhances recall for vertical cracks. Although YOLOv11-T slightly surpasses the baseline at low to medium recall levels, it shows fluctuations and a sharp decline in performance beyond Recall 0.8, suggesting instability due to local module replacement. Conversely, YOLOv11-DCFNet outperforms all other models, maintaining near-perfect precision until a Recall of approximately 0.9, followed by a gradual decline, and demonstrating superior performance across all crack types. Thus, the CFT module facilitates optimal detection through cross-modal and cross-layer fusion. Overall, DCFNet leads in performance, while EF and ADD provide substantial improvements without increasing complexity. YOLOv11-T shows limited advancements, and the baseline model experiences the most significant degradation at high recall levels. These findings confirm the efficacy of YOLOv11-DCFNet for detecting fine cracks in low-light conditions.

3.5. Visualization Comparison

Visual verification using Grad-CAM heatmaps (see Figure 7) demonstrates the detection efficacy of YOLOv11-DCFNet. The heatmaps reveal the model’s attention, with darker red indicating regions of higher focus. The results indicate that the original YOLOv11 struggles in low-light or nighttime conditions, often becoming distracted by road textures or high-contrast non-target areas in complex backgrounds. This distraction leads to inaccurate crack localization, blurred boundaries, and detection failures. In contrast, YOLOv11-DCFNet exhibits a more stable attention mechanism. By fusing RGB (texture details) and IR (thermal differentials), it generates concentrated thermal responses that precisely highlight crack structures while preserving texture information. This cross-modal fusion significantly enhances target localization in low-light and dark environments.

To validate the stability of YOLOv11-DCFNet, we compared models utilizing different fusion modules (ADD, EF, T, DCFNet) under both weak-light and no-light conditions. The results demonstrate that YOLOv11-DCFNet exhibits optimal performance across all environments, particularly showcasing superior robustness and generalization in low and no-light scenarios (Seeing Figure 8). In weak-light conditions, DCFNet outperforms the ADD and EF models in extracting subtle crack textures. In complete darkness, its CFT module effectively leverages infrared thermal radiation to compensate for the loss of RGB texture, thereby enhancing the integrity of crack identification. Overall, YOLOv11-DCFNet maintains stable detection capabilities in both lighting extremes, significantly surpassing traditional fusion methods and unimodal models. This approach not only enhances crack detection accuracy and completeness but also ensures operational reliability in challenging low-visibility environments.

4. Discussion

4.1. Physical Interpretations for Pavement Crack Detection

Here, the detection of pavement cracks fundamentally relies on contrast mechanisms. While the visible light (RGB) imaging captures crack–pavement interactions through reflected light and texture variations, its effectiveness diminishes under low illumination due to reduced contrast and edge definition. Even in daylight, surface reflections and complex textures introduce noise, complicating the analysis. Infrared (IR) imaging overcomes these limitations by exploiting thermal contrast features that arise from material properties: cracks exhibit distinct thermal responses due to differences in conductivity and heat capacity. As illustrated in Figure 9, intact pavement heats faster than cracks under sunlight but cools more slowly at night, generating persistent thermal anomalies. These temperature gradients enable robust crack detection that is independent of lighting conditions. Integrating IR with RGB imaging merges textural details with temperature-based cues, creating a complementary dual-modal representation. This foundation supports subsequent confidence analyses across varying temperature differentials and lighting conditions.

4.2. YOLOv11-RGB Uni-Modal Analysis Under Varying Thermal-Lighting Conditions

Figure 10 illustrates the maximum confidence distribution of the YOLOv11-RGB unimodal model across temperature difference intervals under weak-light (blue) and no-light (red) conditions. The boxplots show the confidence distributions, with the horizontal markers denoting mean values, and scattered points representing outliers. Under weak light, the mean confidence remains relatively stable (greater than 0.55), with higher values observed at smaller temperature differences (1.1–3.2 °C). This stability suggests that the model effectively captures crack textures despite illumination constraints. However, confidence levels fluctuate with increasing thermal contrast (e.g., intervals exceeding 3.2 °C), indicating challenges in feature extraction under such conditions. In no-light conditions, the model’s performance significantly deteriorates, with most mean confidence values falling below 0.5 and the most severe failures (confidence approaching zero) occurring in the 3.2–4.6 °C interval. These low-confidence predictions suggest a reliance on background noise rather than true crack features. Overall, the YOLOv11-RGB model demonstrates a strong dependence on illumination. This, in turn, underscores the necessity for dual-modal fusion to achieve robust detection.

4.3. YOLOv11-DCFNet Dual-Model Analysis Under Varying Thermal-Lighting Conditions

Figure 11 illustrates the maximum confidence distribution of the YOLOv11-DCFNet model across various temperature difference intervals. The boxplots show the distributions, with horizontal markers indicating mean values and scattered points representing individual samples. Compared to the unimodal RGB model, DCFNet consistently maintains higher confidence levels (mean values above 0.85 under weak light, and mostly around 0.75–0.80 under no-light conditions, with certain intervals such as 5.5–6.4 °C and 7.3–8.1 °C exceeding 0.80). In weak-light environments, the sustained high confidence highlights the effective fusion of visible and infrared features, enabling robust crack perception. In complete darkness, the model leverages distinctive thermal radiation patterns to ensure reliable detection. Importantly, DCFNet alleviates the confidence drops observed in unimodal RGB models at extreme temperature intervals (e.g., 1.1–2.0 °C and 8.1–9.9 °C), thereby demonstrating stronger stability even under challenging thermal-light conditions.

4.4. Model Effectiveness and Applicability

The effectiveness of YOLOv11-DCFNet is rigorously validated through comparative experiments with state-of-the-art bimodal models, including YOLOv11-RGBT, YOLOv5-CFT, and C²Former. As delineated in Table 3, our model achieves superior performance with a mAP@0.5 of 92.9% and mAP@0.5:0.95 of 56.3%. Crucially, YOLOv11-DCFNet establishes an optimal balance between accuracy and efficiency, completing training in merely 1.93 h—approximately ten times faster than C²Former [48]. This demonstrates a clear advantage over existing methods, particularly in challenging weak- and no-light conditions, which underscores the efficacy of the proposed CFT module in achieving robust cross-modal fusion.

Meanwhile, the model’s practical applicability is anchored in three key attributes. Its lightweight design facilitates deployment on resource-constrained platforms like drones or vehicle-mounted systems, enabling mobile and large-scale inspection. The visible–thermal infrared fusion is uniquely advantageous for low/no-light environments and is further enhanced in regions with significant diurnal temperature variation, where asphalt cracking presents high thermal contrast. Finally, by targeting cracks wider than 0.5 cm—a threshold aligned with routine maintenance standards—the system ensures focused detection of structurally significant damage, thereby effectively supporting daily road health monitoring, preemptive maintenance, and enhanced road safety.

However, several limitations of YOLOv11-DCFNet remain: the reliance on infrared (IR) data leads to thermal blurring, known as ‘modal smoothing,’ particularly in high-temperature scenarios where ambient temperature fluctuations occur. Furthermore, the dual-branch processing increases the computational load compared to the unimodal RGB model, which has only 2.62 million parameters and 3.31 GFLOPs, thereby hindering deployment at the edge. Additionally, the dataset is biased towards limited scenarios, lacking representation of extreme weather conditions, diverse surfaces, and cluttered environments, which constrains generalization capabilities. Despite its superior robustness in low-light conditions, these challenges highlight the need for future work focused on enhancing IR capabilities, reducing model size, and improving cross-scene generalization to expand applicability.

5. Conclusions

In this study, we propose an infrared–visible dual-modal fusion crack detection method, YOLOv11-DCFNet, to address the performance degradation of traditional RGB-based models in low-light and no-light environments. The core innovation is the Cross-Modality Fusion Transformer (CFT) module, which facilitates deep interactions between infrared (IR) and visible (RGB) features across multiple layers. By establishing robust correlations between global semantics and local details, YOLOv11-DCFNet compensates for the limitations of unimodal inputs in extreme lighting conditions, enhancing the perception of low-contrast crack boundaries and fine structures while achieving high precision and robustness in detection. Through hierarchical feature interaction, the model effectively leverages thermal radiation cues from infrared images and texture details from visible images, while suppressing redundant information, thereby improving accuracy and stability without compromising inference efficiency. Experimental results demonstrate that YOLOv11-DCFNet outperforms unimodal detectors and conventional fusion strategies, maintaining strong detection integrity in scenarios with low light, no light, background clutter, and target degradation.

Future research can further enhance the utility and generalization of YOLOv11-DCFNet. At the data level, introducing a wider range of diverse conditions—such as heavy rain, fog, snow reflection, and various road materials—will improve cross-geographic and cross-seasonal adaptability. Preprocessing strategies, including weak-light enhancement, de-fogging, and heat source suppression, can mitigate infrared degradation caused by high temperatures or strong thermal interference. At the model level, integrating attention optimization and multi-scale context mechanisms may enhance feature expressiveness while reducing computational overhead, thereby facilitating deployment on mobile and embedded devices. Furthermore, the combination of active learning and semi-supervised approaches can adaptively refine model performance while reducing annotation costs. Although YOLOv11-DCFNet has already achieved significant advancements, ongoing innovation in dual-modal detection will further advance the development of more accurate, robust, and efficient crack monitoring systems for complex real-world environments.

Author Contributions

Conceptualization, X.C., Y.Z., J.L., L.L. (Lelin Li), L.L. (Lifang Liu) and D.Z.; methodology, Y.Z. and X.C.; software, Y.Z. and J.L.; validation, Y.Z., X.C. and J.L.; formal analysis, Y.Z., X.C. and J.L.; investigation, Y.Z., and X.C.; resources, Y.Z. and X.C.; data curation, Y.Z. and J.L.; writing—original draft preparation, Y.Z., and X.C.; writing—review and editing, X.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Hunan Provincial Natural Science Foundation (2025JJ50185, 2018JJ2118), Chinese national college students innovation and entrepreneurship training program (S202510534211), National Natural Science Foundation of China (grant no. 42530710, 42377453).

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors on request.

Acknowledgments

The authors would like to express many thanks to all the anonymous reviewers.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Chen, X.; Wang, C.; Liu, C.; Zhu, X.; Zhang, Y.; Luo, T.; Zhang, J. Autonomous Crack Detection for Mountainous Roads Using UAV Inspection System. Sensors 2024, 24, 4751. [Google Scholar] [CrossRef]
Zhao, Y.; Zhou, L.; Wang, X.; Wang, F.; Shi, G. Highway Crack Detection and Classification Using UAV Remote Sensing Images Based on CrackNet and CrackClassification. Appl. Sci. 2023, 13, 7269. [Google Scholar] [CrossRef]
Chen, X.; Liu, C.; Chen, L.; Zhu, X.; Zhang, Y.; Wang, C. A Pavement Crack Detection and Evaluation Framework for a UAV Inspection System Based on Deep Learning. Appl. Sci. 2024, 14, 1157. [Google Scholar] [CrossRef]
Du, F.-J.; Jiao, S.-J. Improvement of Lightweight Convolutional Neural Network Model Based on YOLO Algorithm and Its Research in Pavement Defect Detection. Sensors 2022, 22, 3537. [Google Scholar] [CrossRef] [PubMed]
Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Columbus, OH, USA, 23–28 June 2014; pp. 580–587. [Google Scholar] [CrossRef]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask R-CNN. arXiv 2017, arXiv:1703.06870. [Google Scholar] [CrossRef]
Wang, C.-Y.; Bochkovskiy, A.; Liao, H.-Y.M. YOLOv7: Trainable Bag-of-Freebies Sets New State-of-the-Art for Real-Time Object Detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 18–22 June 2023; pp. 7464–7475. [Google Scholar] [CrossRef]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.-Y.; Berg, A.C. SSD: Single Shot MultiBox Detector. In Computer Vision—ECCV 2016; Leibe, B., Matas, J., Sebe, N., Welling, M., Eds.; Lecture Notes in Computer Science; Springer: Cham, Switzerland, 2016; Volume 9905, pp. 21–37. [Google Scholar] [CrossRef]
Lin, T.-Y.; Goyal, P.; Girshick, R.; He, K.; Dollar, P. Focal Loss for Dense Object Detection. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar] [CrossRef]
Wang, P.; Wang, C.; Liu, H.; Liang, M.; Zheng, W.; Wang, H.; Zhu, S.; Zhong, G.; Liu, S. Research on Automatic Pavement Crack Recognition Based on the Mask R-CNN Model. Coatings 2023, 13, 430. [Google Scholar] [CrossRef]
Li, Q.; Xu, X.; Guan, J.; Yang, H. The Improvement of Faster-RCNN Crack Recognition Model and Parameters Based on Attention Mechanism. Symmetry 2024, 16, 1027. [Google Scholar] [CrossRef]
Kortmann, F.; Talits, K.; Fassmeyer, P.; Warnecke, A.; Meier, N.; Heger, J.; Drews, P.; Funk, B. Detecting Various Road Damage Types in Global Countries Utilizing Faster R-CNN. In Proceedings of the IEEE International Conference on Big Data (Big Data), Atlanta, GA, USA, 10–13 December 2020; pp. 5563–5571. [Google Scholar] [CrossRef]
Xu, X.; Zhao, M.; Shi, P.; Ren, R.; He, X.; Wei, X.; Yang, H. Crack Detection and Comparison Study Based on Faster R-CNN and Mask R-CNN. Sensors 2022, 22, 1215. [Google Scholar] [CrossRef]
Balci, F.; Yilmaz, S. Faster R-CNN Structure for Computer Vision-Based Road Pavement Distress Detection. Politek. Derg. 2023, 26, 701–710. [Google Scholar] [CrossRef]
Lv, Z.; Cheng, C.; Lv, H. Automatic Identification of Pavement Cracks in Public Roads Using an Optimized Deep Convolutional Neural Network Model. Philos. Trans. R Soc. Math. Phys. Eng. Sci. 2023, 381, 20220169. [Google Scholar] [CrossRef] [PubMed]
Zhou, S.; Yang, D.; Zhang, Z.; Zhang, J.; Qu, F.; Punetha, P.; Li, W.; Li, N. Enhancing Autonomous Pavement Crack Detection: Optimizing YOLOv5s Algorithm with Advanced Deep Learning Techniques. Measurement 2025, 240, 115603. [Google Scholar] [CrossRef]
Yu, Z. YOLO V5s-Based Deep Learning Approach for Concrete Cracks Detection. SHS Web Conf. 2022, 144, 03015. [Google Scholar] [CrossRef]
Karimi, N.; Mishra, M.; Lourenço, P.B. Automated Surface Crack Detection in Historical Constructions with Various Materials Using Deep Learning-Based YOLO Network. Int. J. Archit. Herit. 2024, 5, 581–597. [Google Scholar] [CrossRef]
Li, H.; Peng, T.; Qiao, N.; Guan, Z.; Feng, X.; Guo, P.; Duan, T.; Gong, J. CrackTinyNet: A Novel Deep Learning Model Specifically Designed for Superior Performance in Tiny Road Surface Crack Detection. Intell. Transp. Syst. 2024, 18, 2693–2712. [Google Scholar] [CrossRef]
Wen, Y.; Gao, X.; Luo, L.; Li, J. Improved YOLOv8-Based Target Precision Detection Algorithm for Train Wheel Tread Defects. Sensors 2024, 24, 3477. [Google Scholar] [CrossRef]
Zhang, Z.; Zhang, H.; Zhang, T. Enhanced YOLOv8-Based Pavement Crack Detection: A High-Precision Approach. PLoS ONE 2025, 20, e0324512. [Google Scholar] [CrossRef]
Yang, J.; Tian, R.; Zhou, Z.; Tan, X.; He, P. Flexi-YOLO: A Lightweight Method for Road Crack Detection in Complex Environments. PLoS ONE 2025, 20, e0325993. [Google Scholar] [CrossRef]
Manjusha, M.; Sunitha, V. Optimizing YOLO Models for High-Accuracy Automated Detection and Classification of Road Surface Distresses. Innov. Infrastruct. Solut. 2025, 10, 381. [Google Scholar] [CrossRef]
Fan, Y.; Zhi, K.; An, H.; Gu, R.; Ding, X.; Tang, J. Disease Monitoring and Characterization of Feeder Road Network Based on Improved YOLOv11. Electronics 2025, 14, 1818. [Google Scholar] [CrossRef]
Dong, X.; Yuan, J.; Dai, J. Study on Lightweight Bridge Crack Detection Algorithm Based on YOLO11. Sensors 2025, 25, 3276. [Google Scholar] [CrossRef]
Ge, Y.; Li, Z.; Meng, L. YOLO-MSD: A Robust Industrial Surface Defect Detection Model via Multi-Scale Feature Fusion. Appl. Intell. 2025, 55, 840. [Google Scholar] [CrossRef]
Kang, Z.; Gu, K.; Hu, A.Y.; Du, H.; Gu, Q.; Jiang, Y.; Gan, W. PC3D-YOLO: An Enhanced Multi-Scale Network for Crack Detection in Precast Concrete Components. Buildings 2025, 15, 2225. [Google Scholar] [CrossRef]
Li, T.; Li, G. Road Defect Identification and Location Method Based on an Improved ML-YOLO Algorithm. Sensors 2024, 24, 6783. [Google Scholar] [CrossRef] [PubMed]
Bai, T.; Lv, B.; Wang, Y.; Gao, J.; Wang, J. Crack Detection of Track Slab Based on RSG-YOLO. IEEE Access 2023, 11, 124004–124013. [Google Scholar] [CrossRef]
Xing, Y.; Han, X.; Pan, X.; An, D.; Liu, W.; Bai, Y. EMG-YOLO: Road Crack Detection Algorithm for Edge Computing Devices. Front. Neurorobotics 2024, 18, 1423738. [Google Scholar] [CrossRef] [PubMed]
Zhou, F.; Liu, L.; Hu, H.; Jin, W.; Zheng, Z.; Li, Z.; Ma, Y.; Wang, Q. An Improved YOLO Network for Insulator and Insulator Defect Detection in UAV Images. Photogramm. Eng. Remote Sens. 2024, 90, 355–361. [Google Scholar] [CrossRef]
Chen, Y.; Wang, B.; Guo, X.; Zhu, W.; He, J.; Liu, X.; Yuan, J. DEYOLO: Dual-Feature-Enhancement YOLO for Cross-Modality Object Detection. arXiv 2024, arXiv:2412.04931. [Google Scholar]
Xu, J.; Zhang, Q.; Liu, Y.; Zheng, M. Small Object Detection in Remote Sensing Images Based on Window Self-Attention Mechanism. Photogramm. Eng. Remote Sens. 2023, 89, 489–497. [Google Scholar] [CrossRef]
Chen, S.; Feng, Z.; Xiao, G.; Chen, X.; Gao, C.; Zhao, M.; Yu, H. Pavement Crack Detection Based on the Improved Swin-Unet Model. Buildings 2024, 14, 1442. [Google Scholar] [CrossRef]
Saberironaghi, A.; Ren, J. DepthCrackNet: A Deep Learning Model for Automatic Pavement Crack Detection. J. Imaging 2024, 10, 100. [Google Scholar] [CrossRef]
Wang, Z.; Leng, Z.; Zhang, Z. A Weakly-Supervised Transformer-Based Hybrid Network with Multi-Attention for Pavement Crack Detection. Constr. Build. Mater. 2024, 411, 134134. [Google Scholar] [CrossRef]
Zhang, H.; Chen, N.; Li, M.; Mao, S. The Crack Diffusion Model: An Innovative Diffusion-Based Method for Pavement Crack Detection. Remote Sens. 2024, 16, 986. [Google Scholar] [CrossRef]
Yang, H.; Zhou, C.; Xing, X.; Wu, Y.; Wu, Y. A High-Resolution Remote Sensing Road Extraction Method Based on the Coupling of Global Spatial Features and Fourier Domain Features. Remote Sens. 2024, 16, 3896. [Google Scholar] [CrossRef]
Guan, J.; Zhao, Q.; Tian, W.; Yao, X.; Li, J.; Li, W. Swin-FSNet: A Frequency-Aware and Spatially Enhanced Network for Unpaved Road Extraction from UAV Remote Sensing Imagery. Remote Sens. 2025, 17, 2520. [Google Scholar] [CrossRef]
Zhang, J.; Xia, H.; Li, P.; Zhang, K.; Hong, W.; Guo, R. A Pavement Crack Detection Method via Deep Learning and a Binocular-Vision-Based Unmanned Aerial Vehicle. Appl. Sci. 2024, 14, 1778. [Google Scholar] [CrossRef]
Wang, Y.; He, Z.; Zeng, X.; Zeng, J.; Cen, Z.; Qiu, L.; Xu, X.; Zhuo, Q. GGMNet: Pavement-Crack Detection Based on Global Context Awareness and Multi-Scale Fusion. Remote Sens. 2024, 16, 1797. [Google Scholar] [CrossRef]
Shi, Y.; Cui, L.; Qi, Z.; Meng, F.; Chen, Z. Automatic Road Crack Detection Using Random Structured Forests. IEEE Trans. Intell. Transp. Syst. 2016, 17, 3434–3445. [Google Scholar] [CrossRef]
Gavilán, M.; Balcones, D.; Marcos, O.; Llorca, D.F.; Sotelo, M.A.; Parra, I.; Ocaña, M.; Aliseda, P.; Yarza, P.; Amírola, A. Adaptive Road Crack Detection System by Pavement Classification. Sensors 2011, 11, 9628–9657. [Google Scholar] [CrossRef]
Oliveira, H.; Correia, P.L. Automatic Road Crack Detection and Characterization. IEEE Trans. Intell. Transp. Syst. 2013, 14, 155–168. [Google Scholar] [CrossRef]
Wan, D.; Lu, R.; Fang, Y.; Lang, X.; Shu, S.; Chen, J.; Shen, S.; Xu, T.; Ye, Z. YOLOv11-RGBT: Towards a Comprehensive Single-Stage Multispectral Object Detection Framework. arXiv 2025, arXiv:2506.14696. [Google Scholar] [CrossRef]
Qingyun, F.; Han, D.; Wang, Z. Cross-Modality Fusion Transformer for Multispectral Object Detection. arXiv 2021, arXiv:2111.00273. [Google Scholar]
Yuan, M.; Wei, X. C²Former: Calibrated and Complementary Transformer for RGB-Infrared Object Detection. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5403712. [Google Scholar] [CrossRef]

Figure 1. Classification of Road Cracks Datasets (all cracks have more than 0.5 cm in width; crack regions are delineated by red rectangles).

Figure 2. YOLOv11-DCFNet network structure (© denotes the feature splicing operation; ⊕ denotes the residual connection).

Figure 3. Loss plot of the YOLOv11-DCFNet model (the optimal iterations number is 300).

Figure 4. The results of running performance with various unimodal conditions (YOLOv5, v8, v11, v13).

Figure 5. Comparison of standardized confusion matrices: (a) YOLOv11-RGB; (b) YOLOv11-DCFNet.

Figure 6. Comparison of Precision–Recall curves for five groups of ablation experiment models in different scenarios.

Figure 7. Comparison of Grad-CAM visualizations: area (A) shows the original images (weak-light left, no-light right); area (B) shows YOLOv11 (RGB-only) heatmaps (weak-light left, no-light right); area (C) shows YOLOv11-DCFNet heatmaps (left: weak-light; right: no-light); Crack regions are delineated by red rectangles.

Figure 8. Visualization comparison of road crack detection performance of YOLOv11-DCFNet and other integration strategies under weak- and no-light conditions (the red box indicates the location of the target detected by the model in the input image).

Figure 9. Uneven exposure of road cracks to sunlight leads to temperature differences.

Figure 10. Weak-light vs. no-light: confidence distribution of RGB-based YOLOv11 at different temperature intervals.

Figure 11. Weak-light vs. No-light: confidence distribution of YOLOv11-DCFNet under different temperature intervals.

Table 1. Ablation results of Transformer hyperparameters (h, block_exp, n_layer) for selecting the best configuration.

Groups	h (Heads)	Block_exp	n_Layer	Precision (%)	Recall (%)	mAP@0.5/%	mAP@0.5:0.95/%	Trainning Time
Group 1	4	4	8	0.931	0.887	0.915	0.546	1.272 h
Group 2	16	4	8	0.947	0.902	0.923	0.539	5.135 h
Group 3	8	1	8	0.946	0.914	0.937	0.554	2.416 h
Group 4	8	2	8	0.953	0.905	0.929	0.563	1.931 h
Group 5	8	4	4	0.91	0.885	0.913	0.548	1.739 h
Group 6	8	4	16	0.94	0.901	0.92	0.554	9.771 h
Group 7	8	4	8	0.915	0.902	0.925	0.539	3.952 h

Table 2. Configuration of experimental environment.

Parameters	Configuration
Operating systems	Windows11
CPU	13th Gen Intel^® Core™ i5-13400F
GPU	NVIDIA GeForce RTX 4060 Ti
GPU memory size	16 G
Deep learning architecture	torch-2.5.1+cu121-cp310

Table 3. Performance Comparison of Dual-modal Fusion Target Detection Models in Weak- and No-Light Conditions.

Models	Precision (%)	Recall (%)	mAP@0.5/%	mAP@0.5:0.95/%	Training Time
YOLO v11 RGBT-YOLO8 (n)	0.956	0.767	0.894	0.542	1.224 h
YOLO v11 RGBT-YOLO10 (n)	0.872	0.753	0.856	0.543	1.299 h
YOLO v11 RGBT-YOLO11 (n)	0.894	0.829	0.894	0.545	1.233 h
YOLO v5 CFT (s)	0.688	0.668	0.693	0.3	0.974 h
YOLO v5 CFT (m)	0.759	0.671	0.724	0.303	1.059 h
YOLO v5 CFT (l)	0.715	0.703	0.744	0.32	6.513 h
C²Former [48]	0.920	0.956	0.905	0.546	18.564 h
YOLO v11 DCFNet (ours-n)	0.953	0.905	0.929	0.563	1.931 h

Table 4. Comparison of the detection performance of ablation experiments with different module combinations for the YOLOv11 model (Note: √ indicates the presence of the module; × indicates its absence).

Integration Strategies	RGB	EF	ADD	T	DCFNet	Precision (%)	Recall (%)	mAP@0.5 (%)	mAP@0.5:0.95 (%)
Strategy 1	√	×	×	×	×	76.2	72.3	76.6	38
Strategy 2	×	√	×	×	×	85.7	85.1	84.4	36.3
Strategy 3	×	×	√	×	×	90.1	83.4	84.4	36.9
Strategy 4	×	×	×	√	×	87.2	80.1	80.3	35.2
Strategy 5 (Ours)	×	×	×	×	√	0.953	0.905	0.929	0.563

Table 5. Experimental results of ablation in DCFNet with different module combinations (Strategy 1–5): computational complexity and real-time performance analysis (Note: √ indicates the presence of the module; × indicates its absence).

Integration Strategies	RGB	EF	ADD	T	DCFNet	Parameters (M)	GFLOPs (G)	FPS (f/s)
Strategy 1	√	×	×	×	×	2.62	3.31	114.79
Strategy 2	×	√	×	×	×	3.99	4.87	82.61
Strategy 3	×	×	√	×	×	3.96	4.78	88.22
Strategy 4	×	×	×	√	×	13.95	6.13	37.34
Strategy 5 (Ours)	×	×	×	×	√	10.36	5.49	45.30

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Chen, X.; Zhang, Y.; Lei, J.; Li, L.; Liu, L.; Zhang, D. YOLOv11-DCFNet: A Robust Dual-Modal Fusion Method for Infrared and Visible Road Crack Detection in Weak- or No-Light Illumination Environments. Remote Sens. 2025, 17, 3488. https://doi.org/10.3390/rs17203488

AMA Style

Chen X, Zhang Y, Lei J, Li L, Liu L, Zhang D. YOLOv11-DCFNet: A Robust Dual-Modal Fusion Method for Infrared and Visible Road Crack Detection in Weak- or No-Light Illumination Environments. Remote Sensing. 2025; 17(20):3488. https://doi.org/10.3390/rs17203488

Chicago/Turabian Style

Chen, Xinbao, Yaohui Zhang, Junqi Lei, Lelin Li, Lifang Liu, and Dongshui Zhang. 2025. "YOLOv11-DCFNet: A Robust Dual-Modal Fusion Method for Infrared and Visible Road Crack Detection in Weak- or No-Light Illumination Environments" Remote Sensing 17, no. 20: 3488. https://doi.org/10.3390/rs17203488

APA Style

Chen, X., Zhang, Y., Lei, J., Li, L., Liu, L., & Zhang, D. (2025). YOLOv11-DCFNet: A Robust Dual-Modal Fusion Method for Infrared and Visible Road Crack Detection in Weak- or No-Light Illumination Environments. Remote Sensing, 17(20), 3488. https://doi.org/10.3390/rs17203488

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

YOLOv11-DCFNet: A Robust Dual-Modal Fusion Method for Infrared and Visible Road Crack Detection in Weak- or No-Light Illumination Environments

Abstract

Highlights

Abstract

1. Introduction

2. Materials and Methods

2.1. Road Crack Datasets

2.2. The Dual-Modal Fusion Method Based on YOLOv11

2.2.1. YOLOv11-DCFNet Architecture

2.2.2. CFT Module Design

2.3. Experimental Environment

2.4. Evaluation Indicators

3. Results

3.1. Uni-Modal Experimental Results with Three Lighting Scenarios

3.2. Comparative Experiments of Our Dual-Modal with Uni-Modal Model

3.3. Ablation Experiments

3.4. Visual Interpretability Experiment

3.5. Visualization Comparison

4. Discussion

4.1. Physical Interpretations for Pavement Crack Detection

4.2. YOLOv11-RGB Uni-Modal Analysis Under Varying Thermal-Lighting Conditions

4.3. YOLOv11-DCFNet Dual-Model Analysis Under Varying Thermal-Lighting Conditions

4.4. Model Effectiveness and Applicability

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI