Structure-Aware Progressive Multi-Modal Fusion Network for RGB-T Crack Segmentation

Yuan, Zhengrong; Ding, Xin; Xia, Xinhong; He, Yibin; Fang, Hui; Yang, Bo; Fu, Wei

doi:10.3390/jimaging11110384

Open AccessArticle

Structure-Aware Progressive Multi-Modal Fusion Network for RGB-T Crack Segmentation

by

Zhengrong Yuan

¹,

Xin Ding

²,

Xinhong Xia

¹,

Yibin He

^1,*,

Hui Fang

¹,

Bo Yang

¹ and

Wei Fu

³

¹

Hunan Architectural Design Institute Group Co., Ltd., Changsha 410208, China

²

School of Artificial Intelligence and Robotics, Hunan University, Changsha 410012, China

³

College of Computer Science and Electronic Engineering, Hunan University, Changsha 410082, China

^*

Author to whom correspondence should be addressed.

J. Imaging 2025, 11(11), 384; https://doi.org/10.3390/jimaging11110384

Submission received: 10 September 2025 / Revised: 27 October 2025 / Accepted: 29 October 2025 / Published: 1 November 2025

(This article belongs to the Special Issue Image Segmentation Techniques: Current Status and Future Directions (2nd Edition))

Download

Browse Figures

Versions Notes

Abstract

Crack segmentation in images plays a pivotal role in the monitoring of structural surfaces, serving as a fundamental technique for assessing structural integrity. However, existing methods that rely solely on RGB images exhibit high sensitivity to light conditions, which significantly restricts their adaptability in complex environmental scenarios. To address this, we propose a structure-aware progressive multi-modal fusion network (SPMFNet) for RGB-thermal (RGB-T) crack segmentation. The main idea is to integrate complementary information from RGB and thermal images and incorporate structural priors (edge information) to achieve accurate segmentation. Here, to better fuse multi-layer features from different modalities, a progressive multi-modal fusion strategy is designed. In the shallow encoder layers, two gate control attention (GCA) modules are introduced to dynamically regulate the fusion process through a gating mechanism, allowing the network to adaptively integrate modality-specific structural details based on the input. In the deeper layers, two attention feature fusion (AFF) modules are employed to enhance semantic consistency by leveraging both local and global attention, thereby facilitating the effective interaction and complementarity of high-level multi-modal features. In addition, edge prior information is introduced to encourage the predicted crack regions to preserve structural integrity, which is constrained by a joint loss of edge-guided loss, multi-scale focal loss, and adaptive fusion loss. Experimental results on publicly available RGB-T crack detection datasets demonstrate that the proposed method outperforms both classical and advanced approaches, verifying the effectiveness of the progressive fusion strategy and the utilization of the structural prior.

Keywords:

crack segmentation; deep learning; gate control attention; feature fusion

1. Introduction

Human transportation, habitation, and production are inseparable from a large number of man-made structures such as buildings, roads, and bridges. As time passes, these structures suffer from structural damages, which are usually indicated by surface cracks. Therefore, to ensure the safety of people’s lives and property, it is necessary to regularly monitor the surface cracks of man-made structures. Compared with on-site manual inspection, automatic crack segmentation methods based on UAV images are low cost, low risk, and can monitor inaccessible areas.

Semantic segmentation is the critical technique for crack segmentation. It is a task of dense pixel-wise classification, where each pixel in an image is assigned a corresponding semantic category label [1]. In the field of infrastructure maintenance, semantic segmentation-based crack extraction draws increasing attention in various civil engineering scenarios such as building inspection [2], bridge monitoring [3], tunnel safety evaluation [4], and road surface analysis [5]. In recent years, deep learning technologies, especially convolutional neural networks (CNNs), have brought revolutionary advancements to the field of semantic segmentation. A series of milestone architectures have emerged, including Fully Convolutional Networks (FCNs) [6], U-Net [7], and DeepLab V3+ [8], all of which have achieved remarkable results in large-scale RGB image semantic segmentation tasks. Inspired by these successes, researchers gradually introduce these advanced techniques in the field of crack segmentation. Most methods tend to adopt convolutional architectures. For example, the DeepCrack [9] utilizes hierarchical multi-scale feature extraction methods to capture fine structural crack details. APF-Net [10] integrates a progressive fusion (PF) module and a hybrid multiple attention (HMA) mechanism, and this technique has been proposed for pavement crack segmentation.

Transformer-based approaches have also demonstrated strong performance. CrackFormer-II [11] employs local self-attention and attention-guided skip connections to improve both global context modeling and fine detail segmentation. CrackSegNet [12] adopts dilated convolutions and multi-scale pooling within a modular architecture, addressing class imbalance with an optimized loss function. A two-stage method proposed by Liu combines Mask R-CNN for detection and DeepLabV3+ for fine segmentation in bridge crack images [13].

To segment thin and low-contrast cracks more accurately, StripCuts [14] formulates segmentation as an optimization problem, leveraging volumetric dynamic programming and crack linearization. Li et al. [15] introduced a physically informed method that incorporates dynamic snake convolution and cross-correlation constraints to enhance continuity in road crack segmentation. Furthermore, Yoon et al. [16] developed a comprehensive polygon-annotated crack dataset with diverse crack types, serving as a robust benchmark for semantic and instance segmentation.

In addition, several methods are developed for specific environments. For example, Sun et al. proposed a hybrid model combining an EfficientNet-enhanced YOLOX for crack identification and an UNETR++ for fine segmentation [17] for tunnel environments. Similarly, ISTD-CrackNet [18] uses a hierarchical Transformer with multi-angle strip convolution and dynamic upsampling to maintain crack continuity and sharp edges. In complex mining environments, Wang et al. [19] employed mid-level semantic features and multi-scale attention fusion within a DeepLabV3+-based network to improve segmentation under noisy backgrounds. To address the challenge of detecting fine linear fatigue cracks in metallic aero-engine components, Si et al. [20] developed a VGG-16-based U-shaped FCN model integrating CBAM and advanced loss functions, achieving significant improvements in segmentation performance. Several other end-to-end architectures have also been proposed. Wang et al. [21] introduced a U-Net variant incorporating multi-layer feature fusion, residual blocks with pointwise convolution, and maximum unpooling to enhance thin-crack edge detail. DCNCrack [22] integrates deformable convolution into CNNs to improve adaptive spatial feature aggregation and long-range dependence. MorFormer [23] combines background morphology learning and morphology-aware attention to suppress noise and capture crack topology. Semi-CSN [24] adopts a semi-supervised training strategy with multi-scale attention fusion and dynamic pseudo-labeling to leverage unlabeled data effectively. In the domain of weakly supervised learning, CrackCLIP [25] utilizes vision-language pre-training (CLIP) and language prompts to guide semantic understanding and pseudo-label generation.

However, under challenging conditions such as rainy weather, haze, or low-light environments, crack detection and segmentation methods that rely solely on RGB images exhibit limited performance. Moreover, approaches that depend exclusively on infrared images may hinder the model’s ability to distinguish cracks from other surface defects (e.g., stains, dust, or corrosion), thereby leading to false positives or missed detections. To address these issues, RGB-T crack segmentation techniques [26] have been proposed in recent years because thermal images do not depend on light conditions and are complementary to RGB images [27]. Some RGB-T semantic segmentation networks have been developed to achieve good results in other fields. It means that multi-modal information can promote the stability and accuracy of semantic segmentation. For instance, Ha et al. [28] pioneered this approach with the introduction of MFNet, which utilizes two independent but structurally identical encoders to extract modality-specific features, followed by a unified fusion strategy in the decoder for cross-modal integration. Since the introduction of MFNet, numerous models have been developed under the symmetric framework. Zhang et al. [29] proposed ABMDRNet, which incorporates a dedicated sub-network to reduce the domain gap between RGB and thermal features during the fusion process, thereby enabling more effective joint representation learning using shared operations. Similarly, Zhou et al. [30] introduced GMNet, which decomposes features into two semantic levels and applies distinct fusion modules accordingly while employing multi-label supervision to guide training. Zhou et al. [31] decoupled the feature fusion process from the encoder and decoder stages, extracting global contextual information from high-level fused features via a parallel convolutional structure to enhance decoding performance. More recently, to further explore cross-modal complementarities and morphological diversity, MMSMCNet [32] was proposed. It integrates modal memory sharing with multi-scale morphological guidance, introducing a novel decoder composed of contour, skeleton, and morphological complementary modules, and achieves superior performance through a multi-unit-based complementary supervision strategy. Beyond fusion strategy innovations, some models have begun focusing on enhancing the decoding process itself. For example, the Shape and Semantic Enhancements Module (SASEM) [33] proposes a dual-branch decoder structure that separately reinforces shape and semantic representations using signed distance maps and channel-level enhancement, which effectively improves feature recovery and object boundary integrity. Furthermore, UTFNet [34] introduces an uncertainty-guided RGB-T fusion strategy by quantifying the modality confidence of each input through evidential theory. It dynamically adjusts the fusion based on illumination evidence and multi-scale reliability, enabling more trustworthy cross-modal integration under varied real-world conditions. In addition, DHFNet [35] proposes a hierarchical fusion approach that decouples global and local features from each modality. Through modules such as lightweight global self-attention (LGSA), cross-modal deformable convolution (CMDC), and long-distance feature fusion (CMLFF), it effectively addresses feature misalignment and redundancy, achieving fine-grained fusion across different levels. Zhao et al. [36] propose OpenRSS, an open-vocabulary RGB-T semantic segmentation method that efficiently fuses RGB and thermal information, achieving significant improvements in multiple benchmarks. Liu et al. [37] introduce IQSeg, which enhances RGB-T segmentation accuracy by addressing implicit alignment and refining query-based predictions. Liu et al. [38] present MiLNet, which is a module-free network that integrates multiplex feature interactions to improve RGB-T semantic segmentation performance.

These indicate that the fusion of multi-modal information can promote the performance of semantic segmentation, especially in terms of the stability and accuracy. To better leverage the complementary information from both modalities, an effective multi-modal feature fusion mechanisms [39] should be carefully designed. Moreover, we note that cracks often exhibit tiny, intermittent, and irregular edges, making it difficult to accurately segment cracks without effective edge features. To this end, we propose a crack detection method based on the structure-adaptive fusion of RGB and thermal images. The main contributions of this work are as follows:

An end-to-end edge-guided progressive multi-modal segmentation network is proposed in this work. The developed framework is designed to jointly leverage both RGB and thermal infrared information while incorporating structural priors. By embedding edge-aware supervision throughout the decoding process, the overall accuracy and robustness of crack segmentation are significantly enhanced, effectively addressing the boundary blurring issue commonly encountered in conventional approaches.
A progressive multi-modal feature fusion strategy is designed to hierarchically integrate cross-modal features. For shallow-level features characterized by rich spatial details, a Gated Control Attention (GCA) module is introduced to dynamically recalibrate modal contributions through a gating mechanism, thereby enhancing texture perception in local crack regions. For deep-level features carrying high-level semantics, an Attention Feature Fusion (AFF) module is employed to align and complement cross-modal representations via joint local–global attention, effectively strengthening semantic consistency in structural prediction.
A structure-prior-guided segmentation prediction strategy is proposed, which utilizes edge prediction consistency constraints to preserve the original structural characteristics in segmentation results. By formulating a joint optimization objective comprising edge-guided loss, multi-scale focal loss, and adaptive fusion loss, the structural integrity and boundary accuracy of the final predictions are significantly improved.

In Section 2, we comprehensively describe the proposed approach, while Section 3 presents empirical evaluations and Section 4 concludes this paper.

2. Proposed Method

In this section, we first introduce the framework of the proposed method. Then, detailed descriptions about the progressive multi-modal feature fusion stage and the structure-guided crack segmentation stage, as well as developed modules, are demonstrated. Finally, the edge-guided loss function is coupled with a classic loss function to employ the bias of crack structures.

The proposed SPMFNet (Structure-aware Progressive Multi-modal Fusion Network for RGB-T Crack Segmentation) method mainly consists of two stages, as shown in Figure 1. When RGB and thermal images are input, the dual-branch encoder extracts multi-level semantic features from both modalities using a four-stage hierarchy. To enable stage-specific fusion, the encoder employs a Gated Control Attention (GCA) module in the first two layers for fine-grained cross-modal interaction. Furthermore, an Attentional Feature Fusion (AFF) module in the last two layers is used to enhance semantic alignment at deeper levels. These adaptively fused features are subsequently passed to the decoder, which performs progressive upsampling with skip connections to recover spatial resolution. The decoder further integrates edge-aware information and outputs pixel-level segmentation maps refined by an edge-guided supervision strategy, thereby enabling accurate and robust crack detection under complex environmental conditions.

2.1. Progressive Multi-Modal Feature Fusion

The proposed multi-modal crack detection network employs a dual-branch symmetric encoder to independently extract features from RGB and thermal (T) images through parallel paths of four hierarchical convolutional layers. To fully exploit the complementary characteristics of these modalities—namely the structural richness of RGB and the thermal sensitivity of T—the encoder integrates distinct fusion strategies adapted to semantic depth. In the first two shallow encoding stages, a Gated Control Attention (GCA) module dynamically performs spatially adaptive cross-modal fusion via hierarchical gated attention, which selectively modulates modality contributions based on the spatial scene context. This mechanism enhances local crack perception by suppressing redundancy and emphasizing modality-specific structural details. In the deeper third and fourth stages, an Attentional Feature Fusion (AFF) module is employed, combining both channel-wise and spatial attention branches to selectively aggregate high-level semantic features and improve semantic alignment across modalities. This depth-aware, staged fusion strategy enables the effective preservation of fine spatial details in shallow layers while ensuring semantic consistency at deeper layers, thereby enhancing the robustness and accuracy of crack representation under complex imaging conditions. In the progressive multi-modal feature fusion stage, a novel combination of techniques was employed to achieve a high-quality fusion of RGB and thermal modal features.

Overall, the encoder architecture leverages a progressive fusion strategy that aligns well with the hierarchical nature of feature extraction. By combining modality-specific low-level cues with jointly refined high-level semantics, the encoder ensures the preservation of detailed spatial information while enhancing the representation of complex crack structures. By employing GCA in early encoding layers and AFF in deeper stages, the network benefits from both fine-grained spatial adaptivity and high-level semantic selectivity. This hierarchical fusion scheme ensures that the encoder captures crack-related cues with varying spatial and contextual granularity, thereby laying a solid foundation for the subsequent edge-guided decoding and final segmentation prediction.

2.1.1. Gate Control Attention Module

To enhance the interaction and integration of modality-specific features in shallow layers, a Gate Control Attention (GCA) module is designed and deployed in the encoder. GCA is specifically constructed to perform cross-modal interaction and dynamic feature fusion between RGB and thermal image representations.

The GCA is designed to perform cross-modal interaction and dynamic feature fusion. For each layer, the features extracted from the RGB and thermal branches are concatenated along the channel dimension to construct the input for fusion. The structure of the GCA is illustrated in Figure 2. Let

F_{RGB}^{i}

and

F_{T}^{i}

denote the input features of the RGB and thermal branches at the i-th (

i \in {1, 2}

) layer, respectively. The inputs

F_{RGB}^{i}

and

F_{T}^{i}

are respectively processed by a Convblock module to generate enhanced features

F_{RGBC}^{i}

and

F_{TC}^{i}

, which is formulated as follows:

F_{RGBC}^{i} = ReLU ({Conv}_{1 \times 1} (F_{RGB}^{i}) + BasicConv (F_{RGB}^{i})), i = 1, 2 .

(1)

BasicConv (F_{RGB}^{i}) = BN ({Conv}_{3 \times 3} (ReLU (BN ({Conv}_{3 \times 3} (F_{RGB}^{i}))))), i = 1, 2 .

(2)

where BN is batch normalization,

{Conv}_{3 \times 3}

denotes a learnable

3 \times 3

convolution operation, and ReLU is the rectified linear unit activation function.

F_{TC}^{i} = ReLU ({Conv}_{1 \times 1} (F_{T}^{i}) + BasicConv (F_{T}^{i})), i = 1, 2 .

(3)

BasicConv (F_{T}^{i}) = BN ({Conv}_{3 \times 3} (ReLU (BN ({Conv}_{3 \times 3} (F_{T}^{i}))))), i = 1, 2 .

(4)

By sequentially applying feature concatenation, Convblock, and the squeeze-and-excitation operation proposed by Hu et al. [40], the feature representation

F_{SE}^{i}

is obtained. The computation is formulated as shown below:

F_{SE}^{i} = SE (ReLU ({Conv}_{1 \times 1} (F_{RGB}^{i}) + BasicConv (Concat (F_{RGB}^{i}, F_{T}^{i}))))

(5)

where SE is the squeeze-and-excitation operation.

The gated fusion between

F_{RGBC}^{i}

and

F_{SE}^{i}

is formulated as follows:

G_{1}^{i} = σ ({Conv}_{1 \times 1} (Concat (F_{RGBC}^{i}, F_{SE}^{i}))), i = 1, 2 .

(6)

where

{Conv}_{1 \times 1}

denotes the learnable convolution operation with kernel size

1 \times 1

,

σ (\cdot)

is the sigmoid activation function, and

Concat (\cdot, \cdot)

represents the concatenation operation along the channel dimension. The output

G_{1}^{i}

denotes the spatial attention weights.

The first fused feature output

F_{fuse 1}^{i}

is defined as shown below:

F_{fuse 1}^{i} = G_{1}^{i} \cdot F_{RGBC}^{i} + (1 - G_{1}^{i}) \cdot F_{SE}^{i}, i = 1, 2 .

(7)

Then, the gated fusion between

F_{TC}^{i}

and

F_{SE}^{i}

is formulated as follows:

G_{2}^{i} = σ ({Conv}_{1 \times 1} (Concat (F_{TC}^{i}, F_{SE}^{i}))), i = 1, 2 .

(8)

The second fused feature output

F_{fuse 2}^{i}

is defined as shown below:

F_{fuse 2}^{i} = G_{2}^{i} \cdot F_{TC}^{i} + (1 - G_{2}^{i}) \cdot F_{SE}^{i}, i = 1, 2 .

(9)

After that,

F_{fuse 1}^{i}

and

F_{fuse 2}^{i}

are, respectively, passed through a Convblock module to generate the refined features

F_{fuseC 1}^{i}

and

F_{fuseC 2}^{i}

. The calculation formulas are as follows:

F_{fuseC 1}^{i} = ReLU ({Conv}_{1 \times 1} (F_{fuse 1}^{i}) + BasicConv (F_{fuse 1}^{i})), i = 1, 2 .

(10)

BasicConv (F_{fuse 1}^{i}) = BN ({Conv}_{3 \times 3} (ReLU (BN ({Conv}_{3 \times 3} (F_{fuse 1}^{i}))))), i = 1, 2 .

(11)

F_{fuseC 2}^{i} = ReLU ({Conv}_{1 \times 1} (F_{fuse 2}^{i}) + BasicConv (F_{fuse 2}^{i})), i = 1, 2 .

(12)

BasicConv (F_{fuse 2}^{i}) = BN ({Conv}_{3 \times 3} (ReLU (BN ({Conv}_{3 \times 3} (F_{fuse 2}^{i}))))), i = 1, 2 .

(13)

The gated fusion between

F_{fuseC 1}^{i}

and

F_{fuseC 2}^{i}

is formulated as follows:

G_{3}^{i} = σ ({Conv}_{1 \times 1} (Concat (F_{fuseC 1}^{i}, F_{fuseC 2}^{i}))), i = 1, 2 .

(14)

The final output of the fused feature

F_{fuse}^{i}

(

i \in {1, 2}

) is computed using a weighted combination of elements:

F_{fuse}^{i} = G_{3}^{i} \cdot F_{fuseC 1}^{i} + (1 - G_{3}^{i}) \cdot F_{fuseC 2}^{i}, i = 1, 2 .

(15)

The key innovation of GCA lies in its gating mechanism, which allows the network to dynamically adjust the fusion ratio of the two modalities based on spatial and semantic contexts. By learning position-specific gating weights, the module can adaptively balance modality contributions, enabling the network to fully exploit complementary information from both RGB and thermal sources. This result in a more expressive and context-aware multi-modal feature representation, which is critical for precise crack segmentation under diverse environmental conditions.

In addition to this core gating mechanism, the GCA module also integrates Squeeze-and-Excitation Attention to refine the coarse fused features. It implements a Squeeze-and-Excitation mechanism that extracts global channel statistics through average pooling and applies adaptive channel weighting via a two-layer bottleneck structure. This allows the model to enhance informative channels and suppress irrelevant ones, improving the quality of the joint representation.

To enable fine-grained interactions, the GCA module integrates a hierarchical gated fusion strategy. This strategy employs three independent GatedFusion blocks, each generating spatially adaptive gates via 1 × 1 convolutions followed by sigmoid activations. These gates modulate the contribution of each modality at the pixel level. The feature fusion process of the GCA module is conducted in three stages to progressively refine feature integration. The first two stages generate fusion maps between each modality and the interactive feature

F_{SE}^{i}

, while the final stage aggregates the outputs of the preceding two stages to enhance the overall fusion quality. This multi-stage design facilitates precise feature alignment and fusion, allowing the model to capture subtle inter-modal dependencies. The final output is produced by aggregating the multiple gated results. Figure 3 presents the visualization of the GCA module before and after fusion on the asphalt pavement crack detection dataset.

Both SE-Block and CBAM are designed to enhance feature representation through attention mechanisms. SE-Block focuses on channel-wise recalibration via global pooling and channel weighting, while CBAM extends this idea by integrating spatial attention to simultaneously capture informative channels and spatial regions. GCA, on the other hand, not only incorporates the channel weighting concept of SE-Block but also introduces a gating mechanism and multi-feature fusion strategy, enabling dynamic cross-modal feature selection. Therefore, compared with SE-Block and CBAM, GCA shares the common goal of feature refinement but places greater emphasis on cross-modal fusion and adaptive feature integration.

2.1.2. Attention Feature Fusion

In the last two stages of the encoder, we adopt the Attentional Feature Fusion (AFF) module proposed by Dai et al. [41], as illustrated in Figure 4. The module begins by performing an element-wise addition of RGB and thermal features to generate an initial fused representation. Then, AFF incorporates two parallel attention branches—local and global—to enhance the feature fusion process.

The local branch employs two successive 1 × 1 convolutions to capture nonlinear relationships between channels, while the global branch uses global average pooling to extract channel-wise contextual information, which is followed by a dimension restoration operation. The outputs from the two branches are summed and passed through a Sigmoid activation to form a fusion attention map, which dynamically evaluates the importance of each channel.

To adaptively fuse the input features from two sources, the AFF module computes a spatial attention weight

W^{i}

based on both local and global contexts. Let

F_{RGB}^{i}

and

F_{T}^{i}

denote the input features of the RGB and T branches at the i-th layer (

i \in {3, 4}

), respectively. First, the input features

F_{RGB}^{i}

and

F_{T}^{i}

are added element-wise to obtain

F_{a d d}^{i}

, which is computed as follows:

F_{a d d}^{i} = F_{RGB}^{i} + F_{T}^{i}, i = 3, 4 .

(16)

Then,

F_{a d d}^{i}

passes through a local attention branch and a global attention branch. The local attention path applies two

1 \times 1

convolutions and batch normalization with ReLU activation:

F_{l}^{i} = BN ({Conv}_{1 \times 1} (ReLU (BN ({Conv}_{1 \times 1} (F_{a d d}^{i}))))), i = 3, 4 .

(17)

where BN is batch normalization,

{Conv}_{1 \times 1}

denotes a learnable

1 \times 1

convolution operation, and ReLU is the Rectified Linear Unit activation function.

While the global attention branch performs global average pooling followed by a similar two-layer bottleneck:

F_{g}^{i} = BN ({Conv}_{1 \times 1} (ReLU (BN ({Conv}_{1 \times 1} (GAP (F_{a d d}^{i})))))), i = 3, 4 .

(18)

where

GAP (\cdot)

denotes global average pooling.

The outputs of both attention branches are aggregated to produce the fused attention descriptor

F_{l g}^{i}

:

F_{l g}^{i} = F_{l}^{i} + F_{g}^{i}, i = 3, 4 .

(19)

Finally, a sigmoid activation is applied to obtain the attention weight

W^{i}

:

W^{i} = σ (F_{l g}^{i}), i = 3, 4 .

(20)

where

σ (\cdot)

is the sigmoid function.

The final fused output

F_{fuse}^{i}

(

i \in {3, 4}

) is computed using the following asymmetric weighted formula:

F_{fuse}^{i} = F_{RGB}^{i} \cdot W^{i} + F_{T}^{i} \cdot (1 - W^{i}), i = 3, 4 .

(21)

This attention-driven mechanism reflects the theory of selective response in deep learning, effectively boosting the model’s ability to identify crack-related regions across modalities. Figure 5 presents the visualization of the AFF module before and after fusion on the asphalt pavement crack detection dataset.

2.2. Structure-Guided Crack Segmentation

To reconstruct high-resolution semantic features and recover spatial details lost during the encoding process, a hierarchical decoder with edge-guided fusion is proposed. The decoder adopts a symmetric top–down architecture composed of four upsampling stages. Each stage consists of an upsampling convolution module followed by a nonlinear transformation block and, where applicable, a skip connection that fuses encoder features of the corresponding scale. This progressive decoding process gradually restores low-level textures and high-level semantics, enabling the precise spatial localization of cracks.

At each stage, the decoder outputs an intermediate feature map with a different semantic abstraction level. These decoded features are then passed to a dedicated fusion and prediction head consisting of four parallel branches. Each branch first reduces the channel dimension of the decoded feature via a convolutional projection and then upsamples the feature map to the original resolution. A corresponding edge prediction branch then generates a crack edge probability map

P_{Edge}^{i}

. The predicted edge map is compared with the ground truth edge map obtained via the Canny operator, and a mean squared error (MSE) loss is computed. The resulting edge losses from multiple layers are normalized to compute layer-wise fusion weights, which are concatenated with the upsampled feature and processed by a

1 \times 1

convolution to produce the segmentation predictions

P_{seg}^{i}

. This process is applied independently across all four levels, resulting in candidate segmentation outputs

P_{seg}^{i}

and

P_{Edge}^{i}

(

i \in {1, 2, 3, 4}

).

To guide the segmentation process with fine-grained structural awareness, an edge-aware supervision strategy is employed. The ground-truth edge map is obtained using the Canny operator and transformed into a two-channel binary mask representing edge and non-edge regions. The predicted edge maps are supervised using the mean squared error (MSE) loss. Let

L^{i}

denote the weighted loss at the

i

-th level, which is computed as follows:

L^{i} = \frac{1}{L_{MSE}^{i}}, i = 1, 2, 3, 4 .

(22)

L_{MSE}^{i} = \frac{1}{N} \sum_{k = 1}^{N} {(P_{Edge}^{ki} - Y_{Edge}^{k})}^{2}, i = 1, 2, 3, 4 .

(23)

where

L_{MSE}^{i}

denotes the MSE loss of the crack edge prediction result at the

i

-th level,

N

is the number of training samples,

P_{Edge}^{ki}

represents the predicted edge result of the

i

-th level for the

k

-th sample, and

Y_{Edge}^{k}

is the ground truth edge label of the

k

-th sample.

The weighted loss values of all levels are summed to obtain the total weighted loss.

L_{t} = \sum_{i = 1}^{n} L^{i}, i = 1, 2, 3, 4 .

(24)

Then, the weight of the crack prediction result at the

i

-th level is calculated:

ω_{i} = \frac{L^{i}}{L_{t}}, i = 1, 2, 3, 4 .

(25)

After that, for the input RGB image and thermal image of the object to be detected, the final crack prediction result of the object to be detected is obtained based on the weights:

P = \sum_{i = 1}^{n} ω_{i} \cdot P_{seg}^{i}, n = 4 .

(26)

The visualization comparison before and after adaptive fusion is illustrated in Figure 6. This edge-guided weighting mechanism ensures that feature maps with more accurate edge predictions contribute more significantly to the final result. The combination of hierarchical decoding, skip connections, edge prediction, and adaptive fusion allows the network to effectively capture both semantic context and spatial precision, which is essential for accurate crack detection in complex environments.

2.3. Loss Function

To enhance the accuracy and robustness of crack segmentation under complex surface conditions, a hybrid loss function is designed to guide both semantic prediction and boundary localization. The proposed loss function integrates three complementary components: an edge-guided loss, a multi-scale focal loss, and an adaptive fusion loss. These components jointly supervise the hierarchical outputs of the decoder and enforce consistency between segmentation and edge representations.

2.3.1. Edge-Guided Loss

Each intermediate decoded feature is associated with a corresponding edge prediction map, which is supervised using the mean squared error (MSE) loss, and its formulation is provided in Section 2.2. The ground-truth edge map is generated from the segmentation label using a Canny operator and then split into two binary channels representing edge and non-edge regions.

The total edge loss is the sum over all levels:

L_{Edge} = \sum_{i = 1}^{n} L_{MSE}^{i}, n = 4 .

(27)

2.3.2. Multi-Scale Focal Loss

To address the common issue of class imbalance in crack segmentation, especially when cracks occupy only a small portion of the image, the focal loss is employed at each level of the segmentation outputs. The formula for the focal loss is as follows:

\begin{matrix} L_{Focal}^{(i)} = - \frac{1}{N} \sum_{k = 1}^{N} [α {(1 - p_{k}^{(i)})}^{γ} y_{k}^{(i)} log (p_{k}^{(i)}) \\ + (1 - α) {(p_{k}^{(i)})}^{γ} (1 - y_{k}^{(i)}) log (1 - p_{k}^{(i)})], for i = 1, 2, 3, 4 . \end{matrix}

(28)

where

N

is the number of training samples,

α

is the weighting factor,

γ

is the focusing parameter,

P_{k}

is the final predicted crack result of the

k

-th sample, and

Y_{k}

is the corresponding ground truth label.

The segmentation loss for each level is computed as shown below:

L_{seg}^{i} = L_{Focal}^{i} (P_{seg}^{i}, Y), i = 1, 2, 3, 4 .

(29)

The aggregated segmentation loss across all hierarchical outputs is as follows:

L_{seg} = \sum_{i = 1}^{n} L_{seg}^{i}, n = 4 .

(30)

2.3.3. Adaptive Fusion Loss

The fusion output

P

obtained from Section 3.3 is further supervised by using focal loss:

L_{fuse} = L_{Focal} (P, Y) .

(31)

2.3.4. Total Loss

The total loss function is defined as shown below:

L_{total} = L_{Edge} + L_{seg} + L_{fuse} .

(32)

This composite loss ensures that edge structures are preserved, segmentation outputs are robust to class imbalance, and multi-level predictions are effectively fused under structural guidance.

3. Experiment

In this section, extensive experiments are performed on public datasets to verify the performance of the proposed method.

3.1. Dataset

Experimental evaluations were performed on two publicly available datasets: the asphalt pavement crack detection dataset and the Crack900 dataset. The former consists of crack imagery collected from asphalt pavement surfaces, whereas the latter is composed of crack samples acquired from masonry wall structures.

Asphalt pavement crack detection dataset (available at: https://github.com/lfangyu09/IR-Crack-detection (accessed on 1 December 2024)): This dataset is a public benchmark specifically curated for crack segmentation based on Infrared Thermography [42]. It is a RGB-T asphalt pavement crack segmentation dataset and includes four categories of images: RGB, thermal, RGB-T fused images (created by combining both modalities equally using IR-Fusion™ technology), and manually annotated ground truth masks created using Photoshop. Each category contains 448 images, all with a uniform resolution of 640 × 480 pixels. For experimental purposes, the dataset is divided into two subsets: 358 samples for training and 90 for testing. The dataset was collected across three different time periods, including morning (8:00 a.m.), noon (12:00 p.m.), and dusk (5:00 p.m.). The maximum pavement temperature was nearly identical to the daily maximum temperature, while the temperature at dusk was slightly higher than in the morning. At noon, some crack regions in the images exhibited similar temperatures to the background, whereas the distinction between cracks and background was more pronounced in dusk and morning images. Since shadows from guardrails and trees may compromise image quality, all images were captured on sidewalks without guardrails or trees to minimize environmental interference.

Crack900 dataset (available at: https://data.mendeley.com/datasets/kz84t85z66/1 (accessed on 1 August 2025)): This dataset comprises RGB, thermal, and RGB-T fused images with each category containing 1081 images [43]. The dataset was randomly partitioned into a training set (80%) and a test set (20%), resulting in 731 image groups for training and 183 image groups for validation. All images have a resolution of 288 × 384 pixels. Additionally, data augmentation techniques—including RandomFlip, RandomRotate, and RandomCrop—were applied to expand the dataset. The dataset was collected from masonry scenes containing cracks. Masonry structures are characterized by more complex textures and noise caused by mortar joints, which often resemble cracks and can easily lead to false detections. Training and testing solely on images with highly regular brick shapes, sizes, and colors may result in high accuracy within such constrained scenarios but cause a significant performance drop when applied to other environments. To avoid this limitation, the dataset was collected in scenes with diverse brick patterns, ensuring greater variability and robustness in crack segmentation.

3.2. Implementation Details

The proposed method is implemented with the PyTorch framework (version 1.12.1) where the Adam optimizer and a weight decay parameter of

1 \times 10^{- 4}

are used. The training is achieved with the batch size of 4 over the course of 200 epochs. All experiments are implemented on a server equipped with an Nvidia GeForce GTX 3090 GPU. The settings of experimental parameters are presented in Table 1. The total training time of our model is approximately 200 min; however, based on multiple experimental trials, the model achieves optimal performance within 170 min. The inference process for obtaining both visual results and accuracy metrics takes approximately 30 s. The overall efficiency demonstrates strong applicability for practical crack detection tasks.

To investigate the impact of the initial learning rate on model performance, we conducted a set of controlled experiments on the asphalt pavement crack detection dataset by varying the learning rate while keeping other hyperparameters fixed. Specifically, four different values were tested:

1 \times 10^{- 3}

,

1 \times 10^{- 4}

,

1 \times 10^{- 5}

,

1 \times 10^{- 6}

. The corresponding segmentation performance is illustrated in Figure 7.

When the initial learning rate was set to

1 \times 10^{- 4}

, the model achieved the best overall performance. In contrast, lower learning rates such as

1 \times 10^{- 5}

and

1 \times 10^{- 6}

led to sub-optimal convergence, which is possibly due to insufficient gradient updates that slow down learning and risk becoming stuck in local minima. On the other hand, a higher learning rate of

1 \times 10^{- 3}

resulted in unstable optimization with a relatively lower accuracy and F1-score, which may be attributed to overshooting the optimal solution during backpropagation.

The learning rate controls the step size of gradient descent—too small a value reduces convergence efficiency, while too large a value tends to cause unstable training. Therefore, choosing

1 \times 10^{- 4}

as the initial learning rate achieves a good balance between stability and convergence speed, ultimately resulting in superior generalization performance.

An experimental analysis is conducted on the parameter settings of

α

and

γ

in the focal loss. First,

γ

is fixed at 2.0, and

α

is set to 0.05, 0.15, 0.25, 0.35, and 0.45 for five separate experiments, as illustrated in Figure 8. Subsequently,

α

is fixed at 0.25, and

γ

is set to 1.0, 1.5, 2.0, 2.5, and 3.0 for another five experiments, as shown in Figure 9.

The weights of the three loss components were systematically evaluated using seven different combinations: (0.5, 1, 1), (1, 0.5, 1), (1, 1, 0.5), (1, 0.5, 0.5), (0.5, 1, 0.5), (0.5, 0.5, 1), and (1, 1, 1). The corresponding results are presented as a bar chart in Figure 3. As shown in Figure 10, the model performs optimally when the weights of the three components of the loss function are set to (1, 1, 1).

As shown in Figure 8 and Figure 9, the model achieves optimal performance when

α

is set to 0.25 and

γ

is set to 2.0. Therefore, setting

α

to 0.25 and

γ

to 2.0 yields the best segmentation performance.

Ablation experiments are conducted on the three loss components to validate their individual contributions. The results are presented in Table 2.

As shown in Table 2, removing any component of the loss function results in a decline in the segmentation performance of the model, indicating that each loss component is essential.

3.3. Evaluation Metrics

Crack detection and segmentation belong to a binary classification problem: common evaluation metrics include precision, specificity, recall, kappa, overall accuracy (OA), mean intersection over union (mIoU) and F1-score. Precision refers to the ratio of true positive samples to all predicted positive samples, reflecting the false positive rate. Specificity refers to the proportion of true negative samples correctly predicted out of all actual negative samples. Recall refers to the ratio of true positive samples to all true positive samples, reflecting the false negative rate. Kappa is an index used to measure the consistency between model prediction results and actual classification results, reflecting the classification performance. IoU is the intersection over union of the predicted map and the ground truth map, while mIoU is the mean IoU across all categories. F1-score is the harmonic mean of precision and recall, which takes into account both false positive and false negative rates, and it can better reflect the detection capability of the method used. Overall accuracy reflects the proportion of correctly classified pixels to the total pixels. Higher values for these metrics indicate superior segmentation performance. Since OA, F1 and mIoU can better reflect segmentation recognition capability, this paper mainly refers to these three metrics to compare the effectiveness of various methods. The expressions for each metric are as follows:

Precision = \frac{TP}{TP + FP},

(33)

Specificity = \frac{TN}{TN + FP},

(34)

Recall = \frac{TP}{TP + FN},

(35)

where

TP

represents the number of true positives (correctly predicted positive samples),

FN

represents the number of false negatives (incorrectly predicted positive samples),

FP

represents the number of false positives (incorrectly predicted negative samples), and

TN

represents the number of true negatives (correctly predicted negative samples).

Kappa = \frac{p_{0} - p_{e}}{1 - p_{0}},

(36)

where

p_{0}

is the accuracy rate, and

p_{e}

is the sum of the “product of actual and predicted quantities” for all categories divided by the “square of the total number of samples”.

OA = \frac{TP + TN}{TP + FN + FP + TN},

(37)

mIoU = \frac{1}{n} \sum_{l = 1}^{n} \frac{{TP}_{l}}{{TP}_{l} + {FP}_{l} + {FN}_{l}},

(38)

where

n

represents the number of categories.

F 1 = \frac{2 \times Precision \times Recall}{Precision + Recall} .

(39)

3.4. Ablation Study

We conducted ablation studies to validate the effectiveness of each component on the asphalt pavement crack detection dataset with each fusion module in the model assessed individually. The edge loss function was also evaluated separately, and the results can be found in Table 3, while the visualization results are presented in Figure 11. To observe the differences more clearly, the image was subjected to partial enlargement.

3.5. Comparison Experiments with Different Methods

To validate the effectiveness of the proposed method, we compared SPMFNet with eight advanced approaches on the two datasets. Among these, U-net [6] and Crackformer-II [11] use only RGB images or infrared images. CAINet [44], MCNet [45], SFAF-MA [46], SGFNet [47], GMNet [30] and IRFusionFormer [48] employ both infrared and RGB images as inputs.

(1) Comparison on the asphalt pavement crack detection dataset: The results in Table 4 indicate that the RGB-T integrated model outperforms the model that uses only RGB images, both of which are superior to the model that uses only infrared images on the asphalt pavement crack detection dataset. As shown in Figure 12, on the test dataset, the proposed SPMFNet achieved the best results in terms of six evaluation metrics, outperforming the second best model SGFNet in precision, specificity, recall, kappa, OA, mIoU, and F1, at 3.83%, 0.39%, 1.36%, 1.28%, 0.52%, 1.45%, and 1.92%, respectively. To observe the differences more clearly, the image has been subjected to partial enlargement. From the figures, it can be observed that SPMFNet can still produce accurate segmentation results under severe noise conditions, such as shadows, paint-like interference, and large-area speckles. For fine cracks, SPMFNet can also achieve high-precision results.

(2) Comparison on the Crack900 dataset: As presented in Table 5, the proposed SPMFNet achieves the most competitive overall performance on the masonry crack dataset Crack900. As illustrated in Figure 13, SPMFNet demonstrates significantly reduced false positives and missed detections compared to other state-of-the-art network models, highlighting its superior detection capability. From the figures, it can be seen that for fine cracks in masonry scenarios, SPMFNet can accurately segment and restore the true morphology of the cracks.

4. Conclusions

This paper presents an end-to-end edge-guided progressive multi-modal segmentation network for crack detection, which effectively integrates RGB and thermal infrared information through a carefully designed fusion strategy. In the first two encoder layers, we use a GCA module to adaptively fuse shallow features, preserving fine spatial details while reducing noise. In the deeper layers, the AFF module applies local and global attention to weight multi-level features, ensuring a robust fusion of RGB and thermal infrared information. The GCA module enhances shallow feature fusion, while the AFF module refines deeper fusion, enabling the model to capture both spatial details and semantic context for high-precision crack segmentation. The Squeeze-and-Excitation Attention and AFF modules used in the paper are existing components. However, by combining them with other components in a novel technical manner, we ultimately achieved relatively good feature fusion effects in the progressive multi-modal feature fusion stage. By incorporating structural priors and edge-aware supervision, the proposed method achieves a high-precision segmentation of fine-grained crack structures and demonstrates strong robustness across diverse scenarios. The proposed method can compensate for the limitations of traditional RGB crack detection methods under adverse weather conditions. Moreover, it demonstrates superior capability in detecting crack edge shapes, which provides significant assistance for subsequent repair work. The experimental results validate its consistent superiority over existing approaches in both accuracy and reliability. For instance, the Transformer network can be incorporated into the model to capture long-range dependencies and contextual relationships between cracks, which is particularly useful for crack detection in large-scale or complex structural surfaces. Self-supervised learning, on the other hand, can improve the model’s adaptability under weak labeling conditions by learning useful representations from unlabeled data, thus reducing the reliance on manually annotated datasets. These advancements are expected to enhance the robustness of the model and improve its generalization ability, especially when dealing with small sample sizes and varying environmental conditions.

Author Contributions

Conceptualization, Z.Y., X.D. and Y.H.; Methodology, Z.Y., X.D. and Y.H.; Software, Z.Y., X.D. and Y.H.; Validation, Z.Y. and X.D.; Formal analysis, Z.Y., X.D. and X.X.; Investigation, Z.Y.; Data curation, Z.Y. and X.D.; Writing—original draft preparation, Z.Y., X.D., Y.H. and W.F.; Writing—review and editing, Z.Y., X.D., Y.H., H.F. and W.F.; Visualization, Z.Y., X.X. and B.Y.; Supervision, Z.Y., X.D. and W.F.; Funding acquisition, X.X., H.F. and B.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Science and Technology Major Project of Changsha (Grant No. KH2401024) and the Research and Development Plan of Key Areas in Hunan Province (Grant No. 2024AQ2017).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The datasets used in this study can be downloaded from the URLs in their official websites. The code is accessible from the corresponding author upon reasonable request.

Conflicts of Interest

Authors Zhengrong Yuan, Xinhong Xia, Yibin He, Hui Fang and Bo Yang was employed by the company Hunan Architectural Design Institute Group Co., Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Wang, H.; Wu, G.; Liu, Y. Efficient generative-adversarial U-Net for multi-organ medical image segmentation. J. Imaging 2025, 11, 19. [Google Scholar] [CrossRef]
Dais, D.; Bal, I.E.; Smyrou, E.; Sarhosis, V. Automatic crack classification and segmentation on masonry surfaces using convolutional neural networks and transfer learning. Autom. Constr. 2021, 125, 103606. [Google Scholar] [CrossRef]
Tran, T.S.; Nguyen, S.D.; Lee, H.J.; Tran, V.P. Advanced crack detection and segmentation on bridge decks using deep learning. Constr. Build. Mater. 2023, 400, 132839. [Google Scholar] [CrossRef]
Wang, H.; Li, Y.; Dang, L.M.; Lee, S.; Moon, H. Pixellevel tunnel crack segmentation using a weakly supervised annotation approach. Comput. Ind. 2021, 133, 103545. [Google Scholar] [CrossRef]
Han, C.; Yang, H.; Ma, T.; Wang, S.; Zhao, C.; Yang, Y. Crackdiffusion: A two-stage semantic segmentation framework for pavement crack combining unsupervised and supervised processes. Autom. Constr. 2024, 160, 105332. [Google Scholar] [CrossRef]
Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 3431–3440. [Google Scholar]
Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional networks for biomedical image segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention—MICCAI 2015, Munich, Germany, 5–9 October 2015; Springer: Munich, Germany, 2015; pp. 234–241. [Google Scholar]
Chen, L.C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 801–818. [Google Scholar]
Du, Y.; Zhang, X.; Li, F.; Sun, L. Detection of crack growth in asphalt pavement through use of infrared imaging. Transp. Res. Rec. 2017, 2645, 24–31. [Google Scholar] [CrossRef]
Ma, M.; Lei, Y.; Liu, Y.; Yu, H. An attention-based progressive fusion network for pixelwise pavement crack detection. Measurement 2024, 226, 114159. [Google Scholar] [CrossRef]
Liu, H.; Yang, J.; Miao, X.; Mertz, C. CrackFormer network for pavement crack segmentation. IEEE Trans. Intell. Transp. Syst. 2023, 24, 9240–9252. [Google Scholar] [CrossRef]
Ren, Y.; Huang, J.; Hong, Z.; Lu, W.; Yin, J.; Zou, L.; Shen, X. Image-based concrete crack detection in tunnels using deep fully convolutional networks. Constr. Build. Mater. 2020, 234, 117367. [Google Scholar] [CrossRef]
Liu, Y. DeepLabV3+ Based Mask R-CNN for Crack Detection and Segmentation in Concrete Structures. Int. J. Adv. Comput. Sci. Appl. 2025, 16, 423. [Google Scholar] [CrossRef]
Hou, W.; He, J.; Cui, C.; Zhong, F.; Jiang, X.; Lu, L.; Zhang, J.; Tu, C. Segmentation refinement of thin cracks with minimum strip cuts. Adv. Eng. Inf. 2025, 65, 103249. [Google Scholar] [CrossRef]
Li, S.; Gou, S.; Yao, Y.; Chen, Y.; Wang, X. Physically informed prior and cross-correlation constraint for fine-grained road crack segmentation. In Proceedings of the Chinese Conference on Pattern Recognition and Computer Vision (PRCV), Urumqi, China, 18–20 October 2024; Springer Nature: Singapore, 2024. [Google Scholar]
Yoon, H.; Kim, H.K.; Kim, S. PPDD: Egocentric crack segmentation in the port pavement with deep learning-based methods. Appl. Sci. 2025, 15, 5446. [Google Scholar] [CrossRef]
Sun, W.; Liu, X.; Lei, Z. Research on tunnel crack identification localization and segmentation method based on improved YOLOX and UNETR++. Sensors 2025, 25, 3417. [Google Scholar] [CrossRef]
Zhang, Z.; Zhuang, Y.; Song, W.; Wu, J.; Ye, X.; Zhang, H.; Xu, Y.; Shi, G. ISTD-CrackNet: Hybrid CNN-transformer models focusing on fine-grained segmentation of multi-scale pavement cracks. Measurement 2025, 251, 117215. [Google Scholar] [CrossRef]
Wang, L.; Wu, G.; Tossou, A.I.H.C.F.; Liang, Z.; Xu, J. Segmentation of crack disaster images based on feature extraction enhancement and multi-scale fusion. Earth Sci. Inform. 2025, 18, 55. [Google Scholar] [CrossRef]
Si, J.; Lu, J.; Zhang, Y. An FCN-based segmentation network for fine linear crack detection and measurement in metals. Int. J. Struct. Integr. 2025, 16, 1117–1137. [Google Scholar] [CrossRef]
Wang, Z.; Zeng, Z.; Huang, F.; Sherratt, R.S.; Alfarraj, O.; Tolba, A.; Zhang, J. A U-Net-like full convolutional pavement crack segmentation network based on multi-layer feature fusion. Int. J. Pavement Eng. 2025, 26, 2508919. [Google Scholar] [CrossRef]
Wang, C.; Liu, H.; An, X.; Gong, Z.; Deng, F. DCNCrack: Pavement crack segmentation based on large-scaled deformable convolutional network. J. Comput. Civ. Eng. 2025, 39, 04025009. [Google Scholar] [CrossRef]
Guo, X.; Tang, W.; Wang, H.; Wang, J.; Wang, S.; Qu, X. MorFormer: Morphology-aware transformer for generalized pavement crack segmentation. IEEE Trans. Intell. Transp. Syst. 2025, 26, 8219–8232. [Google Scholar] [CrossRef]
Zeng, L.; Zhang, C.; Cai, S.; Yan, X.; Wang, S. Deep crack segmentation: A semi-supervised approach with coordinate attention and adaptive loss. Meas. Sci. Technol. 2025, 36, 065011. [Google Scholar] [CrossRef]
Liang, F.; Li, Q.; Yu, H.; Wang, W. CrackCLIP: Adapting vision-language models for weakly supervised crack segmentation. Entropy 2025, 27, 127. [Google Scholar] [CrossRef]
Kütük, Z.; Algan, G. Semantic segmentation for thermal images: A comparative survey. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–23 June 2022; pp. 286–295. [Google Scholar]
Wang, Z.; Zhang, H.; Qian, Z.; Chen, L. A complex scene pavement crack semantic segmentation method based on dual-stream framework. Int. J. Pavement Eng. 2023, 24, 2286461. [Google Scholar] [CrossRef]
Ha, Q.; Watanabe, K.; Karasawa, T.; Ushiku, Y.; Harada, T. MFNet: Towards real-time semantic segmentation for autonomous vehicles with multi-spectral scenes. In Proceedings of the 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Vancouver, BC, Canada, 24–28 September 2017; pp. 5108–5115. [Google Scholar]
Zhang, Q.; Zhao, S.; Luo, Y.; Zhang, D.; Huang, N.; Han, J. ABMDRNet: Adaptive-weighted bi-dTectional modality difference reduction network for RGB-T semantic segmentation. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 2633–2642. [Google Scholar]
Zhou, W.; Liu, J.; Lei, J.; Yu, L.; Hwang, J. GMNet: Graded-feature multilabel-learning network for RGB-thermal urban scene semantic segmentation. IEEE Trans. Image Process. 2021, 30, 7790–7802. [Google Scholar] [CrossRef] [PubMed]
Zhou, W.; Dong, S.; Xu, C.; Yaguan, Q. Edge-aware guidance fusion network for RGB-thermal scene parsing. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtual, 22–30 January 2022; pp. 3571–3579. [Google Scholar]
Zhou, W.; Zhang, H.; Yan, W.; Lin, W. MMSMCNet: Modal memory sharing and morphological complementary networks for RGB-T urban scene semantic segmentation. IEEE Trans. Circuits Syst. Video Technol. 2023, 33, 7096–7108. [Google Scholar] [CrossRef]
Yang, Y.; Shan, C.; Zhao, F.; Liang, W.; Han, J. On exploring shape and semantic enhancements for RGB-X semantic segmentation. IEEE Trans. Intell. Veh. 2024, 9, 2223–2235. [Google Scholar] [CrossRef]
Wang, Q.; Yin, C.; Song, H.; Shen, T.; Gu, Y. UTFNet: Uncertainty-guided trustworthy fusion network for RGB-Thermal semantic segmentation. IEEE Geosci. Remote Sens. Lett. 2023, 20, 7001205. [Google Scholar] [CrossRef]
Chen, H.; Wang, Z.; Qin, H.; Mu, X. DHFNet: Decoupled hierarchical fusion network for RGB-T dense prediction tasks. Neurocomputing 2024, 583, 127594. [Google Scholar] [CrossRef]
Zhao, G.; Huang, J.; Peng, T. Open-vocabulary RGB-Thermal semantic segmentation. In Proceedings of the European Conference on Computer Vision, Milan, Italy, 29 September–4 October 2024; pp. 304–320. [Google Scholar]
Liu, C.; Liu, H.; Ma, H. Implicit alignment and query refinement for RGB-T semantic segmentation. Pattern Recognit. 2026, 169, 111951. [Google Scholar] [CrossRef]
Liu, J.; Liu, H.; Xu, X. MiLNet: Multiplex interactive learning network for RGB-T semantic segmentation. IEEE Trans. Image Process. 2025, 34, 1686–1699. [Google Scholar] [CrossRef]
Tang, L.; Yuan, J.; Ma, J. Image fusion in the loop of high-level vision tasks: A semantic-aware real-time infrared and RGB image fusion network. Inform. Fusion 2022, 82, 28–42. [Google Scholar] [CrossRef]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 13–18 June 2018; pp. 7132–7141. [Google Scholar]
Dai, Y.; Gieseke, F.; Oehmcke, S.; Wu, Y. Attentional feature fusion. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Virtual, 5–9 January 2021. [Google Scholar]
Liu, F.; Liu, J.; Wang, L. Asphalt pavement crack detection based on convolutional neural network and infrared thermography. IEEE Trans. Intell. Transport. Syst. 2022, 23, 22145–22155. [Google Scholar] [CrossRef]
Huang, H.; Cai, Y.; Zhang, C. Crack detection of masonry structure based on thermal and visible image fusion and semantic segmentation. Autom. Constr. 2024, 158, 105213. [Google Scholar] [CrossRef]
Lv, Y.; Liu, Z.; Li, G. Context-aware interaction network for RGB-T semantic segmentation. IEEE Trans. Multimedia 2024, 26, 6348–6360. [Google Scholar] [CrossRef]
Guo, X.; Liu, T.; Mou, Y.; Chai, S.; Ren, B.; Wang, Y. Transferring prior thermal knowledge for snowy urban scene semantic segmentation. IEEE Trans. Intell. Transp. Syst. 2025, 26, 12474–12487. [Google Scholar] [CrossRef]
He, X.; Wang, M.; Liu, T.; Zhao, L.; Yue, Y. SFAF-MA: Spatial feature aggregation and fusion with modality adaptation for RGB-thermal semantic segmentation. IEEE Trans. Instrum. Meas. 2023, 72, 1–10. [Google Scholar] [CrossRef]
Wang, Y.; Li, G.; Liu, Z. SGFNet: Semantic-guided fusion network for RGB-thermal semantic segmentation. IEEE Trans. Circuits Syst. Video Technol. 2023, 33, 7737–7748. [Google Scholar] [CrossRef]
Xiao, R.; Chen, X. IRFusionFormer: Enhancing Pavement Crack Segmentation with RGB-T Fusion and Topological-Based Loss. arXiv 2024, arXiv:2409.20474. [Google Scholar]

Figure 1. The structure of the proposed SPMFNet.

Figure 2. The structure of the proposed GCA.

Figure 3. Comparison features of the GCA module before and after fusion: (a) original RGB images, (b) original thermal images, (c) label images, (d) RGB features before fusion, (e) thermal features before fusion, and (f) fused features.

Figure 4. The structure of AFF.

Figure 5. Comparison features of the AFF module before and after fusion: (a) original RGB images, (b) original thermal images, (c) label images, (d) RGB features before fusion, (e) thermal features before fusion, and (f) fused features.

Figure 6. Visualization before and after adaptive fusion. Out1, Out2, Out3 and Out4 represent the corresponding outputs of the four decoder layers.

Figure 7. Initial learning rate analysis.

Figure 8. Focal loss alpha analysis.

Figure 9. Focal loss gamma analysis.

Figure 10. Loss function weight analysis.

Figure 11. Visualization of ablation experiments. The images in rows 2, 4, 6, and 8 present locally magnified views of the respective red-boxed regions in the corresponding images from rows 1, 3, 5, and 7.

Figure 12. Comparison of crack segmentation results from nine models on the asphalt pavement crack detection dataset. The images in rows 2, 4, 6, and 8 present locally magnified views of the respective red-boxed regions in the corresponding images from rows 1, 3, 5, and 7.

Figure 13. Comparison of crack segmentation results from nine models on the Crack900 dataset. The images in rows 2, 4, 6, and 8 present locally magnified views of the respective red-boxed regions in the corresponding images from rows 1, 3, 5, and 7.

Table 1. Experimental parameter settings.

Settings	SPMFNet
Optimizer	Adam
Learning Rate	$1 \times 10^{- 4}$
Epochs	200
Batch Size	4
Image Size	480 × 640/288 × 384

Table 2. Contribution analysis of loss function components. The bold entries in the table indicate the best results under each evaluation metric.

Variants	P	Spec	Rec	Kappa	OA	mIoU	F1
w/o loss1	76.15	97.73	92.81	82.25	97.37	84.55	83.66
w/o loss2	55.03	95.32	73.30	59.51	93.72	69.61	62.87
w/o loss3	74.71	97.83	82.15	76.47	96.69	80.38	78.25
SPMFNet	81.45	98.40	89.84	84.23	97.78	86.10	85.44

Table 3. Quantitative results (%) of the evaluation of GCA, AFF, and edge-guided loss in SPMFNet. The bold entries in the table indicate the best results under each evaluation metric.

Variants	P	Spec	Rec	Kappa	OA	mIoU	F1
baseline	81.01	98.05	84.42	81.29	97.44	83.87	82.68
w/o EageLoss	80.71	98.39	86.42	82.13	97.52	84.49	83.47
w/o GCA	78.56	98.09	89.56	82.34	97.47	84.63	83.70
w/o AFF	80.07	98.29	88.07	82.56	97.55	84.81	83.88
SPMFNet	81.45	98.40	89.84	84.24	97.78	86.10	85.44

Table 4. Segmentation performance (%) for different methods on the asphalt pavement crack detection dataset. The bold entries in the table indicate the best results under each evaluation metric.

Type	Model	P	Spec	Rec	Kappa	OA	mIoU	F1
RGB	U-Net [6]	78.24	98.06	89.07	81.91	97.42	84.31	83.30
RGB	Crackformer-II [11]	80.07	98.28	88.38	82.70	97.56	84.92	84.42
T	U-Net [6]	54.61	94.90	78.53	61.09	91.73	70.42	64.44
T	Crackformer-II [11]	67.82	97.83	84.50	70.98	94.98	70.24	62.76
RGB-T	CAINet [44]	81.06	98.51	81.85	81.00	97.30	82.82	81.46
RGB-T	MCNet [45]	78.95	98.24	84.63	81.06	97.25	83.13	82.08
RGB-T	SFAF-MA [46]	78.85	97.62	85.97	83.61	97.31	85.19	82.54
RGB-T	SGFNet [47]	77.62	98.01	88.48	82.96	97.26	84.65	83.52
RGB-T	GMNet [30]	78.28	98.09	82.39	81.44	97.17	82.44	82.98
RGB-T	IRfusionFormer [48]	79.57	98.27	86.23	81.36	97.40	83.91	82.77
RGB-T	SPMFNet	81.45	98.40	89.84	84.24	97.78	86.10	85.44

Table 5. Segmentation performance (%) for different methods on the Crack900 dataset. The bold entries in the table indicate the best results under each evaluation metric.

Type	Model	P	Spec	Rec	Kappa	OA	mIoU	F1
RGB	U-Net [6]	56.60	99.77	57.62	56.88	99.55	69.76	57.10
RGB	Crackformer-II [11]	42.77	99.68	45.92	43.99	99.41	63.93	44.29
T	U-Net [6]	63.54	99.81	64.96	64.06	99.63	73.47	64.24
T	Crackformer-II [11]	59.16	99.77	64.37	61.45	99.59	72.07	61.65
RGB-T	CAINet [44]	58.98	99.81	52.31	55.23	99.57	68.96	55.45
RGB-T	MCNet [45]	64.13	99.83	59.41	61.49	99.62	72.11	61.68
RGB-T	SFAF-MA [46]	53.31	99.72	62.01	57.09	99.52	69.85	57.33
RGB-T	SGFNet [47]	47.06	99.64	62.64	53.47	99.45	68.09	53.74
RGB-T	GMNet [30]	59.02	99.78	59.92	59.26	99.58	70.95	59.47
RGB-T	IRfusionFormer [48]	42.11	99.56	61.94	49.83	99.37	66.41	50.14
RGB-T	SPMFNet	66.63	99.83	64.89	65.58	99.65	74.31	65.75

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Yuan, Z.; Ding, X.; Xia, X.; He, Y.; Fang, H.; Yang, B.; Fu, W. Structure-Aware Progressive Multi-Modal Fusion Network for RGB-T Crack Segmentation. J. Imaging 2025, 11, 384. https://doi.org/10.3390/jimaging11110384

AMA Style

Yuan Z, Ding X, Xia X, He Y, Fang H, Yang B, Fu W. Structure-Aware Progressive Multi-Modal Fusion Network for RGB-T Crack Segmentation. Journal of Imaging. 2025; 11(11):384. https://doi.org/10.3390/jimaging11110384

Chicago/Turabian Style

Yuan, Zhengrong, Xin Ding, Xinhong Xia, Yibin He, Hui Fang, Bo Yang, and Wei Fu. 2025. "Structure-Aware Progressive Multi-Modal Fusion Network for RGB-T Crack Segmentation" Journal of Imaging 11, no. 11: 384. https://doi.org/10.3390/jimaging11110384

APA Style

Yuan, Z., Ding, X., Xia, X., He, Y., Fang, H., Yang, B., & Fu, W. (2025). Structure-Aware Progressive Multi-Modal Fusion Network for RGB-T Crack Segmentation. Journal of Imaging, 11(11), 384. https://doi.org/10.3390/jimaging11110384

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Structure-Aware Progressive Multi-Modal Fusion Network for RGB-T Crack Segmentation

Abstract

1. Introduction

2. Proposed Method

2.1. Progressive Multi-Modal Feature Fusion

2.1.1. Gate Control Attention Module

2.1.2. Attention Feature Fusion

2.2. Structure-Guided Crack Segmentation

2.3. Loss Function

2.3.1. Edge-Guided Loss

2.3.2. Multi-Scale Focal Loss

2.3.3. Adaptive Fusion Loss

2.3.4. Total Loss

3. Experiment

3.1. Dataset

3.2. Implementation Details

3.3. Evaluation Metrics

3.4. Ablation Study

3.5. Comparison Experiments with Different Methods

4. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI