1. Introduction
Cracks are among the most common and critical forms of damage in building and civil infrastructure systems [
1]. They frequently occur during the service life of structures as a consequence of material aging, environmental effects, construction defects, or excessive loading conditions. As visible indicators of internal deterioration, cracks may significantly compromise structural safety, durability, and serviceability if left untreated [
2,
3,
4]. Consequently, accurate and timely crack detection plays a vital role in structural health monitoring, condition assessment, and maintenance decision-making [
5].
Traditional crack detection methods rely on manual visual inspection, ultrasonic testing, ground-penetrating radar, or infrared thermography [
6,
7,
8]. While manual inspection remains widely adopted due to its simplicity, it is labor-intensive, time-consuming, and highly dependent on inspector expertise, leading to inconsistent results [
9,
10]. Contact-based techniques such as crack gauges provide precise localized measurements but require prior knowledge of crack locations [
11]. Non-contact NDT methods offer improved detection capability but involve specialized equipment and high operational costs, restricting their deployment for routine large-scale inspections [
12]. With the advancement of computer vision, vision-based approaches using handcrafted features such as edge detection, morphological operations, and texture analysis have emerged as cost-effective alternatives [
13,
14,
15,
16,
17,
18]. However, their robustness is limited under varying illumination, surface texture, and background interference commonly encountered in real-world environments. Machine learning methods integrating HOG, LBP, and Gabor features with classifiers such as SVM and random forests improve robustness to some extent [
19,
20,
21,
22,
23,
24,
25], yet still rely on manually designed representations that may not generalize well across diverse crack patterns and imaging conditions.
Deep learning, particularly convolutional neural networks, has significantly advanced crack detection by automatically learning hierarchical feature representations from raw images through end-to-end training [
26]. Existing CNN-based methods can be categorized into three frameworks. Classification-based approaches divide images into patches and assign binary labels [
27], but cannot provide pixel-accurate delineation essential for quantitative crack analysis such as width measurement. Object detection frameworks predict bounding boxes for efficient localization [
28,
29], yet fail to capture the fine-grained, curvilinear structure of cracks that span large areas with irregular shapes. Segmentation-based methods generate dense pixel-wise predictions and have been widely adopted for crack analysis [
30,
31]. Representative works include Cao et al.’s VGG16-based encoder–decoder network for autonomous crack detection [
32], Wu et al.’s enhanced DeepLabV3 framework with Dice loss for fine crack segmentation [
33], and Choi and Cha’s SDDNet employing densely connected separable convolutions for real-time segmentation [
34]. Recent studies have explored generative AI-based approaches such as diffusion models for crack detection and data augmentation [
35]. Although the above segmentation-based methods offer pixel-wise predictions, they still rely on fixed-grid convolutions [
36] and region-based losses that are not specifically designed for the curvilinear geometry and topological connectivity requirements of crack structures. In terms of feature extraction, standard convolutions sample features at fixed spatial intervals, which cannot adapt to the curvilinear and tortuous geometry of cracks, leading to broken skeletons at turning points and incomplete recovery of thin branches. Regarding training objectives, cross-entropy and Dice losses penalize pixel-wise or area-wise mismatch but do not explicitly enforce topological constraints, allowing fragmented predictions to achieve acceptable overlap scores while exhibiting severe structural discontinuities at crack junctions. From a representation perspective, existing methods operate exclusively in the spatial domain, treating all frequency components equally, which limits their ability to enhance crack-related high-frequency details while suppressing low-frequency background interference.
These challenges arise from the unique characteristics of crack structures. Cracks typically exhibit distinctive geometric characteristics that pose significant challenges for standard convolution [
37]. Their widths range from sub-pixel hairline fractures to multi-pixel structural cracks, and their trajectories are often tortuous with curvature radii varying from sharp corners to gentle bends. Branching patterns frequently form Y-shaped and T-shaped junctions, while discontinuous segments may be interrupted by occlusions or surface wear. Practical imaging conditions further complicate detection. Uneven illumination casts shadows that mimic crack appearance or obscures genuine cracks in dark regions. Surface stains, weathering marks, and repair patches create false edges difficult to distinguish from real cracks. Textured backgrounds including exposed aggregate, mortar joints, and formwork imprints generate high-frequency patterns easily confused with fine cracks [
38]. The key problem that remains unsolved is how to simultaneously achieve geometric adaptivity for curvilinear crack structures and topological consistency for connected crack networks. This study aims to bridge this gap by proposing a unified framework that integrates geometry-aware convolution with topology-preserving regularization.
The main contributions of this paper are as follows. First, SDCrackSeg is proposed as a crack-oriented segmentation network that integrates geometry-aware, frequency-aware, and topology-aware components. Second, Dynamic Snake Convolution (DSConv) is designed to adaptively deform convolution kernels along curvilinear crack trajectories. Third, Frequency Spatial Convolution (FSConv) is developed to fuse frequency-domain enhancement with spatial geometric adaptivity. Fourth, a topology-aware loss based on persistent homology is incorporated to regularize structural connectivity and reduce fragmentation. The overall architecture of SDCrackSeg is illustrated in
Figure 1, which shows the integration of the proposed components and their relationships within the encoder–decoder framework.
2. Methodology
2.1. Overall Network Architecture
The proposed SDCrackSeg is tailored for crack segmentation with thin, elongated, and topology-sensitive structures. As depicted in
Figure 1, the model employs an encoder–decoder framework while incorporating specialized geometry and frequency modules to enhance local detail extraction and global structural consistency.
The encoder progressively learns multi-scale representations through convolution and downsampling. To alleviate the limitations of fixed-grid convolutions, each stage employs a hybrid feature extraction scheme that integrates spatial deformation modeling and frequency-domain enhancement, improving robustness to complex crack morphologies and background interference.
The decoder mirrors the encoder and reconstructs high-resolution predictions through upsampling and feature fusion. Skip connections are used to recover fine details, and hybrid modules further refine boundaries and improve the recovery of thin or low-contrast branches.
In addition, a topology-aware supervision strategy is incorporated to encourage continuous and connected crack masks and to reduce fragmentation. Details of DSConv, FSConv, multi-scale fusion, and the topology-aware loss are presented in the following subsections.
2.2. Dynamic Snake Convolution (DSConv)
Crack patterns on pavement surfaces typically exhibit slender, curved, discontinuous, and highly irregular geometries. Conventional convolutions use fixed grid sampling and cannot align their receptive fields with such anisotropic structures. Even deformable convolutions only apply local pointwise offsets, lacking a mechanism to maintain global continuity along elongated structures.
To overcome these limitations, inspired by Dynamic Snake Convolution (DSConv) [
39], a shape-adaptive convolutional operator is introduced, whose sampling locations form a smooth, snake-like trajectory that follows the geometry of line-shaped structures. DSConv was originally introduced for road segmentation in remote sensing imagery. Given the structural resemblance between roads and cracks, both exhibiting elongated, curvilinear, and topology-sensitive patterns, DSConv is well suited for adapting convolutional sampling to better capture crack morphology. DSConv consists of three tightly coupled components:
- 1.
Offset branch: learns displacement values for K sampling points.
- 2.
DSC module: converts offsets into a continuous deformable sampling curve.
- 3.
Directional convolution kernel: a (horizontal) or (vertical) kernel that performs convolution along the snake trajectory.
This design enables DSConv to dynamically align the receptive field with crack geometry.
Given an input feature map
, DSConv first predicts spatial offsets through a convolutional layer and constrains them using batch normalization and hyperbolic tangent activation:
where
and the offset range is constrained to
.
Unlike conventional deformable convolutions that apply offsets independently, DSConv employs an iterative offset propagation mechanism that ensures continuity in the sampling path, as illustrated in
Figure 2a. The center position remains fixed while offsets propagate iteratively from the center to both ends, forming a continuous snake-like trajectory. The final sampling coordinates are computed as:
where
is a hyperparameter controlling deformation magnitude (typically set to 1).
Features are extracted via bilinear interpolation at the deformed coordinates, as shown in
Figure 2b, and a directional convolution (
for horizontal or
for vertical orientation) is applied along the snake trajectory. The complete mathematical derivations, including iterative offset propagation formulas, coordinate grid generation, and bilinear interpolation details, are provided in
Appendix A.
2.3. Adaptive Frequency Convolution (AFConv)
Cracks appear as thin, high-frequency structures, while backgrounds typically contain low-frequency components from illumination and texture variations. Standard convolution treats all frequencies equally, limiting its ability to enhance crack details while suppressing background noise. To complement DSConv’s geometry-adaptive modeling, we introduce the AFConv module [
40] to enhance crack-sensitive high-frequency cues, as illustrated in
Figure 3. Given input features
, AFConv operates as follows:
(1) Adaptive routing: The input is projected to an expanded channel space
, and input-dependent routing coefficients are computed via global average pooling and MLP:
where
F is the number of learnable frequency filters
.
(2) Frequency modulation: An adaptive spectral weight
is applied in the frequency domain:
where
denotes FFT and
models inter-frequency dependencies.
(3) Output: The inverse FFT transforms features back to the spatial domain: .
As illustrated in
Figure 1b, the proposed FSConv module combines AFConv and DSConv branches through residual fusion. The DSConv branch applies two orthogonal snake convolutions (horizontal and vertical) to capture crack patterns in different orientations. The branch outputs are concatenated and projected through a
convolution followed by batch normalization:
where
denotes concatenation and
is a
convolution. The residual connection ensures stable gradient flow, while batch normalization balances the feature magnitudes from the two heterogeneous branches, preventing either from dominating the fused representation. This design enhances high-frequency crack details while preserving curvilinear geometry.
2.4. Topology-Aware Loss
Region-based losses, like cross-entropy and Dice loss, penalize area mismatch but not structural discontinuities, often producing fragmented masks. To preserve crack connectivity, a topology-aware loss based on persistent homology [
41] is introduced, as illustrated in
Figure 4.
Given predicted probability map
and ground-truth mask
, we first binarize the prediction:
. The topology loss compares persistence diagrams which can capture connected components and loops using Hausdorff distance:
where
and
denote 0-dim (components) and 1-dim (loops) persistence diagrams, and
,
balance the two terms. In practice, we set
to weight components and loops equally, and use
to control the overall strength of topology regularization. Minimizing
encourages continuous crack masks with reduced fragmentation.
2.5. Overall Training Objective
The total loss combines pixel-wise supervision with topology regularization:
where
is cross-entropy,
handles class imbalance,
supervises the auxiliary head, and
enforces topological consistency.
3. Experiments and Evaluation Metrics
3.1. Datasets
CHCrack5K is a large-scale benchmark dataset designed for building wall crack detection and segmentation [
42]. It was constructed by integrating 11 publicly available crack datasets that mainly capture surface cracks on common building materials, such as concrete and mortar walls, as shown in
Figure 5. To reduce the domain shift caused by heterogeneous data sources and to enable a fair comparison across methods, all images and corresponding annotations were standardized to a unified resolution of
pixels through a careful preprocessing pipeline.
Specifically, images with non-square aspect ratios or missing borders were processed using padding so that the original spatial structure and crack geometry were preserved. For datasets that provided higher-resolution patches, cropping and resizing were applied to match the target size while retaining representative crack patterns. For sources with diverse image scales, padding and resizing were combined to achieve consistent spatial dimensions and to minimize distortion of thin crack morphology. Through these procedures, CHCrack5K forms a unified yet diverse and challenging benchmark, supporting robust evaluation of building crack segmentation models under variations in illumination, background texture, and crack width.
3.2. Experimental Setup
All experiments were implemented in PyTorch (version 2.1.0) using a unified training pipeline for semantic segmentation. Model training was performed on a single NVIDIA vGPU equipped with 48 GB memory.
The proposed method was trained for 200 epochs with a batch size of 8. Stochastic gradient descent was used as the optimizer, with an initial learning rate of 0.01, a momentum of 0.9, and a weight decay of
. A polynomial learning-rate decay strategy was adopted throughout training, defined as
where
denotes the initial learning rate,
e is the current epoch index, and
E is the total number of training epochs.
For data preprocessing and augmentation, each input image was first resized using a base size of 520 and then randomly cropped to a resolution of . Random horizontal flipping was further applied with a probability of 0.5 to improve robustness to appearance variations. The dataloader used four worker processes. During validation, the batch size was set to 1 to ensure stable evaluation and to avoid memory-related constraints.
The training objective was composed of a standard pixel-wise cross-entropy loss and additional optional regularization terms. Specifically, a Dice loss term was included by default to better handle class imbalance between crack pixels and background pixels. In addition, an optional topology continuity constraint loss was incorporated to regularize structural connectivity. We set in and used to control its overall contribution, and these values were finalized after multiple rounds of tuning on the validation set. When this topology term was enabled, prediction maps were binarized using a threshold of 0.5, and the corresponding loss weight was set to 0.1.
3.3. Evaluation Metrics
This study evaluates crack segmentation performance using five commonly adopted metrics: Precision, Recall, F1-score, Dice, and mIoU. Let , , and denote the numbers of true-positive, false-positive, and false-negative pixels for the crack (foreground) class, respectively. Specifically, counts crack pixels correctly predicted as cracks, counts background pixels incorrectly predicted as cracks, and counts crack pixels missed by the model.
3.3.1. Precision and Recall
Precision measures the proportion of correctly predicted crack pixels among all pixels predicted as cracks, while Recall measures the proportion of correctly predicted crack pixels among all crack pixels in the ground truth. They are defined as
3.3.2. F1-Score
F1-score provides a balanced summary of Precision and Recall, and is defined as
3.3.3. Dice Coefficient
Dice evaluates the similarity between the predicted crack region and the ground truth. A higher Dice value (ranging from 0 to 1) indicates better segmentation quality:
3.3.4. Mean Intersection over Union (mIoU)
The Intersection over Union (IoU) measures the overlap between prediction and ground truth relative to their union. For binary crack segmentation, mIoU is equivalent to the IoU of the crack class:
4. Results and Discussion
4.1. Quantitative Comparison and Discussion
Table 1 reports the quantitative performance of representative segmentation models on the crack segmentation task, including U-Net [
31], FCN variants [
30], DeepLabv3 variants [
43], SegNet [
44], and the proposed SDCrackSeg. Overall, SDCrackSeg achieves the best comprehensive performance across the key metrics, with a Precision of 0.900, an mIoU of 0.816, an F1-score of 0.888, and a Dice coefficient of 0.675. These results suggest that SDCrackSeg not only improves the agreement between predictions and ground-truth crack masks as reflected by overlap-based metrics, but also enhances the reliability of crack identification by effectively suppressing false positives in cluttered building-surface backgrounds, as reflected by the highest Precision.
A direct comparison with the strongest overall baseline in this setting highlights the practical improvements brought by SDCrackSeg. Relative to SegNet, SDCrackSeg increases Precision from 0.856 to 0.900, which indicates a marked reduction in false alarms caused by stains, rough textures, and crack-like background patterns. At the same time, mIoU increases from 0.777 to 0.816 and Dice increases from 0.597 to 0.675. The gains in overlap-based metrics are particularly relevant for crack segmentation because cracks often occupy a small fraction of pixels and exhibit thin, elongated shapes. In such cases, even small local discontinuities, missing branches, or boundary offsets can substantially degrade topology and connectivity, while region-level averages may appear only moderately affected. The consistent improvement across mIoU and Dice therefore suggests that SDCrackSeg recovers crack regions more completely and with higher boundary fidelity, which is aligned with the objective of preserving thin branches and maintaining continuity.
The Precision and Recall patterns across methods provide additional insights into different model behaviors. DeepLabv3 variants achieve the highest Recall values, reaching 0.896 and 0.905, implying that multi-scale context modeling helps reduce missed detections in faint or partially occluded cracks. However, this improved sensitivity is accompanied by lower Precision, with values of 0.805 and 0.795, indicating that context aggregation and strong semantic activation can also amplify crack-like background structures. This trade-off is common in building surface scenes where shadows, mortar lines, texture edges, and stains resemble crack trajectories. In contrast, SDCrackSeg maintains a competitive Recall of 0.878 while achieving substantially higher Precision. This balance indicates that the proposed feature design improves discriminability between true cracks and confusing background patterns, which is crucial in engineering practice because excessive false positives increase manual verification workload and may lead to overestimation of damage severity.
For classical encoder–decoder baselines, the results show stable yet limited performance. U-Net achieves a relatively high Recall of 0.887, but its mIoU and Dice remain lower than those of SDCrackSeg. This suggests that U-Net can detect many crack pixels but may produce masks with imprecise boundaries, local fragmentation, or incomplete branching, which reduces overlap consistency. The FCN variants exhibit comparable Precision and Recall, but their mIoU and Dice are noticeably lower, reflecting the difficulty of accurately recovering thin, curvilinear crack structures using coarse upsampling and standard fixed-grid convolution alone. These observations support the notion that, for topology-sensitive targets, accurate segmentation requires not only semantic recognition but also fine-scale structural modeling to maintain narrow branches and continuous skeletons.
The F1-score trends further confirm the overall superiority of SDCrackSeg. Since F1-score jointly reflects Precision and Recall, the best F1-score of 0.888 indicates that SDCrackSeg achieves a favorable compromise between sensitivity and specificity. In crack inspection tasks, this compromise is especially important because missed crack segments can break connectivity and bias length measurements, whereas false positives can contaminate crack networks and distort derived indicators such as density, branching degree, and orientation statistics. The combination of high Precision and high F1-score therefore implies that SDCrackSeg yields more trustworthy crack masks for downstream structural assessment.
From an application-oriented perspective, the improvements achieved by SDCrackSeg can be interpreted as enhanced robustness to three common sources of performance degradation. First, low-contrast and extremely thin cracks are better preserved, which contributes to higher Dice and mIoU by reducing missing branches and local gaps. Second, background interference is suppressed more effectively, improving Precision by avoiding spurious activations along crack-like textures. Third, the overall mask quality is more coherent, which is reflected in the consistent gains across overlap-based and classification-based metrics rather than in a single indicator. This consistency is important because it suggests that the performance benefits are not limited to a specific operating point but generalize across different evaluation criteria.
In summary, the quantitative evaluation demonstrates that SDCrackSeg provides consistent and meaningful improvements over representative baselines. The highest Precision indicates stronger resistance to false positives in cluttered backgrounds, while the best mIoU and Dice indicate more accurate overlap and better preservation of fine-scale crack regions. Together with the strong F1-score, these results suggest that SDCrackSeg is well suited for reliable building crack inspection in challenging real-world conditions, where both detailed structural representation and robust background suppression are required.
4.2. Feature Map Visualization and Qualitative Discussion
To provide an intuitive understanding of how different networks perceive crack structures, we visualize and compare the feature maps produced by SDCrackSeg and several representative baselines, including U-Net, SegNet, FCN with ResNet backbones, and DeepLabv3 variants. The qualitative results are shown in
Figure 6. Overall, SDCrackSeg exhibits more crack-aligned, compact, and continuous responses, while the baselines tend to produce broader activations or spurious responses on background textures.
As illustrated in
Figure 6, SDCrackSeg produces responses that closely follow the crack centerlines and remain compact near the crack boundaries. For thin cracks and weak contrast segments, the highlighted bands remain visible and continuous, indicating strong sensitivity to fine details. In contrast, U-Net and SegNet often exhibit more diffuse activations around crack neighborhoods. Such diffusion may visually correspond to thicker predicted boundaries and increased ambiguity, especially when cracks are narrow or partially occluded by surface texture.
In addition, SDCrackSeg shows improved stability at crack junctions and branching structures. For Y shaped and T shaped intersections, SDCrackSeg maintains coherent responses across multiple directions and preserves the connectivity of branches. Several baselines show blurred or widened responses near junctions, and some responses weaken along minor branches, which can lead to local discontinuities in the final segmentation. Since crack inspection often relies on connected crack paths and complete branch recovery, this qualitative advantage is practically important.
A further observation is the stronger background suppression ability of SDCrackSeg. Under challenging backgrounds such as rough textures, stains, illumination variations, and structural edges, SDCrackSeg demonstrates reduced activation spillover to non-crack regions. The compared FCN and DeepLabv3 variants, which emphasize large receptive fields and contextual aggregation, are more likely to activate on textured patterns or intensity transitions. Although such context modeling benefits large objects, it can be less suitable for cracks because the target is thin and topology sensitive, and background patterns frequently mimic crack appearance.
These qualitative results align with the design motivation of SDCrackSeg. The frequency enhancement pathway strengthens crack related high frequency cues that support boundary delineation and fine branch visibility. The geometry adaptive sampling pathway better follows tortuous crack trajectories and reduces the mismatch introduced by fixed grid convolution when modeling curvilinear structures. Together, they contribute to more selective and structurally consistent responses, which helps reduce false alarms and preserve crack connectivity in complex building surface scenes. The visual evidence in
Figure 6 therefore provides an intuitive explanation for the superior quantitative performance reported in the benchmark experiments.
4.3. Qualitative Comparison on Challenging Crack Patterns
Figure 7 presents a qualitative comparison of crack segmentation results produced by U-Net, FCN with ResNet50 and ResNet101 backbones, DeepLabv3 with ResNet50 and ResNet101 backbones, SegNet, and the proposed SDCrackSeg. For each example, the first row shows the raw image, followed by the ground truth and the predicted masks. The red and green boxes indicate representative regions in which the compared methods yield clearly different outputs. The selected cases reflect common yet challenging scenarios in building crack inspection, including crack junctions, thin and low-contrast branches, tortuous crack trajectories, and ambiguous background textures.
Across all samples, SDCrackSeg generates predictions that are more consistent with the ground truth, particularly in terms of structural continuity, branch preservation, and boundary reliability. In the junction regions highlighted by the green boxes, several baseline methods exhibit discontinuities around intersections or fail to recover secondary branches, producing fragmented crack structures. These errors are important in practice because junction connectivity influences subsequent morphology analysis, including length estimation and branching characterization. In comparison, SDCrackSeg better preserves connectivity at intersections, maintaining coherent links between the main crack and its branches and producing more structurally plausible crack networks.
In the thin-branch regions highlighted by the red boxes, most baselines show reduced sensitivity to faint or narrow cracks. U-Net and FCN variants frequently yield incomplete responses along thin segments, resulting in small gaps and truncated endpoints, which suggests limitations in representing highly curvilinear patterns using fixed-grid convolutions and standard upsampling. DeepLabv3 variants are often more responsive to crack pixels, but this increased sensitivity can be accompanied by reduced precision. In several cases, they generate locally thicker masks or spurious activations on crack-like textures, leading to over-segmentation in the highlighted areas. SegNet provides comparatively stable predictions in some samples, yet missed detections and discontinuities remain evident when cracks become extremely thin or exhibit weak contrast.
Background interference is another recurring challenge. In the presence of rough surface texture, stains, or low-frequency intensity variations, some baseline methods produce false positives or irregular boundaries, especially when true cracks appear close to edges, shadows, or texture gradients. Compared with these approaches, SDCrackSeg yields cleaner masks with fewer visually implausible artifacts while retaining fine details without excessively widening the predicted crack regions. This behavior indicates improved discrimination between true cracks and crack-like background patterns.
Overall, the qualitative results confirm that SDCrackSeg is more robust on difficult crack patterns. The advantages are most apparent in topology-critical regions, such as junctions and branching structures, and in thin or low-contrast segments, where conventional methods often produce fragmented predictions or incomplete branches. These visual observations are consistent with the quantitative improvements reported in
Table 1. Despite these improvements, segmentation errors may still occur when multiple challenging factors coexist, such as when extremely fine cracks traverse textured regions and approach junctions. Future work could incorporate multi-scale topological constraints or uncertainty estimation to further enhance robustness in these complex scenarios.
4.4. Accuracy and Runtime Efficiency Assessment
Figure 8 presents the relationship between segmentation accuracy and inference efficiency by reporting
mIoU and
FPS for different models. This comparison provides a practical perspective on deployability, where a higher
mIoU indicates better pixel-level agreement with the ground truth and a higher
FPS indicates stronger real-time capability. Overall, SDCrackSeg occupies a competitive region in the accuracy–efficiency space, indicating a favorable balance between segmentation quality and runtime speed.
In terms of accuracy, SDCrackSeg achieves the highest mIoU among all evaluated methods, reaching approximately 0.816. It surpasses the classical encoder–decoder baseline U-Net, which attains around 0.785, and also outperforms FCN and DeepLabv3 variants, whose mIoU values are approximately within the range of 0.755 to 0.770. The improvement suggests that SDCrackSeg provides more stable overlap with crack masks, especially for thin, curvilinear, and branching patterns that are prone to fragmentation or over-smoothing when using conventional convolutional backbones. Importantly, this accuracy gain is obtained without a clear reduction in inference speed, which is essential for large-scale inspection scenarios.
With respect to efficiency, SDCrackSeg maintains an inference speed close to 200 FPS under the adopted evaluation setting. Although SegNet and FCN_Resnet50 report higher throughput, exceeding 240 FPS, their mIoU values remain below 0.78, indicating that faster decoding or simplified feature processing may reduce the ability to preserve fine crack details. Conversely, DeepLabv3_Resnet101 exhibits the lowest efficiency, close to 160 FPS, while delivering only moderate accuracy. This observation suggests that increasing backbone depth and relying on multi-scale context aggregation does not necessarily lead to improved segmentation of topology-sensitive crack structures, but it does increase computational cost.
A closer inspection reveals clear differences across model families. FCN and SegNet achieve relatively high throughput, yet their accuracy is limited, which can be attributed to coarse upsampling and insufficient recovery of high-frequency boundaries. DeepLabv3 variants provide stronger contextual modeling but remain less efficient, and their accuracy gains are marginal, implying that background interference and false positives still affect final overlap quality. U-Net represents a stronger overall baseline with a more balanced profile, but it remains inferior to SDCrackSeg at similar runtime, reflecting the difficulty of capturing complex crack geometry with fixed-grid convolutions.
In summary,
Figure 8 indicates that SDCrackSeg offers a superior balance between accuracy and runtime efficiency. It delivers the highest
mIoU while retaining high inference throughput, which makes it suitable for practical applications that require both reliable segmentation and rapid processing, such as facade inspection using unmanned aerial vehicles, mobile imaging platforms, and long-term structural health monitoring.
5. Conclusions
This paper presented SDCrackSeg, a segmentation network designed for building cracks that are thin, elongated, and highly sensitive to structural continuity. The method targets two common failure modes in crack segmentation, namely blurred or over-thickened boundaries caused by background textures and broken connectivity caused by weak contrast, occlusions, and complex crack junctions.
A key contribution is the Frequency Spatial Convolution module, which integrates complementary cues from frequency enhancement and geometry adaptivity. The Adaptive Frequency Convolution branch emphasizes crack-sensitive high-frequency details that are essential for delineating narrow branches and weak-contrast segments. The Dynamic Snake Convolution branch improves the ability of convolutional sampling to follow curved crack paths, reducing the mismatch introduced by fixed-grid convolution when modeling tortuous structures. By combining these branches through feature fusion, SDCrackSeg produces representations that are more crack-aligned and less affected by irrelevant background patterns.
In addition to architectural design, this work incorporated a topology-aware loss based on persistent homology to explicitly regularize the structural consistency of predictions. Unlike region-based objectives that primarily measure overlap, the topology constraint penalizes fragmentation and encourages connectivity preservation, which is crucial for downstream crack morphology analysis and for obtaining reliable crack networks at junctions and turning points.
Extensive experiments on CHCrack5K validated the effectiveness of the proposed approach. SDCrackSeg achieved a precision of 0.900, an mIoU of 0.816, an F1-score of 0.888, and a Dice coefficient of 0.675, and it maintained an inference speed close to 200 FPS, indicating a favorable balance between accuracy and efficiency for practical inspection workloads. Together with the qualitative evidence from feature response visualizations, the results suggest that the proposed frequency enhancement, geometry-adaptive sampling, and topology regularization jointly improve boundary reliability, suppress false activations on crack-like textures, and reduce discontinuities in challenging regions.
Despite these promising results, several limitations should be acknowledged. First, although CHCrack5K integrates 11 diverse datasets, validation under extreme imaging conditions such as strong shadows, overexposure, or night-time acquisition remains limited and requires further investigation. Second, highly discontinuous cracks with large gaps may still produce fragmented predictions, as the topology-aware loss encourages connectivity but cannot hallucinate missing segments. Third, dense parallel crack patterns may occasionally merge due to the connectivity-favoring regularization. Fourth, the persistent homology computation increases training time by approximately 15%, although inference speed remains unaffected.
Future work will pursue the following specific directions. First, cross-domain generalization will be evaluated by testing SDCrackSeg on pavement crack datasets like CrackForest, DeepCrack, and industrial inspection scenarios to determine whether the frequency–geometry fusion approach transfers effectively across different crack types and imaging conditions. Second, a lightweight variant will be developed by replacing standard convolutions in DSConv with depthwise separable convolutions, targeting inference speeds exceeding 300 FPS on edge devices while maintaining mIoU above 0.80. Third, multi-threshold topology consistency will be investigated by computing the topology-aware loss at multiple binarization thresholds during training, with the hypothesis that this will reduce sensitivity to post-processing threshold selection. Fourth, joint crack attribute estimation will be explored by extending the decoder to predict crack width and orientation maps alongside segmentation masks in a multi-task learning framework.