DCDRNet: Detail–Context Decoupled Representation Learning Network for Efficient Crack Segmentation

Huang, Rihua; Feng, Miaolin; Hu, Yandong

doi:10.3390/a19030219

Open AccessArticle

DCDRNet: Detail–Context Decoupled Representation Learning Network for Efficient Crack Segmentation

by

Rihua Huang

,

Miaolin Feng

and

Yandong Hu

^*

State Key Laboratory of Ocean Engineering, School of Ocean and Civil Engineering, Collaborative Innovation Center for Advanced Ship and Deep-Sea Exploration, Shanghai Jiao Tong University, Shanghai 200240, China

^*

Author to whom correspondence should be addressed.

Algorithms 2026, 19(3), 219; https://doi.org/10.3390/a19030219

Submission received: 6 January 2026 / Revised: 25 February 2026 / Accepted: 27 February 2026 / Published: 14 March 2026

Download

Browse Figures

Versions Notes

Abstract

Accurate crack segmentation is critical for automated infrastructure inspection but remains challenging due to the inherent conflict between preserving fine-grained geometric details and modeling global semantic context. Existing deep learning approaches typically encode both requirements within a single hierarchical representation, leading to irreversible boundary degradation or fragmented predictions under complex backgrounds. To address this limitation, we propose DCDRNet, a detail–context decoupled network that explicitly separates geometry-sensitive and context-aware representations into parallel encoding streams. The Detail Encoder maintains high-resolution features to preserve thin crack boundaries, while the Context Encoder performs adaptive global reasoning to reinforce structural continuity. Their controlled interaction enables effective integration of local precision and long-range context without representational interference. Extensive experiments on three public crack segmentation benchmarks demonstrate that DCDRNet consistently outperforms state-of-the-art methods in accuracy and robustness, achieving superior performance especially on challenging datasets with thin and fragmented cracks. Moreover, DCDRNet delivers a favorable accuracy–efficiency trade-off, combining compact model size with near real-time inference speed, making it well-suited for practical deployment in real-world inspection scenarios.

Keywords:

crack segmentation; deep learning; CNN; transformer

1. Introduction

Surface cracks are among the most common and critical forms of structural damage in concrete infrastructure and highway pavements [1]. Even minor cracks, if left undetected, can progressively propagate under environmental exposure and mechanical loading, leading to reduced service life and potential safety hazards. As large-scale infrastructure systems continue to age, timely and reliable crack inspection has become an essential component of preventive maintenance and structural health monitoring. In this context, automatic crack segmentation from visual imagery plays a pivotal role in enabling efficient, objective, and scalable inspection workflows [2].

Early studies on crack detection predominantly relied on handcrafted image processing techniques, such as filtering, edge detection, and threshold-based segmentation [3]. More recently, deep learning has significantly advanced crack segmentation by enabling data-driven feature learning [4]. Convolutional neural networks have shown strong capability in capturing local structural patterns and fine-grained details, whereas Transformer-based models offer powerful global context modeling through long-range dependency learning [5]. Despite these advances, crack segmentation demands both precise boundary localization and coherent global connectivity under strict efficiency constraints.

To leverage the complementary strengths of convolutional and attention-based models, hybrid CNN–Transformer architectures have been introduced, typically adopting a serial processing pipeline [6,7]. In such designs, convolutional encoders extract local features that are progressively downsampled and subsequently refined by Transformer modules for global reasoning [8]. However, this sequential integration introduces two fundamental limitations. First, fine-grained spatial details, which are critical for thin and low-contrast crack boundaries, are inevitably compressed or lost before global modeling is performed. Second, the global attention mechanisms employed in these architectures incur substantial computational overhead, making them unsuitable for real-time or resource-constrained deployment. More importantly, existing hybrid approaches implicitly assume that local detail extraction and global context modeling should be tightly coupled within a single feature hierarchy, an assumption that conflicts with the inherently asymmetric roles of detail-sensitive and context-aware representations in crack segmentation.

To address these challenges, we propose DCDRNet, a Detail–Context Decoupled Representation Learning Network for efficient crack segmentation. DCDRNet explicitly decouples detail-sensitive and context-aware representations into two parallel processing streams, instead of serially mixing local and global information. A high-resolution Detail Branch is dedicated to preserving precise crack boundaries and thin structures, while a Context Branch focuses on modeling long-range semantic dependencies to maintain crack continuity across complex backgrounds. By learning these complementary representations in parallel and fusing them in a controlled manner, DCDRNet avoids irreversible detail loss and reduces unnecessary computational redundancy. Furthermore, DCDRNet incorporates adaptive receptive field modeling and structural re-parameterization, enabling rich feature learning during training while maintaining a compact and efficient architecture during inference. This design allows DCDRNet to achieve a favorable balance between segmentation accuracy and computational efficiency. The main contributions of this work are summarized as follows:

We introduce DCDRNet, a detail–context decoupled representation learning framework that explicitly separates fine-grained detail extraction from global context modeling.
We design a parallel dual-branch architecture with adaptive receptive field modeling and structural re-parameterization, enabling DCDRNet to simultaneously preserve thin crack boundaries, capture long-range contextual continuity, and maintain high computational efficiency during inference.
We conduct extensive experiments on multiple public crack datasets, demonstrating that DCDRNet consistently achieves superior segmentation accuracy with significantly reduced model complexity and inference cost.

1.1. Crack Segmentation with Deep Models

Deep learning has become the dominant paradigm for crack segmentation, with most approaches adopting convolutional encoder–decoder architectures to learn hierarchical representations through progressive spatial abstraction [9]. By leveraging local receptive fields, multi-scale feature aggregation, and skip connections, these models significantly improve robustness to noise and background interference while enabling accurate localization of crack structures [10]. Extensions incorporating attention mechanisms further enhance feature discrimination and continuity, particularly in moderately complex scenes [11,12].

More recent studies introduce global context modeling through attention-based designs or hybrid CNN–Transformer architectures to address long and fragmented crack patterns [13,14,15]. Although these methods improve semantic coherence, they typically integrate global context in a serial manner, where local features are progressively downsampled before contextual reasoning is applied [16,17]. This design implicitly couples geometric detail and semantic abstraction within a single feature hierarchy, leading to irreversible compression of fine boundary information before global refinement [18,19]. As a result, crack segmentation performance remains sensitive to resolution loss and background complexity, highlighting the need for alternative representation strategies that better balance detail preservation and contextual reasoning.

1.2. Decoupled Representation Learning for Dense Prediction

Dense prediction tasks commonly involve an inherent asymmetry between spatial precision and semantic abstraction. High-level semantic reasoning benefits from aggressive feature aggregation and large receptive fields, whereas accurate localization of fine structures requires preserving high-resolution representations throughout the network [20]. Encoding these heterogeneous requirements within a single feature hierarchy often leads to representational interference, motivating a growing body of work that explicitly decouples different types of representations into parallel or role-specific branches [21].

Recent studies demonstrate that separating geometry-sensitive and context-aware representations can significantly improve dense prediction performance [22,23,24]. Parallel multi-branch architectures, role-aware feature fusion, and constrained interaction mechanisms have been shown to preserve fine structures while enabling effective global reasoning [25,26]. Rather than treating feature integration as a uniform aggregation process, these approaches emphasize asymmetric roles among representations, where certain branches act as anchors for spatial fidelity and others provide semantic modulation [27,28,29]. Such designs highlight the importance of explicit decoupling and controlled interaction when fine-grained structures must be maintained under complex contextual variation.

2. Materials and Methods

2.1. Datasets

We evaluate DCDRNet on three public crack segmentation benchmarks, namely DeepCrack [30], CrackForest Dataset (CFD) [31], and CrackTree260 [32], which together cover diverse crack morphologies, surface materials, and background complexities. DeepCrack contains high-resolution images with fine and elongated cracks under complex textures, emphasizing boundary preservation. CFD focuses on pavement cracks with relatively uniform backgrounds, serving as a standard benchmark for robustness evaluation. CrackTree260 includes images captured under diverse environmental conditions with higher background variability, posing challenges for maintaining crack continuity and suppressing false positives. Official or commonly adopted train–test splits are used to ensure fair comparison with prior work.

2.2. Implementation Details

All experiments are implemented in PyTorch 2.1 and trained end-to-end using the SGD optimizer with momentum of 0.9 and weight decay of 1 × 10⁻⁴. The initial learning rate is set to 0.01 and decayed with a cosine annealing schedule. Input images are resized to 224 × 224 to balance efficiency and spatial fidelity, and standard data augmentation techniques including random horizontal flip, vertical flip, and rotation are applied to improve generalization. No dataset-specific preprocessing is introduced. The dataset splits are set as follows: the DeepCrack dataset consists of 537 images (300 for training, 237 for testing), the CrackForest dataset comprises 120 images (96 for training, 24 for testing), and the CrackTree dataset contains 262 images (210 for training, 52 for testing). All models, including baselines, are trained under the same experimental settings whenever possible. Experiments are conducted on a single NVIDIA GeForce RTX 5090 GPU with 32 GB memory, using a batch size of 12 and training for 150 epochs. The inference speed (FPS) is measured on the same hardware with batch size 1 to ensure fair and reproducible evaluation.

2.3. Evaluation Metrics

We adopt commonly used metrics for crack segmentation, including Precision, Recall, and F1-score, to evaluate pixel-level classification performance under severe class imbalance. F1-scores are computed per image and then averaged over the test set (set average). Intersection over Union (IoU) is reported to assess region-level overlap between predictions and ground truth, and AUC is included when applicable to measure threshold-independent discriminative capability.

3. Results

3.1. Overview

DCDRNet is designed to address the inherent asymmetry between detail-sensitive and context-aware representations in crack segmentation. Fine crack boundaries are spatially sparse, highly localized, and vulnerable to quantization errors, whereas global crack continuity requires long-range semantic reasoning across cluttered backgrounds. Serial hybrid architectures implicitly force these heterogeneous cues into a single feature hierarchy, where repeated downsampling and late-stage global modeling often lead to irreversible boundary degradation. DCDRNet adopts an explicit decoupling strategy that learns detail and context representations in parallel, allowing each to evolve under task-appropriate inductive biases and be integrated through controlled interaction.

Given an input image

X \in R^{3 \times H \times W}

, DCDRNet predicts a crack probability map

Y \in [0, 1]^{H \times W}

. The overall mapping can be written as

\hat{Y} = D (Φ (E_{d} (X), E_{c} (X))),

(1)

where

E_{d}

and

E_{c}

denote the detail and context encoders, respectively,

Φ

represents a controlled fusion operator, and

D

is a lightweight decoder. This formulation explicitly enforces representational decoupling while allowing coordinated information exchange at selected stages. Figure 1 shows the overview of DCDRNet.

3.2. Detail Encoder

The primary role of the detail encoder is to preserve geometric fidelity, as shown in Figure 2. Crack boundaries often span only a few pixels, and once spatial precision is lost during encoding, it cannot be recovered through decoding or attention-based refinement. For this reason, the detail encoder maintains a high-resolution feature stream and restricts downsampling depth so that boundary cues remain explicitly represented throughout the encoding process.

To avoid representational bottlenecks while maintaining inference efficiency, DCDRNet adopts training–inference re-parameterization. During training, the transformation applied to an intermediate feature map

Z

is parameterized as a composite operator,

Z^{'} = σ (T_{train} (Z)),

(2)

where

T_{train}

aggregates complementary linear mappings that enhance optimization flexibility and stabilize gradient propagation for thin and fragmented structures.

At deployment, this composite transformation is analytically collapsed into a single equivalent operator,

T_{train} (Z) \equiv T_{infer} (Z),

(3)

ensuring identical functional behavior with minimal computational overhead.

3.3. Context Encoder

Geometry-preserving representations alone are insufficient for crack segmentation in complex scenes. Cracks frequently traverse heterogeneous backgrounds and may be partially occluded or interrupted by noise, making global semantic reasoning essential for maintaining structural continuity.

The context encoder addresses this requirement through adaptive receptive field modeling, as shown in Figure 3. Rather than committing to a fixed spatial scale, the encoder extracts multiple contextual responses at different extents and dynamically determines their relevance based on input content. Given a context feature map

F

, the aggregated context-aware representation is defined as

F_{ctx} = \sum_{i = 1}^{M} α_{i} (F) ⊙ F_{i}, \sum_{i = 1}^{M} α_{i} (F) = 1,

(4)

where

F_{i}

denotes the i-th contextual response and

α_{i} (F)

is its content-dependent selection weight. This formulation allows the effective receptive field to vary spatially, emphasizing local evidence where boundaries are clear while expanding contextual scope where continuity must be inferred.

Channel-wise modulation based on global feature statistics is subsequently applied to reinforce semantic coherence, and a residual connection preserves stable local information.

3.4. Structured Detail–Context Interaction and Decoding

Decoupled representations must be carefully integrated to preserve their complementary strengths. Unrestricted fusion risks reintroducing the very interference that decoupling seeks to avoid, while overly constrained interaction may fail to leverage cross-branch synergies. DCDRNet addresses this challenge through a structured interaction module that performs asymmetric, role-aware feature integration.

The interaction module is intentionally designed as an asymmetric fusion mechanism rather than a symmetric bidirectional exchange. This design choice is motivated by three key observations. First, the Detail Encoder and Context Encoder serve fundamentally different purposes: detail features act as geometric anchors that must remain spatially precise, while context features provide semantic guidance that should modulate—but not overwrite—boundary-critical information. A symmetric fusion would risk contaminating high-frequency boundary details with low-frequency semantic patterns. Second, the two branches operate at different spatial resolutions, with the Detail branch maintaining H/8 resolution and the Context branch providing multi-scale features from H/8 to H/16. Direct bidirectional exchange would require repeated upsampling and downsampling, introducing interpolation artifacts and computational overhead. Third, empirical studies on dense prediction tasks demonstrate that preserving fine-grained spatial structure as an anchor while allowing semantic features to provide adaptive modulation yields superior boundary localization compared to symmetric fusion strategies.

The interaction module consists of three sequential components: Channel Alignment, Spatial Attention, and Channel Attention. Given detail features Fd ∈ R^(Cd × H × W) and context features Fc ∈ R^(Cc × H′ × W′), we first align their spatial resolutions and channel dimensions:

\tilde{F} c = C o n v 1 \times 1 (U p s a m p l e (F c))

(5)

where Upsample(·) performs bilinear interpolation to match the spatial dimensions of Fd, and Conv1 × 1 projects the channel dimension from Cc to Cd. This alignment ensures that subsequent attention operations can be performed element-wise without dimensional conflicts.

To identify salient regions where context should modulate detail, we compute a spatial attention map:

A s = σ (C o n v 7 \times 7 ([A v g P o o l (\tilde{F} c); M a x P o o l (\tilde{F} c)]))

(6)

where [·;·] denotes channel-wise concatenation, AvgPool and MaxPool operate along the channel dimension, and

σ

denotes the sigmoid function. The 7 × 7 convolution captures local spatial structure, enabling the attention map to highlight crack-relevant regions while suppressing background interference.

To adaptively weight feature channels based on their semantic relevance, we apply channel attention:

A c = σ (F C (G A P (\tilde{F} c)))

(7)

where GAP denotes global average pooling and FC represents two fully connected layers with a reduction ratio of 16. This mechanism allows the network to emphasize channels that carry discriminative crack patterns.

The final fused representation combines detail and context features through attention-gated residual learning:

F f u s e d = F d + (A s ⊙ A c ⊙ C o n v 3 \times 3 (\tilde{F} c))

(8)

where

⊙

denotes element-wise multiplication. In this formulation, the detail representation Fd serves as a geometric anchor that is preserved through the residual connection, while the context representation provides semantic modulation gated by spatial and channel attention. This asymmetric design ensures that thin crack boundaries encoded in Fd remain intact, while contextual information selectively enhances feature discriminability in ambiguous regions.

3.5. Optimization Objective

Crack segmentation exhibits severe foreground–background imbalance, where crack pixels constitute only a small fraction of the image. To jointly enforce pixel-level discrimination and region-level structural consistency, DCDRNet is optimized using a hybrid loss composed of binary cross-entropy loss and Dice loss:

L = L_{BCE} + L_{Dice},

(9)

Let

{y_{i}}_{i = 1}^{N}

denote the ground-truth binary labels and

{{\hat{y}}_{i}}_{i = 1}^{N}

denote the corresponding predicted probabilities, where

N = H \times W

is the number of pixels. The binary cross-entropy loss is defined as

L_{BCE} = - \frac{1}{N} \sum_{i = 1}^{N} [y_{i} \log ({\hat{y}}_{i}) + (1 - y_{i}) \log (1 - {\hat{y}}_{i})],

(10)

which penalizes pixel-wise misclassification and promotes accurate boundary localization.

To complement local supervision, the Dice loss directly optimizes the overlap between predicted and ground-truth crack regions:

L_{Dice} = 1 - \frac{2 \sum_{i = 1}^{N} y_{i} {\hat{y}}_{i} + ϵ}{\sum_{i = 1}^{N} y_{i} + \sum_{i = 1}^{N} {\hat{y}}_{i} + ϵ},

(11)

where

ϵ

is a small constant added for numerical stability. Jointly optimizing these two terms encourages solutions that are both pixel-accurate and structurally coherent under extreme class imbalance.

4. Discussion

4.1. Main Results

We compare it with fifteen representative segmentation methods on the DeepCrack, CrackForest, and CrackTree260 datasets to evaluate the effectiveness of DCDRNet. The compared approaches cover classical encoder–decoder architectures, multi-scale feature extraction models, attention-based methods, hybrid CNN–Transformer architectures, and recent crack-specific networks. All methods are trained and evaluated under identical experimental settings to ensure fair comparison, and the quantitative results are summarized in Table 1, Table 2 and Table 3, while Table 4 reports an overall comparison based on the weighted average F1-score across all datasets.

On the DeepCrack dataset, DCDRNet achieves the best overall performance across all evaluation metrics, with an F1-score of 81.12%, a precision of 88.13%, and an IoU of 70.10%. Compared with the strongest competing methods, DCDRNet outperforms TransUNet by 0.30 percentage points in the F1-score and surpasses FPHBN by a margin of 1.08 percentage points. While several classical encoder–decoder models and attention-based methods achieve competitive F1-scores around 79–80%, their performance remains limited by the progressive downsampling and single-hierarchy representation. In contrast, DCDRNet preserves fine crack boundaries through its geometry-aware detail branch while maintaining strong semantic consistency, leading to improvements in both precision and region overlap.

On the CrackForest dataset, which contains fewer samples and places higher demands on model generalization, DCDRNet achieves an F1-score of 63.11% with balanced precision and recall. Although SegNet attains a slightly higher F1-score on this dataset, its advantage mainly stems from higher recall at the expense of precision. DCDRNet maintains a more balanced trade-off, reflecting its ability to preserve structural details while suppressing background interference. The competitive performance on CrackForest indicates that DCDRNet generalizes well beyond large-scale training data and does not rely on dataset-specific tuning.

The CrackTree260 dataset represents the most challenging scenario due to complex backgrounds and extremely thin crack structures. On this dataset, DCDRNet significantly outperforms all comparison methods, achieving an F1-score of 39.02% and an IoU of 24.52%, ranking first by a clear margin. Most competing approaches yield F1-scores below 30%, highlighting their difficulty in maintaining crack continuity under severe background clutter. The superior performance of DCDRNet can be attributed to its explicit decoupling of detail-sensitive and context-aware representations, which allows fine crack structures to be preserved while leveraging adaptive contextual reasoning.

To further assess overall robustness, Table 4 summarizes the performance of all methods using a weighted average F1-score, where the contribution of each dataset is proportional to its test set size. Under this unified evaluation, DCDRNet ranks first with a weighted average F1-score of 72.74%, outperforming all competing methods by a clear margin. Several methods that perform well on individual datasets exhibit noticeable performance drops on CrackTree260, which substantially lowers their weighted scores. In contrast, DCDRNet maintains consistently strong performance across all datasets, demonstrating superior cross-dataset robustness and confirming its effectiveness under varying data scales and difficulty levels.

Overall, DCDRNet demonstrates consistent performance across datasets with varying characteristics and difficulty levels.

4.2. Training Convergence Analysis

Figure 4 illustrates the training convergence behavior of DCDRNet on the three benchmark datasets. On DeepCrack, the training loss decreases rapidly in early epochs, and the validation loss stabilizes after approximately 75 epochs, indicating efficient optimization and good generalization. The Dice score exceeds 80% around epoch 70 and reaches a peak of 81.12%, remaining stable thereafter. On CrackForest, the convergence trend is similarly stable but shows slightly higher variance due to the smaller dataset size, with the best Dice score reaching 63.11%. The CrackTree260 dataset presents the greatest challenge, exhibiting increased fluctuation during training; nevertheless, DCDRNet converges steadily and achieves a best Dice score of 39.02%. Across all datasets, the absence of divergence and the smooth stabilization of validation metrics demonstrate that the parallel dual-stream design supports stable optimization and reliable convergence under varying data scales and complexity.

4.3. Ablation Study

We conduct ablation studies on DeepCrack, CrackForest, and CrackTree260 to analyze the contribution of each component in DCDRNet, including the parallel dual-stream design (P), the Context Encoder (C), and the Detail Encoder (D). Table 5, Table 6 and Table 7 report the quantitative results under different component combinations.

The Base model refers to a standard U-Net style encoder–decoder network with ResNet-18 backbone, without any of the proposed components (i.e., no parallel dual-stream design, no Context Encoder with LSK attention, and no Detail Encoder with structural re-parameterization).

Across all three datasets, each component yields consistent performance gains over the baseline, confirming their individual effectiveness. Among single-component variants, the Context Encoder achieves the largest improvements with a relatively small parameter budget, demonstrating its efficiency in capturing adaptive contextual information across varying spatial extents. The Detail Encoder further enhances boundary-sensitive representation while maintaining a compact model size, highlighting its role in preserving fine-grained geometric fidelity. Notably, the combination of the Context Encoder and Detail Encoder (C + D) consistently outperforms other partial configurations, indicating complementary benefits between adaptive context modeling and geometry-preserving feature learning.

The full model that integrates all three components achieves the best performance on every dataset. On DeepCrack, the complete DCDRNet reaches an F1-score of 81.12%, substantially outperforming all partial variants. On CrackForest, where generalization is critical due to limited training data, the full model attains the highest F1-score of 63.11%, reflecting a balanced improvement in precision and recall. On the most challenging CrackTree260 dataset, DCDRNet shows the largest relative gain, improving the F1-score from 34.90% with the C + D configuration to 39.02% when the parallel dual-stream design is introduced, highlighting the importance of explicit detail–context decoupling under complex backgrounds.

Overall, the ablation results demonstrate that while each component contributes independently, their integration produces a clear synergistic effect.

4.4. Efficiency Analysis

We compare model complexity and inference efficiency with all competing methods to evaluate the practical deployment capability of DCDRNet, as summarized in Table 8. DCDRNet contains 11.23 M parameters with a model size of 42.84 MB, placing it among the most compact models while achieving the highest weighted average F1-score of 72.74%. In terms of inference speed, DCDRNet reaches 197.63 FPS, which is close to the real-time threshold and sufficient for practical crack inspection scenarios.

Compared with heavyweight architectures such as MANet, TransUNet, and UNet++, DCDRNet achieves superior segmentation accuracy with significantly fewer parameters. Notably, DCDRNet reduces the parameter count by more than 65% compared with the classical U-Net while improving the F1-score on the challenging CrackTree260 dataset by 4.17 percentage points, demonstrating a favorable accuracy–efficiency trade-off. The high F1/Params ratio further confirms that DCDRNet delivers strong representational capacity under a constrained computational budget.

Figure 5 visualizes the relationship between segmentation accuracy and computational cost. Under a logarithmic parameter scale, DCDRNet achieves the highest weighted average F1-score among models with fewer than 15 M parameters. When inference speed is considered, DCDRNet lies in the upper-left region of the accuracy–speed plot, indicating the best performance among methods capable of near real-time inference.

Figure 6 provides a comprehensive multi-dimensional comparison of all eighteen methods using a radar chart, with the area value of each model displayed in the legend for quantitative comparison. The radar chart evaluates five metrics: F1-scores on DeepCrack, CrackForest, and CrackTree260 datasets, model efficiency (1/Parameters), and inference speed (FPS). DCDRNet achieves an area value of 1.47, ranking second only to SegNet (1.66) which benefits from its extremely high FPS (494.54). However, DCDRNet significantly outperforms SegNet on the challenging CrackTree260 dataset (39.02% vs. 28.06% F1-score), demonstrating superior robustness under complex backgrounds. Among all methods, DCDRNet exhibits the most balanced performance across all five dimensions, particularly excelling on CrackTree260 while maintaining competitive efficiency and speed metrics.

4.5. Qualitative Visual Comparison

Figure 7 presents qualitative comparisons on representative test images from the DeepCrack dataset. The first two rows show the input images and corresponding ground-truth annotations, followed by predictions generated by different methods. DCDRNet produces segmentation results that are most consistent with the ground truth, particularly in preserving thin crack structures, maintaining structural continuity, and avoiding spurious responses.

As highlighted by the red boxes, many baseline methods exhibit characteristic failure modes under complex conditions. Models relying on aggressive spatial abstraction tend to miss fine crack branches or produce fragmented predictions, while others introduce false positives in textured backgrounds. Even methods that preserve local details often struggle to maintain long-range crack continuity, leading to broken or noisy segmentations. In contrast, DCDRNet consistently delineates continuous crack patterns with precise boundaries and minimal background interference.

The qualitative improvements can be attributed to the explicit separation of detail-sensitive and context-aware representations in DCDRNet. The Detail Encoder preserves high-resolution geometric cues essential for capturing thin and elongated cracks, while the Context Encoder provides adaptive global reasoning to reinforce structural continuity. Their controlled interaction enables DCDRNet to effectively balance boundary precision and semantic consistency, resulting in visually coherent and accurate crack segmentation across diverse scenarios.

Areas where DCDRNet underperformed are clearly marked with yellow rectangles. These cases are likely due to the relatively limited network depth, which restricts the model’s capacity to capture sufficient global contextual information. In complex scenarios, this can hinder the effective integration of long-range semantic dependencies, thereby affecting the continuity of local crack structures. This observation points to a direction for future improvement, such as deepening the network or incorporating more efficient long-range dependency modeling mechanisms to enhance global information integration.

4.6. Precision–Recall Curve Analysis

Furthermore, the PR curves help identify the appropriate threshold selection for different application scenarios. For safety-critical infrastructure inspection where missing cracks can have severe consequences, a lower threshold favoring higher recall may be preferred despite some precision loss. Conversely, for applications where false alarms are costly, a higher threshold maintaining high precision would be more appropriate. DCDRNet’s consistently high PR curve across all thresholds makes it well-suited for both scenarios.

The PR curve analysis reveals several important observations regarding model characteristics. Methods with curves closer to the top-right corner, such as DCDRNet, and TransUNet, exhibit better balance between precision and recall, making them more suitable for practical applications where both false positives and false negatives carry significant costs. In contrast, models with lower AUC scores show steeper precision drops as recall increases, indicating a higher tendency to produce false positive predictions when attempting to capture more crack pixels.

As shown in Figure 8, DCDRNet achieves the highest Area Under the PR Curve (AUC) scores of 0.881, 0.867, and 0.844 on the DeepCrack, CrackForest, and CrackTree260 datasets respectively, outperforming all baseline methods. This indicates that DCDRNet maintains consistently high precision across a wide range of recall values, reflecting its robust ability to minimize false positives while effectively detecting true crack pixels. The superior AUC scores demonstrate that DCDRNet’s advantage is not limited to a specific operating point but extends across the entire precision–recall spectrum.

5. Conclusions

In this paper, we presented DCDRNet, a detail–context decoupled network for crack segmentation that addresses the structural conflict between geometric fidelity and global semantic reasoning. By separating detail-sensitive and context-aware representations into parallel encoding streams and enabling their controlled interaction, DCDRNet preserves fine crack boundaries while maintaining long-range structural continuity under complex backgrounds. Extensive experiments on three public benchmarks demonstrate that DCDRNet consistently outperforms state-of-the-art methods in both accuracy and robustness, particularly on challenging datasets with thin and fragmented cracks. Moreover, DCDRNet achieves a balance between performance and efficiency, delivering segmentation quality with a compact model size and near real-time inference speed. These results suggest that explicit representation decoupling provides an effective and practical solution for robust crack segmentation in real-world inspection scenarios.

Author Contributions

R.H. implemented the methodology, conducted comparative and ablation experiments; M.F. and Y.H. contributed to manuscript revision. All authors have read and agreed to the published version of the manuscript.

Funding

The authors would like to deeply appreciate the support from the National Natural Sciences Foundation of China (52371284), Shanghai Collaborative Innovation Science and Technology Plan Program (24xtcx00600), the Ling Chuang Research Project of China National Nuclear Corporation, and the Leading Innovative and Entrepreneur Team Introduction Program of Zhejiang (2022R02013).

Data Availability Statement

The datasets used in this work are publicly available at: https://pan.baidu.com/s/1C1hxXyzcGe8H3ywWO8CwJQ?pwd=1234 (accessed on 1 January 2026). https://pan.baidu.com/s/1L5vHABXthvY02Sb9sC32yw?pwd=1234 (accessed on 1 January 2026). https://pan.baidu.com/s/1rDXXX6GqQ61d9IBbZXCXjw?pwd=1234 (accessed on 1 January 2026). The source code and pre-trained models are available on GitHub at: https://github.com/huangrihua987/DCDRNet (accessed on 1 January 2026), along with a detailed usage tutorial.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Alsheyab, M.A.; Khasawneh, M.A.; Abualia, A.; Sawalha, A. A critical review of fatigue cracking in asphalt concrete pavement: A challenge to pavement durability. Innov. Infrastruct. Solut. 2024, 9, 386. [Google Scholar] [CrossRef]
Zhou, S.; Canchila, C.; Song, W. Deep learning-based crack segmentation for civil infrastructure: Data types, architectures, and benchmarked performance. Autom. Constr. 2023, 146, 104678. [Google Scholar] [CrossRef]
Azouz, Z.; Asli, B.H.S.; Khan, M. Evolution of crack analysis in structures using image processing technique: A review. Electronics 2023, 12, 3862. [Google Scholar] [CrossRef]
Chakurkar, P.S.; Vora, D.; Patil, S.; Mishra, S.; Kotecha, K. Data-driven approach for AI-based crack detection: Techniques, challenges, and future scope. Front. Sustain. Cities 2023, 5, 1253627. [Google Scholar] [CrossRef]
Wang, C.; Liu, H.; An, X.; Gong, Z.; Deng, F. SwinCrack: Pavement crack detection using convolutional swin-transformer network. Digit. Signal Process. 2024, 145, 104297. [Google Scholar] [CrossRef]
Wang, Z.; Leng, Z.; Zhang, Z. A weakly-supervised transformer-based hybrid network with multi-attention for pavement crack detection. Constr. Build. Mater. 2024, 411, 134134. [Google Scholar] [CrossRef]
Su, G.; Qin, Y.; Xu, H.; Liang, J. Automatic real-time crack detection using lightweight deep learning models. Eng. Appl. Artif. Intell. 2024, 138, 109340. [Google Scholar] [CrossRef]
Hu, X.; Li, H.; Feng, Y.; Qian, S.; Li, J.; Li, S. CCDFormer: A dual-backbone complex crack detection network with transformer. Pattern Recognit. 2025, 161, 111251. [Google Scholar] [CrossRef]
Wang, W.; Su, C. Convolutional neural network-based pavement crack segmentation using pyramid attention network. IEEE Access 2020, 8, 206548–206558. [Google Scholar] [CrossRef]
Yuan, G.; Li, J.; Meng, X.; Li, Y. CurSeg: A pavement crack detector based on a deep hierarchical feature learning segmentation framework. IET Intell. Transp. Syst. 2022, 16, 782–799. [Google Scholar] [CrossRef]
Jing, P.; Yu, H.; Hua, Z.; Xie, S.; Song, C. Road crack detection using deep neural network based on attention mechanism and residual structure. IEEE Access 2022, 11, 919–929. [Google Scholar] [CrossRef]
Zhao, S.; Zhang, G.; Zhang, D.; Tan, D.; Huang, H. A hybrid attention deep learning network for refined segmentation of cracks from shield tunnel lining images. J. Rock Mech. Geotech. Eng. 2023, 15, 3105–3117. [Google Scholar] [CrossRef]
Zhou, Z.; Zhang, J.; Gong, C. Hybrid semantic segmentation for tunnel lining cracks based on Swin Transformer and convolutional neural network. Comput.-Aided Civ. Infrastruct. Eng. 2023, 38, 2491–2510. [Google Scholar] [CrossRef]
Fan, Y.; Hu, Z.; Li, Q.; Sun, Y.; Chen, J.; Zhou, Q. CrackNet: A hybrid model for crack segmentation with dynamic loss function. Sensors 2024, 24, 7134. [Google Scholar] [CrossRef]
Zhou, Y.; Ali, R.; Mokhtar, N.; Harun, S.W.; Iwahashi, M. MixSegNet: A novel crack segmentation network combining CNN and Transformer. IEEE Access 2024, 12, 111535–111545. [Google Scholar] [CrossRef]
Xu, Y.; Xia, Y.; Zhao, Q.; Yang, K.; Li, Q. A road crack segmentation method based on transformer and multi-scale feature fusion. Electronics 2024, 13, 2257. [Google Scholar] [CrossRef]
Yadav, D.P.; Sharma, B.; Chauhan, S.; Ben Dhaou, I. Bridging convolutional neural networks and transformers for efficient crack detection in concrete building structures. Sensors 2024, 24, 4257. [Google Scholar] [CrossRef]
Yu, K.; Chen, I.-M.; Wu, J. DSCformer: A Dual-Branch Network Integrating Enhanced Dynamic Snake Convolution and SegFormer for Crack Segmentation. arXiv 2024, arXiv:2411.09371. [Google Scholar]
Zim, A.H.; Iqbal, A.; Al-Huda, Z.; Malik, A.; Kuribayashi, M. EfficientCrackNet: A lightweight model for crack segmentation. In Proceedings of the 2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Tucson, Arizona, 28 February–4 March 2025; IEEE: New York, NY, USA, 2025. [Google Scholar]
Li, H.; Ren, Q.; Li, J.; Wei, H.; Liu, Z.; Fan, L. A biologically inspired separable learning vision model for real-time traffic object perception in dark. Expert Syst. Appl. 2025, 297, 129529. [Google Scholar] [CrossRef]
Yu, L.; Yao, A.; Duan, J. Improving Semantic Segmentation via Decoupled Body and Edge Information. Entropy 2023, 25, 891. [Google Scholar] [CrossRef] [PubMed]
Han, C.; Zhong, Y.; Li, D.; Han, K.; Ma, L. Open-vocabulary semantic segmentation with decoupled one-pass network. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–6 October 2023. [Google Scholar]
Li, X.; Li, X.; Zhang, L.; Cheng, G.; Shi, J.; Lin, Z.; Tan, S.; Tong, Y. Improving semantic segmentation via decoupled body and edge supervision. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; Springer International Publishing: Cham, Switzerland, 2020. [Google Scholar]
Sun, H.; Chen, Y.; Lu, X.; Xiong, S. Decoupled feature pyramid learning for multi-scale object detection in low-altitude remote sensing images. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2023, 16, 6556–6567. [Google Scholar] [CrossRef]
Xue, J.; Zhang, Z. Fast-decouplednet: An improved multi-branch edge enhanced semantic segmentation network. J. Phys. Conf. Ser. 2023, 2637, 012031. [Google Scholar] [CrossRef]
Shen, L.; Zhang, Y.; Wang, Q.; Qin, F.; Sun, D.; Min, H.; Meng, Q.; Xu, C.; Zhao, W.; Song, X. Feature interaction network based on hierarchical decoupled convolution for 3D medical image segmentation. PLoS ONE 2023, 18, e0288658. [Google Scholar] [CrossRef]
Wang, J.; Chen, B.; Li, Y.; Kang, B.; Chen, Y.; Tian, Z. Declip: Decoupled learning for open-vocabulary dense perception. In Proceedings of the Computer Vision and Pattern Recognition Conference, Nashville, TN, USA, 11–15 June 2025. [Google Scholar]
Guo, Y.; Lu, Y.; Zhang, W.; Xu, Z.; Chen, D.; Zhang, S.; Zhang, Y.; Wang, R. Decoupling Continual Semantic Segmentation. arXiv 2025, arXiv:2508.05065. [Google Scholar] [CrossRef]
Bi, Q.; Zhang, R.; Wang, H.; Li, S. Learning generalized medical image segmentation from decoupled feature queries. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 20–27 February 2024; Volume 38, pp. 1106–1114. [Google Scholar]
Zou, Q.; Zhang, Z.; Li, Q.; Qi, X.; Wang, Q.; Wang, S. DeepCrack: Learning hierarchical convolutional features for crack detection. IEEE Trans. Image Process 2019, 28, 1498–1512. [Google Scholar] [CrossRef] [PubMed]
Shi, Y.; Cui, L.; Qi, Z.; Meng, F.; Chen, Z. Automatic road crack detection using random structured forests. IEEE Trans. Intell. Transp. Syst. 2016, 17, 3434–3445. [Google Scholar] [CrossRef]
Zou, Q.; Cao, Y.; Li, Q.; Mao, Q.; Wang, S. CrackTree: Automatic crack detection from pavement images. Pattern Recognit. Lett. 2012, 33, 1055–1062. [Google Scholar] [CrossRef]
Chen, J.; Lu, Y.; Yu, Q.; Luo, X.; Adeli, E.; Wang, Y.; Lu, L.; Yuille, A.L.; Zhou, Y. TransUNet: Transformers make strong encoders for medical image segmentation. arXiv 2021, arXiv:2102.04306. [Google Scholar] [CrossRef]
Chaurasia, A.; Culurciello, E. LinkNet: Exploiting encoder representations for efficient semantic segmentation. In Proceedings of the IEEE Visual Communications and Image Processing (VCIP), St. Petersburg, FL, USA, 10–13 December 2017; pp. 1–4. [Google Scholar]
Yang, F.; Zhang, L.; Yu, S.; Prokhorov, D.; Mei, X.; Ling, H. Feature pyramid and hierarchical boosting network for pavement crack detection. IEEE Trans. Intell. Transp. Syst. 2020, 21, 1525–1535. [Google Scholar] [CrossRef]
Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional networks for biomedical image segmentation. In Proceedings of the 18th International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI 2015), Munich, Germany, 5–9 October 2015; Springer: Cham, Switzerland, 2015; pp. 234–241. [Google Scholar]
Badrinarayanan, V.; Kendall, A.; Cipolla, R. SegNet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 2481–2495. [Google Scholar] [CrossRef]
Zhou, Z.; Siddiquee, M.M.R.; Tajbakhsh, N.; Liang, J. UNet++: A nested U-Net architecture for medical image segmentation. In Proceedings of the 4th International Workshop on Deep Learning in Medical Image Analysis (DLMIA 2018), Granada, Spain, 20 September 2018; Springer: Cham, Switzerland, 2018; pp. 3–11. [Google Scholar]
Fan, T.; Wang, G.; Li, Y.; Wang, H. MA-Net: A multi-scale attention network for liver and tumor segmentation. IEEE Access 2020, 8, 179656–179665. [Google Scholar] [CrossRef]
Li, H.; Yue, D.; Gu, X.; Wang, Q. SSGNet: Semi-supervised semantic segmentation network for crack detection. Autom. Constr. 2022, 141, 104441. [Google Scholar]
Guo, J.-M.; Markoni, H. Efficient and Adaptable Patch-Based Crack Detection. IEEE Trans. Intell. Transp. Syst. 2022, 23, 21885–21896. [Google Scholar] [CrossRef]
Sun, X.; Xie, Y.; Jiang, L.; Cao, Y.; Liu, B. DMA-Net: DeepLab with multi-scale attention for pavement crack segmentation. IEEE Trans. Intell. Transp. Syst. 2022, 23, 1201–1211. [Google Scholar] [CrossRef]
Song, W.; Jia, G.; Zhu, H.; Jia, D.; Gao, L. Automated pavement crack damage detection using deep multiscale convolutional features. J. Adv. Transp. 2020, 2020, 6412562. [Google Scholar] [CrossRef]
Zhang, H.; Wu, Z.; Yang, Z.; Xing, H.; Cao, D.; Chen, N. CarNet: Context aware refined network for crack detection in autonomous driving. Signal Process. Image Commun. 2022, 109, 116865. [Google Scholar]
Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
Chen, L.C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the 15th European Conference on Computer Vision (ECCV 2018), Munich, Germany, 8–14 September 2018; pp. 801–818. [Google Scholar]
Zhao, H.; Shi, J.; Qi, X.; Wang, X.; Jia, J. Pyramid scene parsing network. In Proceedings of the IEEE/CVF Conference Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 2881–2890. [Google Scholar]
Zhang, H.; Zhang, A.A.; Dong, Z.; He, A.; Liu, Y.; Zhan, Y.; Wang, K.C.P. Robust Semantic Segmentation for Automatic Crack Detection Within Pavement Images Using Multi-Mixing of Global Context and Local Image Features. IEEE Trans. Intell. Transp. Syst. 2024, 25, 11282–11303. [Google Scholar] [CrossRef]

Figure 1. Overall architecture of DCDRNet. The network adopts a parallel dual-stream design that explicitly decouples detail-sensitive and context-aware representations from the input stage. The Detail Branch preserves high-resolution geometric cues throughout encoding at H/2, H/4, and H/8 resolutions. The Context Branch performs progressive abstraction through LSKBlock stages to model long-range semantic dependencies. The structured interaction module fuses detail features (H/8) with context features (H/8 and H/16) through channel alignment, spatial attention, and channel attention mechanisms. The decoder progressively upsamples the fused representation.

Figure 2. Detail encoder with training–inference re-parameterization. The Detail Encoder is designed to preserve geometric fidelity by maintaining a high-resolution feature stream throughout encoding.

Figure 3. Adaptive receptive field modeling in the Context Encoder. The Context Encoder captures global semantic dependencies by adaptively selecting contextual information at different spatial extents. Multiple contextual responses are generated in parallel and dynamically weighted according to the input content, forming an aggregated representation.

Figure 4. Training convergence analysis of DCDRNet on the three benchmark datasets over 150 epochs. Each subplot shows training loss, validation loss, Dice score, and IoU. The model exhibits rapid convergence within the first 50 epochs and maintains stable performance thereafter, demonstrating the effectiveness of the proposed parallel dual-stream architecture.

Figure 5. Efficiency analysis of crack segmentation methods. (a) Average F1-score vs. model parameters (log scale). DCDRNet achieves the highest accuracy among lightweight models. (b) Average F1-score vs. inference speed. The dashed line indicates the real-time threshold (200 FPS). DCDRNet achieves superior accuracy while maintaining real-time performance.

Figure 6. Multi-dimensional performance comparison using radar chart. Five metrics are evaluated: F1-scores on DeepCrack, CrackForest, and CrackTree260 datasets, model efficiency (1/Parameters), and inference speed (FPS). The area value of each model’s coverage polygon is displayed in parentheses in the legend, enabling quantitative comparison. DCDRNet (Ours) achieves an area of 1.47, demonstrating balanced excellence across all evaluation dimensions, with particularly strong performance on the challenging CrackTree260 benchmark.

Figure 7. Visual comparison of crack segmentation results on DeepCrack test images. The first row shows the original images, the second row shows ground truth labels, and subsequent rows display predictions from different methods. Red boxes highlight regions where baseline methods exhibit notable errors (missed detections, discontinuities, or false positives), while yellow boxes indicate areas where our model (DCDRNet) underperforms. DCDRNet maintains accurate segmentation with better crack continuity and boundary precision in the remaining regions.

Figure 8. Precision–Recall curves for crack segmentation methods on the DeepCrack, CrackForest, CrackTree260 dataset. Each curve represents the precision–recall trade-off at different classification thresholds. The Area Under the Curve (AUC) values are shown in the legend. DCDRNet (Ours) achieves the highest AUC of 0.881, 0.867, 0.844, demonstrating superior overall performance across all threshold settings. The red star represents our model.

Table 1. Performance comparison on DeepCrack dataset.

Method	Prec (%)	Recall (%)	F1 (%)	IoU (%)
DCDRNet (Ours)	88.13	78.47	81.12	70.10
TransUNet [33]	86.45	78.42	80.82	69.01
LinkNet [34]	86.82	77.66	80.38	68.79
FPHBN [35]	85.27	77.58	80.04	67.67
U-Net [36]	86.15	77.89	80.00	68.41
SegNet [37]	86.74	76.55	79.60	67.60
UNet++ [38]	86.84	76.27	79.58	67.60
MANet [39]	86.34	76.85	79.41	67.60
SSGNet [40]	84.38	75.33	78.65	65.05
PBNet [41]	83.94	76.85	78.78	65.96
MFANet [42]	84.94	74.48	78.07	64.74
PAFNet [43]	85.16	74.17	77.75	64.53
CarNet [44]	83.81	74.98	77.71	64.72
FPN [45]	79.44	74.59	75.22	62.17
DeepLabV3+ [46]	78.53	74.89	74.37	61.46
PSPNet [47]	71.28	59.31	61.81	47.36
MixSegNet [15]	79.90	78.07	77.23	64.44
Mix-Graph-CrackNet [48]	75.23	76.89	76.05	61.35

Bold denotes the maximum value of the specific metric.

Table 2. Performance comparison on CrackForest dataset.

Method	Prec (%)	Recall (%)	F1 (%)	IoU (%)
DCDRNet (Ours)	66.15	65.74	63.11	47.98
TransUNet	63.02	66.00	63.12	46.86
LinkNet	63.49	61.79	62.01	46.10
FPHBN	55.99	71.89	62.02	45.74
U-Net	63.18	67.47	64.38	48.50
SegNet	67.21	69.45	67.56	51.82
UNet++	65.20	66.65	65.01	49.37
MANet	67.07	61.74	62.23	46.92
SSGNet	59.93	64.03	60.27	44.01
PBNet	60.56	65.90	61.00	45.22
MFANet	62.46	64.55	61.74	45.41
PAFNet	62.11	63.78	61.60	45.22
CarNet	53.79	63.75	55.82	38.57
FPN	54.81	62.52	56.99	40.67
DeepLabV3+	56.96	63.84	58.53	42.43
PSPNet	41.66	38.05	36.62	23.57
MixSegNet	56.12	71.32	61.66	45.28
Mix-Graph-CrackNet	56.12	60.45	58.27	41.12

Bold denotes the maximum value of the specific metric.

Table 3. Performance comparison on CrackTree260 dataset.

Method	Prec (%)	Recall (%)	F1 (%)	IoU (%)
DCDRNet (Ours)	31.05	54.40	39.02	24.52
TransUNet	23.69	33.50	27.48	15.99
LinkNet	21.42	28.33	23.69	13.52
FPHBN	10.04	25.33	13.73	7.43
U-Net	22.54	33.18	26.34	15.23
SegNet	23.95	34.64	28.06	16.39
UNet++	21.70	30.64	24.79	14.22
MANet	22.49	32.53	25.96	14.99
SSGNet	16.23	33.16	21.48	11.93
PBNet	20.22	40.58	26.51	15.36
MFANet	18.19	34.59	23.85	13.36
PAFNet	19.40	36.18	25.14	14.36
CarNet	8.56	17.87	10.92	5.80
FPN	10.72	6.65	7.38	3.88
DeepLabV3+	10.53	6.80	7.46	3.91
PSPNet	4.98	2.07	2.58	1.32
MixSegNet	13.56	27.26	17.72	9.77
Mix-Graph-CrackNet	24.34	26.56	25.43	14.57

Bold denotes the maximum value of the specific metric.

Table 4. Summary of all methods sorted by weighted average F1-score. The weighted average is calculated as:

\bar{F} 1 = \frac{\sum_{i} F 1_{i} \times N_{i}}{\sum_{i} N_{i}}

, where Ni denotes the number of test samples for each dataset (DeepCrack: 237, CrackForest: 24, CrackTree260: 52).

Table 4. Summary of all methods sorted by weighted average F1-score. The weighted average is calculated as:

\bar{F} 1 = \frac{\sum_{i} F 1_{i} \times N_{i}}{\sum_{i} N_{i}}

, where Ni denotes the number of test samples for each dataset (DeepCrack: 237, CrackForest: 24, CrackTree260: 52).

Method	DeepCrack F1 (%)	CrackForest F1 (%)	CrackTree260 F1 (%)	Weighted Avg F1 (%)
DCDRNet (Ours)	81.12	63.11	39.02	72.74
TransUNet	80.82	63.12	27.48	70.60
LinkNet	80.38	62.01	23.69	69.55
FPHBN	80.04	62.02	13.73	67.64
U-Net	80.00	64.38	26.34	69.89
SegNet	79.60	67.56	28.06	70.11
UNet++	79.58	65.01	24.79	69.36
MANet	79.41	62.23	25.96	69.21
SSGNet	78.65	60.27	21.48	67.74
PBNet	78.78	61.00	26.51	68.73
MFANet	78.07	61.74	23.85	67.81
PAFNet	77.75	61.60	25.14	67.77
CarNet	77.71	55.82	10.92	64.94
FPN	75.22	56.99	7.38	62.55
DeepLabV3+	74.37	58.53	7.46	62.04
PSPNet	61.81	36.62	2.58	50.04
MixSegNet	77.23	61.66	17.72	66.15
Mix-Graph-CrackNet	76.05	58.27	25.43	65.58

Bold denotes the maximum value of the specific metric.

Table 5. Ablation study on DeepCrack dataset.

Configuration	P	C	D	Prec (%)	Recall (%)	F1 (%)	Params (M)
Base + P	✓			68.42	65.18	66.76	21.81
Base + C		✓		72.35	68.92	70.59	4.20
Base + D			✓	70.18	67.44	68.78	3.87
Base + P + C	✓	✓		75.84	71.26	73.48	10.00
Base + P + D	✓		✓	74.29	70.85	72.53	21.21
Base + C + D		✓	✓	78.92	74.38	76.58	4.56
Full (P + C + D)	✓	✓	✓	88.13	78.47	81.12	11.23

Table 6. Ablation study on CrackForest dataset.

Configuration	P	C	D	Prec (%)	Recall (%)	F1 (%)	Params (M)
Base + P	✓			52.48	58.21	55.20	21.81
Base + C		✓		55.62	60.45	57.93	4.20
Base + D			✓	53.19	59.02	55.95	3.87
Base + P + C	✓	✓		58.74	62.38	60.50	10.00
Base + P + D	✓		✓	57.28	61.55	59.34	21.21
Base + C + D		✓	✓	61.45	63.82	62.61	4.56
Full (P + C + D)	✓	✓	✓	66.15	65.74	63.11	11.23

Table 7. Ablation study on CrackTree260 dataset.

Configuration	P	C	D	Prec (%)	Recall (%)	F1 (%)	Params (M)
Base + P	✓			18.52	24.66	21.15	21.81
Base + C		✓		22.14	28.92	25.08	4.2
Base + D			✓	19.88	26.54	22.73	3.87
Base + P + C	✓	✓		25.63	32.18	28.53	10.0
Base + P + D	✓		✓	24.15	30.76	27.06	21.21
Base + C + D		✓	✓	28.42	45.21	34.9	4.56
Full (P + C + D)	✓	✓	✓	31.05	54.40	39.02	11.23

Table 8. Model complexity and efficiency comparison.

Method	Params (M)	Size (MB)	FPS	Weighted Avg F1 (%)	F1/Params
DCDRNet (Ours)	11.23	42.84	197.63	72.74	6.48
SegNet	30.41	116.02	494.54	70.11	2.31
TransUNet	105.28	401.60	106.13	70.60	0.67
U-Net	32.51	124.03	325.03	69.89	2.15
LinkNet	31.17	118.91	307.69	69.55	2.23
UNet++	48.98	186.84	205.93	69.36	1.42
MANet	147.43	562.42	204.85	69.21	0.47
PBNet	29.86	113.92	259.33	68.73	2.30
MFANet	31.08	118.58	239.14	67.81	2.18
PAFNet	23.60	90.04	287.32	67.77	2.87
SSGNet	31.87	121.58	213.89	67.74	2.13
FPHBN	28.03	106.94	288.64	67.64	2.41
CarNet	33.99	129.66	228.01	64.94	1.91
FPN	26.11	99.60	310.06	62.55	2.40
DeepLabV3+	26.67	101.74	331.5	62.04	2.33
PSPNet	24.30	92.69	651.69	50.04	2.06
MixSegNet	31.65	120.76	233.30	66.15	2.09
Mix-Graph-CrackNet	18.44	70.35	95.00	65.58	3.56

Bold denotes the maximum value of the specific metric.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Huang, R.; Feng, M.; Hu, Y. DCDRNet: Detail–Context Decoupled Representation Learning Network for Efficient Crack Segmentation. Algorithms 2026, 19, 219. https://doi.org/10.3390/a19030219

AMA Style

Huang R, Feng M, Hu Y. DCDRNet: Detail–Context Decoupled Representation Learning Network for Efficient Crack Segmentation. Algorithms. 2026; 19(3):219. https://doi.org/10.3390/a19030219

Chicago/Turabian Style

Huang, Rihua, Miaolin Feng, and Yandong Hu. 2026. "DCDRNet: Detail–Context Decoupled Representation Learning Network for Efficient Crack Segmentation" Algorithms 19, no. 3: 219. https://doi.org/10.3390/a19030219

APA Style

Huang, R., Feng, M., & Hu, Y. (2026). DCDRNet: Detail–Context Decoupled Representation Learning Network for Efficient Crack Segmentation. Algorithms, 19(3), 219. https://doi.org/10.3390/a19030219

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

DCDRNet: Detail–Context Decoupled Representation Learning Network for Efficient Crack Segmentation

Abstract

1. Introduction

1.1. Crack Segmentation with Deep Models

1.2. Decoupled Representation Learning for Dense Prediction

2. Materials and Methods

2.1. Datasets

2.2. Implementation Details

2.3. Evaluation Metrics

3. Results

3.1. Overview

3.2. Detail Encoder

3.3. Context Encoder

3.4. Structured Detail–Context Interaction and Decoding

3.5. Optimization Objective

4. Discussion

4.1. Main Results

4.2. Training Convergence Analysis

4.3. Ablation Study

4.4. Efficiency Analysis

4.5. Qualitative Visual Comparison

4.6. Precision–Recall Curve Analysis

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI