Next Article in Journal
A Novel Shallow Neural Network-Augmented Pose Estimator Based on Magneto-Inertial Sensors for Reference-Denied Environments
Previous Article in Journal
Development of a Low-Cost Infrared Imaging System for Real-Time Analysis and Machine Learning-Based Monitoring of GMAW
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

CONTI-CrackNet: A Continuity-Aware State-Space Network for Crack Segmentation

1
School of Information Science and Technology, Nantong University, Nantong 226019, China
2
School of Artificial Intelligence and Computer Science, Nantong University, Nantong 226019, China
3
School of Transportation and Civil Engineering, Nantong University, Nantong 226019, China
*
Author to whom correspondence should be addressed.
Sensors 2025, 25(22), 6865; https://doi.org/10.3390/s25226865
Submission received: 3 October 2025 / Revised: 7 November 2025 / Accepted: 8 November 2025 / Published: 10 November 2025
(This article belongs to the Section Sensor Networks)

Abstract

Crack segmentation in cluttered scenes with slender and irregular patterns remains difficult, and practical systems must balance accuracy and efficiency. We present CONTI-CrackNet, which is a lightweight visual state-space network that integrates a Multi-Directional Selective Scanning Strategy (MD3S). MD3S performs bidirectional scanning along the horizontal, vertical, and diagonal directions, and it fuses the complementary paths with a Bidirectional Gated Fusion (BiGF) module to strengthen global continuity. To preserve fine details while completing global texture, we propose a Dual-Branch Pixel-Level Global–Local Fusion (DBPGL) module that incorporates a Pixel-Adaptive Pooling (PAP) mechanism to dynamically weight max-pooled responses and average-pooled responses. Evaluated on two public benchmarks, the proposed method achieves an F1 score (F1) of 0.8332 and a mean Intersection over Union (mIoU) of 0.8436 on the TUT dataset, and it achieves an mIoU of 0.7760 on the CRACK500 dataset, surpassing competitive Convolutional Neural Network (CNN), Transformer, and Mamba baselines. With 512 × 512 input, the model requires 24.22 G floating point operations (GFLOPs), 6.01 M parameters (Params), and operates at 42 frames per second (FPS) on an RTX 3090 GPU, delivering a favorable accuracy–efficiency balance. These results show that CONTI-CrackNet improves continuity and edge recovery for thin cracks while keeping computational cost low, and it is lightweight in terms of parameter count and computational cost.

1. Introduction

Cracks are among the most common and harmful defects [1]. They weaken the load-bearing capacity and durability of structures, increase maintenance costs, and pose public-safety risks. Cracks manifest in diverse materials and structures, such as pavements, bricks, and tiles [2]. Owing to irregular shapes, large-scale variations, and complex textures, accurate crack recognition remains challenging. Traditional crack detection still relies on manual inspection [3]. This process is slow, costly, and subjective, which often leads to missed cases and inconsistent results and cannot meet large-scale application needs. Early segmentation methods mainly used digital image processing, such as edge detection, clustering, thresholding, and morphological operations [4,5,6,7], but their performance is limited in the presence of noise.
To overcome these limitations, researchers introduced Convolutional Neural Networks (CNNs) for crack segmentation [8]. Although prior studies expanded the effective receptive field through pyramid pooling [9], employed deformable convolutions to adaptively focus on relevant regions [10], and incorporated lightweight attention in SegNeXt [11] to highlight local features while reducing computational overhead, CNNs remain constrained by local convolutions [12], which limits the modeling of long-range dependencies and complex semantic relations. In view of these shortcomings, Transformer-based methods have been applied to segmentation tasks, including hierarchical encoders coupled with Multi-layer Perceptron (MLP) decoders for cross-level feature aggregation [13] and CNN–Transformer dual-path architectures that exploit complementary strengths [14]. However, Transformers typically involve large parameter scales and substantial computational cost [15], which restricts deployment on resource-constrained devices. Moreover, MaskSup [16] performs context modeling for image segmentation via random masking, thereby improving segmentation performance without incurring any additional inference overhead.
To retain both global and local modeling while controlling computational cost, researchers have begun exploring state-space models for vision tasks [17]. Mamba is an efficient model in this class, employing selective state spaces to model long sequences in linear time [18] and thereby delivering strong global-context capture with high computational efficiency. Building on this idea, many studies have extended Mamba to computer vision. For example, VMamba [19] divides an image into a sequence of patches and employs Visual State-Space (VSS) blocks for multi-directional scanning to better capture complex inter-patch dependencies. PlainMamba [20] removes the hierarchical structure and adopts continuous 2D scanning, simplifying the network and improving inference efficiency. MambaIR [21] incorporates local enhancement and channel attention and demonstrates improvements in image restoration. CSMamba [22] combines a CNN encoder with a Mamba decoder for remote-sensing image segmentation.
Overall, these studies indicate that Mamba offers clear advantages for global-information modeling. However, despite progress in image restoration and remote sensing [23,24,25], in-depth investigations of crack segmentation remain limited. A few studies have introduced Mamba into crack segmentation. For example, CrackMamba [26] adopts a Mamba-based encoder and a Snake-scan mechanism to strengthen global modeling; MambaCrackNet [27] incorporates novel visual Mamba blocks, effectively capturing global semantic relations while reducing computational cost. Cracks exhibit a slender, irregular morphology with fuzzy boundaries, which requires models to preserve spatial continuity while balancing pixel-level local details and global texture. At present, crack segmentation primarily relies on architectures based on CNNs or Transformers [28,29,30]. The former (e.g., SFIAN [31]) excels at detailed feature extraction, whereas the latter (e.g., CT-crackseg [32]) achieves gains in boundary modeling. Yet CNNs are limited by the locality of convolutions, whereas Transformers incur high computational cost due to attention [33]. Consequently, achieving a favorable balance between segmentation quality and computational efficiency remains an open problem.
To address the above challenges, we propose CONTI-CrackNet. Our main contributions are outlined below:
  • CONTI-CrackNet architecture: a cascaded and lightweight crack segmentation network that effectively represents complex crack morphology while markedly reducing computation.
  • Multi-Directional Selective Scanning Strategy (MD3S): efficient long-range modeling that characterizes crack continuity and shape from multiple directions combined with the Bidirectional Gated Fusion (BiGF) module to alleviate directional bias.
  • Dual-Branch Pixel-Level Global–Local Fusion (DBPGL) module: a Pixel-Adaptive Pooling (PAP) mechanism that balances max-pooled and average-pooled features at each pixel, preserving edge fidelity while improving global connectivity.

2. Methods

2.1. Network Architecture

This paper proposes a new crack-segmentation model, CONTI-CrackNet; its overall structure is shown in Figure 1a. The model addresses multiple crack properties: it enhances shape modeling, captures global information about crack continuity, and preserves the local details of fine cracks. Building on this design, the model employs a Multi-Directional Selective Scanning Strategy (MD3S), a Dual-Branch Pixel-Level Global–Local Fusion (DBPGL) module, and a Pixel-Adaptive Pooling (PAP) mechanism to achieve fine-grained segmentation in complex scenes.
The input is an RGB image of shape ( 3 ,   H ,   W ) , where 3 denotes the channel count and H and W denote the image height and width, respectively. We first partition the image into N = H h · W w patches, where h and w are the height and width of each patch, and add positional encodings to each patch to preserve spatial information. The patch sequence is then fed into an Structure-Strengthened Visual State-Space (SSVSS) block for feature modeling. As shown in Figure 1b, the SSVSS block first applies a Gated Bottleneck Convolution (GBC) [34] module to extract detailed features and then splits the stream into two branches. The lower branch expands the channels via a linear layer and applies SiLU. The upper branch applies a linear transform and activation, and then it introduces an MD3S to better capture the crack shape and structure. The outputs of the two branches are fused by element-wise multiplication and projected back to the original channel size via a linear layer.
To improve both local detail and global semantics, we further design DBPGL. The MD3S-processed features and the corresponding original features are fed into this module. This module employs PAP to fuse max-pooling and average-pooling features at the pixel level, thereby enhancing both global and local representations. Residual connections are added in the SSVSS block to preserve original details during fusion and mitigate gradient vanishing. During feature extraction, the network produces four feature maps at different scales: low-level maps primarily encode spatial details, whereas high-level maps carry richer semantic information. These features gradually integrate spatial structure and contextual information. Afterward, an MLP aligns the four scales to a common channel width, and dynamic upsampling [35] restores them to the original image resolution. Finally, all feature maps are concatenated and further processed by a convolutional layer and an MLP to yield the final crack-segmentation map.

2.2. Multi-Directional Selective Scanning Strategy (MD3S)

In recent years, two-dimensional scanning strategies have demonstrated strong feature-modeling capability in image analysis tasks. However, common scanning modes in visual Mamba models include parallel, diagonal, Z-order, and Hilbert scans [36,37]. Each mode follows a single strategy (Figure 2). Parallel scans better capture long-range dependencies along the horizontal and vertical directions; diagonal and Z-order scans are more suitable for modeling global information along diagonal paths; and Hilbert scans preserve spatial continuity to some extent but are weaker in capturing global information. Thus, single-strategy modes are limited when handling complex structures and cannot fully represent slender, irregular cracks across multiple directions [38].
To address this limitation, we propose the Multi-Directional Selective Scanning Strategy (MD3S). MD3S scans along four directions: horizontal, vertical, main-diagonal, and anti-diagonal. It performs forward and backward passes in each direction to obtain four bidirectional feature sequences (Figure 3a). To reduce redundancy and emphasize complementary information between the bidirectional sequences [39], we introduce a Bidirectional Gated Fusion (BiGF) module (Figure 3b), which adaptively assigns weights to the forward and backward paths and fuses them:
σ = SiLU Linear Norm T t 1
T t = Linear σ · Y forward + 1 σ · Y backward + T t 1
where T t 1 is the original feature and T t is the output feature; Norm denotes normalization, Linear denotes a linear projection, and SiLU is the activation function; Y forward and Y backward denote the features from the forward and backward paths, respectively.
After the gated fusion, the four fused sequences are passed to the S6 block [18]. A subsequent scan–merge operation concatenates the sequences from different directions and maps them back to the original spatial resolution, yielding a representation that preserves multi-directional structural awareness and global-context modeling.

2.3. Dual-Branch Pixel-Level Global–Local Fusion (DBPGL) Module

To preserve fine-grained details from the original sequence while fusing the crack texture patterns produced by MD3S, we propose the DBPGL module. DBPGL incorporates a Pixel-Adaptive Pooling (PAP) mechanism within a dual-branch design. It performs dynamic weighting at the pixel level, thereby strengthening the SSVSS module’s capacity to model and learn crack features.
As shown in Figure 4a, DBPGL fuses two inputs using a dual-branch structure, focusing on local detail preservation and global texture completion. In the left branch, the input feature passes through a 1 × 1 convolution for channel reduction, which is followed by ReLU to introduce nonlinearity and enhance representation. A 1 × 1 convolution then restores the original channel number. Batch normalization is applied to improve training stability and generalization. The right branch employs PAP to complement global structural information related to crack continuity. Finally, the outputs of the two branches are fused. Element-wise operations compute per channel attention weights via a sigmoid function, which enables adaptive importance assignment to the two branches. The fusion can be written as the equations below.
X l = BN Conv ReLU Conv X
Y g = BN Conv ReLU Conv Y
Z 1 = Sigmoid X l + Y g · X + 1 Sigmoid X l + Y g · Y
where X and Y denote input a and input b; Conv denotes the 1 × 1 convolution; ReLU denotes the activation function; BN denotes batch normalization; Sigmoid denotes the activation function. X l denotes the output of the left branch of the module, Y g represents the output of the right branch, and Z 1 is the fused output obtained by weighting and combining the features from both branches.

2.4. Pixel-Adaptive Pooling (PAP) Mechanism

In feature extraction, max pooling and average pooling are often used together because they are complementary. Max pooling focuses on high-response regions, whereas average pooling integrates global background information and yields a balanced representation [40]. To leverage both, we design a pixel-level adaptive fusion strategy within PAP, as illustrated in Figure 4b. Specifically, the input features X and Y are split into two paths and processed by max pooling and average pooling to obtain complementary spatial features. During two-branch feature fusion, a fixed rule cannot adapt to crack structures at the pixel scale. We introduce the Pixel Attention Mechanism (PAM) [41] to enable adaptive, pixel-level fusion. It assigns weights to the two complementary branches at each spatial location, preserving fine details and maintaining global connectivity. This mechanism applies a 1 × 1 convolution followed by batch normalization to each path, performs element-wise multiplication for interaction, and then applies another 1 × 1 convolution and batch normalization. The output is passed through a sigmoid to produce a gating coefficient σ , which encodes the dynamic preference between the two branches. Finally, the pixel attention weights are applied to the two features. Fusion is completed by weighted multiplication and element-wise addition, which is formulated as shown below:
σ = Sigmoid BNConv BNConv V a · BNConv V b
Z 2 = σ · X max + 1 σ · Y avg
where V a and V b are pixels from the two-branch feature maps; BNConv denotes a 1 × 1 convolution followed by batch normalization, i.e., BNConv ( ) = BN ( Conv ( ) ) ; Sigmoid is the activation function; X max and Y avg are the feature maps produced by max pooling and average pooling, respectively; and Z 2 is the fused output obtained by weighting and combining the features from both branches. With this design, the network fuses important local features with global context and can adapt, at the pixel level, the contribution of different spatial locations, which improves the modeling of complex crack structures.

3. Results

3.1. Datasets

We validate our approach on two public benchmark datasets for crack segmentation: TUT [42] and CRACK500 [43]. TUT covers diverse scenes with complex backgrounds, which helps assess the model’s robustness and generalization across scenes and under strong noise. CRACK500 is a standard pavement crack benchmark that enables fair comparison with prior methods. Table 1 reports the main characteristics and split schemes of the datasets, while Figure 5 presents representative images and the corresponding ground truth masks.
  • TUT [42]: Unlike datasets with simple backgrounds, TUT contains dense and cluttered scenes with diverse crack shapes. It includes 1408 RGB images from eight representative scenes: bitumen, cement, bricks, runways, tiles, metal, generator blades, and underground pipelines. Of these, 1270 images were collected in-house using mobile phones, and 138 images were sourced from the internet.
  • CRACK500 [43]: Proposed by Yang et al., the original dataset contains 500 bitumen crack images at 2000 × 1500 resolution, which were all captured by mobile phones. Because the dataset is small, images are cropped into 16 nonoverlapping patches, and only samples with more than 1000 crack pixels are retained. After this processing, each patch has a resolution of 640 × 320. Data augmentation increases the total to 3368 images, and each sample is paired with a per-pixel binary mask.

3.2. Evaluation Metrics

To comprehensively evaluate the proposed segmentation model, we use several common metrics: Optimal Dataset Scale (ODS), Optimal Image Scale (OIS), Precision (P), Recall (R), F1 score (F1), and mean Intersection over Union (mIoU). The ODS evaluates the model on the whole dataset under a single fixed threshold m. The OIS evaluates performance when an optimal threshold n is chosen for each image. The F1, as the harmonic mean of P and R, balances both and is widely used to assess overall robustness and reliability in crack segmentation. In addition, mIoU quantifies spatial accuracy by the overlap between the predicted mask and the ground-truth annotation. They are defined as shown below:
ODS = max m 2 · P m · R m P m + R m
OIS = 1 N i = 1 N max n 2 · P n , i · R n , i P n , i + R n , i
F 1 = 2 · P · R P + R
mIoU = 1 N + 1 i = 0 N p i i j = 0 N p i j + j = 0 N p j i p i i
where N is the number of classes; i indexes the ground-truth class and j denotes the predicted class. Let p i j denote the number of pixels of ground-truth class i predicted as class j (thus, p i i are the true positives for class i ). In this case, N = 1 .

3.3. Implementation Details and Training

The proposed network is implemented in PyTorch v1.13.1 and trained on an Intel Xeon Platinum 8336C CPU. AdamW is used with an initial learning rate of 5 × 10−4 and the PolyLR scheduler for dynamic adjustment. The weight decay is 0.01, and the random seed is 42. Training runs for 50 epochs. All input images are resized to 512 × 512 pixels, and the batch size is 8. After each epoch, performance on the validation set is measured, and the checkpoint with the best validation score is kept for later testing.
A joint loss that combines Binary Cross-Entropy (BCE) and Dice loss [44] is adopted. BCE measures per-pixel classification accuracy, while Dice focuses on the overlap between the predicted mask and the ground truth, improving coherence and completeness. The combined loss is shown below:
L = α · L BCE + 1 α · L Dice
where L BCE is the BEC, L Dice is the Dice loss, and α controls the weights of the two terms. As shown in Table 2, we conduct a sensitivity study of the loss-weight hyperparameter α on the TUT dataset; the results indicate that α = 0.2 achieves the best performance.

3.4. Experimental Results

We evaluate the proposed model against seven representative baseline networks: SFIAN [31], SegNeXt [11], Crackmer [14], SegFormer [13], CT-crackseg [32], CSMamba [22], and PlainMamba [20]. On two public datasets, we make a systematic comparison between these methods and our model. Specifically, SFIAN and SegNeXt are built on CNNs; Crackmer, SegFormer, and CT-crackseg adopt the Transformer architecture; CSMamba and PlainMamba follow the Mamba framework.

3.4.1. Experimental Results on Dataset TUT

On the TUT dataset, CONTI-CrackNet achieves the best results across all metrics. As shown in Table 3, our model reaches an F1 of 0.8332 and an mIoU of 0.8436, outperforming architectures based on CNNs, Transformers, and Mamba. Compared with SegNeXt, F1 improves by 0.0815, indicating a stronger modeling of thin and low-contrast cracks, with fewer breaks and missed segments along long, slender structures. Compared with the best Transformer baseline, CT-crackseg, R increases by 0.0248, confirming advantages in accuracy. Compared with strong Mamba-based baselines, the proposed method attains higher accuracy; relative to PlainMamba, mIoU increases by 0.0120, demonstrating clear competitiveness within the same architectural class. In addition, a standard deviation analysis was conducted on the TUT dataset. The standard deviations of ODS, OIS, F1, P, R, and mIoU are 0.0114, 0.0106, 0.0065, 0.0043, 0.0102, and 0.0071, respectively. These results demonstrate that the proposed model performs well and exhibits strong engineering reliability.
Beyond the numbers, our visual results also show clear gains. As shown in Figure 6, in scenes with complex backgrounds and uneven lighting, SegNeXt and SegFormer tend to produce false cracks and false positives, while CONTI-CrackNet suppresses background noise. For very thin cracks or low contrast, CNN and Transformer methods often show broken or missed segments, but our method restores complete crack shapes. For long and curved cracks, PlainMamba still shows blurred edges and local breaks, while CONTI-CrackNet keeps global continuity and sharp boundaries. These improvements come from the MD3S, which models crack continuity, and the DBPGL module, which fuses global and local features. Together, they enable robust segmentation in complex scenes and demonstrate the effectiveness and practical value of our method.

3.4.2. Experimental Results on Dataset CRACK500

On the CRACK500 dataset, the results further verify the effectiveness of our method. As shown in Table 4, CONTI-CrackNet attains an mIoU of 0.7760, outperforming all compared methods on every metric. Compared with the second-best PlainMamba, our model improves by 0.0181 on mIoU, respectively. Against the CNN and Transformer baselines, the gains are larger, especially on F1 and mIoU, which shows stronger robustness in preserving overall crack structure and regional consistency. On CRACK500, the standard deviations for ODS, OIS, F1, P, R, and mIoU are 0.0057, 0.0102, 0.0091, 0.0146, 0.0052, and 0.0119, respectively. This finding demonstrates the consistency and reliability of the proposed model across diverse evaluation conditions.
The visual results in Figure 7 also show clear advantages. In scenes with complex backgrounds and strong texture noise, CNN and Transformer models such as SegNeXt and SegFormer often produce false positives or fake cracks, as seen in the first and fifth rows of Figure 7. When crack shapes are complex, Mamba methods can still show over segmentation or blurry boundaries, as illustrated in the second row. In contrast, CONTI-CrackNet accurately extracts the main crack skeleton in most cases and maintains sharp edges and global continuity; even under high noise or low contrast, it avoids several false segmentations. Although CONTI-CrackNet does not perfectly recover every tiny crack, its results are closer to the ground truth and contain much less noise overall.

3.5. Ablation Study

To verify the effectiveness of the proposed modules, we conduct a systematic ablation study on the TUT dataset. Specifically, we replace, enable, or disable the proposed modules to systematically assess each module’s impact on segmentation performance. This design quantifies the contributions of each component and the proposed scanning strategy and further confirms the soundness and effectiveness of the approach.

3.5.1. Ablation Study of Components

We further conduct ablation studies on the modules introduced in SSVSS. As summarized in Table 5, the full configuration integrating MD3S, DBPGL, and PAP attains the best overall metrics, validating its effectiveness for crack segmentation. The baseline excludes all proposed modules and only employs the standard SS2D scanning [17] with element-wise addition (ElemAdd) for fusion. On this basis, introducing MD3S increases the mIoU to 0.8350, indicating that MD3S effectively captures directional continuity and multi-orientation morphology.
In the comparison of fusion strategies, both ElemAdd and SKNet [45] underperform DBPGL. With the addition of PAP, the metrics further improve to 0.8436, suggesting that the combination of DBPGL and PAP enables fine-grained, pixel-adaptive gating between the two branches. Moreover, the global–local dual-branch design better suits crack scenarios, where max-pooled and average-pooled features are complementarily fused at the pixel level, thereby preserving the connectivity of slender and discontinuous cracks while suppressing texture noise.
In summary, MD3S, DBPGL, and PAP exhibit clear divisions of labor and synergy: MD3S provides direction-aware global modeling and continuity completion; DBPGL together with PAP delivers complementary global–local fusion with pixel-level adaptive weighting to refine fusion decisions. Their joint effect leads to substantial performance gains in crack segmentation.

3.5.2. Ablation on Scanning Strategies

As shown in Table 6, we compare four directional scanning strategies under the same settings. In this experiment, we disable both DBPGL and PAP, and we perform feature fusion using only ElemAdd. Using parallel serpentine scanning (ParaSpn), with forward and backward traversals along the horizontal and vertical directions, the model achieves F1 = 0.8063. Using diagonal serpentine scanning (DiagSpn) [37], with forward and backward traversals along the main and anti-diagonal directions, the model attains F1 = 0.8077, which remains relatively low. Combining the two (ParaSpn + DiagSpn), employing both parallel and diagonal serpentine scans, raises the F1 to 0.8127 and the mIoU to 0.8260, indicating that multi-directional cues benefit crack segmentation. Furthermore, after introducing the proposed MD3S, the model achieves new best results across all metrics. Compared with the combination of ParaSpn and DiagSpn, F1 and P improve by 0.0141 and 0.0252, respectively. These results indicate that the proposed strategy captures multi-direction features and strengthens the modeling of complex crack structures. Compared with a single scanning strategy, our method builds forward–backward (bidirectional) sequences in four directions: horizontal, vertical, main diagonal, and anti-diagonal. It then performs direction-wise fusion. This lets the network learn pixel-level adaptive preferences. Compared with simple multi-direction stacking, the per-direction bidirectional fusion better fits the modeling needs of multi-scale and multi-directional geometric features. It yields superior performance on crack.

3.5.3. Ablation Study of the Attention Mechanism in the PAP Module

To substantiate the necessity of the proposed Pixel Attention Mechanism (PAM), we conducted a controlled comparison under identical experimental settings (Table 7). In this study, MD3S and DBPGL were kept enabled, and only the weight-generation unit in the fusion stage was swapped between the Convolutional Block Attention Module (CBAM) [46] and PAM. The results show that the model with PAM delivers the best performance, indicating that pixel-level, adaptive gating weights better discriminate thin cracks from background textures than CBAM, effectively reducing false positives and breakages and achieving finer pixel-wise segmentation.

3.6. Complexity Analysis

Table 8 reports the comparison under a fixed input size of 512 × 512. Our method requires 24.22 G floating point operations (GFLOPs), has 6.01M parameters (Params), and runs at 42 frames per second (FPS). Compared with the lightweight Crackmer [14], our model is slightly heavier but delivers a clear accuracy gain, achieving a better accuracy–efficiency trade-off. At the same time, compared with larger and more complex networks such as CSMamba [22] and SegFormer [13], our method uses fewer GFLOPs and Params and offers much faster inference. In sum, CONTI-CrackNet provides high segmentation accuracy with low computational cost and fast runtime, making it suitable for research and practical deployment.

3.7. Analysis of Failure Cases and Limitations

To objectively evaluate the performance of CONTI-CrackNet, we analyze the failure cases observed in our experiments. Figure 8 illustrates two representative scenarios: extreme noisy backgrounds and complex crack intersections. Experiments indicate that in complex scenarios with strong noise and frequent crack intersections, the continuity of fine cracks remains disturbed. Fine cracks may not be fully recovered because complex backgrounds cause confusion between crack pixels and background textures, and intricate crack geometries tend to induce discontinuities at intersections. Although CONTI-CrackNet may exhibit under-segmentation in a few extreme cases, the proposed architecture improves segmentation accuracy under challenging conditions. In future work, we will further enhance the model to increase noise robustness and multi-scale handling, thereby improving the overall performance.

4. Discussion and Conclusions

We propose CONTI-CrackNet, which is a lightweight crack-segmentation network that improves pixel-level fine segmentation under low computational cost. Its performance stems from three aspects: (1) a cascaded lightweight backbone that captures crack morphology while reducing resource usage; (2) MD3S, which aggregates multi-directional information to model the global context of thin and irregular cracks; and (3) DBPGL with a PAP mechanism, which employs dual-branch pixel attention with max and average pooling to enhance local details and overall structural perception. On TUT and CRACK500, the model achieves superior or comparable accuracy with 24.22 G GFLOPs and 6.01 M parameters while maintaining high inference speed. Ablation studies show that MD3S strengthens continuity, and DBPGL with PAP improves segmentation by coupling global dependencies with detail enhancement. Limitations remain: validation is conducted on two public datasets and focuses on static images. Future work will expand data diversity, improve small-crack segmentation, investigate crack-depth quantification, and assess pathways toward efficient edge execution.

Author Contributions

Conceptualization, W.S. and M.Z.; methodology, W.S.; software, W.S.; validation, W.S., M.Z. and X.X.; formal analysis, W.S.; investigation, W.S.; resources, W.S.; data curation, W.S.; writing—original draft preparation, W.S.; writing—review and editing, M.Z. and X.X.; visualization, W.S.; supervision, M.Z.; project administration, M.Z.; funding acquisition, M.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Nantong Natural Science Foundation of China (JC2023073).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Publicly available datasets were analyzed in this study. The TUT dataset can be found at https://github.com/Karl1109/CrackSCF, accessed on 2 July 2025, and the CRACK500 dataset is available at https://github.com/fyangneil/pavement-crack-detection, accessed on 2 July 2025.

Acknowledgments

During the preparation of this manuscript, the authors used ChatGPT (OpenAI, GPT-5, https://www.openai.com/chatgpt, accessed on 22 September 2025) for language editing. The authors reviewed and edited the output and take full responsibility for the content of this publication.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Wu, S.; Withers, P.J.; Beretta, S.; Kang, G. Editorial: Tomography traces the growing cracks and defects. Eng. Fract. Mech. 2023, 292, 109628. [Google Scholar] [CrossRef]
  2. Matarneh, S.; Elghaish, F.; Edwards, D.J.; Rahimian, F.P.; Abdellatef, E.; Ejohwomu, O. Automatic crack classification on asphalt pavement surfaces using convolutional neural networks and transfer learning. J. Inf. Technol. Constr. 2024, 29, 1239–1256. [Google Scholar] [CrossRef]
  3. Singh, V.; Baral, A.; Kumar, R.; Tummala, S.; Noori, M.; Yadav, S.V.; Kang, S.; Zhao, W. A Hybrid Deep Learning Model for Enhanced Structural Damage Detection: Integrating ResNet50, GoogLeNet, and Attention Mechanisms. Sensors 2024, 24, 7249. [Google Scholar] [CrossRef] [PubMed]
  4. Gao, J.; Gui, Y.; Ji, W.; Wen, J.; Zhou, Y.; Huang, X.; Wang, Q.; Wei, C.; Huang, Z.; Wang, C.; et al. EU-Net: A segmentation network based on semantic fusion and edge guidance for road crack images. Appl. Intell. 2024, 54, 12949–12963. [Google Scholar] [CrossRef]
  5. Zhou, Z.; Zhou, S.; Zheng, Y.; Yan, L.; Yang, H. Clustering and diagnosis of crack images of tunnel linings via graph neural networks. Georisk Assess. Manag. Risk Eng. Syst. Geohazards 2024, 18, 825–837. [Google Scholar] [CrossRef]
  6. Lei, Q.; Zhong, J.; Wang, C. Joint optimization of crack segmentation with an adaptive dynamic threshold module. IEEE Trans. Intell. Transp. Syst. 2024, 25, 6902–6916. [Google Scholar] [CrossRef]
  7. Liu, Y.; Yeoh, J.K.W. Robust pixel-wise concrete crack segmentation and properties retrieval using image patches. Autom. Constr. 2021, 123, 103535. [Google Scholar] [CrossRef]
  8. Yamaguchi, T.; Mizutani, T. Road crack detection interpreting background images by convolutional neural networks and a self-organizing map. Comput.-Aided Civ. Infrastruct. Eng. 2024, 39, 1616–1640. [Google Scholar] [CrossRef]
  9. Zhao, H.; Shi, J.; Qi, X.; Wang, X.; Jia, J. Pyramid Scene Parsing Network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 22–25 July 2017; pp. 2881–2890. [Google Scholar] [CrossRef]
  10. Zhu, X.; Hu, H.; Lin, S.; Dai, J. Deformable ConvNets V2: More Deformable, Better Results. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 9308–9316. [Google Scholar] [CrossRef]
  11. Guo, M.H.; Lu, C.Z.; Hou, Q.; Liu, Z.; Cheng, M.M.; Hu, S.M. SegNeXt: Rethinking Convolutional Attention Design for Semantic Segmentation. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), New Orleans, LA, USA, 28 November–9 December 2022; Volume 35, pp. 1140–1156. [Google Scholar]
  12. Li, W.; Xue, L.; Wang, X.; Li, G. ConvTransNet: A CNN-Transformer Network for Change Detection with Multiscale Global-Local Representations. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5610315. [Google Scholar] [CrossRef]
  13. Xie, E.; Wang, W.; Yu, Z.; Anandkumar, A.; Alvarez, J.M.; Luo, P. SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers. arXiv 2021, arXiv:2105.15203. [Google Scholar] [CrossRef]
  14. Wang, J.; Zeng, Z.; Sharma, P.K.; Alfarraj, O.; Tolba, A.; Zhang, J.; Wang, L. Dual-path network combining CNN and transformer for pavement crack segmentation. Autom. Constr. 2024, 158, 105217. [Google Scholar] [CrossRef]
  15. Shao, Z.; Wang, Z.; Yao, X.; Bell, M.G.H.; Gao, J. ST-MambaSync: Complement the power of Mamba and Transformer fusion for less computational cost in spatial–temporal traffic forecasting. Inf. Fusion 2025, 117, 102872. [Google Scholar] [CrossRef]
  16. Zunair, H.; Ben Hamza, A. Masked Supervised Learning for Semantic Segmentation. arXiv 2022, arXiv:2210.00923. [Google Scholar] [CrossRef]
  17. Zhu, L.; Liao, B.; Zhang, Q.; Wang, X.; Liu, W.; Wang, X. Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model. arXiv 2024, arXiv:2401.09417. [Google Scholar] [CrossRef]
  18. Gu, A.; Dao, T. Mamba: Linear-Time Sequence Modeling with Selective State Spaces. arXiv 2024, arXiv:2312.00752. [Google Scholar]
  19. Liu, Y.; Tian, Y.; Zhao, Y.; Yu, H.; Xie, L.; Wang, Y.; Ye, Q.; Jiao, J.; Liu, Y. VMamba: Visual State Space Model. arXiv 2024, arXiv:2401.10166. [Google Scholar]
  20. Yang, C.; Chen, Z.; Espinosa, M.; Ericsson, L.; Wang, Z.; Liu, J.; Crowley, E.J. PlainMamba: Improving Non-Hierarchical Mamba in Visual Recognition. arXiv 2024, arXiv:2403.17695. [Google Scholar]
  21. Guo, H.; Li, J.; Dai, T.; Ouyang, Z.; Ren, X.; Xia, S.T. MambaIR: A Simple Baseline for Image Restoration with State-Space Model. In Proceedings of the European Conference on Computer Vision (ECCV), Milan, Italy, 29 September–4 October 2024; pp. 222–241. [Google Scholar] [CrossRef]
  22. Liu, M.; Dan, J.; Lu, Z.; Yu, Y.; Li, Y.; Li, X. CM-UNet: Hybrid CNN-Mamba UNet for Remote Sensing Image Semantic Segmentation. arXiv 2024, arXiv:2405.10530. [Google Scholar]
  23. Luan, X.; Fan, H.; Wang, Q.; Yang, N.; Liu, S.; Li, X.; Tang, Y. FMambaIR: A Hybrid State-Space Model and Frequency Domain for Image Restoration. IEEE Trans. Geosci. Remote Sens. 2025, 63, 4201614. [Google Scholar] [CrossRef]
  24. Li, B.; Zhao, H.; Wang, W.; Hu, P.; Gou, Y.; Peng, X. MaIR: A Locality- and Continuity-Preserving Mamba for Image Restoration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 10–17 June 2025; pp. 7491–7501. [Google Scholar] [CrossRef]
  25. Ding, H.; Xia, B.; Liu, W.; Zhang, Z.; Zhang, J.; Wang, X.; Xu, S. A Novel Mamba Architecture with a Semantic Transformer for Efficient Real-Time Remote Sensing Semantic Segmentation. Remote Sens. 2024, 16, 2620. [Google Scholar] [CrossRef]
  26. Zuo, X.; Sheng, Y.; Shen, J.; Shan, Y. Topology-aware Mamba for Crack Segmentation in Structures. Autom. Constr. 2024, 168, 105845. [Google Scholar] [CrossRef]
  27. Han, C.; Yang, H.; Yang, Y. Enhancing Pixel-Level Crack Segmentation with Visual Mamba and Convolutional Networks. Autom. Constr. 2024, 168, 105770. [Google Scholar] [CrossRef]
  28. Zhang, T.; Wang, D.; Lu, Y. ECSNet: An Accelerated Real-Time Image Segmentation CNN Architecture for Pavement Crack Detection. IEEE Trans. Intell. Transp. Syst. 2023, 24, 15105–15112. [Google Scholar] [CrossRef]
  29. Wang, W.; Su, C. Automatic concrete crack segmentation model based on transformer. Autom. Constr. 2022, 139, 104275. [Google Scholar] [CrossRef]
  30. Fan, Y.; Hu, Z.; Li, Q.; Sun, Y.; Chen, J.; Zhou, Q. CrackNet: A Hybrid Model for Crack Segmentation with Dynamic Loss Function. Sensors 2024, 24, 7134. [Google Scholar] [CrossRef]
  31. He, T.; Shi, F.; Zhao, M.; Yin, Y. A Lightweight Selective Feature Fusion and Irregular-Aware Network for Crack Detection Based on Federated Learning. In Proceedings of the International Conference on High Performance Big Data and Intelligent Systems (HDIS), Tianjin, China, 10–11 December 2022; pp. 294–298. [Google Scholar] [CrossRef]
  32. Tao, H.; Liu, B.; Cui, J.; Zhang, H. A Convolutional-Transformer Network for Crack Segmentation with Boundary Awareness. In Proceedings of the IEEE International Conference on Image Processing (ICIP), Kuala Lumpur, Malaysia, 8–11 October 2023; pp. 86–90. [Google Scholar] [CrossRef]
  33. Rahman, M.M.; Tutul, A.A.; Nath, A.; Laishram, L.; Jung, S.K.; Hammond, T. Mamba in Vision: A Comprehensive Survey of Techniques and Applications. arXiv 2024, arXiv:2410.03105. [Google Scholar] [CrossRef]
  34. Liu, H.; Jia, C.; Shi, F.; Cheng, X.; Chen, S. SCSegamba: Lightweight Structure-Aware Vision Mamba for Crack Segmentation in Structures. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 10–17 June 2025; pp. 29406–29416. [Google Scholar] [CrossRef]
  35. You, H.; Li, Z.; Wei, Z.; Zhang, L.; Bi, X.; Bi, C.; Li, X.; Duan, Y. A Blueberry Maturity Detection Method Integrating Attention-Driven Multi-Scale Feature Interaction and Dynamic Upsampling. Horticulturae 2025, 11, 600. [Google Scholar] [CrossRef]
  36. Zhu, Q.; Fang, Y.; Cai, Y.; Chen, C.; Fan, L. Rethinking Scanning Strategies With Vision Mamba in Semantic Segmentation of Remote Sensing Imagery: An Experimental Study. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 18223–18234. [Google Scholar] [CrossRef]
  37. Qu, H.; Ning, L.; An, R.; Fan, W.; Derr, T.; Liu, H.; Xu, X.; Li, Q. A Survey of Mamba. arXiv 2025, arXiv:2408.01129. [Google Scholar] [CrossRef]
  38. Zhao, S.; Chen, H.; Zhang, X.; Xiao, P.; Bai, L.; Ouyang, W. RS-Mamba for Large Remote Sensing Image Dense Prediction. arXiv 2024, arXiv:2404.02668. [Google Scholar] [CrossRef]
  39. Sun, H.; Li, S.; Zheng, X.; Lu, X. Remote Sensing Scene Classification by Gated Bidirectional Network. IEEE Trans. Geosci. Remote Sens. 2020, 58, 82–96. [Google Scholar] [CrossRef]
  40. Nirthika, R.; Manivannan, S.; Ramanan, A.; Wang, R. Pooling in convolutional neural networks for medical image analysis: A survey and an empirical study. Neural Comput. Appl. 2022, 34, 5321–5347. [Google Scholar] [CrossRef]
  41. Xiao, J.; Guo, H.; Yao, Y.; Zhang, S.; Zhou, J.; Jiang, Z. Multi-Scale Object Detection with the Pixel Attention Mechanism in a Complex Background. Remote Sens. 2022, 14, 3969. [Google Scholar] [CrossRef]
  42. Liu, H.; Jia, C.; Shi, F.; Cheng, X.; Wang, M.; Chen, S. CrackSCF: Lightweight Cascaded Fusion Network for Robust and Efficient Structural Crack Segmentation. arXiv 2025, arXiv:2408.12815. [Google Scholar]
  43. Yang, F.; Zhang, L.; Yu, S.; Prokhorov, D.; Mei, X.; Ling, H. Feature Pyramid and Hierarchical Boosting Network for Pavement Crack Detection. IEEE Trans. Intell. Transp. Syst. 2020, 21, 1525–1535. [Google Scholar] [CrossRef]
  44. Zhang, H.; Zhang, A.A.; Dong, Z.; He, A.; Liu, Y.; Zhan, Y.; Wang, K.C.P. Robust Semantic Segmentation for Automatic Crack Detection Within Pavement Images Using Multi-Mixing of Global Context and Local Image Features. IEEE Trans. Intell. Transp. Syst. 2024, 25, 11282–11303. [Google Scholar] [CrossRef]
  45. Li, X.; Wang, W.; Hu, X.; Yang, J. Selective Kernel Networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 510–519. [Google Scholar] [CrossRef]
  46. Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. CBAM: Convolutional Block Attention Module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar] [CrossRef]
Figure 1. Overall framework of CONTI-CrackNet: (a) CONTI-CrackNet, (b) SSVSS block.
Figure 1. Overall framework of CONTI-CrackNet: (a) CONTI-CrackNet, (b) SSVSS block.
Sensors 25 06865 g001
Figure 2. Four common single-strategy scans: (a) parallel scans, (b) diagonal scans, (c) Z-order scans, (d) Hilbert scans.
Figure 2. Four common single-strategy scans: (a) parallel scans, (b) diagonal scans, (c) Z-order scans, (d) Hilbert scans.
Sensors 25 06865 g002
Figure 3. Architecture of MD3S: (a) MD3S, (b) BiGF.
Figure 3. Architecture of MD3S: (a) MD3S, (b) BiGF.
Sensors 25 06865 g003
Figure 4. Structures of DBPGL and PAP: (a) DBPGL; (b) PAP.
Figure 4. Structures of DBPGL and PAP: (a) DBPGL; (b) PAP.
Sensors 25 06865 g004
Figure 5. Sample images and ground-truth masks from the datasets used in this study: (a) TUT, (b) CRACK500.
Figure 5. Sample images and ground-truth masks from the datasets used in this study: (a) TUT, (b) CRACK500.
Sensors 25 06865 g005
Figure 6. Typical visual comparison on the TUT dataset using four methods. Red boxes highlight key details; green boxes mark wrongly identified regions.
Figure 6. Typical visual comparison on the TUT dataset using four methods. Red boxes highlight key details; green boxes mark wrongly identified regions.
Sensors 25 06865 g006
Figure 7. Typical visual comparison on the CRACK500 dataset using four methods. The red boxes highlight key details; the green boxes mark wrongly identified regions.
Figure 7. Typical visual comparison on the CRACK500 dataset using four methods. The red boxes highlight key details; the green boxes mark wrongly identified regions.
Sensors 25 06865 g007
Figure 8. Visualization of CONTI-CrackNet failure cases. Yellow boxes indicate missed detections.
Figure 8. Visualization of CONTI-CrackNet failure cases. Yellow boxes indicate missed detections.
Sensors 25 06865 g008
Table 1. Description of the experimental datasets.
Table 1. Description of the experimental datasets.
DatasetResolutionImagesTrainingValidationTest
TUT [42] 640 × 640 1408986141281
CRACK500 [43] 640 × 360 33682358337673
Table 2. Sensitivity analysis of α on the TUT dataset.
Table 2. Sensitivity analysis of α on the TUT dataset.
α ODSOISF1PRmIoU
00.80210.80590.82790.81760.83840.8368
0.10.80490.81160.82920.81850.84020.8371
0.20.81330.81650.83320.82200.84470.8436
0.30.81100.81230.83130.81960.84340.8420
0.50.80020.80170.82830.82010.83660.8359
Table 3. Comparison of segmentation results of different models on the TUT dataset.
Table 3. Comparison of segmentation results of different models on the TUT dataset.
MethodsODSOISF1PRmIoU
SFIAN (2022) [31]0.72900.75130.74730.77150.72470.7756
SegNeXt (2022) [11]0.73120.74350.75170.78120.72450.7785
Crackmer (2024) [14]0.74290.75010.75780.75010.76560.7966
SegFormer (2021) [13]0.75320.76120.76700.76540.76880.8078
CT-crackseg (2023) [32]0.79400.79960.81990.82020.81950.8301
CSMamba (2024) [22]0.78790.79460.81460.79470.83530.8263
PlainMamba (2024) [20]0.78890.79540.81540.79550.83650.8316
Ours0.81330.81650.83320.82200.84470.8436
Table 4. Comparison of segmentation results of different models on the CRACK500 dataset.
Table 4. Comparison of segmentation results of different models on the CRACK500 dataset.
MethodsODSOISF1PRmIoU
SFIAN (2022) [31]0.64730.69410.72040.69830.74410.7315
SegNeXt (2022) [11]0.64880.67620.73340.71340.75460.7345
Crackmer (2024) [14]0.69330.70970.72670.69850.75720.7421
SegFormer (2021) [13]0.69980.71340.72450.70670.74340.7456
CT-crackseg (2023) [32]0.69410.70590.73220.69400.77480.7591
CSMamba (2024) [22]0.69310.71620.73150.68580.78230.7592
PlainMamba (2024) [20]0.65740.68700.74220.73180.75300.7579
Ours0.71040.73010.75870.73330.78600.7760
Table 5. Ablation validating the effectiveness of the proposed modules.
Table 5. Ablation validating the effectiveness of the proposed modules.
MD3SElemAddSKNetDBPGLPAPODSOISF1PRmIoU
0.78450.79210.80490.79450.81560.8078
0.80080.80740.82680.81610.83770.8350
0.79440.80640.82000.81230.82790.8268
0.80940.81680.82970.81420.84590.8406
0.81330.81650.83320.82200.84470.8436
Table 6. Ablation validating the effectiveness of the proposed MD3S.
Table 6. Ablation validating the effectiveness of the proposed MD3S.
MethodODSOISF1PRmIoU
ParaSna0.78800.79260.80630.78980.82360.8041
DiagSna0.78970.79790.80770.78910.82730.8059
ParaSna + DiagSna0.79700.80710.81270.79090.83590.8260
MD3S0.80080.80740.82680.81610.83770.8350
Table 7. Ablation validating the effectiveness of the PAM.
Table 7. Ablation validating the effectiveness of the PAM.
MethodsODSOISF1PRmIoU
CBAM0.80160.80230.82500.82450.82560.8277
PAM0.81330.81650.83320.82200.84470.8436
Table 8. Comparison of CONTI-CrackNet and other methods in Params, GFLOPs, and FPS.
Table 8. Comparison of CONTI-CrackNet and other methods in Params, GFLOPs, and FPS.
MethodsGFLOPsParamsFPS
SFIAN (2022) [31]84.57 G13.63 M32
SegNeXt (2022) [11]31.80 G27.52 M22
Crackmer (2024) [14]14.94 G5.90 M33
SegFormer (2021) [13]30.80 G28.20 M21
CT-crackseg (2023) [32]39.47 G22.88 M28
CSMamba (2024) [22]145.84 G35.95 M19
PlainMamba (2024) [20]73.36 G16.72 M33
Ours24.22 G6.01 M42
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Song, W.; Zhao, M.; Xu, X. CONTI-CrackNet: A Continuity-Aware State-Space Network for Crack Segmentation. Sensors 2025, 25, 6865. https://doi.org/10.3390/s25226865

AMA Style

Song W, Zhao M, Xu X. CONTI-CrackNet: A Continuity-Aware State-Space Network for Crack Segmentation. Sensors. 2025; 25(22):6865. https://doi.org/10.3390/s25226865

Chicago/Turabian Style

Song, Wenjie, Min Zhao, and Xunqian Xu. 2025. "CONTI-CrackNet: A Continuity-Aware State-Space Network for Crack Segmentation" Sensors 25, no. 22: 6865. https://doi.org/10.3390/s25226865

APA Style

Song, W., Zhao, M., & Xu, X. (2025). CONTI-CrackNet: A Continuity-Aware State-Space Network for Crack Segmentation. Sensors, 25(22), 6865. https://doi.org/10.3390/s25226865

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop