DSS-MobileNetV3: An Efficient Dynamic-State-Space- Enhanced Network for Concrete Crack Segmentation

Li, Haibo; Cheng, Yong; Zhang, Qian; Chen, Lingkun

doi:10.3390/buildings15111905

Open AccessArticle

DSS-MobileNetV3: An Efficient Dynamic-State-Space- Enhanced Network for Concrete Crack Segmentation

¹

School of Information Technology, Jiangsu Open University, Nanjing 210036, China

²

College of Architecture Science and Engineering, Yangzhou University, Yangzhou 225127, China

^*

Author to whom correspondence should be addressed.

Buildings 2025, 15(11), 1905; https://doi.org/10.3390/buildings15111905

Submission received: 13 April 2025 / Revised: 21 May 2025 / Accepted: 30 May 2025 / Published: 1 June 2025

(This article belongs to the Special Issue Structural Health Monitoring and Intelligent Operation Maintenance of Concrete and Steel Structures)

Download

Browse Figures

Versions Notes

Abstract

Crack segmentation is crucial for health monitoring and preventive maintenance of concrete structures. However, the complex morphologies of cracks and the limited resources of mobile devices pose challenges for accurate and efficient segmentation. To address this, we propose an efficient dynamic-state-space-enhanced network termed DSS-MobileNetV3 for crack segmentation. The DSS-MobileNetV3 adopts a U-shaped encoder–decoder architecture, and a dynamic-state-space (DSS) block is designed into the encoder to improve the MobileNetV3 bottleneck module in modeling global dependencies. The DSS block improves the MobileNetV3 model in structural perception and global dependency modeling for complex crack morphologies by integrating dynamic snake convolution and a state space model. The decoder utilizes the upsampling and depthwise separable convolution to progressively decode and efficiently restore the spatial resolution. In addition, to suppress complex noise in the image background and highlight crack textures, the strip pooling module is introduced into the skip connection between the encoder and decoder for performance enhancement. Extensive experiments are conducted on three public crack datasets, and the proposed DSS-MobileNetV3 achieves SOTA performance in both accuracy and efficiency.

Keywords:

crack segmentation; MobileNetV3; dynamic state space; strip pooling

1. Introduction

Cracks are prevalent defects in concrete structures, typically exhibiting complex structural patterns such as irregular linear, mesh-like, or branching forms. These cracks not only compromise the mechanical performance of infrastructure but also precipitate further deterioration [1]. Consequently, efficient crack segmentation holds significant importance for preventive maintenance [2]. However, in real-world scenarios, uneven illumination, obstructions by debris, and background interference impede accurate crack segmentation. Additionally, the limited memory and computational resources of mobile devices pose challenges in efficient crack segmentation.

To address the above challenges, traditional methods utilize crack morphology, contrast with background pixels, and spatial clustering algorithms to extract cracks. For example, Landstrom et al. [3] develop an automatic crack detection system based on crack morphology and logistic regression statistical classification, which successfully extract 80% of the crack length. A pavement crack segmentation method based on grayscale histograms and the Otsu thresholding method is proposed in [4]. This method searches for cracks based on the ratio between the Otsu threshold of the image and the maximum histogram value and is capable of efficiently extracting various types of cracks from different pavement images. These traditional methods can efficiently extract prominent cracks. However, for crack images with complex backgrounds, traditional methods struggle to accurately identify fine-grained cracks. Furthermore, the repeated parameter tuning process can easily cause the algorithm to become trapped in local optimality.

With the emergence of deep learning technologies, CNN-based methods have alleviated the limitations of traditional algorithms in efficient crack segmentation, significantly improving accuracy. For instance, Xu et al. [5] propose an innovative deep learning framework, YOLO-DL, for detecting cracks in concrete. This method builds upon DeepLabv3+ [6] and incorporates attention mechanisms and calibration modules to achieve accurate crack segmentation. Guo et al. [7] incorporate the proposed Instance Normalization Wavelet (INW) layer into a deep model for crack segmentation. This model leverages prior knowledge from wavelets to capture crack features while simultaneously filtering out high-frequency noise. Liu et al. [8] propose an efficient deep learning model based on stereo vision for crack segmentation. By integrating the Semi-Global Block Matching (SGBM) algorithm, this method effectively segments the crack structures, laying a foundation for the subsequent quantitative analysis of crack length, width, and orientation angle. To improve the efficiency of CNNs, MobileNetV3 [9] builds upon MobileNetV1 [10] and MobileNetV2 [11] by introducing the Squeeze-and-Excitation module [12] and neural architecture search, enabling efficient and accurate semantic segmentation. These CNN-based methods have significant advantages in extracting cracks with localized variations. However, the limited receptive field of CNNs cannot accurately model the global dependencies of complex and long-spanned cracks, which are prone to fragmentation.

With the introduction of Vision Transformer (ViT) [13] in the Computer Vision (CV) field, a series of Transformer-based crack segmentation models have been proposed. Liu et al. [14] leverage the merit of Transformers in global relationship modeling and design a Crackformer for accurate crack detection. Shan et al. [15] designed a novel DCUFormer to address the bottleneck of existing Transformer-based methods in boundary delineation. By introducing a dual-cross attention module and an up-sampling attention module, the proposed DCUFormer effectively integrates low-level and high-level features, enabling it to refine boundary pixels. To improve accuracy, a new tunnel crack segmentation network, CGV-Net, has been proposed in [16]. This method integrates the advantages of CNN, Graph Neural Networks (GNNs) and ViT. By exchanging information between local features, CGV-Net effectively models the global structural patterns of cracks and achieves SOTA performance. However, as the core component of the Transformer-based methods, the self-attention mechanism requires significant computational resources, which poses a challenge for efficient segmentation tasks.

Recently, Mamba [17] has emerged as a competitor to Transformers due to its global modeling capability and computational efficiency. The core of Mamba is the state space model, which can linearly model global dependencies. Inspired by this, we propose an improved MobileNetV3 based on the state space model and attention mechanism for efficient crack segmentation. Specifically, the proposed model adopts a U-shaped encoder–decoder architecture. For the encoder, we use the improved MobileNetV3 to extract hierarchical features by replacing the Squeeze-and-Excitation (SE) attention module with the proposed Dynamic State Space (DSS) block. The DSS block improves the inaccurate limitation of SE attention module in crack feature activation by combining dynamic snake convolution with state space model, thereby achieving more accurate representation for elongated and slender cracks. For the decoder, we use upsampling and depthwise separable convolutions to progressively restore spatial resolution. Considering the complex background in images captured by real-time devices, we introduce the strip pooling module to highlight cracks and suppress noise.

In summary, our contributions are as follows:

We propose an improved MobileNetV3 based on the state space model and attention mechanism for efficient crack segmentation, which efficiently integrates both local and global features of cracks.
The DSS block is designed based on dynamic snake convolution and state space model to effectively model the global dependence of crack curves and boost encoding performance.
Depthwise separable convolution and upsampling are utilized to progressively decode local textures, and the strip pooling module is incorporated to effectively fuse fine-grained crack features and filter out background-related noise.
Extensive experiments are conducted and the results show that our proposed model achieves the SOTA performance with fewer parameters and FLOPs.

2. Related Work

2.1. Crack Segmentation Algorithms

We classify existing crack segmentation algorithms into traditional methods and deep learning-based methods. Traditional methods primarily rely on morphological operations, edge detection, and adaptive thresholding for crack segmentation. For example, Abdel-Qader et al. [18] demonstrated the effectiveness and superiority of the fast Haar transform edge detection algorithm in segmenting bridge cracks, while traditional methods efficiently extract prominent cracks, they struggle to capture key textures in cracks with complex backgrounds.

Deep learning-based crack segmentation methods are mainly divided into models based on CNN and Transformer architectures. The CNN-based methods primarily extend variants on the FCN [19] framework. For example, Lee et al. [20] proposed an FCN-based variant integrated with an autoencoder structure for pavement surface detection. Di Benedetto et al. [21] introduced the ResNet-50 encoder [22] into U-Net, achieving impressive performance on the Crack500 dataset. In [23], a hybrid network termed CrackNet has been proposed to accurately segment cracks by harnessing the strengths of both CNNs and Transformers. A CNN-based model termed CrackNet-V [24] is designed for asphalt pavement crack segmentation, showcasing the efficacy of deep learning methods in the realm of automated crack segmentation systems.

These CNN-based methods are more accurate in extracting local crack structures, but they may encounter breaks in the segmentation of long-spanned and elongate cracks. Liu et al. [14] leverage the strengths of Transformers in modeling global relationships and design a Crackformer for accurate crack detection. However, the computational complexity of Transformers limits their deployment on real-time devices.

2.2. Segmentation Methods Based on MobileNetV3

MobileNetV3 [9] has been widely applied to various segmentation tasks due to its efficient segmentation performance. In the medical imaging field, Alsenan et al. [25] integrated the advantages of the U-Net architecture and MobileNetV3, proposing MobileUNetV3 for the segmentation of spinal cord gray matter. In agricultural engineering, MobileNetV3 has been used to efficiently encode deep features, and combined with an attention mechanism to improve the accuracy of rice disease detection [26]. For lane lines in traffic scenarios, Deng et al. [27] employed MobileNetV3 to enhance the encoding capability and real-time performance of vehicle and lane line features. These MobileNetV3-based variants have achieved SOTA performance across various downstream tasks. However, due to the inherent limitations of CNNs, they struggle to model global features of diverse cracks.

Recently, Mamba [17] has achieved great success in the field of NLP. Its core is the state space model, which is capable of modeling long-range dependencies with linear complexity. Motivated by this, we combine the strengths of both Mamba and MobilenetV3 and propose a novel state space-based MobilenetV3 bottleneck module for efficient crack segmentation.

3. Method

3.1. Overall Architecture

The proposed model adopts a U-shaped encoder–decoder architecture for efficient crack segmentation, as shown in Figure 1. The U-shaped encoder–decoder architecture was proposed by Ronneberger et al. [28]. Its main advantage is combining multi-scale features efficiently, which improves crack segmentation performance, especially for recovering fine-grained crack textures. For the encoder, we use MobileNetV3 [9] to extract features. The core of MobileNetV3 is the bottleneck module, which uses an inverted residual design combined with Squeeze-and-Excitation (SE) attention module [12] to encode features efficiently. However, the SE attention module highlights crack semantics by compressing channels, which cannot effectively solve the mismatch between local receptive fields of CNNs and complex crack global morphologies. Thus, we introduce the dynamic state space (DSS) block into the MobileNetV3 bottleneck module and propose the state space-based MobileNetV3 bottleneck module to extract hierarchical features. For the decoder, we primarily use upsampling operation and depthwise separable convolution to progressively restore and integrate features. Considering the complex background and uneven lighting in images captured by mobile devices, we embed the strip pooling module between the encoder and decoder to highlight cracks and suppress background-related noise. Additionally, we jointly use cross-entropy loss and boundary loss as the objective functions to optimize the classification of crack pixels and the accuracy of boundary pixels.

3.2. State-Space-Based MobileNetV3 Bottleneck Module

The bottleneck module in MobileNetV3 [9] achieves a good balance between accuracy and efficiency. This module is mainly composed of depthwise separable convolution, inverted residual structure, SE attention module, and hard-swish activation function. It efficiently models local dependencies and encodes features. However, it still cannot escape the inherent limitations of convolution, as it is unable to model global dependencies for diverse types of cracks.

To address this, we propose a state-space-based bottleneck module, which improves the performance of the original bottleneck module by incorporating the dynamic state space block. Specifically, given the input feature

x_{i n}

, the state-space-based bottleneck module first uses a

1 \times 1

convolution to expand the number of channels, enhancing the representation capability. Then, it utilizes depthwise separable convolution

D S C o n v (\cdot)

to efficiently extract local features and outputs

x_{d w}

, which can be reformulated as

\begin{matrix} x_{d w} & = D S C o n v (E x p a n d (x_{i n})), \end{matrix}

(1)

where

E x p a n d (\cdot)

is the

1 \times 1

convolution. Different from the bottleneck module in MobileNetV3, we replace the SE attention module with a DSS block to model global dependencies in deep features. This effectively addresses the limitations of convolution operations in terms of receptive field size. Finally,

1 \times 1

convolution and a residual connection are used to compress the channels and prevent gradient vanishing, respectively. Moreover, the output is

x_{o u t}

, which can be reformulated as

\begin{matrix} x_{o u t} & = C o n v (D S S (x_{d w})) + x_{i n}, \end{matrix}

(2)

3.3. Dynamic State Space Block

Due to the uncertainty in the shape and distribution of crack curves, we design a dynamic state space block to bridge the gap between state space models and 2D crack images, as shown in Figure 2. Inspired by [29], the DSS block consists of a linear layer for feature expansion, a set of dynamic snake convolutions (DSCs) [30] in both horizontal and vertical directions, a SiLU function for local representation enhancement, the SS2D module for modeling contextual relationships for 2D images with linear complexity, and the layer normalization for standardizing deep features.

In contrast to the 1D sequence-based selective scan algorithm in Mamba [17], the SS2D block involves three steps, including crossing scan, selective scan and crossing merge. Specifically, it first scans the slices in four different directions, and then flattens them into sequences. The scanning directions are represented by the following matrix. Next, the S6 block in Mamba is applied to simultaneously process the sequences derived from the four scanning directions. In the final crossing merge step, the processed sequences are integrated and reshaped to generate the output. Based on the above process, the workflow of the DSS block can be reformulated as

x_{D S S} = (L i n e a r * D S C * S i L U * S S 2 D * L N) (x_{d w}),

(3)

where

L N

represents the layer normalization, ‘*’ is the cascaded operation,

L i n e a r

is the linear layer, and

D S C

is the dynamic snake convolution. The dynamic snake convolution is a deformable convolution that is more suitable for cracks. We take the accumulation process of DSC in the x-axis as an example, as shown in Figure 3.

The central coordinate of DSC is defined as

K_{i}

, and the point coordinate

K_{i \pm c} = (x_{i \pm c}, y_{i \pm c})

(where

c \in [0, 4]

) in the DSC under

3 \times 3

receptive field. The position of

K_{i \pm c}

is the result of dynamic accumulation. Starting from the center point

K_{i}

,

K_{i \pm 1}

is determined by

K_{i}

and offset

Δ

(where

Δ = {δ ∣ δ \in [- 1, 1]}

. The accumulation process of DSC in x-axes can be formalized as follows:

K_{i \pm c} = \{\begin{matrix} (x_{i + c}, y_{i + c}) & = (x_{i} + c, y_{i} + \sum_{i}^{i + c} Δ y) \\ (x_{i - c}, y_{i - c}) & = (x_{i} - c, y_{i} + \sum_{i}^{i - c} Δ y) \end{matrix}

(4)

The accumulation process of DSC in y-axes is similar to the above process, which can be formalized as

K_{j \pm c} = \{\begin{matrix} (x_{j + c}, y_{j + c}) & = (x_{j} + c, y_{j} + \sum_{j}^{j + c} Δ y) \\ (x_{j - c}, y_{j - c}) & = (x_{j} - c, y_{j} + \sum_{j}^{j - c} Δ y) \end{matrix}

(5)

where ∑ is the dynamic accumulation process.

3.4. Strip Pooling Module

Directly using skip connections to fuse deep and shallow features may introduce background-related noise and disrupt the structured representation of cracks. Images captured by mobile devices contain various types of objects, and uneven lighting can reduce the distinction between crack and background pixels. To highlight crack features and suppress noise, we embed the strip pooling module [31] in the skip connections between the encoder and decoder.

As shown in Figure 4, the strip pooling module consists of the strip pooling branch and the skip connection branch. Given the features from the encoder

x_{e}

, it first undergoes horizontal and vertical strip pooling to extract band-like features in different directions in a parallel manner. Next, 1D convolution and interpolation are used to integrate the strip-like features and restore the spatial resolution. Through

1 \times 1

convolution and the sigmoid function, features from different directions are fused, and the output is

x_{e}^{'}

, as follows:

\begin{matrix} x_{e}^{'} = S i g m o i d (C o n v (F ({C o n v}_{1 D} (S P_{h} (x_{e}))))) + F ({C o n v}_{1 D} (S P_{v} (x_{e}))), \end{matrix}

(6)

where

S P_{h}

and

S P_{v}

represent horizontal and vertical strip pooling, respectively. The

F (\cdot)

is the interpolation operation. Finally, by multiplying the output with the original input, cracks are highlighted, and background noise is suppressed, which can be represented as

\begin{matrix} x_{o u t} = x^{'} ⊙ x, \end{matrix}

(7)

where ‘⊙’ is the Hadamard product.

3.5. Loss Function

To accurately segment crack pixels, we utilize the cross-entropy loss and boundary loss as the objective functions. The cross-entropy loss effectively measures the discrepancy between the predictions and the labels, while the boundary loss focuses on the inference of the crack boundaries. Therefore, the total loss can be expressed as

\begin{matrix} L o s s_{t o t a l} & = λ \cdot L o s s_{C E} + (1 - λ) \cdot L o s s_{B o}, \end{matrix}

(8)

\begin{matrix} L o s s_{C E} & = - [G T \cdot log (P r e d) + (1 - G T) \cdot log (1 - P r e d)], \end{matrix}

(9)

\begin{matrix} L o s s_{B o} & = | D T (P r e d) - D T (G T) |, \end{matrix}

(10)

where

λ

is the hyperparameter that balances the contributions of each loss term.

P r e d

and

G T

are the prediction and ground truth, respectively. The

D T (\cdot)

is the distance transformation operation.

4. Experiments

4.1. Datasets

DeepCrack. The DeepCrack dataset [32] is a publicly available dataset specifically designed for crack segmentation, comprising 537 RGB images along with manually annotated mask images. As shown in the first row of Figure 5, each image and its corresponding mask have a resolution of 544 × 284 pixels. In this paper, we follow the original setup and divide DeepCrack into training and testing sets, containing 300 and 237 image/mask pairs, respectively.

Crack500. The public Crack500 dataset [33] is collected using mobile phones on the main campus of Temple University, as shown in the second row of Figure 5. Each image and its corresponding annotated mask contain pavement cracks, with a resolution of approximately 2000 × 1500 pixels. In the original dataset, the training, validation, and testing sets are divided into 250, 50, and 200 images, respectively. To balance the high resolution of the images and limited computational resources, these high-resolution images are cropped into 16 non-overlapping sub-regions, and only slices containing more than 1000 crack pixels are retained. Based on this approach, the training, validation, and testing data consist of 1896, 348, and 1124 image patches, respectively.

CFD. The CFD dataset [34] is collected using an iPhone 5 on urban road surfaces in Beijing. As shown in the last row of Figure 5, the challenges of the CFD dataset include complex backgrounds, uneven illumination, and interference from various obstructions, such as water stains, oil spots, and lane lines. The dataset consists of 118 images with a resolution of

480 \times 320

, along with manually annotated masks. In our experiments, the dataset is split into training and testing sets in a 6:4 ratio, respectively.

4.2. Performance Metrics

To quantitatively evaluate the segmentation performance of the models, we selected six evaluation metrics: intersection over union (

I o U

), dice similarity coefficient (

D i c e

), precision (

P r e

), accuracy (

A c c

), and centerline dice (

c l D i c e

). The first five metrics primarily assess the accuracy performance at the pixel level, while

c l D i c e

evaluates the discrepancy between the predicted values and the ground truth based on the connectivity of the curve. The calculation principles of these metrics are as follows:

\begin{matrix} I o U & = \frac{T P}{T P + F P + F N}, \end{matrix}

(11)

\begin{matrix} D i c e & = \frac{2 T P}{2 T P + F P + F N}, \end{matrix}

(12)

\begin{matrix} P r e & = \frac{T P}{T P + F P}, \end{matrix}

(13)

\begin{matrix} A c c & = \frac{T P + T N}{T P + T N + F P + F N}, \end{matrix}

(14)

\begin{matrix} c l D i c e & = \frac{2 | P r e d \cap G r o u n d |}{| P r e d | + | G r o u n d |}, \end{matrix}

(15)

where

T P

,

F P

,

F N

and

T N

are true positives, false positives, false negatives and true negatives.

4.3. Implementation Details

All the experiments in this study were conducted on a server equipped with the Ubuntu 18.04 operating system and a GeForce RTX 3070 8GB GPU. For the software environment, we chose the PyTorch 2.0 framework and set up the necessary dependencies using Anaconda version 23.5.2. During model training, we meticulously configured the hyperparameters: the batch size was set as 4, and the initial learning rate and training iterations were 0.0001 and 80,000, respectively. The AdamW optimizer was utilized to optimize deep learning models. In addition, to effectively prevent overfitting, we employed data augmentation techniques, including random flipping, random cropping, and random deformation, as well as rotation, color jittering, and scaling operations. All the images were resized to

256 \times 256

. The hyperparameter

λ

in the loss function was set to 0.6.

4.4. Comparison Experiments

We compared the proposed method with fifteen different segmentation methods, including FCN [19], U-Net [28], MobileNetV1 [10], MobileNetV2 [11], MobileNetV3 (small version) [9], DeepLabv3+ [6] (Resnet-50 as backbone), Swin Transformer (tiny version) [35], SegFormer (b0 version) [36], CrackSegNet [37], CrackW-Net [38], TEED [39], CrackFormer [40], DECSNet [41] (Resnet-50 as backbone), RHACrackNet [2] and CarNet [42] (MobileNetV3-small as backbone) on three public datasets. Note that the methods without marking the backbone network keep the same configuration as the original methods. These methods can be divided into specialized real-time crack segmentation methods and general segmentation models. Moreover, all the methods were loaded with pre-trained weights on the ImageNet-1K dataset [43].

We quantified the performance of all methods on the DeepCrack dataset and recorded the results in Table 1. Upon observation, we found that the proposed model achieved the highest scores across all metrics. Compared to the general-purpose Unet, our proposed method outperformed in

I o U

,

D i c e

,

R e c a l l

,

A c c

, and

c l D i c e

scores by 5.18%, 2.53%, 9.97%, 0.47% and 2.28%, respectively. Additionally, the proposed method surpassed the CarNet model, which is specifically designed for crack segmentation, by 0.48% in

I o U

, 1.09% in

D i c e

, 1.07% in

R e c a l l

, 0.16% in

A c c

, and 0.1% in

c l D i c e

. For an intuitive comparison, we visualized the top performance models including Unet, RHACrackNet and CarNet methods in Figure 6. We observed that the segmentation results from U-Net and RHACrackNet contained more false positive pixels, while CarNet exhibited discontinuities in its segmentation of slender cracks. In contrast, the proposed method demonstrated superior accuracy in detecting cracks from the DeepCrack images, particularly in regions where cracks are discontinuous. In addition, the proposed method is not sensitive enough to cracks and artifacts in blurred image areas, making the segmentation results prone to fragmentation.

As shown in Table 2, our proposed model consistently outperformed all others across all evaluation metrics. Specifically, it achieved improvements of 3.43%, 1.81%, 9.53%, 0.59%, and 1.81% over the Unet model in terms of

I o U

,

D i c e

,

R e c a l l

,

A c c

, and

c l D i c e

, respectively. Furthermore, the proposed method also outperformed the CarNet model, which is specialized for crack segmentation, with gains of 0.52% in

I o U

, 0.33% in

D i c e

, 0.39% in

R e c a l l

, 0.54% in

A c c

, and 1.77% in

c l D i c e

. To provide a clearer comparison of the segmentation performance, we visualized the results for Unet and CarNet in Figure 7. We observed that the proposed model demonstrated more precise segmentation performance on Crack500 images, with significantly fewer false positive pixels and fracture points. For the mesh-like structures formed by interconnected annular cracks, the proposed method demonstrates superior capability in capturing pixel-level correlations. Moreover, it successfully identifies crack textures that were overlooked in the ground truth images. However, the accuracy of the proposed method in segmenting boundary pixels of thick cracks needs further enhancement.

We quantified the performance of all methods on the CFD dataset, and the results are recorded in Table 3. We found that the proposed model achieved the highest scores across all metrics. Compared to the Unet model, the proposed method outperformed it by 6.84%, 1.52%, 5.97%, 1.82%, and 1.58% in

I o U

,

D i c e

,

R e c a l l

,

A c c

, and

c l D i c e

metrics, respectively. Moreover, the proposed model surpassed the CarNet model, which is specifically designed for crack segmentation, by 1.14% in

I o U

, 0.27% in

D i c e

, 0.48% in

R e c a l l

, 0.23% in

A c c

, and 0.33% in

c l D i c e

. To provide a clear comparison of the segmentation performance, we visualized the Unet and CarNet methods in Figure 8. We found that U-Net and RHACrackNet exhibited suboptimal performance in segmenting complex annular crack networks, failing to accurately capture pixel-level connectivity relationships. In contrast, the proposed model demonstrated superior accuracy and continuity in segmenting both long-span and intricately meshed crack structures. Additionally, the proposed method achieved more precise identification of crack junctions. However, the proposed method is not precise in capturing local textures of jagged cracks. This suggests that the proposed model should incorporate edge-related features, which will be our future research focus.

We present the training process of our proposed method and recent SOTA models on three public datasets. As shown in the first row of Figure 9, compared to the three SOTA methods, including RHACrackNet, DECSNet and CarNet, the proposed method converges faster and achieves a lower loss on the DeepCrack dataset. On the challenging Crack500 dataset, DECSNet exhibits severe oscillations and does not converge stably. In contrast, our proposed method demonstrates stable convergence. On the CFD dataset, the IoU curve and loss curve of the proposed method also converge quickly and achieve higher accuracy, which effectively validates the superiority of the proposed method. Consequently, the proposed method demonstrates stable segmentation performance and achieves rapid convergence to optimal values across three public datasets, which also proves its superiority.

To evaluate the comprehensive performance of the model, we report the number of parameters, FLOPs and FPS values for all methods. As shown in Table 4, compared to general-purpose methods, the proposed model has significantly fewer parameters. Additionally, compared to real-time segmentation models, our proposed method has a lower FLOP value and faster FPS. To visually assess the efficiency of all models, we have visualized their performance in terms of parameter count, FLOPs and FPS. As shown in Figure 10, the proposed method is located in the bottom-right corner and has the smallest radius, indicating lower parameter count, fewer FLOPs, and faster FPS. Thus, our method is competitive in efficient crack segmentation tasks.

4.5. Ablation Experiments

To evaluate the contribution of each module in the proposed model, we conducted corresponding ablation experiments. As shown in Table 5, we used MobileNetV3 (small version) as the backbone network, and by incorporating a progressive decoder, we found that the proposed method achieved an 0.48% increase in the IoU metric. Furthermore, we observed that the DSS block contributed the most to the model, boosting the IoU score by 0.75%, which also suggests that the proposed DSS block has great potential for crack segmentation. As shown in the fourth and fifth rows of Table 5, our proposed DSS module surpasses the original SE attention module by 0.41% in IoU metric across the DeepCrack dataset, which proves the superiority of our proposed DSS module. To validate the impact of embedding the DSS block at different positions within the MobileNetV3 module, we integrated the DSS block into three distinct locations. As observed in Table 5, embedding the DSS block in the middle position of the MobileNetV3 module yields the most significant improvement for the crack segmentation task, which proves the effectiveness of the proposed method. Finally, we conducted an ablation of the strip pooling module and found that it improved the IoU score by 0.62%. Therefore, based on the above analysis, we have demonstrated the effectiveness of each module in our method.

To investigate the impact of different input image resolutions on the proposed method, we conducted corresponding ablation experiments. As shown in Table 6, when using input images with resolutions of

128 \times 128

and

224 \times 224

, the proposed method performed suboptimally. We believe this is because the images were resized too small, resulting in the loss of fine-grained features. When we increased the input image resolution to

448 \times 448

, we found that the proposed method achieved lower scores in recall, accuracy, and clDice indicators. This suggests that increasing the input image resolution may lead to noise interference and disrupt the structural features of cracks. In contrast, the proposed method achieved SOTA performance by resizing the input images to

256 \times 256

.

Moreover, we evaluated the hyperparameters which controlled the fusion of the cross-entropy loss and boundary loss across three public datasets, as shown in Figure 11. We found that when

λ = 0.6

, the proposed method achieves the highest IoU scores on all three datasets. Furthermore, we also conducted corresponding ablation experiments on other hyperparameters, including the learning rate, architecture depth, and batch size. As shown in Table 7, we find that initializing with a smaller learning rate facilitates model convergence to the optimal solution. Furthermore, deepening the encoder architecture unexpectedly degrades segmentation accuracy, suggesting that excessive model depth may impair the decoder’s capacity to recover crack semantics. Notably, batch size adjustments demonstrate a negligible impact on overall model performance.

To evaluate the impact of the offset range

Δ

in dynamic snake convolution on model performance, we conducted targeted ablation experiments. As shown in Table 8, four distinct offset ranges were configured. We observed a moderate decline in performance across all three datasets when expanding the offset range

Δ

, which may be attributed to the mismatch between the receptive fields generated by dynamic snake convolution and the inherent crack morphology. Specifically, a larger offset range compromises the model’s ability to focus on the curvilinear and continuous nature of cracks. Furthermore, when the offset range

Δ

was constrained to [−1, 1], the model achieved state-of-the-art (SOTA) performance, validating the efficacy of our parameter configuration in aligning with the structural characteristics of cracks.

4.6. Cross-Dataset Transfer Experiments

To test the sensitivity and robustness of the proposed method, we conducted cross-dataset transfer experiments. Specifically, we loaded the model’s weight trained on one dataset and used them to test two other datasets. As shown in Table 9, the first column represents the dataset from which the model weights were loaded, while the first row represents the test datasets. By observing the results, we found that the proposed method achieved stable segmentation performance across all cross-dataset transfer experiments. These results also indicate that the proposed model is highly robust and not significantly affected by external factors, like background or environmental changes.

4.7. Conclusions

In this paper, we propose an efficient dynamic-state-space-enhanced network to accurately and efficiently segment concrete cracks. We embed a dynamic state space (DSS) block into the encoder to improve the global dependencies modeling capability of MobileNetV3 bottleneck modules. Specifically, we replace the SE module in the original MobileNetV3 bottleneck module with our proposed DSS block for power representation. For the decoding stage, we utilize the upsampling and depthwise separable convolution to progressively restore the spatial resolution. Moreover, to highlight the crack textures and suppress the background-related noise, we embed the strip pooling module into the skip connection between the encoder and decoder. Extensive experiments are conducted on three public datasets, and the results show the superiority of our proposed method in accuracy and efficiency.

Although our method achieves good results on crack segmentation, it still has some limitations. Firstly, it struggles to detect cracks in blurred image areas, leading to fragmented segmentation results. Secondly, the accuracy of segmenting thick crack boundaries needs improvement, as the model often misses clear edges. Thirdly, the method cannot precisely capture the fine textures of jagged cracks, which suggests that edge-related features should be strengthened in future work.

Author Contributions

Conceptualization, H.L. and Y.C.; methodology, H.L.; software, H.L.; validation, Y.C. and Q.Z.; formal analysis, Y.C. and L.C.; investigation, Q.Z.; data curation, H.L.; writing—original draft preparation, H.L.; writing—review and editing, Y.C.; visualization, Q.Z.; supervision, Y.C. and L.C.; project administration, Q.Z. and L.C.; funding acquisition, Y.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China, grant number 62206114.

Data Availability Statement

Three public datasets were used to evaluate our proposed method in this paper, which were termed the CFD dataset, DeepCrack dataset, and Crack500 dataset. They can be downloaded at https://paperswithcode.com/dataset/cfd, https://github.com/yhlleo/DeepCrack, and https://paperswithcode.com/dataset/crack500 (accessed on 25 October 2024), respectively.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Shan, J.; Jiang, W.; Huang, Y.; Yuan, D.; Liu, Y. Unmanned aerial vehicle (UAV)-Based pavement image stitching without occlusion, crack semantic segmentation, and quantification. IEEE Trans. Intell. Transp. Syst. 2024, 25, 17038–17053. [Google Scholar] [CrossRef]
Zhu, G.; Liu, J.; Fan, Z.; Yuan, D.; Ma, P.; Wang, M.; Sheng, W.; Wang, K.C. A lightweight encoder–decoder network for automatic pavement crack detection. Comput.-Aided Civ. Infrastruct. Eng. 2024, 39, 1743–1765. [Google Scholar] [CrossRef]
Landstrom, A.; Thurley, M.J. Morphology-based crack detection for steel slabs. IEEE J. Sel. Top. Signal Process. 2012, 6, 866–875. [Google Scholar] [CrossRef]
Akagic, A.; Buza, E.; Omanovic, S.; Karabegovic, A. Pavement crack detection using Otsu thresholding for image segmentation. In Proceedings of the 2018 41st International Convention on Information and Communication Technology, Electronics and Microelectronics (MIPRO), Opatija, Croatia, 21–25 May 2018; IEEE: New York, NY, USA, 2018; pp. 1092–1097. [Google Scholar]
Xu, G.; Zhang, Y.; Yue, Q.; Liu, X. A deep learning framework for real-time multi-task recognition and measurement of concrete cracks. Adv. Eng. Inform. 2025, 65, 103127. [Google Scholar] [CrossRef]
Chen, L.C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 801–818. [Google Scholar]
Guo, L.; Xiong, F.; Cao, Y.; Xue, H.; Cui, L.; Han, X. Focusing on Cracks with Instance Normalization Wavelet Layer. Sensors 2024, 25, 146. [Google Scholar] [CrossRef] [PubMed]
Liu, L.; Shen, B.; Huang, S.; Liu, R.; Liao, W.; Wang, B.; Diao, S. Binocular Video-Based Automatic Pixel-Level Crack Detection and Quantification Using Deep Convolutional Neural Networks for Concrete Structures. Buildings 2025, 15, 258. [Google Scholar] [CrossRef]
Howard, A.; Sandler, M.; Chu, G.; Chen, L.C.; Chen, B.; Tan, M.; Wang, W.; Zhu, Y.; Pang, R.; Vasudevan, V.; et al. Searching for mobilenetv3. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 1314–1324. [Google Scholar]
Howard, A.G. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv 2017, arXiv:1704.04861. [Google Scholar]
Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L.C. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 4510–4520. [Google Scholar]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
Dosovitskiy, A. An image is worth 16 × 16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Liu, H.; Miao, X.; Mertz, C.; Xu, C.; Kong, H. Crackformer: Transformer network for fine-grained crack detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 3783–3792. [Google Scholar]
Shan, J.; Huang, Y.; Jiang, W. DCUFormer: Enhancing pavement crack segmentation in complex environments with dual-cross/upsampling attention. Expert Syst. Appl. 2025, 264, 125891. [Google Scholar] [CrossRef]
Liu, K.; Ren, T.; Lan, Z.; Yang, Y.; Liu, R.; Xu, Y. CGV-Net: Tunnel Lining Crack Segmentation Method Based on Graph Convolution Guided Transformer. Buildings 2025, 15, 197. [Google Scholar] [CrossRef]
Gu, A.; Dao, T. Mamba: Linear-time sequence modeling with selective state spaces. arXiv 2023, arXiv:2312.00752. [Google Scholar]
Abdel-Qader, I.; Abudayyeh, O.; Kelly, M.E. Analysis of edge-detection techniques for crack identification in bridges. J. Comput. Civ. Eng. 2003, 17, 255–263. [Google Scholar] [CrossRef]
Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3431–3440. [Google Scholar]
Lee, T.; Yoon, Y.; Chun, C.; Ryu, S. Cnn-based road-surface crack detection model that responds to brightness changes. Electronics 2021, 10, 1402. [Google Scholar] [CrossRef]
Di Benedetto, A.; Fiani, M.; Gujski, L.M. U-Net-based CNN architecture for road crack segmentation. Infrastructures 2023, 8, 90. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Fan, Y.; Hu, Z.; Li, Q.; Sun, Y.; Chen, J.; Zhou, Q. CrackNet: A Hybrid Model for Crack Segmentation with Dynamic Loss Function. Sensors 2024, 24, 7134. [Google Scholar] [CrossRef] [PubMed]
Fei, Y.; Wang, K.C.; Zhang, A.; Chen, C.; Li, J.Q.; Liu, Y.; Yang, G.; Li, B. Pixel-level cracking detection on 3D asphalt pavement images through deep-learning-based CrackNet-V. IEEE Trans. Intell. Transp. Syst. 2019, 21, 273–284. [Google Scholar] [CrossRef]
Alsenan, A.; Ben Youssef, B.; Alhichri, H. Mobileunetv3—A combined unet and mobilenetv3 architecture for spinal cord gray matter segmentation. Electronics 2022, 11, 2388. [Google Scholar] [CrossRef]
Jia, L.; Wang, T.; Chen, Y.; Zang, Y.; Li, X.; Shi, H.; Gao, L. MobileNet-CA-YOLO: An improved YOLOv7 based on the MobileNetV3 and attention mechanism for Rice pests and diseases detection. Agriculture 2023, 13, 1285. [Google Scholar] [CrossRef]
Deng, T.; Wu, Y. Simultaneous vehicle and lane detection via MobileNetV3 in car following scene. PLoS ONE 2022, 17, e0264551. [Google Scholar] [CrossRef]
Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention—MICCAI 2015: 18th International Conference, Munich, Germany, 5–9 October 2015; Springer: Berlin/Heidelberg, Germany, 2015; pp. 234–241. [Google Scholar]
Liu, Y.; Tian, Y.; Zhao, Y.; Yu, H.; Xie, L.; Wang, Y.; Ye, Q.; Liu, Y. VMamba: Visual State Space Model. arXiv 2024, arXiv:2401.10166. [Google Scholar]
Qi, Y.; He, Y.; Qi, X.; Zhang, Y.; Yang, G. Dynamic snake convolution based on topological geometric constraints for tubular structure segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 6070–6079. [Google Scholar]
Hou, Q.; Zhang, L.; Cheng, M.M.; Feng, J. Strip pooling: Rethinking spatial pooling for scene parsing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 4003–4012. [Google Scholar]
Liu, Y.; Yao, J.; Lu, X.; Xie, R.; Li, L. DeepCrack: A deep hierarchical feature learning architecture for crack segmentation. Neurocomputing 2019, 338, 139–153. [Google Scholar] [CrossRef]
Yang, F.; Zhang, L.; Yu, S.; Prokhorov, D.; Mei, X.; Ling, H. Feature pyramid and hierarchical boosting network for pavement crack detection. IEEE Trans. Intell. Transp. Syst. 2019, 21, 1525–1535. [Google Scholar] [CrossRef]
Shi, Y.; Cui, L.; Qi, Z.; Meng, F.; Chen, Z. Automatic road crack detection using random structured forests. IEEE Trans. Intell. Transp. Syst. 2016, 17, 3434–3445. [Google Scholar] [CrossRef]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 10012–10022. [Google Scholar]
Xie, E.; Wang, W.; Yu, Z.; Anandkumar, A.; Alvarez, J.M.; Luo, P. SegFormer: Simple and efficient design for semantic segmentation with transformers. Adv. Neural Inf. Process. Syst. 2021, 34, 12077–12090. [Google Scholar]
Ren, Y.; Huang, J.; Hong, Z.; Lu, W.; Yin, J.; Zou, L.; Shen, X. Image-based concrete crack detection in tunnels using deep fully convolutional networks. Constr. Build. Mater. 2020, 234, 117367. [Google Scholar] [CrossRef]
Han, C.; Ma, T.; Huyan, J.; Huang, X.; Zhang, Y. CrackW-Net: A novel pavement crack image segmentation convolutional neural network. IEEE Trans. Intell. Transp. Syst. 2021, 23, 22135–22144. [Google Scholar] [CrossRef]
Soria, X.; Li, Y.; Rouhani, M.; Sappa, A.D. Tiny and efficient model for the edge detection generalization. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 1364–1373. [Google Scholar]
Liu, H.; Yang, J.; Miao, X.; Mertz, C.; Kong, H. CrackFormer network for pavement crack segmentation. IEEE Trans. Intell. Transp. Syst. 2023, 24, 9240–9252. [Google Scholar] [CrossRef]
Zhang, J.; Zeng, Z.; Sharma, P.K.; Alfarraj, O.; Tolba, A.; Wang, J. A dual encoder crack segmentation network with Haar wavelet-based high–low frequency attention. Expert Syst. Appl. 2024, 256, 124950. [Google Scholar] [CrossRef]
Li, K.; Yang, J.; Ma, S.; Wang, B.; Wang, S.; Tian, Y.; Qi, Z. Rethinking lightweight convolutional neural networks for efficient and high-quality pavement crack detection. IEEE Trans. Intell. Transp. Syst. 2024, 25, 237–250. [Google Scholar] [CrossRef]
Deng, J.; Dong, W.; Socher, R.; Li, L.J.; Li, K.; Fei-Fei, L. Imagenet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; IEEE: New York, NY, USA, 2009; pp. 248–255. [Google Scholar]

Figure 1. The pipeline of the proposed method.

Figure 2. The illustration of the DSS block.

Figure 3. The illustration of the dynamic accumulation process of dynamic snake convolution in the x-axis direction.

Figure 4. The illustration of the strip pooling module.

Figure 5. Examples of three public datasets. Images in the first, second and third rows belong to the DeepCrack, Crack500 and CFD datasets, respectively.

Figure 6. Visual comparison of the top performance methods on the DeepCrack dataset. (a) Ground truth, (b) Unet, (c) RHACrackNet, (d) CarNet, and (e) our proposed method. Red rectangles highlight the differences between the competitors and the proposed methods.

Figure 7. Visual comparison of the top performance methods on the Crack500 dataset. (a) Ground truth, (b) Unet, (c) RHACrackNet, (d) CarNet, and (e) our proposed method. Red rectangles highlight the differences between the competitors and the proposed methods.

Figure 8. Visual comparison of the top performance methods on the CFD dataset. (a) Ground truth, (b) Unet, (c) RHACrackNet, (d) CarNet, and (e) our proposed method. Red rectangles highlight the differences of the competitors and the proposed methods.

Figure 9. IoU and loss curves of three public datasets. (a,b) Corresponding curves of DeepCrack dataset. (c,d) Corresponding curves of Crack500 dataset. (e,f) Corresponding curves of CFD dataset.

Figure 10. Comparison of parameters and efficiency of each method. The size of the circle indicates the scale of the model parameters. FPS and FLOPs represent the efficiency of the models.

Figure 11. Ablation studies for

λ

. (a–c) Parameter variations for the DeepCrack, Crack500 and CFD datasets, respectively.

Figure 11. Ablation studies for

λ

. (a–c) Parameter variations for the DeepCrack, Crack500 and CFD datasets, respectively.

Table 1. Comparison performance (%) between our proposed DSS-MobileNetV3 and SOTA methods on the DeepCrack dataset [32].

Method	Year	IoU	Dice	Recall	Acc	clDice
FCN [19]	2015	68.81	81.79	74.55	98.37	86.79
Unet [28]	2015	69.03	81.83	74.96	98.40	86.92
MobilenetV1 [10]	2017	69.28	81.94	75.20	98.49	87.13
MobilenetV2 [11]	2018	70.81	82.11	78.14	98.52	87.36
MobileNetV3 [9]	2019	72.59	82.29	79.62	98.57	88.09
Deeplabv3+ [6]	2018	71.34	82.22	78.87	98.54	87.61
Swin Transformer [35]	2021	73.04	82.58	83.11	98.51	88.42
SegFormer [36]	2021	73.39	83.01	83.49	98.61	88.84
CrackSegNet [37]	2020	73.28	83.07	83.41	98.68	88.61
CrackW-Net [38]	2021	73.17	82.92	83.25	98.62	88.34
TEED [39]	2023	70.50	82.14	78.42	98.53	87.39
CrackFormer [40]	2023	73.37	82.95	83.43	98.58	88.80
DECSNet [41]	2024	71.67	82.18	78.95	98.54	87.60
RHACrackNet [2]	2024	73.43	83.06	83.57	98.67	88.91
CarNet [42]	2024	73.73	83.27	83.86	98.71	89.10
Ours	−	74.21	84.36	84.93	98.87	89.20

Table 2. Comparison performance (%) between our proposed DSS-MobileNetV3 and SOTA methods on the Crack500 dataset [33].

Method	Year	IoU	Dice	Recall	Acc	clDice
FCN [19]	2015	53.11	69.70	67.95	96.68	74.80
Unet [28]	2015	53.25	69.91	68.14	96.70	74.88
MobilenetV1 [10]	2017	53.87	70.01	69.38	96.79	74.94
MobilenetV2 [11]	2018	54.10	70.23	69.75	96.87	75.10
MobileNetV3 [9]	2019	54.88	70.57	72.17	96.95	75.39
DeepLabv3+ [6]	2018	54.57	70.14	70.39	96.88	74.95
Swin Transformer [35]	2021	55.81	71.07	75.27	96.66	75.14
SegFormer [36]	2021	56.02	71.06	75.82	97.07	75.40
CrackSegNet [37]	2020	55.21	71.07	73.84	97.04	75.96
CrackW-Net [38]	2021	54.97	70.65	73.72	97.10	75.71
TEED [39]	2023	53.46	70.04	69.60	97.12	75.78
CrackFormer [40]	2023	56.03	71.12	75.49	96.71	76.20
DECSNet [41]	2024	54.58	70.41	71.75	96.89	75.18
RHACrackNet [2]	2024	56.06	71.19	76.12	96.68	75.23
CarNet [42]	2024	56.16	71.39	77.28	96.75	74.92
Ours	−	56.68	71.72	77.67	97.29	76.69

Table 3. Comparison performance (%) between our proposed DSS-MobileNetV3 and SOTA methods on the CFD dataset [34].

Method	Year	IoU	Dice	Recall	Acc	clDice
FCN [19]	2015	51.44	72.56	75.29	93.99	78.29
Unet [28]	2015	51.53	72.77	75.39	94.01	78.50
MobilenetV1 [10]	2017	51.64	72.90	75.61	94.22	78.71
MobilenetV2 [11]	2018	52.38	73.08	76.31	94.72	79.22
MobileNetV3 [9]	2019	54.92	73.40	78.54	95.05	79.56
DeepLabv3+ [6]	2018	53.63	73.22	77.56	94.89	79.30
Swin Transformer [35]	2021	56.36	73.51	79.67	95.35	79.44
SegFormer [36]	2021	56.96	73.80	79.94	95.66	79.82
CrackSegNet [37]	2020	55.85	73.84	78.89	95.43	79.70
CrackW-Net [38]	2021	55.51	73.27	79.12	95.58	79.51
TEED [39]	2023	52.39	72.95	76.41	94.50	78.53
CrackFormer [40]	2023	56.98	73.77	79.83	95.53	79.78
DECSNet [41]	2024	53.65	73.21	77.80	94.94	79.38
RHACrackNet [2]	2024	57.05	73.94	80.62	95.51	79.35
CarNet [42]	2024	57.23	74.02	80.88	95.60	79.75
Ours	−	58.37	74.29	81.36	95.83	80.08

Table 4. The parameter counts, FLOPs and FPS of the methods.

Method	Year	# Param (M)	FLOPs (G)	FPS
FCN [19]	2015	25.3	5.6	331.4
Unet [28]	2015	29.8	6.8	389.7
MobilenetV1 [10]	2017	4.7	8.7	785.6
MobilenetV2 [11]	2018	1.8	7.4	792.4
MobileNetV3 [9]	2019	3.8	8.3	804.9
DeepLabv3+ [6]	2018	22.7	15.4	352.9
Swin Transformer [35]	2021	37.6	33.7	586.2
SegFormer [36]	2021	8.7	46.3	479.1
CrackSegNet [37]	2020	18.0	64.6	482.3
CrackW-Net [38]	2021	2.1	15.2	798.2
TEED [39]	2023	5.0	1.3	846.8
CrackFormer [40]	2023	4.8	82.7	812.7
DECSNet [41]	2024	47.4	61.3	176.5
RHACrackNet [2]	2024	4.6	10.8	808.4
CarNet [42]	2024	5.1	7.8	851.1
Ours	−	2.4	7.5	960.1

Table 5. Ablation study for the proposed method. Note that all the values are the IoU performance (%) of the corresponding methods.

Method	DeepCrack	Crack500	CFD
Backbone	72.36	54.64	54.89
Backbone + decoder	72.84	55.31	55.74
Backbone + decoder + SE attention module	73.18	55.52	56.83
Backbone + decoder + DSS block (Middle)	73.59	56.20	57.75
Backbone + decoder + DSS block (Head)	73.21	55.78	56.71
Backbone + decoder + DSS block (Tail)	73.07	55.62	56.34
Proposed	74.21	56.68	58.37

Table 6. Ablation study for the resolution of input images on the DeepCrack dataset [32]. Note that ‘⋆’ represents the proposed method.

Resolution	IoU (%)	Dice (%)	Recall (%)	Acc (%)	clDice (%)
$128 \times 128$	73.97	83.85	84.33	98.37	88.62
$224 \times 224$	74.04	83.97	84.46	98.50	88.79
$256 \times 256$ $(⋆)$	74.21	84.36	84.93	98.87	89.20
$448 \times 448$	74.26	84.41	84.83	98.70	88.61

Table 7. Ablation study for the hyperparameters including the learning rate, architecture depth, and batch size on the DeepCrack dataset [32]. Note that ‘⋆’ represents the proposed method.

Hyperparameter	Setting	IoU (%)	Acc (%)	clDice (%)
Learning Rate	0.01	73.95	98.47	88.37
	0.001	74.04	98.51	88.64
	0.0001 (⋆)	74.21	98.87	89.20
Architecture Depth	7	73.98	98.66	89.02
	5 (⋆)	74.21	98.87	89.20
	4	74.05	98.77	89.14
Batch Size	1	74.13	98.78	89.08
	2	74.15	98.81	89.13
	4 (⋆)	74.21	98.87	89.20

Table 8. Ablation study for the offset range

Δ

of dynamic snake convolution on the DeepCrack dataset.

Table 8. Ablation study for the offset range

Δ

of dynamic snake convolution on the DeepCrack dataset.

Offset Range $Δ$	IoU (%)	Dice (%)	Recall (%)	Acc (%)	clDice (%)
$Δ \in [- 5, 5]$	73.78	83.95	84.41	98.30	88.78
$Δ \in [- 3, 3]$	73.87	84.06	84.54	98.42	88.83
$Δ \in [- 2, 2]$	74.05	84.12	84.69	98.59	89.04
$Δ \in [- 1, 1]$	74.21	84.36	84.93	98.87	89.20

Table 9. Cross-dataset transfer experiments. Note that all the values are the dice performance (%) of the proposed method.

Transfer Dataset	DeepCrack	Crack500	CFD
DeepCrack	84.36	52.58	51.94
Crack500	55.88	71.72	53.42
CFD	55.64	56.26	74.29

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Li, H.; Cheng, Y.; Zhang, Q.; Chen, L. DSS-MobileNetV3: An Efficient Dynamic-State-Space- Enhanced Network for Concrete Crack Segmentation. Buildings 2025, 15, 1905. https://doi.org/10.3390/buildings15111905

AMA Style

Li H, Cheng Y, Zhang Q, Chen L. DSS-MobileNetV3: An Efficient Dynamic-State-Space- Enhanced Network for Concrete Crack Segmentation. Buildings. 2025; 15(11):1905. https://doi.org/10.3390/buildings15111905

Chicago/Turabian Style

Li, Haibo, Yong Cheng, Qian Zhang, and Lingkun Chen. 2025. "DSS-MobileNetV3: An Efficient Dynamic-State-Space- Enhanced Network for Concrete Crack Segmentation" Buildings 15, no. 11: 1905. https://doi.org/10.3390/buildings15111905

APA Style

Li, H., Cheng, Y., Zhang, Q., & Chen, L. (2025). DSS-MobileNetV3: An Efficient Dynamic-State-Space- Enhanced Network for Concrete Crack Segmentation. Buildings, 15(11), 1905. https://doi.org/10.3390/buildings15111905

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

DSS-MobileNetV3: An Efficient Dynamic-State-Space- Enhanced Network for Concrete Crack Segmentation

Abstract

1. Introduction

2. Related Work

2.1. Crack Segmentation Algorithms

2.2. Segmentation Methods Based on MobileNetV3

3. Method

3.1. Overall Architecture

3.2. State-Space-Based MobileNetV3 Bottleneck Module

3.3. Dynamic State Space Block

3.4. Strip Pooling Module

3.5. Loss Function

4. Experiments

4.1. Datasets

4.2. Performance Metrics

4.3. Implementation Details

4.4. Comparison Experiments

4.5. Ablation Experiments

4.6. Cross-Dataset Transfer Experiments

4.7. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI