Detection in Road Crack Images Based on Sparse Convolution

Li, Yang; Li, Xinhang; Shen, Ke; Li, Yacong; Sui, Dong; Guo, Maozu

doi:10.3390/mca30060132

Open AccessArticle

Detection in Road Crack Images Based on Sparse Convolution

by

Yang Li

^1,2,

Xinhang Li

^1,2

,

Ke Shen

^1,2,

Yacong Li

³,

Dong Sui

^1,2 and

Maozu Guo

^1,2,*

¹

School of Intelligent Science and Technology, Beijing University of Civil Engineering and Architecture, Beijing 102616, China

²

Beijing Key Laboratory for Intelligent Processing Methods of Architectural Big Data, Beijing University of Civil Engineering and Architecture, Beijing 102616, China

³

Beijing Academic of Artificial Intelligence, Beijing 100089, China

^*

Author to whom correspondence should be addressed.

Math. Comput. Appl. 2025, 30(6), 132; https://doi.org/10.3390/mca30060132

Submission received: 28 September 2025 / Revised: 20 November 2025 / Accepted: 2 December 2025 / Published: 3 December 2025

Download

Browse Figures

Review Reports Versions Notes

Abstract

Ensuring the structural integrity of road infrastructure is vital for transportation safety and long-term sustainability. This study presents a lightweight and accurate pavement crack detection framework named SpcNet, which integrates a Sparse Encoding Module, ConvNeXt V2-based decoder, and a Binary Attention Module (BAM) within an asymmetric encoder–decoder architecture. The proposed method first applies a random masking strategy to generate sparse pixel inputs and employs sparse convolution to enhance computational efficiency. A ConvNeXt V2 decoder with Global Response Normalization (GRN) and GELU activation further stabilizes feature extraction, while the BAM, in conjunction with Channel and Spatial Attention Bridge (CAB/SAB) modules, strengthens global dependency modeling and multi-scale feature fusion. Comprehensive experiments on four public datasets demonstrate that SpcNet achieves state-of-the-art performance with significantly fewer parameters and lower computational cost. On the Crack500 dataset, the method achieves a precision of 91.0%, recall of 85.1%, F1 score of 88.0%, and mIoU of 79.8%, surpassing existing deep-learning-based approaches. These results confirm that SpcNet effectively balances detection accuracy and efficiency, making it well-suited for real-world pavement condition monitoring.

Keywords:

deep learning; road crack detection; sparse convolution; attention mechanism; asymmetric mask

1. Introduction

With the rapid expansion of urbanization and continuous increase in traffic volume, maintaining the safety and durability of road infrastructure has become increasingly critical. Pavement structures are prone to gradual deterioration caused by heavy loads, temperature fluctuations, and even seismic activity [1,2]. Timely and accurate crack detection is therefore essential for preventing further structural damage, reducing maintenance costs, and ensuring traffic safety [3].

Traditional manual inspection methods are often time-consuming, labor-intensive, and susceptible to environmental interference [4]. In contrast, automated image-based crack detection offers a more objective and efficient alternative by leveraging computer vision and deep learning. However, despite significant progress, existing deep learning–based crack detection methods, such as FCN and U-Net, still face notable limitations: (1) high computational cost due to dense convolution operations; (2) limited ability to capture long-range dependencies and global crack continuity; and (3) reduced generalization to complex or noisy pavement textures.

To address these challenges, this study proposes a Sparse Convolution–based Crack Detection Network (SpcNet) that integrates sparse feature encoding, an asymmetric ConvNeXt V2 decoder, and a Binary Attention Module (BAM). The Sparse Encoding Module, guided by a random masking strategy, reduces redundant computation while preserving essential crack features. The asymmetric encoder–decoder architecture prevents information leakage from masked to visible regions, enabling more effective contextual reasoning. Furthermore, the BAM and CAB/SAB modules improve global dependency modeling and multi-scale feature fusion, ensuring more complete crack delineation.

The main contributions of this work are summarized as follows:

Sparse Encoding Module: A random masking and sparse convolution mechanism that effectively balances computational cost and detection accuracy.
Asymmetric Encoder–Decoder: A lightweight ConvNeXt V2-based decoder that enhances feature reconstruction and prevents shortcut information copying.
Enhanced Attention Mechanisms: Integration of BAM and CAB/SAB modules for multi-stage, multi-scale feature aggregation and refined spatial awareness.
Comprehensive Evaluation: Extensive experiments on four benchmark datasets demonstrating that SpcNet outperforms existing methods in both precision and efficiency.
Cross-Dataset Generalization: Unlike crack-specific networks designed and evaluated on a single dataset, SpcNet is developed as a generalizable segmentation framework. It achieves consistently strong performance across four heterogeneous benchmarks, demonstrating robustness and practical applicability for real-world pavement inspection.

2. Related Work

Early studies on pavement crack detection mainly relied on traditional image processing techniques [5,6], such as edge detection, morphological operations, and handcrafted feature extraction. For instance, Li et al. [7] used a CCD camera and Sobel-based templates for crack edge extraction, while Hu et al. [8] employed guided filtering and mean shift algorithms to distinguish between mesh and linear cracks. Although these methods achieved reasonable results under controlled conditions, they are highly sensitive to illumination, shadows, and background noise, limiting their robustness in complex outdoor environments [9,10,11].

With the advent of deep learning, particularly within the broader field of structural health monitoring (SHM) [12], convolutional neural networks (CNNs) have significantly improved crack detection performance. Kim et al. [13] introduced a lightweight CNN architecture for concrete crack detection, enabling deployment on edge devices. Early approaches in this area, such as Cha et al. [14], utilized region-based deep learning for detecting multiple damage types. Deng et al. [15] enhanced region-based CNNs with deformable convolutions to improve spatial adaptability, while Yang et al. [16] proposed a Fully Convolutional Network (FCN) for pixel-level crack segmentation and geometric feature extraction. Meftah et al. [17] further integrated random forests with CNNs to enhance robustness in autonomous driving scenarios. To further address the demand for real-time processing, SDDNet [18] utilized separable convolutions to enhance inference speed, yet its high reported metrics were established on a custom dataset rather than standard public benchmarks.

Despite these advances, existing CNN-based models still suffer from three main limitations: (1) excessive computational cost due to dense convolution operations, (2) insufficient global context modeling for long and irregular cracks, and (3) poor generalization when applied to complex pavement textures and illumination variations.

To overcome these challenges, recent works have explored sparse and attention-based architectures. Sparse convolutional networks [19,20] significantly reduce redundant computation by operating only on non-zero (visible) pixels, showing great potential for tasks with locally distributed features such as cracks. In parallel, ConvNeXt and its successor ConvNeXt V2 [21] have modernized CNN design by integrating Transformer-style normalization (Global Response Normalization, GRN) and smoother activation (GELU), achieving strong feature consistency and efficiency. Moreover, attention mechanisms have evolved from Squeeze-and-Excitation (SE) and Convolutional Block Attention Module (CBAM) to more efficient formulations such as Binary Attention Module (BAM), which combines linear and attachment linear attention to capture long-range dependencies with minimal overhead.

Given these developments, our study integrates the efficiency of sparse convolution with the representational power of ConvNeXt V2 and the global modeling capability of BAM. This combination aims to achieve accurate yet lightweight pavement crack detection. Unlike existing heavy networks, our approach emphasizes a balance between computational efficiency and feature expressiveness, targeting real-world scenarios where both accuracy and speed are critical.

In summary, the gaps identified from prior studies are as follows: (1) high computational costs in dense CNN-based models; (2) limited ability to capture multi-scale, global crack patterns; (3) insufficient integration between sparse feature extraction and attention mechanisms. To bridge these gaps, this paper proposes an integrated framework—SpcNet—that leverages a Sparse Encoding Module, an asymmetric ConvNeXt V2-based decoder, and a Binary Attention Module to enhance crack feature representation and improve detection robustness across complex environments.

3. SpcNet Model

3.1. Model Structure

The proposed SpcNet architecture is illustrated in Figure 1. It consists of four key components: a Sparse Encoding Module, a ConvNeXt V2–based Decoder, a Binary Attention Module (BAM), and Channel/Spatial Attention Bridge (CAB/SAB) modules for multi-scale feature fusion.

First, a random masking strategy is applied to the input image to generate a sparse pixel distribution, simulating incomplete visual information and encouraging the network to infer missing regions from contextual cues. The Sparse Encoding Module, built upon sparse convolution, processes only visible data points to reduce unnecessary computation while maintaining accurate feature extraction. This design effectively balances performance and computational load, making it particularly suitable for lightweight crack detection.

The encoder comprises Patch Mixing and Channel Mixing submodules. The former facilitates spatial feature interaction, while the latter enhances inter-channel communication, enabling rich and discriminative representations of cracks.

The decoder adopts a lightweight ConvNeXt V2 module, an evolution of the original ConvNeXt architecture. ConvNeXt V2 introduces Global Response Normalization (GRN) and GELU activation to improve gradient stability and representation smoothness. Its asymmetric design—where the encoder is deeper and more hierarchical than the decoder—prevents shortcut information copying from masked to visible regions, compelling the model to learn genuine contextual understanding rather than pixel replication. This asymmetric encoder–decoder structure thus enforces semantic reconstruction and enhances generalization to diverse crack patterns.

Finally, the Binary Attention Module (BAM) is integrated to strengthen global dependency modeling. BAM employs Linear Attention (LA) and Attachment Linear Attention (ALA) to dynamically capture long-range spatial relationships between crack regions. The CAB and SAB modules then fuse multi-scale features across channels and spatial dimensions, respectively, ensuring precise localization and complete crack contours. The model concludes with a global average pooling layer and output prediction head for final crack segmentation.

Overall, the SpcNet framework leverages sparse encoding for efficiency, ConvNeXt V2 for stability, and binary attention for enhanced global context understanding—jointly achieving high-accuracy, lightweight crack detection suitable for real-world pavement monitoring.

3.2. Sparse Encoding Module

A random masking strategy is applied to the original input crack image, using a high masking ratio to force the model to predict the masked portions based on the limited remaining information. This encourages the model to generate effective learning signals, despite the loss of content.

Specifically, the random masking strategy is implemented via a uniform sampling mechanism to ensure unbiased context learning. The process follows three key steps:

Patch Partitioning: The input image is divided into non-overlapping patches of size 16×16.
Uniform Random Permutation: We generate a random permutation of the patch indices following a uniform distribution. This ensures global independence among masked regions, distinguishing our approach from path-dependent strategies like random walks.
Masking Execution: A fixed masking ratio of 40% is applied. The first 40% of the permuted indices are discarded (masked), while the remaining patches constitute the sparse input.

We empirically validated masking ratios of 20%, 40%, and 60%, finding that 40% offers the optimal trade-off between reconstruction difficulty and semantic preservation. To ensure reproducibility, the random seed is fixed and the mask pattern is deterministically regenerated at each epoch.

The sparse encoding module, as shown in Figure 2, takes a sequence of S non-overlapping image patches as input. Let

X_{i n} \in R^{S \times C}

denote the input feature matrix, where C is the feature dimension. The resolution of the original input image is

(H, W)

, the resolution of each image patch is

(P, P)

, and the resulting number of patches is

S = (H W) / P^{2}

. All image patches are projected using the same projection matrix to ensure consistency of information.

To efficiently extract features, the model performs sparse convolution with a kernel size of

3 \times 3

followed by max pooling operations for downsampling. The operations at each position (referred to as channel mixing in this paper) are separated from operations across positions (referred to as patch mixing in this paper).

The first component is the patch mixing ConvNeXt V2. It allows communication between different spatial positions and operates independently on each channel, acting on the columns of the input matrix (i.e., the transposed input). It shares parameters across all columns. The second component is the channel mixing ConvNeXt V2. It enables communication between different channels and operates independently on each patch, acting on the rows of the matrix, sharing parameters across all rows.

By applying patch mixing on spatial positions and channel mixing on feature channels, we extract crack features and generate rich feature representations. The process of the sparse encoding module is formally described by Equations (1) and (2):

U = T_{2} (Φ (Pool (SparseConv (T_{1} (X_{i n}))))) + X_{i n},

(1)

Y = Φ (LN (U)) + U .

(2)

In these equations,

X_{i n}

denotes the input feature map,

U

represents the intermediate features, and

Y

denotes the output of the sparse encoding module.

Φ (\cdot)

represents the ConvNeXt V2 block operation, and LN denotes Layer Normalization. Pool stands for max pooling, and SparseConv represents the sparse convolution operation. Crucially,

T_{1}

and

T_{2}

represent the first and second transposition operations, respectively, which align the tensor dimensions for patch mixing and channel mixing.

To ensure smooth feature flow between the sparse encoder and the dense decoder, a sparse-to-dense projection layer is introduced. After sparse convolution and pooling, the output tensor

Y

is converted into a dense grid by filling unobserved spatial positions with zeros and performing bilinear interpolation to restore spatial continuity. This operation provides the dense ConvNeXt V2 decoder with a complete feature map while preserving the efficiency and sparsity benefits of the encoder.

We apply a lightweight ConvNeXt V2 module as a decoder, resulting in an asymmetric encoder–decoder structure, where the encoder is more complex and hierarchical. This asymmetric design prevents the model from copying and pasting mask area information to unknown regions via shortcuts, effectively preventing information leakage. The decoder must make predictions based on an understanding of the global context rather than simple copy-paste operations.

The ConvNeXt V2 module is shown in Figure 3. The computation method is mathematically described by Equation (3):

Z_{o u t} = GRN (GELU (LN (Z_{i n}))) + Z_{i n},

(3)

where

Z_{i n}

and

Z_{o u t}

denote the input and output features, GELU represents the GELU activation function, and GRN stands for Global Response Normalization.

First, global feature aggregation is performed using a global function

G (\cdot)

. This transforms the spatial feature map

X \in R^{H \times W \times C}

into a global feature vector

g \in R^{C}

to capture overall contextual information. As shown in Equation (4), this is achieved by computing the

L_{2}

-norm for each channel:

g = {{∥ X_{1} ∥}_{2}, {∥ X_{2} ∥}_{2}, \dots, {∥ X_{C} ∥}_{2}} \in R^{C},

(4)

where

X_{i}

represents the feature map of the i-th channel.

Feature normalization is then applied to compute a relative importance score

v_{i}

for each channel. We define a normalization function

N (\cdot)

which divides the feature norm of the current channel by the global aggregated norm. As shown in Equation (5):

v_{i} = N ({∥ X_{i} ∥}_{2}) = \frac{{∥ X_{i} ∥}_{2}}{\sum_{j = 1}^{C} {∥ X_{j} ∥}_{2} + ϵ},

(5)

where

ϵ

is a small constant ensuring numerical stability. This normalization creates feature competition, suppressing redundant channels while highlighting discriminative ones.

The final step is feature calibration, which uses the computed normalization scores to adjust the original input responses. This ensures that responses fully exhibit their representational capability while maintaining trainability. The approach enhances the diversity and discriminative power of the learned features, as formulated in Equation (6):

{\hat{X}}_{i} = X_{i} \cdot N (G {(X)}_{i}) \in R^{H \times W} .

(6)

Here,

{\hat{X}}_{i}

represents the calibrated feature map of the i-th channel. To optimize this process and provide greater flexibility, learnable parameters

γ

and

β

(initialized to zero) are introduced via a residual connection. The final computation of the GRN block is shown in Equation (7):

{\hat{X}}_{i} = γ_{i} \cdot (X_{i} \cdot v_{i}) + β_{i} + X_{i} .

(7)

In the early stages of training, the GRN layer approximates an identity mapping due to the zero initialization. As training progresses, it adapts to optimize the network’s learning requirements. The residual connection ensures stability and convergence, while the normalization enhances feature responses over time.

3.3. Channel Attention Bridge Module and Spatial Attention Bridge Module

The Channel Attention Bridge (CAB) and Spatial Attention Bridge (SAB) modules achieve multi-stage, multi-scale feature information fusion. The CAB module consists of global average pooling, concatenation operations, fully connected layers, and the Sigmoid activation function. Meanwhile, the SAB module integrates max pooling, average pooling, and dilated convolution operations to improve the model’s convergence speed and enhance its sensitivity to crack features.

Acquiring and fusing multi-stage information is crucial. As shown in Figure 4, the CAB module integrates features by generating channel attention maps. Let

F_{k}

denote the input feature map from the k-th stage (where

k \in {1, 2, 3}

). The fusion process is mathematically formulated as follows:

v_{k} = GAP (F_{k}),

(8)

V_{c a t} = [v_{1}, v_{2}, v_{3}],

(9)

W = {Conv}_{1 D} (V_{c a t}),

(10)

A_{k} = σ (W_{k}),

(11)

{\hat{F}}_{k} = F_{k} + F_{k} ⊙ A_{k},

(12)

where GAP denotes global average pooling,

v_{k}

represents the pooled vector from the previous stage, and

[\cdot]

represents the concatenation operation along the channel dimension.

V_{c a t}

is the concatenated multi-scale feature vector.

{Conv}_{1 D}

denotes the 1D convolution operation, and

W_{k}

corresponds to the fully connected layer output at stage k.

σ

is the sigmoid function,

A_{k}

denotes the generated channel attention map, and ⊙ represents element-wise multiplication. Finally,

{\hat{F}}_{k}

represents the fused output feature map. CAB splits the multi-stage fusion into local and global contexts to provide richer attention maps.

As shown in Figure 5, the SAB module fuses multi-level and multi-scale information along the spatial axis. First, average pooling and max pooling are performed along the channel dimension. The results are concatenated to form a two-channel map. Next, dilated convolution is applied to enhance feature representation. Finally, a spatial attention map is generated via the Sigmoid function and multiplied with the original features, with residual information added for fusion.

3.4. Binary Attention Module

In this paper, a Binary Attention Module (BAM) is designed to fuse different features, thereby enhancing the crack detection performance. As shown in Figure 6, when the input is

G

, BAM uses Linear Attention (LA) and Attachment Linear Attention (ALA) to account for pixel relationships, enhancing global dependency.

BAM processes the input

G

through both attention branches. It then employs a convolutional layer with batch normalization (BN) and ReLU activation. The resulting map is added to the original

G

to obtain the refined features. The mathematical representation is shown in Equation (13):

BAM (G) = ReLU (G + {Conv}_{1 \times 1} (BN (ReLU (ALA (G) + LA (G))))),

(13)

where

{Conv}_{1 \times 1}

represents a standard

1 \times 1

convolution.

In the proposed BAM, two complementary branches are employed, as shown in Figure 7: Linear Attention (LA) and Attachment Linear Attention (ALA). The LA branch follows the general formulation of kernel-based linear attention [22], which approximates standard softmax attention with linear complexity:

LA (Q, K, V) = ϕ (Q) (ϕ {(K)}^{⊤} V),

(14)

where

Q

,

K

, and

V

are the query, key, and value matrices, and

ϕ (\cdot)

denotes a non-negative kernel mapping function.

The ALA branch extends this structure by introducing an adaptive weighting factor to modulate the attention response according to local feature similarity:

ALA (Q, K, V) = α ⊙ ϕ (Q) (ϕ {(K)}^{⊤} V),

(15)

where

α

is a channel-wise gating coefficient generated by a lightweight convolutional operation followed by sigmoid activation, formulated as

α = σ (W_{f} * [Q; K])

, where

W_{f}

denotes the learnable weights of the convolution operation. This attachment mechanism allows ALA to adaptively emphasize semantically related regions, improving sensitivity to thin and irregular crack structures.

3.5. Loss Function

To optimize the proposed segmentation network, we adopt a hybrid loss that combines Binary Cross-Entropy (BCE) loss and Dice loss. BCE enforces pixel-wise supervision, whereas Dice loss emphasizes region-level overlap, which is particularly beneficial for addressing the foreground–background imbalance present in crack segmentation.

The Binary Cross-Entropy loss is formulated as:

L_{BCE} = - \frac{1}{N} \sum_{i = 1}^{N} [y_{i} log (p_{i}) + (1 - y_{i}) log (1 - p_{i})],

(16)

where

p_{i}

and

y_{i}

represent the predicted probability and ground-truth label of pixel i, respectively, and N is the number of pixels.

The Dice loss is defined as:

Dice (p, y) = \frac{2 \sum_{i = 1}^{N} p_{i} y_{i} + ϵ}{\sum_{i = 1}^{N} p_{i} + \sum_{i = 1}^{N} y_{i} + ϵ},

(17)

L_{Dice} = 1 - Dice (p, y),

(18)

where

ϵ

is a small constant ensuring numerical stability. The final training objective is expressed as:

L = λ_{1} L_{BCE} + λ_{2} L_{Dice},

(19)

with both

λ_{1}

and

λ_{2}

empirically set to 1.

4. Experiments

4.1. Datasets and Parameter Settings

To evaluate the effectiveness of the proposed method, four public road crack datasets were selected for experimental evaluation. All publicly available datasets use labels from their original images. The DeepCrack dataset (

544 \times 384

pixels) [21] contains 537 high-quality pavement crack images from different scenes. The GAPs384 dataset (

1920 \times 1080

pixels) [23] contains 1969 asphalt pavement crack images. The Crack500 dataset (

2000 \times 1500

pixels) [24] contains 3368 asphalt crack images. The CFD dataset [25] consists of images sized

480 \times 320

pixels, containing 118 pavement crack images from both asphalt and concrete surfaces.

Parameter Settings: The AdamW optimizer was chosen to reduce network parameters and accelerate convergence, with a momentum of 0.9 and a weight decay rate of 0.0001. The batch size was set to 8. The base learning rate was set to 0.001, and the maximum number of training epochs was 300. The proposed method, SpcNet, was compared with crack detection methods such as FCN [26], U-Net [27], CcNet [28], OcrNet [29], and Resnest [30].

Loss Function: The entire SpcNet framework is trained end-to-end under a fully supervised setting. Although the random masking strategy is inspired by self-supervised masked image modeling, it is directly integrated into the supervised training pipeline without any pre-training or fine-tuning stages. The network is optimized using a combined Binary Cross-Entropy (BCE) and Dice loss. This hybrid loss ensures balanced optimization between pixel-wise classification accuracy and region-level segmentation consistency.

Computing Platform: In this study, the PyTorch 2.1.0 machine learning library was used. The experimental environment was based on the Ubuntu 20.04.5 LTS operating system, equipped with a high-performance 3.4 GHz Intel Xeon Gold i7-6135 CPU, 754 GB RAM, and a 32 GB NVIDIA Tesla V100S GPU.

4.2. Evaluation Metrics

Based on the principles of the confusion matrix, precision (P), recall (R), F1 score (F1), Accuracy (ACC), and mean Intersection over Union (mIoU) were adopted as quantitative indicators to measure the detection quality of the models. Crack pixels are defined as positive instances, while non-crack pixels are treated as negative instances. The various metrics are expressed as shown in Equations (20)–(24):

P = \frac{T P}{T P + F P},

(20)

R = \frac{T P}{T P + F N},

(21)

F 1 = \frac{2 \times P \times R}{P + R},

(22)

A C C = \frac{T P + T N}{T P + F N + T N + F P},

(23)

m I o U = \frac{1}{2} \frac{T P}{T P + F N + F P} .

(24)

According to the relationship between ground truth and predicted results, the instances are classified as True Positive (TP), False Positive (FP), True Negative (TN), and False Negative (FN), as shown in Table 1.

4.3. Experimental Results and Analysis

In this paper, the proposed method was compared with five crack detection methods on four publicly available datasets, as shown in Table 2, Table 3, Table 4, Table 5. The five methods are FCN [26], U-Net [27], CcNet [28], OcrNet [29], and Resnest [30].

As shown in Table 2, our method achieved the highest mIoU of 83.0% on the DeepCrack dataset, with a 5.5% improvement over the second-best method, CcNet. Notably, our method achieved an F1 score of 89.9% for challenging, indistinct cracks in the DeepCrack dataset, which is 3.9% higher than the second-best method, Resnest. As illustrated in Figure 8, comparing the crack detection results generated by different methods clearly shows that the crack detection results from our method more closely resemble the actual crack patterns. This is mainly attributed to the enhanced capability of our method to capture global information, enabling it to gather more valuable and discriminative crack feature information. Additionally, our method successfully captures effective crack edge information, which helps detect a complete and clear crack contour.

It is also worth noting the specific failure case illustrated in the second row of Figure 8. The U-Net model produces an almost entirely blank output, which seems inconsistent with its average 80.0% F1 score in Table 2. This is the actual model output, highlighting a key limitation of standard architectures. The input image in this case contains extremely fine (hairline) and low-contrast cracks, which U-Net’s simple skip-connections fail to differentiate from background noise. In contrast, our SpcNet successfully segments this crack. This direct comparison demonstrates the practical benefit of our architecture’s enhanced attention mechanisms (BAM, CAB/SAB) and sparse encoding, which are specifically designed to capture such sparse and globally-dependent features.

As shown in Table 3, our method achieved precision, recall, F1 score, and mIoU values of 84.1%, 70.8%, 76.8%, and 71.3%, respectively, on the GAPs384 dataset, outperforming the second-best method by 5.0%, 4.8%, 5.1%, and 8.2%. In Figure 9, images in the GAPs384 dataset present complex topologies and significant noise interference, posing a severe challenge for crack detection methods. However, our method successfully outlines the cracks and preserves their shape as much as possible. In contrast, methods like FCN, U-Net, and Resnest tend to generate erroneous detection results when processing such complex images. This is mainly due to their limited receptive fields, relying on information from only a few neighboring regions, leading to fragmented crack detection and confusion between different objects. Particularly, the Resnest method performed poorly across all images, demonstrating its limitations in complex environments. In contrast, our method addresses the receptive field limitations, accurately detecting cracks in complex images and generating clear, coherent detection results.

As shown in Table 4, our method achieved precision, recall, F1 score, and mIoU values of 91.0%, 85.1%, 88.0%, and 79.8%, respectively, on the Crack500 dataset, showing improvements of 5.9%, 4.6%, 5.3%, and 6.6% compared to the second-best method. As illustrated in Figure 10, images in the Crack500 dataset have complex crack textures, low contrast, and cracks that blend with the background, posing significant challenges for crack detection. However, our method effectively overcomes these interference factors, successfully combining global crack features with accurate crack boundary information, resulting in a more precise depiction of crack contours and clearer crack edges. In contrast, other algorithms performed poorly under similar conditions, particularly the FCN method, which nearly failed completely in the second test image. This demonstrates the superior robustness and higher detection accuracy of our method when dealing with complex, variable crack images.

As shown in Table 5, our method achieved precision, recall, F1 score, and mIoU values of 92.1%, 84.9%, 88.3%, and 76.8%, respectively, on the CFD dataset, with improvements of 5.5%, 7.1%, 6.3%, and 5.3% compared to the second-best method. In Figure 11, despite relatively low noise interference in the CFD dataset, various methods provided rough crack contour detection results. However, among these methods, our method’s results were the closest to the ground truth, with clearer and more accurate performance. This is because our method effectively utilizes spatial detail information in the feature maps, suppressing confusion between cracks and non-crack areas, preserving crack contour integrity and making them smoother. In contrast, other methods performed less satisfactorily in crack detection, particularly when detecting edge cracks, often resulting in unclear outcomes.

4.4. Comparative Discussion with Crack-Specific Models

To further verify the state-of-the-art (SOTA) performance of SpcNet, we compared it with more representative crack-detection models mentioned in [31], such as DeepCrack and CrackSAM. The cited review summarizes representative and advanced deep learning methods for crack detection. It provides a unified benchmark covering different architectural paradigms, including CNN-based and foundation-model-based approaches. These models represent the main directions of recent SOTA development in this research field.

Since both our study and the review [31] employed the DeepCrack dataset as a common benchmark, we conducted a literature-based quantitative comparison using the results reported in the original papers of these models. It should be noted that the metrics were directly quoted from the literature and may not be strictly comparable due to differences in evaluation protocols or data splits. Nevertheless, such a comparison provides valuable context for understanding the position of SpcNet among existing crack-specific networks.

The comparison in Table 6 clearly indicates that SpcNet achieves notably higher performance across all key metrics. In particular, SpcNet attains an mIoU of 83.0%, surpassing the foundation-model-based CrackSAM (75.1%) by a substantial margin of 7.9 percentage points. This superiority also holds for the F1/Dice metric (which are mathematically identical), where our model (89.9% F1) again outperforms CrackSAM (85.0% Dice), all while maintaining a lightweight structure optimized for efficiency.

Furthermore, although the parameters or FLOPs of CrackSAM were not explicitly reported, it is well known that SAM-based architectures are computationally intensive. In contrast, SpcNet is designed to achieve high accuracy with fewer parameters and lower computational cost. As demonstrated in our ablation study, the final configuration (46.3 million parameters, 539.8 GFLOPs) achieves higher accuracy than the dense-convolution variant (67 million parameters, 701.3 GFLOPs), confirming its effectiveness and efficiency.

We acknowledge that some recent real-time models, such as SDDNet [18], have reported high mIoU values (e.g., >84%). It is important to clarify that these results were primarily derived from custom datasets (e.g., Crack200) rather than the standard public benchmarks used in this study (DeepCrack, Crack500, etc.). Due to variations in image complexity, resolution, and annotation granularity between custom and public datasets, a direct numerical comparison of mIoU is inappropriate. SpcNet focuses on establishing performance on widely accessible public benchmarks to ensure broad applicability and reproducibility.

In summary, this comparative discussion demonstrates that SpcNet achieves a new state-of-the-art balance between detection accuracy and computational efficiency among existing crack-detection models, while maintaining a lightweight and practical design suitable for real-world applications.

4.5. Ablation Study

The ablation experiments in this section were conducted on the DeepCrack dataset to verify the effectiveness of the method proposed in this paper. The experiments primarily focus on replacing sparse convolutions with regular convolutions and the decoder with a multi-level pyramid structure, as well as removing the binary attention module (BAM).

As shown in Table 7, the results are as follows, when replacing sparse convolutions with regular convolutions and using a multi-level pyramid structure for the decoder, the precision (P), mIoU, and F1 scores decrease by 4.0%, 6.2%, and 5.6%, respectively, compared to the proposed method. Additionally, the parameter requirement increased by 44.8%, and the number of floating-point operations increased by 29.9%.

When removing the binary attention module (BAM), the precision (P), mIoU, and F1 scores decreased by 0.9%, 0.9%, and 0.8%, respectively, compared to the proposed method. When both sparse convolutions were replaced with regular convolutions, the decoder was changed to a multi-level pyramid structure, and BAM was removed, the precision (P), mIoU, and F1 scores decreased by 4.6%, 7.4%, and 6.2%, respectively, compared to the proposed method. Additionally, the parameter requirement increased by 33.9%, and the number of floating-point operations increased by 20.5%.

To further validate the practical processing speed, we also benchmarked the models using the DeepCrack input size (544 × 384 pixels) on our NVIDIA Tesla V100S GPU. As shown in the Speed column in Table 7, our proposed SpcNet achieves [93.1] FPS. This is significantly faster than the dense convolution baseline, which only achieved [74.3] FPS. This result provides strong practical evidence that our sparse-encoding and asymmetric mask structure not only reduces theoretical computation (FLOPs) but also delivers a tangible improvement in real-world processing speed.

These results demonstrate the effectiveness of the sparse convolution with an asymmetric mask structure and the binary attention module (BAM). The proposed method not only achieves excellent crack detection performance but also requires fewer parameters and lower computational loads, making it more suitable for practical detection scenarios.

5. Conclusions

This paper proposes a novel pavement crack detection method based on sparse convolution, named SpcNet. By applying a random masking strategy to input images and leveraging sparse convolution, the method significantly enhances both the training efficiency and accuracy of the model. The sparse convolution module operates only on visible data points, thereby greatly reducing unnecessary computational resource consumption. The Sparse Encoding Module is utilized as the encoder, combining with the ConvNeXt V2 module to deeply extract crack features. The method also incorporates the Binary Attention Module (BAM) alongside Channel Attention Bridge (CAB) and Spatial Attention Bridge (SAB) modules, which enhance the model’s ability to capture spatial relationships and global information, further improving crack detection accuracy.

Moreover, the encoder is designed with the Patch Mixing Module and Channel Mixing Module, while the lightweight ConvNeXt V2 module serves as the decoder. This design avoids the issue of information duplication in masked regions. Extensive experiments on four publicly available datasets validate the method’s effectiveness in handling the complexity of crack images. The method is capable of accurately detecting cracks regardless of variations in shape, size, orientation, or background noise.

Importantly, SpcNet achieves high crack detection performance with fewer parameters and lower computational load. In summary, this crack detection method is not only well-suited for practical application scenarios but also provides valuable insights for the development of future deep learning models in similar domains. In the future, we will focus on improving the model’s adaptability to complex backgrounds and enhancing its robustness in low-quality images or extreme environmental conditions. Additionally, considering the detection requirements for different types of cracks, we plan to expand the model into a multi-task learning framework to achieve joint optimization of crack type classification and precise localization. With these improvements, the model will better meet the diverse needs of real-world applications.

Author Contributions

Methodology, Y.L. (Yang Li); software, Y.L. (Yang Li); validation, K.S., X.L. and Y.L. (Yang Li); formal analysis, Y.L. (Yang Li) and D.S.; investigation, K.S. and X.L.; resources, M.G. and Y.L. (Yang Li); data curation, M.G.; writing—original draft, K.S. and X.L.; writing—review & editing, M.G., K.S., X.L., Y.L. (Yang Li), Y.L. (Yacong Li) and D.S.; supervision, Y.L. (Yang Li) and D.S.; project administration, Y.L. (Yang Li); funding acquisition, M.G. and Y.L. (Yang Li). All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Natural Science Foundation of China, grant numbers 62101022, 62271036; the Pyramid Talent Training Project of Beijing University of Civil Engineering and Architecture, grant number JDYC20220818; Young Teachers Research Ability Enhancement program of Beijing University of Civil Engineering and Architecture, grant number X21083.

Data Availability Statement

The data are unavailable due to privacy security.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Zucca, M.; Reccia, E.; Longarini, N.; Cazzani, A. Seismic assessment and retrofitting of an historical masonry building damaged during the 2016 centro Italia seismic event. Appl. Sci. 2022, 12, 11789. [Google Scholar] [CrossRef]
Mouloud, H.; Chaker, A.; Hallal, N.; Lebdioui, S.; Rodrigues, H.; Agius, M.R. Post-earthquake damage classification and assessment: Case study of the residential buildings after the M_w = 5 earthquake in Mila city, Northeast Algeria on August 7, 2020. Bull. Earthq. Eng. 2023, 21, 849–891. [Google Scholar] [CrossRef] [PubMed]
Zhang, J.; Stang, H. Applications of stress crack width relationship in predicting the flexural behavior of fibre-reinforced concrete. Cem. Concr. Res. 1998, 28, 439–452. [Google Scholar] [CrossRef]
Dang, H.V.; Raza, M.; Nguyen, T.V.; Bui-Tien, T.; Nguyen, H.X. Deep learning-based detection of structural damage using time-series data. Struct. Infrastruct. Eng. 2021, 17, 1474–1493. [Google Scholar] [CrossRef]
Hu, G.X.; Hu, B.L.; Yang, Z.; Huang, L.; Li, P. Pavement crack detection method based on deep learning models. Wirel. Commun. Mob. Comput. 2021, 2021, 5573590. [Google Scholar] [CrossRef]
Pauly, L.; Hogg, D.; Fuentes, R.; Peel, H. Deeper networks for pavement crack detection. In Proceedings of the 34th International Symposium on Automation and Robotics in Construction (ISARC 2017), Taipei, Taiwan, 28 June–1 July 2017; pp. 479–485. [Google Scholar]
Shatnawi, N. Automatic pavement cracks detection using image processing techniques and neural network. Int. J. Adv. Comput. Sci. Appl. (IJACSA) 2018, 9, 399–402. [Google Scholar] [CrossRef]
Fang, H.; He, N. Detection method of cracks in expressway asphalt pavement based on digital image processing technology. Appl. Sci. 2023, 13, 12270. [Google Scholar] [CrossRef]
Xu, Z.; Zhao, X.; Li, H.; Wang, Z.; Zhang, M. Initial classification algorithm for pavement distress images using features fusion of texture and shape. In Proceedings of the Transportation Research Board 95th Annual Meeting, Washington, DC, USA, 1–10 January 2016. [Google Scholar]
Cha, Y.-J.; Choi, W.; Büyükoztürk, O. Deep learning-based crack damage detection using convolutional neural networks. Comput.-Aided Civ. Infrastruct. Eng. 2017, 32, 361–378. [Google Scholar] [CrossRef]
Mustapha, Y.G.; Halliru, A.M.; Hobon, F.A.; Aibinu, A.; Thomas, S.; Malami, N.S. Crack detection on surfaces using digital image processing. In Proceedings of the 2021 1st International Conference on Multidisciplinary Engineering and Applied Science (ICMEAS), Abuja, Nigeria, 15–16 July 2021; pp. 1–6. [Google Scholar]
Ye, X.-W.; Tao, J.; Yun, C.-B. A review on deep learning-based structural health monitoring of civil infrastructures. Smart Struct. Syst. 2019, 24, 567–585. [Google Scholar] [CrossRef]
Kim, B.; Yuvaraj, N.; Sri Preethaa, K.R.; Arun Pandian, R. Surface crack detection using deep learning with shallow CNN architecture for enhanced computation. Neural Comput. Appl. 2021, 33, 9289–9305. [Google Scholar] [CrossRef]
Cha, Y.-J.; Choi, W.; Suh, G.; Mahmoudkhani, S.; Büyüköztürk, O. Autonomous Structural Visual Inspection Using Region-Based Deep Learning for Detecting Multiple Damage Types. Comput.-Aided Civ. Infrastruct. Eng. 2018, 33, 731–747. [Google Scholar] [CrossRef]
Deng, L.; Chu, H.H.; Shi, P.; Wang, W.; Kong, X. Region-based CNN method with deformable modules for visually classifying concrete cracks. Appl. Sci. 2020, 10, 2528. [Google Scholar] [CrossRef]
Yang, X.; Li, H.; Yu, Y.; Luo, X.; Huang, T.; Yang, X. Automatic pixel-level crack detection and measurement using fully convolutional network. Comput.-Aided Civ. Infrastruct. Eng. 2018, 33, 1090–1109. [Google Scholar] [CrossRef]
Meftah, I.; Hu, J.; Asham, M.A.; Meftah, A.; Zhen, L.; Wu, R. Visual detection of road cracks for autonomous vehicles based on deep learning. Sensors 2024, 24, 1647. [Google Scholar] [CrossRef] [PubMed]
Choi, W.; Cha, Y.-J. SDDNet: Real-Time Crack Segmentation. IEEE Trans. Ind. Electron. 2020, 67, 8016–8025. [Google Scholar] [CrossRef]
Choy, C.; Gwak, J.; Savarese, S. 4D spatio-temporal ConvNets: Minkowski convolutional neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 3075–3084. [Google Scholar]
Xie, S.; Gu, J.; Guo, D.; Qi, C.R.; Guibas, L.; Litany, O. PointContrast: Unsupervised pre-training for 3D point cloud understanding. In Computer Vision—ECCV 2020; Springer: Cham, Switzerland, 2020; pp. 574–591. [Google Scholar]
Liu, Y.; Yao, J.; Lu, X.; Xie, R.; Li, L. DeepCrack: A deep hierarchical feature learning architecture for crack segmentation. Neurocomputing 2019, 338, 139–153. [Google Scholar] [CrossRef]
Katharopoulos, A.; Vyas, A.; Pappas, N.; Fleuret, F. Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention. In Proceedings of the 37th International Conference on Machine Learning (PMLR), Virtual, 13–18 July 2020; pp. 5156–5165. Available online: https://proceedings.mlr.press/v119/katharopoulos20a.html (accessed on 1 December 2025).
Eisenbach, M.; Stricker, R.; Seichter, D.; Amende, K.; Debes, K.; Sesselmann, M.; Ebersbach, D.; Stoeckert, U.; Gross, H.M. How to get pavement distress detection ready for deep learning? A systematic approach. In Proceedings of the 2017 International Joint Conference on Neural Networks (IJCNN), Anchorage, AK, USA, 14–19 May 2017; pp. 2039–2047. [Google Scholar]
Yang, F.; Zhang, L.; Yu, S.; Prokhorov, D.; Mei, X.; Ling, H. Feature pyramid and hierarchical boosting network for pavement crack detection. IEEE Trans. Intell. Transp. Syst. 2020, 21, 1525–1535. [Google Scholar] [CrossRef]
Shi, Y.; Cui, L.; Qi, Z.; Meng, F.; Chen, Z. Automatic road crack detection using random structured forests. IEEE Trans. Intell. Transp. Syst. 2016, 17, 3434–3445. [Google Scholar] [CrossRef]
Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3431–3440. [Google Scholar]
Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention—MICCAI 2015; Springer: Cham, Switzerland, 2015; pp. 234–241. [Google Scholar]
Huang, Z.; Wang, X.; Huang, L.; Huang, C.; Wei, Y.; Liu, W. CCNet: Criss-cross attention for semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 603–612. [Google Scholar]
Yuan, Y.; Chen, X.; Wang, J. Object-contextual representations for semantic segmentation. In Proceedings of the European Conference on Computer Vision (ECCV); Springer: Cham, Switzerland, 2020; pp. 173–190. [Google Scholar]
Zhang, H.; Wu, C.; Zhang, Z.; Zhu, Y.; Lin, H.; Zhang, Z.; Sun, Y.; He, T.; Mueller, J.; Manmatha, R.; et al. ResNeSt: Split-attention networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 2736–2746. [Google Scholar]
Zhang, X.; Wang, H.; Hsieh, Y.-A.; Yang, Z.; Yezzi, A.; Tsai, Y.-C. Deep Learning for Crack Detection: A Review of Learning Paradigms, Generalizability, and Datasets. arXiv 2025, arXiv:2508.10256. [Google Scholar] [CrossRef]

Figure 1. Structure of the pavement crack image detection model based on sparse convolution. The model applies random masking and sparse convolution to improve efficiency and accuracy. The encoder uses sparse encoding with patch and channel mixing to extract global crack features. Binary attention and CAB/SAB modules enhance spatial relations and multi-scale feature fusion. Finally, global average pooling reduces dimensions and outputs the detection results.

Figure 2. Sparse encoding module.

Figure 3. ConvNeXt V2 module.

Figure 4. Channel Attention Bridge.

Figure 5. Spatial Attention Bridge.

Figure 6. Binary Attention Module.

Figure 7. Linear Attention and Attachment Linear Attention.

Figure 8. Predicted results of comparison methods on the DeepCrack dataset.

Figure 9. Predicted results of comparison methods on the GAPs384 dataset.

Figure 10. Predicted results of comparison methods on the Crack500 dataset.

Figure 11. Predicted results of comparison methods on the CFD dataset.

Table 1. All Ground Truth and Predicted Results.

	Crack	Non-Crack
Result	Crack	Non-Crack
Crack	True positive (TP)	False positive (FP)
Non-crack	False negative (FN)	True negative (TN)

Table 2. Results on the DeepCrack dataset.

Method	P	R	mIoU	F1
FCN	92.7	76.1	74.0	83.6
U-Net	91.2	71.3	65.3	80.0
CcNet	93.1	78.6	77.5	85.2
OcrNet	94.7	72.1	72.1	81.9
Resnest	93.2	79.8	77.3	86.0
Ours	94.5	85.7	83.0	89.9

Table 3. Results on the GAPs384 dataset.

Method	P	R	mIoU	F1
FCN	73.1	66.0	61.7	69.4
U-Net	70.3	60.6	59.3	65.1
CcNet	75.9	64.7	60.7	69.8
OcrNet	79.1	65.7	63.1	71.8
Resnest	69.8	57.3	56.5	62.9
Ours	84.1	70.8	71.3	76.9

Table 4. Results on the Crack500 dataset.

Method	P	R	mIoU	F1
FCN	72.6	69.7	63.5	71.1
U-Net	79.4	72.6	64.8	75.8
CcNet	83.3	75.8	71.1	79.4
OcrNet	85.1	80.5	73.2	82.7
Resnest	81.7	77.6	68.3	79.6
Ours	91.0	85.1	79.8	88.0

Table 5. Results on the CFD dataset.

Method	P	R	mIoU	F1
FCN	80.3	70.5	63.7	75.1
U-Net	82.7	76.3	65.3	79.4
CcNet	78.5	71.3	64.6	74.7
OcrNet	83.2	75.6	68.8	79.2
Resnest	86.6	77.8	71.5	82.0
Ours	92.1	84.9	76.8	88.3

Table 6. Comparison with state-of-the-art (SOTA) models on the DeepCrack dataset.

Method	mIoU (%)	F1 Score/Dice (%)
DeepCrack *	70.6	81.2 (Dice)
CrackSAM *	75.1	85.0 (Dice)
SpcNet (Ours)	83.0	89.9 (F1)

* Results are reported from the review paper [31]. The F1 Score (used by our paper) and the Dice Coefficient (used by the review) are mathematically identical metrics (

F 1 = D i c e = \frac{2 T P}{2 T P + F P + F N}

).

Table 7. Ablation study results for Sparse Convolution with Asymmetric Mask Structure and BAM.

Sparse + Asym. Mask	BAM	P	mIoU	F1	#param	FLOPs	Speed (FPS)
×	✓	90.5	76.8	84.3	67	701.3 G	74.3
✓	×	93.6	82.1	89.1	43.1	501.9 G
×	×	89.9	75.6	83.7	62	650.7 G
✓	✓	94.5	83.0	89.9	46.3	539.8 G	93.1

Note: The symbol “✓” indicates that the corresponding component is included in the model, while “×” indicates that it is excluded.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Li, Y.; Li, X.; Shen, K.; Li, Y.; Sui, D.; Guo, M. Detection in Road Crack Images Based on Sparse Convolution. Math. Comput. Appl. 2025, 30, 132. https://doi.org/10.3390/mca30060132

AMA Style

Li Y, Li X, Shen K, Li Y, Sui D, Guo M. Detection in Road Crack Images Based on Sparse Convolution. Mathematical and Computational Applications. 2025; 30(6):132. https://doi.org/10.3390/mca30060132

Chicago/Turabian Style

Li, Yang, Xinhang Li, Ke Shen, Yacong Li, Dong Sui, and Maozu Guo. 2025. "Detection in Road Crack Images Based on Sparse Convolution" Mathematical and Computational Applications 30, no. 6: 132. https://doi.org/10.3390/mca30060132

APA Style

Li, Y., Li, X., Shen, K., Li, Y., Sui, D., & Guo, M. (2025). Detection in Road Crack Images Based on Sparse Convolution. Mathematical and Computational Applications, 30(6), 132. https://doi.org/10.3390/mca30060132

Article Menu

Detection in Road Crack Images Based on Sparse Convolution

Abstract

1. Introduction

2. Related Work

3. SpcNet Model

3.1. Model Structure

3.2. Sparse Encoding Module

3.3. Channel Attention Bridge Module and Spatial Attention Bridge Module

3.4. Binary Attention Module

3.5. Loss Function

4. Experiments

4.1. Datasets and Parameter Settings

4.2. Evaluation Metrics

4.3. Experimental Results and Analysis

4.4. Comparative Discussion with Crack-Specific Models

4.5. Ablation Study

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI