3.1. Model Structure
The proposed SpcNet architecture is illustrated in
Figure 1. It consists of four key components: a Sparse Encoding Module, a ConvNeXt V2–based Decoder, a Binary Attention Module (BAM), and Channel/Spatial Attention Bridge (CAB/SAB) modules for multi-scale feature fusion.
First, a random masking strategy is applied to the input image to generate a sparse pixel distribution, simulating incomplete visual information and encouraging the network to infer missing regions from contextual cues. The Sparse Encoding Module, built upon sparse convolution, processes only visible data points to reduce unnecessary computation while maintaining accurate feature extraction. This design effectively balances performance and computational load, making it particularly suitable for lightweight crack detection.
The encoder comprises Patch Mixing and Channel Mixing submodules. The former facilitates spatial feature interaction, while the latter enhances inter-channel communication, enabling rich and discriminative representations of cracks.
The decoder adopts a lightweight ConvNeXt V2 module, an evolution of the original ConvNeXt architecture. ConvNeXt V2 introduces Global Response Normalization (GRN) and GELU activation to improve gradient stability and representation smoothness. Its asymmetric design—where the encoder is deeper and more hierarchical than the decoder—prevents shortcut information copying from masked to visible regions, compelling the model to learn genuine contextual understanding rather than pixel replication. This asymmetric encoder–decoder structure thus enforces semantic reconstruction and enhances generalization to diverse crack patterns.
Finally, the Binary Attention Module (BAM) is integrated to strengthen global dependency modeling. BAM employs Linear Attention (LA) and Attachment Linear Attention (ALA) to dynamically capture long-range spatial relationships between crack regions. The CAB and SAB modules then fuse multi-scale features across channels and spatial dimensions, respectively, ensuring precise localization and complete crack contours. The model concludes with a global average pooling layer and output prediction head for final crack segmentation.
Overall, the SpcNet framework leverages sparse encoding for efficiency, ConvNeXt V2 for stability, and binary attention for enhanced global context understanding—jointly achieving high-accuracy, lightweight crack detection suitable for real-world pavement monitoring.
3.2. Sparse Encoding Module
A random masking strategy is applied to the original input crack image, using a high masking ratio to force the model to predict the masked portions based on the limited remaining information. This encourages the model to generate effective learning signals, despite the loss of content.
Specifically, the random masking strategy is implemented via a uniform sampling mechanism to ensure unbiased context learning. The process follows three key steps:
Patch Partitioning: The input image is divided into non-overlapping patches of size 16×16.
Uniform Random Permutation: We generate a random permutation of the patch indices following a uniform distribution. This ensures global independence among masked regions, distinguishing our approach from path-dependent strategies like random walks.
Masking Execution: A fixed masking ratio of 40% is applied. The first 40% of the permuted indices are discarded (masked), while the remaining patches constitute the sparse input.
We empirically validated masking ratios of 20%, 40%, and 60%, finding that 40% offers the optimal trade-off between reconstruction difficulty and semantic preservation. To ensure reproducibility, the random seed is fixed and the mask pattern is deterministically regenerated at each epoch.
The sparse encoding module, as shown in
Figure 2, takes a sequence of
S non-overlapping image patches as input. Let
denote the input feature matrix, where
C is the feature dimension. The resolution of the original input image is
, the resolution of each image patch is
, and the resulting number of patches is
. All image patches are projected using the same projection matrix to ensure consistency of information.
To efficiently extract features, the model performs sparse convolution with a kernel size of followed by max pooling operations for downsampling. The operations at each position (referred to as channel mixing in this paper) are separated from operations across positions (referred to as patch mixing in this paper).
The first component is the patch mixing ConvNeXt V2. It allows communication between different spatial positions and operates independently on each channel, acting on the columns of the input matrix (i.e., the transposed input). It shares parameters across all columns. The second component is the channel mixing ConvNeXt V2. It enables communication between different channels and operates independently on each patch, acting on the rows of the matrix, sharing parameters across all rows.
By applying patch mixing on spatial positions and channel mixing on feature channels, we extract crack features and generate rich feature representations. The process of the sparse encoding module is formally described by Equations (
1) and (
2):
In these equations,
denotes the input feature map,
represents the intermediate features, and
denotes the output of the sparse encoding module.
represents the ConvNeXt V2 block operation, and LN denotes Layer Normalization. Pool stands for max pooling, and SparseConv represents the sparse convolution operation. Crucially,
and
represent the first and second transposition operations, respectively, which align the tensor dimensions for patch mixing and channel mixing.
To ensure smooth feature flow between the sparse encoder and the dense decoder, a sparse-to-dense projection layer is introduced. After sparse convolution and pooling, the output tensor is converted into a dense grid by filling unobserved spatial positions with zeros and performing bilinear interpolation to restore spatial continuity. This operation provides the dense ConvNeXt V2 decoder with a complete feature map while preserving the efficiency and sparsity benefits of the encoder.
We apply a lightweight ConvNeXt V2 module as a decoder, resulting in an asymmetric encoder–decoder structure, where the encoder is more complex and hierarchical. This asymmetric design prevents the model from copying and pasting mask area information to unknown regions via shortcuts, effectively preventing information leakage. The decoder must make predictions based on an understanding of the global context rather than simple copy-paste operations.
The ConvNeXt V2 module is shown in
Figure 3. The computation method is mathematically described by Equation (
3):
where
and
denote the input and output features, GELU represents the GELU activation function, and GRN stands for Global Response Normalization.
First, global feature aggregation is performed using a global function
. This transforms the spatial feature map
into a global feature vector
to capture overall contextual information. As shown in Equation (
4), this is achieved by computing the
-norm for each channel:
where
represents the feature map of the
i-th channel.
Feature normalization is then applied to compute a relative importance score
for each channel. We define a normalization function
which divides the feature norm of the current channel by the global aggregated norm. As shown in Equation (
5):
where
is a small constant ensuring numerical stability. This normalization creates feature competition, suppressing redundant channels while highlighting discriminative ones.
The final step is feature calibration, which uses the computed normalization scores to adjust the original input responses. This ensures that responses fully exhibit their representational capability while maintaining trainability. The approach enhances the diversity and discriminative power of the learned features, as formulated in Equation (
6):
Here,
represents the calibrated feature map of the
i-th channel. To optimize this process and provide greater flexibility, learnable parameters
and
(initialized to zero) are introduced via a residual connection. The final computation of the GRN block is shown in Equation (
7):
In the early stages of training, the GRN layer approximates an identity mapping due to the zero initialization. As training progresses, it adapts to optimize the network’s learning requirements. The residual connection ensures stability and convergence, while the normalization enhances feature responses over time.
3.3. Channel Attention Bridge Module and Spatial Attention Bridge Module
The Channel Attention Bridge (CAB) and Spatial Attention Bridge (SAB) modules achieve multi-stage, multi-scale feature information fusion. The CAB module consists of global average pooling, concatenation operations, fully connected layers, and the Sigmoid activation function. Meanwhile, the SAB module integrates max pooling, average pooling, and dilated convolution operations to improve the model’s convergence speed and enhance its sensitivity to crack features.
Acquiring and fusing multi-stage information is crucial. As shown in
Figure 4, the CAB module integrates features by generating channel attention maps. Let
denote the input feature map from the
k-th stage (where
). The fusion process is mathematically formulated as follows:
where GAP denotes global average pooling,
represents the pooled vector from the previous stage, and
represents the concatenation operation along the channel dimension.
is the concatenated multi-scale feature vector.
denotes the 1D convolution operation, and
corresponds to the fully connected layer output at stage
k.
is the sigmoid function,
denotes the generated channel attention map, and ⊙ represents element-wise multiplication. Finally,
represents the fused output feature map. CAB splits the multi-stage fusion into local and global contexts to provide richer attention maps.
As shown in
Figure 5, the SAB module fuses multi-level and multi-scale information along the spatial axis. First, average pooling and max pooling are performed along the channel dimension. The results are concatenated to form a two-channel map. Next, dilated convolution is applied to enhance feature representation. Finally, a spatial attention map is generated via the Sigmoid function and multiplied with the original features, with residual information added for fusion.
3.4. Binary Attention Module
In this paper, a Binary Attention Module (BAM) is designed to fuse different features, thereby enhancing the crack detection performance. As shown in
Figure 6, when the input is
, BAM uses Linear Attention (LA) and Attachment Linear Attention (ALA) to account for pixel relationships, enhancing global dependency.
BAM processes the input
through both attention branches. It then employs a convolutional layer with batch normalization (BN) and ReLU activation. The resulting map is added to the original
to obtain the refined features. The mathematical representation is shown in Equation (
13):
where
represents a standard
convolution.
In the proposed BAM, two complementary branches are employed, as shown in
Figure 7: Linear Attention (LA) and Attachment Linear Attention (ALA). The LA branch follows the general formulation of kernel-based linear attention [
22], which approximates standard softmax attention with linear complexity:
where
,
, and
are the query, key, and value matrices, and
denotes a non-negative kernel mapping function.
The ALA branch extends this structure by introducing an adaptive weighting factor to modulate the attention response according to local feature similarity:
where
is a channel-wise gating coefficient generated by a lightweight convolutional operation followed by sigmoid activation, formulated as
, where
denotes the learnable weights of the convolution operation. This attachment mechanism allows ALA to adaptively emphasize semantically related regions, improving sensitivity to thin and irregular crack structures.