A Transformer-Based Pavement Crack Segmentation Model with Local Perception and Auxiliary Convolution Layers

Zhu, Yi; Cao, Ting; Yang, Yiqing

doi:10.3390/electronics14142834

Open AccessArticle

A Transformer-Based Pavement Crack Segmentation Model with Local Perception and Auxiliary Convolution Layers

by

Yi Zhu

¹,

Ting Cao

^2,* and

Yiqing Yang

^2,*

¹

Shaanxi Railway and Underground Traffic Engineering Key Laboratory, Xi’an 710043, China

²

School of Computer Science and Engineering, Xi’an University of Technology, Xi’an 710048, China

^*

Authors to whom correspondence should be addressed.

Electronics 2025, 14(14), 2834; https://doi.org/10.3390/electronics14142834

Submission received: 24 March 2025 / Revised: 9 June 2025 / Accepted: 11 June 2025 / Published: 15 July 2025

Download

Browse Figures

Versions Notes

Abstract

Crack detection in complex pavement scenarios remains challenging due to the sparse small-target features and computational inefficiency of existing methods. To address these limitations, this study proposes an enhanced architecture based on Mask2Former. The framework integrates two key innovations. A Local Perception Module (LPM) reconstructs geometric topological relationships through a Sequence-Space Dynamic Transformation Mechanism (DS2M), enhancing neighborhood feature extraction via depthwise separable convolutions. Simultaneously, an Auxiliary Convolutional Layer (ACL) combines lightweight residual convolutions with shallow high-resolution features, preserving critical edge details through channel attention weighting. Experimental evaluations demonstrate the model’s superior performance, achieving improvements of 3.2% in mIoU and 2.7% in mAcc compared to baseline methods, while maintaining computational efficiency with only 12.8 GFLOPs. These results validate the effectiveness of geometric relationship modeling and hierarchical feature fusion for pavement crack detection, suggesting practical potential for infrastructure maintenance systems. The proposed approach balances precision and efficiency, offering a viable solution for real-world applications with complex crack patterns and hardware constraints.

Keywords:

pavement detection; crack segmentation; deep learning; semantic segmentation

1. Introduction

With the aging of global road infrastructure and the surge in demand for smart maintenance, precise pavement crack segmentation has become one of the core challenges in intelligent transportation systems. Although general segmentation models, such as Mask2Former and SegFormer, which are based on Transformer architectures, have made groundbreaking advances in natural scene tasks, they face three unique challenges in the pavement crack domain: (1) morphological imbalance (the average width of cracks is only 2–5 pixels, accounting for less than 2% of the image area), (2) difficulty in coupling local and global features (the linear topology requires simultaneous capture of millimeter-level local fractures and meter-level continuous distributions), and (3) heterogeneous data sources (material differences between asphalt and concrete lead to a feature drift rate of up to 34%).

Semantic segmentation involves pixel-level analysis, where each pixel in the image is assigned to a corresponding semantic category. Compared to image classification tasks [1], semantic segmentation requires pixel-level understanding, including Fully Convolutional Networks (FCN), U-Net, and DeepLab [2]. Since the introduction of the Transformer, it has demonstrated excellent performance in natural language processing. After Google published the Vision Transformer (ViT), Transformer-based methods have surpassed traditional Convolutional Neural Network (CNN) methods in computer vision.

However, semantic segmentation of pavement cracks differs from that of general objects. The semantic segmentation of pavement cracks presents several challenges. The number of crack pixels is small relative to the whole image area, leading to a class imbalance between crack and background pixels. During feature extraction, crack pixels are prone to gradually disappear. Cracks span large spatial areas and are not contiguous; instead, they form linear or grid-like patterns.

Furthermore, public semantic datasets for pavement cracks are scarce. The complex shape of cracks makes data annotation time-consuming and labor-intensive, and producing high-quality datasets is even more difficult [3]. Furthermore, the limited number of publicly available pavement crack segmentation datasets increases the risk of model overfitting and results in poor model generalization [4].

To address these challenges, we propose a hybrid segmentation network with two key innovations: (1) a Local Perception Module (LPM) to enhance local feature sensitivity within the Transformer framework, and (2) an Auxiliary Convolutional Layer (ACL) to preserve high-resolution edge features. These modules directly tackle the limitations of poor small-target representation and weak edge retention in existing models.

2. Related Work

2.1. Computer Vision Methods

The main difficulty in pavement crack detection lies in extracting the features of the cracks. Various algorithms have been proposed to address this challenge. Amhaz et al. [5] first employed Dijkstra’s algorithm to compute the shortest path for extracting candidate crack regions, followed by morphological post-processing techniques to further optimize the crack segmentation results. Cracks exhibit strong edge features, where the grayscale values of the crack regions show significant step-like changes, while the background shows relatively slow changes and small gradients. Therefore, edge detection algorithms are widely used in crack segmentation [6]. Zhao [7] proposed an improved Canny edge detection method for road edge detection. While edge-based algorithms are widely applied in crack segmentation, their main limitation is that they can only be applied to a set of non-intersecting crack segments, making it difficult to achieve ideal segmentation accuracy. Cracks possess both locality and continuity, prompting some researchers to apply region-growing techniques for crack detection.

Zhang [8] initially fused spatial distribution, intensity, and geometric features to extract regions for coarse crack localization, followed by post-processing. A confidence factor was defined to extract regions with sufficient confidence as seeds, and region-growing algorithms were used to merge highly similar regions to ensure the completeness of the detected cracks, while removing regions with low similarity. Due to the similarity in intensity between pavement shadows and crack pixels, shadows often interfere with crack detection, leading to false positives. Zou [9] employed region-growing techniques to identify potential crack pixels and established a crack probability map using tensor voting to extract crack seeds. They then generated a graph model and derived the final crack detection result from the minimum spanning tree of the graph.

2.2. Deep Learning Methods

Researchers have focused on applying various deep learning methods for pavement crack detection. R-CNN is a convolutional neural network-based object detection algorithm, and several improved versions have been proposed, such as Fast Region-based Convolutional Network (Fast R-CNN) [10] and Mask Region-based Convolutional Neural Network (Mask R-CNN). Based on these R-CNN algorithms, methods for pavement crack detection have been developed. Kumar’s team [11] improved the final layer of the Mask R-CNN model and achieved excellent classification and localization performance on a concrete damage dataset, providing performance analysis of the model. Fujita et al. [12] used a pre-trained Mask R-CNN model for pavement detection tasks and adopted a modified confusion matrix model to avoid using the Intersection over Union (IoU) metric. The YOLO object detection algorithm is a real-time object detection algorithm that simultaneously predicts both the position and category of objects in a single forward pass. Many researchers have explored pavement defect detection methods based on YOLO. Du et al. [13] proposed a lightweight pavement crack detection model designed with a denoising autoencoder network to remove background noise. Wang et al. [14] proposed an improved road defect detection model based on YOLOv8s, reconstructing the neck structure of the YOLOv8s model to reduce the number of parameters, computational load, and model size, thus enhancing feature fusion capabilities and optimizing the feature pyramid layers for improved speed. Zhang et al. [15] proposed an intelligent identification algorithm for tunnel lining cracks based on YOLOv11 and constructed a dedicated crack dataset for model training and testing. The model also integrated an attention mechanism to enhance the extraction of critical crack features, with only 6.33 M parameters, while maintaining reasonable evaluation metrics.

The Transformer model [16] replaces the recurrent neural network (RNN) structure commonly used in natural language processing tasks with a self-attention mechanism. Compared to RNNs, the key advantage of the Transformer is its ability to perform parallel computation. Transformer-based vision detection models show superior modeling capabilities over traditional CNN models, effectively capturing long-range dependencies in images and adapting flexibly to features at different scales. Ji et al. [17] proposed a transformer-based TransUnet deep learning model for crack detection, combining convolutional encoders and self-attention mechanisms to improve feature extraction capabilities, and quantitatively analyzed the morphological features of detected cracks, such as length, width, and area, to provide a reference for road condition assessment. Guo et al. [18] developed a Transformer-based semantic segmentation network using the Swin Transformer as the encoder and the UperNet model based on attention mechanisms as the decoder, achieving accurate pixel-level pavement crack detection. Wang et al. [19] proposed a multi-attention weakly supervised hybrid network named CGTr-Net, which effectively integrated the advantages of CNN and Transformer and performed well in pavement crack detection.

Lin et al. [20] utilized crack block grids and position embedding for feature recovery and pixel-level prediction, achieving significant advantages in damage detection and profile extraction. Shamsabadi et al. [21] proposed the TransUNet hybrid model, which combined CNN and ViT advantages to leverage both local and global features, enhancing the model’s ability to recognize crack features. The mIoU of their dataset reached 75.5%. Ding et al. [22] proposed a method based on drones and Transformers, which achieved crack detection and quantification without reference labels through full-field-scale calibration and independent boundary refinement with Transformers, capable of detecting cracks as small as 0.2 mm. Current research on Transformer-based pavement detection largely focuses on improving defect detection and segmentation performance but does not address challenges such as slow training convergence and poor performance on small target detection.

Mask2Former is a universal image segmentation framework that unifies semantic, instance, and panoptic segmentation using masked attention mechanisms [23]. It has demonstrated state-of-the-art performance in various vision tasks by leveraging learnable queries and pixel-level embeddings. However, due to its transformer-based structure, it often struggles with extracting local details and small-object features, such as narrow cracks in pavement images. Therefore, in this study, we adopt Mask2Former as the backbone framework and propose two targeted enhancements—Local Perception Module (LPM) and Auxiliary Convolutional Layer (ACL)—to address its limitations in the context of fine-grained crack segmentation.

3. Methodology

3.1. Model Architecture

In response to the challenges mentioned in Section 1, such as the scarcity of public datasets and the Transformer’s inefficiency for small targets, this study proposes a hybrid architecture that enhances small target feature retention via auxiliary convolution layers and improves training convergence through local perception modules. These modules mitigate information loss from weak crack signals and facilitate better generalization despite limited data availability.

This study builds upon Mask2Former [24] and introduces an auxiliary convolutional layer to enhance the model’s performance. The auxiliary convolutional layer provides high-resolution information from pavement images, refining the predictions output by the Transformer decoder. Additionally, a local perception module is incorporated within the Transformer to augment the model’s ability to perceive neighborhood information. The overall architecture of the model is depicted in Figure 1.

The architecture is primarily based on Mask2Former, but with modifications tailored to the characteristics of crack semantic segmentation. A crucial addition is the auxiliary convolutional layer, which helps preserve low-dimensional semantics from the backbone network, assisting the classifier in making accurate decisions. The model incorporates a Transformer decoder, which effectively captures global information and long-range dependencies between pixels. However, Transformers struggle with extracting features from small target pixels, such as cracks. To address this issue, we have improved the feed-forward network layer within the Transformer by introducing local perception capabilities, which enhance the model’s ability to represent and process local information effectively.

The auxiliary convolutional module can function in multiple ways. If the device has sufficient computational power, it can work in parallel with the Pixel Decoder. In this case, the results of both the auxiliary module and the Pixel Decoder are fused to produce the final output. However, if the model is to be deployed on edge devices with limited computational resources, the module can be implemented as an auxiliary head. The loss function of the module is then incorporated into the main network’s loss function, guiding the backbone network in feature extraction.

To integrate the proposed Local Perception Module (LPM) and Auxiliary Convolutional Layer (ACL) into the Mask2Former framework, we made the following architectural modifications:

(1) The Local Perception Module (LPM) is embedded in the feed-forward network (FFN) block of each Transformer decoder layer. Specifically, we replace the standard MLP in the FFN with a spatially-aware structure that alternates sequence-to-image and image-to-sequence transformations. This allows the model to retain neighborhood information that is typically ignored in vanilla Transformers.

(2) The Auxiliary Convolutional Layer (ACL) operates in parallel with the pixel decoder in the original Mask2Former architecture. It receives low-level features from the early stages of the Swin Transformer backbone (typically stages 1 and 2) and performs shallow, high-resolution convolution operations to preserve fine edge details. The ACL operates on shallow features (e.g., 160 × 160 resolution) from the backbone, using residual 3 × 3 and 1 × 1 convolutions to preserve spatial detail that is often lost during Transformer encoding. These preserved features are later fused with decoder outputs to refine boundary prediction.

These modifications ensure that the LPM enriches long-range dependencies with local context, while the ACL reinforces low-level feature preservation. The overall fusion improves both the convergence speed and segmentation performance on thin, discontinuous crack patterns.

3.2. Local Perception Module

The success of Transformer models in natural language processing and their subsequent adaptation to image processing tasks has demonstrated their excellent ability to capture global information and long-range dependencies. However, locality is of paramount importance for image processing tasks. Surrounding objects provide additional context for the current region, such as positional relationships, edge information, and color details. Images are typically divided into multiple patches, each of which is converted into a vector known as a token. Although each token is embedded with positional encoding, within the self-attention mechanism, each token learns globally without interacting with spatially adjacent tokens. To enable tokens to capture local information, this study modifies the feed-forward network module by introducing a transformation from sequence to spatial representation. The local perception module is shown in Figure 2.

For an image

X \in R^{C \times H \times W}

, it is transformed into a set of tokens, denoted as

\{{\hat{X}}_{i} \in R^{d} | i = 1, \dots N\},

where

d = C \times p^{2}, N = H \times W / p^{2}

, and therefore

{\hat{X}}_{i} \in R^{N \times d}

. In the self-attention mechanism, the output is given by formula (1), where

Q

denotes the query vector

Q = X W_{Q}

,

K

denotes the key vector

K = X W_{K}

, and

V

denotes the value vector

V = X W_{V}

.

Z = softmax (\frac{Q K^{T}}{\sqrt{d}}) V

(1)

The local perception module is primarily composed of three components. The Sequence to Image (Seq2Img) block is responsible for taking the raw image tokens or the outputs from upper-layer modules and spatially reassembling the sequence information into feature maps [25]. After capturing the local information of each token, Seq2Img and Img2Seq are illustrated as shown in Figure 3.

The input sequence sequentially passes through Seq2Img, deep convolution, and Img2Seq. The corresponding formula is given by (2).

Z^{r} = S e q 2 Im g (Z), Z^{r} \in R^{h \times w \times d}

(2)

where

Z^{r}

denotes the output of the sequence

Z

after reordering and

Z

denotes the input sequence

Z

.

In the feed-forward neural network, there are two linear transformation layers, which are replaced by two 1 × 1 convolution layers in this section. The 1 × 1 convolution layers can perform token mapping and facilitate mapping between the input sequence and the image, enhancing the network’s expressive power. The corresponding formula is given by (3).

Y = f (Z W_{1}) W_{2}

(3)

where

Y

denotes the output of 1 × 1 convolution,

f

denotes the activation function,

Z

denotes the input, and

W

denotes the convolution kernel weights.

3.3. Auxiliary Convolution Layer

In the field of computer vision, especially in crack semantic segmentation tasks, the network’s hierarchical structure significantly impacts feature extraction. Specifically, shallow networks tend to capture more spatial information, while deeper networks extract more abstract, higher-level features. Due to cracks being small targets, the number of features extracted in deeper layers is relatively few, resulting in suboptimal crack segmentation accuracy. To improve the model’s segmentation accuracy for small targets like cracks, this study introduces an auxiliary convolution layer in the semantic segmentation network. The structural module of the auxiliary convolution layer is shown in Figure 4.

The auxiliary convolution layer has a simple structure. It extracts multi-scale local features from the output features of the backbone network. The backbone network of Mask2Former is the Swin Transformer, and the features from its first six layers are shared with both the Transformer decoder and the auxiliary convolution layer. The Transformer decoder uses a pixel decoder and decoder to generate mask predictions. In this study, the number of layers in the auxiliary convolution layer is set to six, with each layer receiving corresponding input from the backbone network. The Reshape module in layer

l

adjusts the input feature

F_{l}

’s size, while 1 × 1 convolution performs upsampling and downsampling and 3 × 3 convolution is used for feature learning. To preserve the spatial information of the cracks, residual connections are employed. Each layer consists of two such structures concatenated together.

The auxiliary convolution layer enhances the segmentation model’s performance by providing insightful information into the main structure through auxiliary loss. Auxiliary loss is defined using a cross-entropy loss function, as shown in Formula (4).

L_{aux} = \sum_{n = 1}^{L = 6} n o r m (C E (y^{'}, g t))

(4)

In the formula,

L_{aux}

represents the auxiliary loss function,

n o r m

denotes the regularization function,

C E

is the cross-entropy loss function,

y^{'}

represents the predicted results, and

g t

stands for the ground truth labels.

The purpose of the auxiliary convolution layer proposed in this study is to compensate for the limitations of the Transformer in Mask2Former, specifically in learning local information. By introducing the auxiliary convolution layer, the model’s ability to learn dense local features is strengthened, improving its accuracy in segmenting small objects.

4. Experimental Results

4.1. Dataset and Evaluation Metrics

The CFD dataset was introduced by Shi et al. [26], containing 118 pavement images captured under complex urban road conditions. It presents various background noises such as shadows, oil stains, and textures, making it suitable for evaluating generalization. The Deep Crack dataset, proposed by Zou et al. [27], includes 175 annotated high-resolution pavement images, each carefully labeled for semantic segmentation. It is widely used as a benchmark for deep learning-based crack detection models.

In addition to the public CFD (CrackForest Dataset) and DeepCrack datasets, we constructed a self-annotated pavement crack dataset to improve model robustness and diversity. The dataset was collected on urban roads in Xi’an, China using a high-resolution industrial-grade CMOS camera (3106 × 4032 pixels) mounted on a vehicle-based inspection system at a height of 1.2 m. Data acquisition was carried out under various weather and lighting conditions, including sunny, cloudy, and shadowed environments, to enhance generalization.

A total of 442 images were collected and manually labeled using the LabelMe tool at the pixel level. Crack regions were annotated as polygons and double-checked by two experienced annotators. Disagreements were resolved through expert review.

For training and evaluation, the self-built dataset was randomly divided into three subsets: 70% for training, 15% for validation, and 15% for testing. All images were resized to 640 × 640 pixels during preprocessing to ensure consistency across datasets.

These two public datasets were combined with our self-constructed dataset to form a comprehensive training and testing dataset in this study. The statistical information of the entire dataset is presented in Table 1. Given that the datasets come from three different sources, and the images have three different sizes, resizing and cropping operations were applied to standardize the dataset. As a result, the final dataset consists of images with a uniform size of 640 × 640 pixels.

The mean Intersection over Union (mIoU) is used to measure the model’s segmentation accuracy at the pixel level. Given the complex shapes of cracks, mIoU reflects the overlap between the ground truth and predicted results in the pavement crack dataset, providing insight into the model’s performance. The mIoU is calculated as shown in Formula (5).

m I o U = \frac{1}{N} \sum_{i = 1}^{N} \frac{T P_{i}}{T P_{i} + F P_{i} + F N_{i}}

(5)

where

N

is the number of classes,

T P_{i}

represents the number of pixels in class

i

that are correctly predicted,

F P_{i}

is the number of pixels in class

i

incorrectly predicted as another class, and

F N_{i}

represents the number of pixels in class

i

that are incorrectly classified.

The accuracy (Acc) can be used to measure the model’s ability to distinguish between background pixels and crack pixels, calculated as shown in Formula (6). The mean accuracy (mAcc) is the average of the accuracy across all categories.

A c c = \frac{T P + T N}{T P + T N + F P + F N}

(6)

Due to the class imbalance between crack pixels and background pixels, accuracy alone may not fully reflect model performance. Thus, we introduce the F1 score and mean Average Precision (mAP) to provide a more comprehensive evaluation.

The F1 score is calculated as shown in Formula (7).

F 1 = \frac{2 \cdot P r e c i s i o n \cdot R e c a l l}{P r e c i s i o n + R e c a l l}

(7)

where precision is the ratio of true positives to all predicted positives, and recall is the ratio of true positives to all actual positives. The mAP is calculated by averaging precision across different recall levels.

4.2. Experimental Environment

All experiments were conducted on a workstation equipped with an NVIDIA RTX 3090 GPU (24 GB), an Intel i9-12900K CPU, and 128 GB RAM. The model was implemented using PyTorch 1.13 and trained on Ubuntu 20.04 LTS (Dell Technologies Inc., Round Rock, TX, USA).

The input image resolution was set to 640 × 640. We used the AdamW optimizer with a learning rate of 1 × 10⁻⁴, a weight decay of 1 × 10⁻⁵, and a batch size of 8. The learning rate was decayed using a cosine annealing schedule. The total number of training iterations was 15,000, and early stopping was applied based on validation mIoU.

4.3. Ablation on the Localization Module

To validate the effectiveness of the proposed components, this study used the original Mask2Former framework as a baseline, iterating 15,000 times on the datasets mentioned above. The study tested the impact of individual components on the accuracy of pavement crack semantic segmentation. The components under test were the Local Perception Module and the Auxiliary Convolution Layer.

The results of the ablation study are presented in Table 2. The baseline Mask2Former achieved an mIoU of 80.24%. When the Local Perception Module was added, the mIoU increased to 81.67%. When the Auxiliary Convolution Layer was introduced, the mIoU improved to 81.43%. Adding both the Local Perception Module and the Auxiliary Convolution Layer together resulted in an mIoU of 82.54%, a 2.30% improvement over the baseline model.

In conclusion, the addition of the Local Perception Module and the Auxiliary Convolution Layer enhanced the performance of Mask2Former in the pavement crack semantic segmentation task.

4.4. Comparison and Analysis

In order to comprehensively evaluate the effectiveness of the proposed method, several representative deep learning-based semantic segmentation models were selected for comparison. U-Net [28] is a classical encoder-decoder architecture originally designed for biomedical image segmentation, known for its skip connections that help preserve spatial information. DeepLabV3 [29] is a CNN-based architecture that uses atrous spatial pyramid pooling (ASPP) to capture multi-scale context, effective in capturing fine-grained boundaries. PSPNet [30] introduces a pyramid pooling module to aggregate global and local features across different spatial scales. SegNet [31] employs a symmetric encoder-decoder design with pooling indices for better upsampling, suitable for resource-limited environments. SAN (Self-Attention Network) [32] is a model incorporating self-attention mechanisms to improve long-range dependency modeling. MaskFormer [33] combines pixel-level segmentation with object queries, laying the foundation for the later Mask2Former.

These models span a variety of architectural designs and have been widely used as baselines in semantic segmentation tasks. Including them in the comparative analysis provides a fair and comprehensive evaluation of the proposed approach. Additionally, the impact of different pavement materials on detection results was discussed.

The testing data from the aforementioned datasets was used to make predictions with different networks. The evaluation metrics are shown in Table 3. The comparison includes mIoU, mAcc, Acc-Crack, and Acc-Background, along with the parameter counts and the number of images each model can infer per second.

The proposed method achieves an mIoU of 82.54%, the highest among the models tested. However, it slightly lags behind Mask2Former in terms of the mAcc metric. The proposed model, based on a self-attention mechanism, enhances the ability to capture local information. This is further improved by the addition of the auxiliary convolutional layer, which retains low-level information, leading to more accurate recognition of image details. SegNet, based on an encoder-decoder structure, is relatively simple and performs poorly on complex objects like cracks, achieving an mIoU of 77.42%. PSPNet, which features a pyramid of features that collect information across different scales, performs better than SegNet, with a mIoU that exceeds it by 3.9%. Mask2Former, built upon MaskFormer, incorporates a Mask Attention mechanism, which helps improve the model’s ability to recognize fine details, thus achieving an mIoU 2.03% higher than PSPNet. SAN, which is capable of adaptively adjusting features and excels at capturing long-range information, suffers from network complexity, which limits its performance when data volume is insufficient, leading to an mIoU of 79.29%.

In the comparison experiments, the results from U-Net and DeepLabv3 stand out, showing significant differences from other models. In these two models, the crack pixel classification accuracy is 0%, and the background accuracy is 100%. This suggests that, with the parameters set in Section 4.2, U-Net and DeepLabv3 fail to recognize cracks, while other models successfully identify them. In all models tested, the mAcc metric was greater than 86.20%, as the pavement background pixels dominate and achieve high accuracy rates of 95–99.1%, while crack recognition remains more challenging. However, significant differences emerge in the mIoU metric. The proposed segmentation model with the local perception module achieves the highest mIoU of 82.54%, making it the best-performing model in terms of this metric.

The results of the proposed method and the Mask2Former, SegNet, and SAN models on eight images are shown in Figure 5. The proposed model demonstrates superior detail extraction of cracks. Compared to other models, it preserves cracks more completely, with fewer losses.

As shown in rows F, G, and H of Figure 5, where cracks are fine and their color is similar to the background, other models exhibit significant loss in the upper-left corners, whereas the proposed model provides relatively complete recognition. Mask2Former identifies the cracks as wider, but the results of the proposed method are closer to the ground truth, indicating the significant improvement brought by the local perception module and the auxiliary convolutional layer in the segmentation results.

Furthermore, the poor crack recognition performance of U-Net and DeepLabv3 in the semantic segmentation task was explored. The analysis indicates that the loss function may not have been optimal for the current dataset. A series of experiments was conducted to assess the performance of U-Net and DeepLabv3 with three different loss functions: Cross Entropy Loss, Dice Loss, and Focal Loss. Results showed that using Cross Entropy Loss and Dice Loss resulted in 0% accuracy for crack recognition and 100% for background.

Further experiments with Focal Loss, applying a class weight ratio of 1:100 between background and crack, led to an improvement in crack pixel recognition accuracy, which was no longer zero. After 14,000 iterations (140 epochs) of training, the crack pixel accuracy reached 7.45%. The overall macro-average accuracy variation is shown in Figure 6.

In conclusion, Focal Loss, by balancing easy and hard samples, effectively addresses the class imbalance issue, enabling both U-Net and DeepLabv3 to successfully identify cracks. The final results, including training outcomes with Focal Loss and the class weights for background and crack categories, are shown in Table 4 and Table 5, respectively.

To enable fair and direct comparison with previous work, we conducted experiments using only the Deep Crack dataset, following the original train–test split. Our model achieved an mIoU of 83.12%, outperforming the current state-of-the-art result reported by Zou et al. (2019) [27], which achieved 81.44% mIoU using DeepCrackNet. This demonstrates the generalization capability and robustness of our method using public datasets.

Table 4 and Table 5 further investigate the poor performance of U-Net and DeepLabV3 observed in Table 3, where both models failed to detect cracks effectively. The root cause is the severe class imbalance in the pavement crack datasets—crack pixels occupy less than 2% of the image, while background pixels dominate. As a result, these models tend to converge toward background classification.

To mitigate this issue, we introduced Focal Loss with class weighting, progressively adjusting the background-to-crack pixel ratio during training. As shown in Table 5, this strategy significantly improved the crack detection accuracy of both models: U-Net’s mIoU increased from 50.00% to 74.47%, and DeepLabV3 improved to 80.16%, demonstrating that appropriate loss functions can help classical models better cope with imbalanced data.

However, despite the improvements, their performance still lags behind the proposed method (82.54% mIoU). This gap highlights that simply optimizing the loss function cannot fully compensate for architectural limitations in modeling fine crack structures. Our proposed network, by combining global-local feature extraction (via LPM) and shallow edge enhancement (via ACL), outperforms classical models even when they are trained under optimized conditions. Therefore, Table 4 and Table 5 not only validate the fairness of our baseline setup, but also emphasize the structural advantage of the proposed method beyond loss-level adjustments.

Table 6 presents a performance comparison among our proposed method, DeepCrackNet, and U-Net on the DeepCrack dataset using the same official train–test split. The results show that our method achieves the highest mIoU (83.12%) and F1 score (80.53%), outperforming DeepCrackNet (81.44% mIoU) and U-Net (73.82% mIoU).

The performance gap can be attributed to several key architectural improvements. First, integration of the Auxiliary Convolutional Layer (ACL) enables better preservation of high-resolution spatial features and edge contours, which is critical for detecting narrow and discontinuous cracks. Second, the Local Perception Module (LPM) enhances the model’s ability to extract local patterns while maintaining long-range dependencies, helping to capture both fine and extended crack structures more effectively.

Additionally, DeepCrack is a fine-grained, high-resolution dataset with detailed annotations, which benefits from models capable of fusing global semantic context and local detail. Our hybrid design addresses this need better than conventional architectures such as U-Net, even under equal training settings. This demonstrates the robustness and generalization of our model across datasets and highlights its applicability to real-world pavement crack detection tasks.

5. Discussion

5.1. Effectiveness of the Proposed Modules

Integration of the Local Perception Module (LPM) and Auxiliary Convolutional Layer (ACL) into the Mask2Former framework clearly enhances segmentation performance, particularly for thin, fragmented pavement cracks. The LPM improves local feature retention within the Transformer’s global context, while ACL strengthens spatial edge awareness. The ablation study (Table 2) supports their individual and combined effectiveness, and the superior results in Table 3 and 6 validate their generalizability across datasets.

5.2. Analysis of Model Performance

Compared to traditional CNN-based models like U-Net and DeepLabV3, the proposed method performs significantly better under both balanced and imbalanced settings. The visual results (Figure 5) and quantitative scores (Table 3) show that our model not only improves overall segmentation accuracy, but also better preserves crack continuity and edge sharpness. These results suggest that fusing global and local information is crucial for real-world crack detection tasks.

5.3. Limitations and Future Work

Despite the improvements, this study has several limitations. First, although the proposed model shows good generalization on two public datasets and a self-constructed dataset, the total sample size remains relatively limited. Second, the method’s inference speed (8.31 FPS) may not fully meet the needs of real-time embedded systems. In future work, we plan to explore lightweight architectures for real-time deployment and investigate semi-supervised or self-supervised methods to reduce annotation costs while maintaining accuracy.

6. Conclusions

This paper presents a crack segmentation model based on a local perception module. While Transformer networks excel at capturing global information, they often overlook local details between image patches when applied to image processing. To address this issue, we introduced a local perception module that restores the spatial relationships of input sequences, enhancing Transformer’s ability to incorporate local information. This module was applied to Mask2Former, and an additional auxiliary convolution layer was added to capture low-dimensional semantic information from the feature extraction network.

In the pavement crack segmentation dataset used, there are 293 publicly available pavement crack images and 442 images labeled by this study. The proposed method achieved an mIoU of 82.54% and an mAcc of 89.29% on this dataset. Ablation experiments showed that the local perception module and auxiliary convolution layer improved the mIoU by 1.43% and 1.19%, respectively. Comparative experiments with mainstream segmentation models showed that the mIoU of 82.54% from the proposed model was the best among all tested models. The superiority of our method is quantitatively supported by Table 3 and Table 6, which show consistent gains in mIoU and F1 score over both classical and state-of-the-art models.

In conclusion, this study presents a novel enhancement to the Mask2Former framework by integrating a Local Perception Module (LPM) and an Auxiliary Convolutional Layer (ACL). These modules significantly improve the model’s ability to capture fine-grained, discontinuous pavement cracks by fusing global semantic context with local edge features. Experiments on multiple datasets, including DeepCrack and our own annotated dataset, demonstrate that the proposed method achieves superior performance compared to both classical models and recent baselines. This confirms the robustness, accuracy, and practical value of our approach for real-world crack detection scenarios.

Author Contributions

Conceptualization, Y.Z.; formal analysis, T.C.; funding acquisition, T.C.; investigation, T.C.; methodology, Y.Z.; project administration, T.C.; resources, T.C.; software, Y.Z.; supervision, T.C.; validation, Y.Z.; visualization, Y.Y.; writing—original draft, Y.Y.; writing—review and editing, Y.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by Industrial Technology Development Project under grant 2021KY41ZD(CYH)-05, in part by the Special Project of Technological Innovation and Guidance in Shaanxi Province under Grant 2022QFY01-03, in part by the Natural Science Foundation in Shaanxi Province under Grant 2022JQ-476, in part by the Science and Technology Program in Xi’an city under Grant 21XJZZ0055, and in part by Key research and development plan of Shaanxi Province under Grant No.2025CY-YBXM-014.

Data Availability Statement

Datasets can be obtained by contacting the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

Abbreviation	Full Term	Description
mIoU	Mean Intersection over Union	Average overlap between predicted and ground truth segments across classes
F1	F1 Score	Harmonic mean of precision and recall for crack pixels
Acc-Crack	Accuracy for Crack Pixels	Accuracy calculated only for crack pixels
mAcc	Mean Accuracy	Average per-class accuracy
GT	Ground Truth	Manually annotated segmentation labels
LPM	Local Perception Module	Proposed module to enhance local feature awareness
ACL	Auxiliary Convolutional Layer	Proposed module to preserve high-resolution spatial details
FPS	Frames Per Second	Model inference speed (higher is faster)
CE	Cross-Entropy Loss	Standard pixel-wise classification loss
FL	Focal Loss	Loss function to handle class imbalance
CRF	Conditional Random Field	Post-processing method to refine segmentation edges
CAM	Class Activation Map	Visualization technique to locate important regions for predictions

References

Li, L.; Zhou, T.; Wang, W.; Li, J.; Yang, Y. Deep hierarchical semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 1246–1257. [Google Scholar]
Chen, L.C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 40, 834–848. [Google Scholar] [CrossRef] [PubMed]
Zhang, H.; Zhang, A.A.; Dong, Z.; He, A.; Liu, Y.; Zhan, Y.; Wang, K.C.P. Robust semantic segmentation for automatic crack detection within pavement images using multi-mixing of global context and local image features. IEEE Trans. Intell. Transp. Syst. 2024, 25, 11282–11303. [Google Scholar] [CrossRef]
Majidifard, H.; Jin, P.; Adu-Gyamfi, Y.; Buttlar, W.G. Pavement image datasets: A new benchmark dataset to classify and densify pavement distresses. Transp. Res. Rec. 2020, 2674, 328–339. [Google Scholar] [CrossRef]
Amhaz, R.; Chambon, S.; Idier, J.; Baltazart, V. Automatic Crack Detection on 2D Pavement Images: An Algorithm Based on Minimal Path Selection. IEEE Trans. Intell. Transp. Syst. 2016, 17, 2718–2729. [Google Scholar] [CrossRef]
Peng, B.; Jiang, Y.-S.; Pu, Y. A review of automatic pavement crack image recognition algorithms. J. Highw. Transp. Technol. 2014, 31, 7. [Google Scholar] [CrossRef]
Zhao, H.; Qin, G.; Wang, X. Improvement of Canny Algorithm Based on Pavement Edge Detection. In Proceedings of the 2010 3rd International Congress on Image and Signal Processing, Yantai, China, 16–18 October 2010; pp. 964–967. [Google Scholar]
Zhang, Y.; Chen, B.; Wang, J.; Li, J.; Sun, X. APLCNet: Automatic Pixel-Level Crack Detection Network Based on Instance Segmentation. IEEE Access 2020, 8, 199159–199170. [Google Scholar] [CrossRef]
Zou, Q.; Cao, Y.; Li, Q.; Mao, Q.; Wang, S. CrackTree: Automatic Crack Detection from Pavement Images. Pattern Recognit. Lett. 2012, 33, 227–238. [Google Scholar] [CrossRef]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef]
He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask R-CNN. In Proceedings of the 2017 IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2020; pp. 2980–2988. [Google Scholar]
Fujita, H.; Itagaki, M.; Ichikawa, K.; Hooi, Y.K.; Kawahara, K.; Sarlan, A. Fine-tuned Surface Object Detection Applying Pre-trained Mask R-CNN Models. In Proceedings of the 2020 International Conference on Computational Intelligence, Las Vegas, NV, USA, 16–18 December 2020; pp. 17–22. [Google Scholar]
Du, Y.; Zhong, S.; Fang, H.; Wang, N.; Liu, C.; Wu, D.; Sun, Y.; Xiang, M. Modeling Automatic Pavement Crack Object Detection and Pixel-level Segmentation. Autom. Constr. 2023, 150, 104840. [Google Scholar] [CrossRef]
Wang, X.; Gao, H.; Jia, Z.; Li, Z. BL-YOLOv8: An Improved Road Defect Detection Model Based on YOLOv8. Sensors 2023, 23, 8361. [Google Scholar] [CrossRef]
Zhang, Y.; Niu, P.; Guo, F.; Yan, W.; Liu, J.; Kou, L. Tunnel Lining Crack Intelligent Recognition Based on YOLOv11 Algorithm. In Proceedings of the 2024 International Conference on Smart Transportation Interdisciplinary Studies, Nanjing, China, 14–15 December 2024. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention Is All You Need. Adv. Neural Inf. Process. Syst. 2017, 30, 1415–1423. [Google Scholar]
Ji, A.; Xue, X.; Zhang, L.; Luo, X.; Man, Q. A transformer-based deep learning method for automatic pixel-level crack detection and feature quantification. Eng. Constr. Archit. Manag. 2025, 32, 2455–2486. [Google Scholar] [CrossRef]
Guo, F.; Liu, J.; Lv, C.; Yu, H. A Novel Transformer-based Network with Attention Mechanism for Automatic Pavement Crack Detection. Constr. Build. Mater. 2023, 391, 131852. [Google Scholar] [CrossRef]
Wang, Z.; Leng, Z.; Zhang, Z. A weakly-supervised transformer-based hybrid network with multi-attention for pavement crack detection. Constr. Build. Mater. 2024, 411, 134134. [Google Scholar] [CrossRef]
Lin, C.; Tian, D.; Duan, X.; Zhou, J. TransCrack: Revisiting Fine-grained Road Crack Detection with A Transformer Design. Philos. Trans. R. Soc. A Math. Phys. Eng. Sci. 2023, 381, 20220172. [Google Scholar] [CrossRef]
Shamsabadi, E.A.; Xu, C.; Dias-Da-Costa, D. Robust Crack Detection in Masonry Structures with Transformers. Measurement 2022, 200, 111590. [Google Scholar] [CrossRef]
Ding, W.; Yang, H.; Yu, K.; Shu, J. Crack Detection and Quantification for Concrete Structures using UAV and Transformer. Autom. Constr. 2023, 152, 104929. [Google Scholar] [CrossRef]
Cheng, B.; Choudhuri, A.; Misra, I.; Kirillov, A.; Girdhar, R.; Schwing, A.G. Mask2former for video instance segmentation. arXiv 2021, arXiv:2112.10764. [Google Scholar]
Cheng, B.; Misra, I.; Schwing, A.G.; Kirillov, A.; Girdhar, R. Masked-attention Mask Transformer for Universal Image Segmentation. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 1280–1289. [Google Scholar]
Li, Y.; Zhang, K.; Cao, J.; Timofte, R.; Magno, M.; Benini, L.; Van Goo, L. LocalViT: Analyzing Locality in Vision Transformers. In Proceedings of the 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems, Detroit, MI, USA, 1–5 October 2023; pp. 9598–9605. [Google Scholar]
Shi, Y.; Cui, L.; Qi, Z.; Meng, F.; Chen, Z. Automatic road crack detection using random structured forests. IEEE Trans. Intell. Transp. Syst. 2016, 17, 3434–3445. [Google Scholar] [CrossRef]
Zou, Q.; Zhang, Z.; Song, Y.; Wang, Q.; Han, Y. DeepCrack: A deep hierarchical feature learning architecture for crack segmentation. Neurocomputing 2019, 338, 139–153. [Google Scholar]
Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. In Medical Image Computing and Computer-Assisted Intervention—MICCAI 2015; Springer: Cham, Switzerland, 2015; pp. 234–241. [Google Scholar]
Chen, L.C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. Rethinking Atrous Convolution for Semantic Image Segmentation. arXiv 2017, arXiv:1706.05587. [Google Scholar]
Zhao, H.; Shi, J.; Qi, X.; Wang, X.; Jia, J. Pyramid Scene Parsing Network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2881–2890. [Google Scholar]
Badrinarayanan, V.; Kendall, A.; Cipolla, R. SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 2481–2495. [Google Scholar] [CrossRef] [PubMed]
Zhao, H.; Zhang, Y.; Liu, S.; Shi, J.; Loy, C.C.; Lin, D.; Jia, J. Exploring Self-Attention for Image Recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 10076–10085. [Google Scholar]
Cheng, B.; Schwing, A.G.; Kirillov, A. Per-Pixel Classification is Not All You Need for Semantic Segmentation. Adv. Neural Inf. Process. Syst. 2021, 34, 17864–17875. [Google Scholar]

Figure 1. Pavement crack segmentation model diagram based on local perception. The model is based on Mask2Former and incorporates two key components: the Local Perception Module (LPM), which enhances local feature extraction within the Transformer, and the Auxiliary Convolutional Layer (ACL), which preserves edge-level details from shallow features to improve segmentation precision.

Figure 2. Local perception module. The LPM enhances local spatial awareness by inserting convolutional operations into the Transformer decoder. It integrates neighborhood features while maintaining the global modeling capacity of the transformer architecture.

Figure 3. Sequence and image transformation. This transformation reshapes 1D token sequences into 2D spatial feature maps for local convolution operations and then restores them to sequence format, enabling spatial locality in Transformer-based decoding.

Figure 4. Auxiliary branch module. The ACL processes shallow features from the backbone using a series of residual convolutions and fuses them with the decoder output. This enhances high-resolution detail and improves crack edge continuity in the final prediction.

Figure 5. Prediction results of different semantic segmentation. Models from left to right: (a) Input Image, (b) Ground Truth, (c) Ours, (d) Mask2Former, (e) SegNet, (f) added SAN. Our model demonstrates improved crack continuity, fewer false negatives, and better structural completeness, especially on thin and fragmented cracks.

Figure 6. mAcc of DeepLabv3 using focal loss.

Table 1. Semantic segmentation dataset statistics.

Source	Amount	Size	Percentage of Cracks (%)
CFD	118	480 × 320	1.61
Deep Crack	175	544 × 384	4.43
Ours	442	3106 × 4032	2.12
Total	735	——	1.82

Table 2. Semantic segmentation method combination experimental results.

Local Perception Module	Auxiliary Convolution Layer	mIoU
✕	✕	80.24
✓	✕	81.67
✕	✓	81.43
✓	✓	82.54

Table 3. Performance of different semantic segmentation models.

Methods	mIoU	F1	Acc- Crack	Acc-Background	mAcc	Params (M)	FPS
Unet	50.00	0.00	0.00	100.00	50.00	7.76	22.21
SegNet	77.42	0.73	74.19	98.21	86.20	14.70	12.42
PSPNet	81.32	0.77	76.33	98.53	87.43	21.80	14.15
DeepLabV3	50.00	0.00	0.00	100.00	50.00	41.31	16.56
SAN	79.29	0.75	78.89	98.57	88.73	21.82	14.20
MaskFormer	78.21	0.74	78.16	99.12	88.64	45.01	12.66
Mask2Former	80.24	0.76	81.27	99.15	90.21	44.52	10.20
Ours	82.54	0.79	79.55	99.03	89.29	56.2	8.31

Table 4. Focal loss weights.

Stage	Weight (Background:Crack)
0–5000	1:500
5001–10,000	1:200
Above 10,000	1:50

Table 5. Unet and Deeplabv3 experimental results.

Model	IoU-Crack	IoU-Background	Acc_Crack	Acc_Background	F1	mIoU	mAcc
Unet	50.11	98.83	64.08	95.38	66.09	74.47	79.73
DeepLabv3	61.43	98.89	79.24	97.58	73.27	80.16	88.41

Table 6. Fair comparison with previous work.

Model	Dataset	mIoU	F1	mAcc
DeepCrackNet	DeepCrack	81.44	78.11	87.20
Unet	DeepCrack	73.82	69.20	85.63
Ours	DeepCrack	83.12	80.53	88.90

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhu, Y.; Cao, T.; Yang, Y. A Transformer-Based Pavement Crack Segmentation Model with Local Perception and Auxiliary Convolution Layers. Electronics 2025, 14, 2834. https://doi.org/10.3390/electronics14142834

AMA Style

Zhu Y, Cao T, Yang Y. A Transformer-Based Pavement Crack Segmentation Model with Local Perception and Auxiliary Convolution Layers. Electronics. 2025; 14(14):2834. https://doi.org/10.3390/electronics14142834

Chicago/Turabian Style

Zhu, Yi, Ting Cao, and Yiqing Yang. 2025. "A Transformer-Based Pavement Crack Segmentation Model with Local Perception and Auxiliary Convolution Layers" Electronics 14, no. 14: 2834. https://doi.org/10.3390/electronics14142834

APA Style

Zhu, Y., Cao, T., & Yang, Y. (2025). A Transformer-Based Pavement Crack Segmentation Model with Local Perception and Auxiliary Convolution Layers. Electronics, 14(14), 2834. https://doi.org/10.3390/electronics14142834

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Transformer-Based Pavement Crack Segmentation Model with Local Perception and Auxiliary Convolution Layers

Abstract

1. Introduction

2. Related Work

2.1. Computer Vision Methods

2.2. Deep Learning Methods

3. Methodology

3.1. Model Architecture

3.2. Local Perception Module

3.3. Auxiliary Convolution Layer

4. Experimental Results

4.1. Dataset and Evaluation Metrics

4.2. Experimental Environment

4.3. Ablation on the Localization Module

4.4. Comparison and Analysis

5. Discussion

5.1. Effectiveness of the Proposed Modules

5.2. Analysis of Model Performance

5.3. Limitations and Future Work

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI