Pavement Crack Detection Based on the Improved Swin-Unet Model

: Accurate pavement surface crack detection is crucial for analyzing pavement survey data and the development of maintenance strategies. On the basis of Swin-Unet, this study develops the improved Swin-Unet (iSwin-Unet) model with the developed skip attention module and the residual Swin Transformer block. Based on the channel attention mechanism, the pavement crack region can be better captured while the crack feature channels can be assigned more weights. Taking advantage of the developed residual Swin Transformer block, the encoder architecture can globally model the pavement crack feature. Meanwhile, the crack feature information can be efficiently exchanged. To verify the pavement crack detection performance of the proposed model, we compare the training performance and visualization results with the other three models, which are Swin-Unet, Swin Transformer, and Unet, respectively. Three public benchmarks (CFD, Crack500, and CrackSC) have been adopted for the purpose of training, validation, and testing. Based on the test results, it can be found that the developed iSwin-Unet achieves a significant increase in mF1 score, mPrecision, and mRecall compared to the existing models, thereby establishing its efficacy in pavement crack detection and underlining its significant advancements over current methodologies.


Introduction
Surface cracks are among the most common pavement distresses [1,2].The infiltration of moisture into these cracks can compromise the compaction of materials in the deeper layers of the pavement, resulting in evident reductions in the bearing capacity of the entire pavement structure.This not only has the potential to impair the functionality of the pavement surface, and shorten the overall service life, but also poses risks to the safety of vehicle operation.Therefore, fast and precise detection of surface cracks plays a crucial role in timely maintenance efforts, preventing the worsening of pavement conditions [3,4].Conventional methods for detecting surface cracks heavily rely on manual inspection, known for being time-consuming, resource-intensive, and subjective.Recent technological advancements in image processing, automation, and artificial intelligence (AI) have significantly influenced the assessment and measurement of surface cracks on pavement.Consequently, transportation management agencies can prioritize and strategically plan road network maintenance to extend the pavement's service life [2,5,6].
Digital image processing techniques have been successfully applied to detect distress in transportation infrastructures [7].This methodology involves two key steps: image acquisition and image detection, respectively.Pavement surface images are specifically captured by automatic inspection vehicles developed by various agencies.In these acquired images, cracks are primarily categorized into two types: linear cracks and block cracks [8,9].
Manually designed features, including grayscale, edges, filters, wavelets, etc., have been successfully employed for their detection.While these methods are easily applicable in the field for linear cracks, detecting complex cracks, with various shapes, varying widths, and disruptions such as oil spots, poses significant challenges.Notably, the performance of these methods is constrained, and issues like poor contrast around cracked pixels due to unfavorable imaging conditions further complicate crack detection.Thus, manually designed features prove inferior in extracting cracks from real-world inspection images [10].
With the rapid evolution of AI, convolutional neural networks (CNNs) offer an automated approach to discerning features of target objects in pavement surface crack detection.By employing layers such as convolution and pooling, CNNs can learn and classify crack features without the need for human intervention in designing.Many CNN-based crack detection algorithms have been developed by leveraging object detection or classification methods, such as locating pavement surface cracks in the acquired image.However, there are challenges in achieving pixel-level accuracy with these methods and thus the further evaluation of these cracks cannot be conducted.To achieve pixel-level crack detection, representative models such as fully convolutional networks (FCNs) [11], Unet [12], DeepLabv3 [13], etc. have been adopted to train and test on pavement image datasets.For instance, in [14], Dung et al. have utilized FCN to detect pavement surface cracks.Nevertheless, this approach overlooks the fact that cracks of varying widths may necessitate distinct sizes of context information.U-net and DeepLabv3 models have also been utilized for pavement surface crack detection, but they do not consider the relationships between neighbor pixels and fail to globally model crack features.
Recently, motivated by the remarkable success of transformers, researchers have endeavored to extend the application of transformers into different computer vision tasks.The vision transformer (ViT) was introduced in [15] to address image recognition tasks by taking 2D image patches with positional embeddings as inputs and undergoing pretraining on extensive datasets.Additionally, the data-efficient image transformer (DeiT), presented in [16], highlights the potential for training transformers on mid-sized datasets.Thus, considering the above limitations of current CNN models on pavement surface crack detection and the advantages of transformer architecture on modeling long-range pixel dependencies, we propose an improved Swin-Unet model for pavement surface crack detection.Namely, the skip attention module based on the channel attention mechanism is designed to efficiently focus on the crack region and assign larger weights to crack feature channels.Additionally, the residual Swin Transformer block is employed in the encoder architecture to model the crack features from a global perspective.In this manner, the effective crack feature information can be further exchanged.
This paper is structured as follows: in the Related Works Section, we present a concise overview of crack detection methods including the digital image processing-based method and the deep learning-based method.In the Proposed Network Section, we provide detailed explanations of the improved Swin-Unet model, the skip attention module, and the residual Swin Transformer block.In the Experiments and Results Section, we outline the performed experiments along with the corresponding results and analysis.In the last section, we offer a summary of the study and future work to be conducted.

Related Works
In this section, we provide a brief overview of the existing literature concerning image processing-based and deep learning-based applications for automatic pavement surface crack detection methods.

Image Processing-Based Methods
In general, image processing-based methods can be categorized into three main groups: intensity threshold-based algorithms [17,18], filter-based methods [19,20], and minimal path selection techniques [21,22].Intensity threshold-based algorithms are straightforward to implement in the field and can yield satisfactory results in acquired images under specific pavement conditions without much noise and interference.For instance, Cheng et al. [23] developed a method that involves reducing the sample space and employing interpolation to determine the threshold value.In [24], the thresholded pavement surface images were divided into non-overlapping blocks for entropy computation.Subsequently, a second dynamic thresholding process was applied to predict cracks.Besides, a pavement surface crack detection method [25] was proposed based on the neighboring difference histogram method, and the objective function for maximizing the difference background and crack pixels was constructed.However, these algorithms prove less effective when dealing with weak contrasts between pavement surface cracks and the background.Notably, lighting conditions significantly impact detection accuracy, especially in the case of unevenly distributed lighting conditions.In addition, accuracy is impacted in scenarios where the image background exhibits noise and intricate textures.Thus, intensity-threshold methods have difficulty detecting the whole crack pattern and fail to identify complicated pavement surface cracks.
Pavement crack detection has seen the widespread adoption of edge detectors, such as the Sobel filter and the Canny filter, owing to the inherent similarity between cracks and edges in morphology, as noted by researchers [26][27][28].However, a notable limitation of these algorithms is their inability to accurately identify complete crack profiles, especially when cracks are set against complex textured backgrounds.The methodology described in [29] diverges from employing a single filter and instead utilizes multiple pre-designed filters adept at capturing cracks based on various attributes such as pixel intensity, shape, and orientation.The extraction of cracks from the background occurs through successive filtering processes using these pre-designed filters.This algorithm effectively transforms pavement images into a new space, preserving cracks while eliminating the background.Nevertheless, a drawback of this approach is the sensitivity of the matched filters to crack width, resulting in challenges in accurately identifying pixels at crack borders.
Minimal path selection techniques have also garnered attention in pavement surface crack detection [21,22,30].From the perspective of computer science, these techniques involve identifying optimal paths between node pairs in a graph.In the context of pavement surface detection, they aid in establishing a threshold of minimal distances to distinguish continuous crack pixels.Specifically, Zou et al. developed CrackTree [31], addressing challenges such as shadows and discontinuity in pavement crack detection.However, these approaches presupposed that crack fragments were connected through a minimum path or minimum spanning tree, ignoring the influence of neighboring pixels.Amhaz et al. proposed a crack detection method [21] based solely on photometry and introduced an algorithm for minimal path searching with fast search speed.Test results have confirmed its capability to outperform the classic Dijkstra's algorithm in terms of search speed.

Deep Learning-Based Method
With the rapid development of artificial intelligence, more and more researchers have focused on deep learning-based methods, especially CNN models, to perform pavement surface crack detection.Generally, CNN-based methods have included block-wise and pixel-wise approaches, respectively.For the block-wise method, object detection models (e.g., YOLO [32], Faster R-CNN [33], RetinaNet [34], etc.) have been commonly adopted.Du et al. [35] adopted YOLO to predict pavement surface cracks with possible location and category on a large pavement survey dataset.The inference speed can achieve 0.0347 s/pic, which is nine times faster than Faster R-CNN, and the detection accuracy has reached 73.64%.Liu et al. [36] developed a two-step method to detect pavement surface cracks.In the first step, the modified YOLOv3 is adopted and can perform the crack detection in an accurate manner.Gou et al. [37] developed a pavement surface crack detection model based on the improved Faster R-CNN with the residual neural network and the anchor size modification strategy.Experimental results validated that the proposed model was less affected by the illumination and the image quality.Zhai et al. [38] proposed the improved Faster R-CNN method for pavement surface crack detection.The classification and regression network was developed by incorporating the residual network, while the model incorporated a feature ensemble structure.A notable limitation of object detectionbased models for pavement surface crack detection is the challenge in accurately evaluating distress regions.Consequently, an increasing number of studies have turned to semantic segmentation models to achieve pixel-level detection of pavement surface cracks for better representation and evaluation.
FCN has been one of the commonly adopted semantic segmentation approaches to conduct road crack detection and measurement [28,30].For instance, Liu et al. [39] utilized U-Net, an encoder-decoder structure, for concrete crack detection.Specifically, it included the focal loss, showing better robustness and accuracy compared to previous DCNNs.Lau et al. [40] introduced a U-Net-based architecture with the ResNet-34 encoder, incorporating layer group freezing, variable learning rates, and incremental image size increments to improve the detection performance.Song et al. [41] presented the Multiscale Feature Attention Network for automatic crack identification.With the fused crack features, the detection performance was largely improved.In [42], researchers developed the weighted cross-entropy (CE) loss function and the distribution equalization learning mechanism for pavement crack detection.With the weighted CE loss function, the training process was simple and efficient.
In [43], the authors developed a new strategy which applied a DeepLabv3+ model to train the original pavement crack image and predicted the CLAHE-enhanced pavement crack images.The sensitivity analysis was performed to indicate the impact of the data volume and the shooting angle of the camera.Wang et al. [44] developed a salient object detection-based method for pavement crack detection.Namely, the hierarchical feature fusion module was developed to integrate crack features, and the boundary refinement module was proposed to refine the crack boundary.Taking advantage of the rapid development of transformers, Liu et al. [45] designed CrackFormer by incorporating the novel self-attention modules and efficient positional embedding to enhance the long-range interactions between neighbor crack pixels.Through suppressing non-semantic features and sharpening semantic crack pixels, the detection performance on cracks was largely improved.Guo et al. [3] improved the pixel-level crack detection performance by unifying the Swin Transformer as the encoder and the SegFormer as the decoder.Based on experimental results on three popular crack datasets, it can be found that the thin crack and the crack impacted by environmental noises can also be accurately detected.Lu et al. [46] have advanced the field of crack classification by ingeniously incorporating pretrained Swin Transformer models into U-Net frameworks, a methodology termed Crack_PSTU.This innovative amalgamation significantly enriches the model's capacity to discern intricate details and contextual nuances within crack imagery, thereby markedly elevating the accuracy and reliability of crack detection.Even CNN-based models and transformer-based models have exhibited distinct characteristics in modeling crack pixels.However, there have been few studies exploring how to leverage both approaches in a single model for crack pixel detection.The method proposed in this study aimed to leverage the strengths of both approaches to refine the pavement surface crack detection performance.

Overview of the Developed Architecture
The enhanced architecture, depicted in Figure 1, is an evolution of the original Swin-Unet [47][48][49].It features an improved Swin Transformer arranged in a "U" shape, serving as both the encoder and the decoder.Drawing inspiration from ResNet, the basic unit of Swin-Unet incorporates a residual block flow.During training, input samples undergo an initial partitioning into non-overlapping patches with a size of 4 × 4.This process transforms the inputs into sequence embeddings, facilitating the efficient calculation of pavement crack features.Each feature patch has a dimension of 4 × 4 × 3, resulting in 48 features.To project this feature dimension into an arbitrary dimension denoted as C (in our study, C = 96), a linear embedding layer is employed.The transformed patch tokens traverse residual Swin Transformer blocks and patch merging layers, creating pavement crack feature representations with varying scales.The patch merging layer handles downsampling and increases the feature dimension, while the residual Swin Transformer block focuses on feature representation learning.
crack features, context features are extracted and fused with multiscale features from the encoder using skip attention layers.The skip attention layer serves a dual purpose-it suppresses redundant information from large background areas while enhancing the efficient utilization of crack features.In contrast to the patch merging layer, the patch expanding layer is dedicated to the up-sampling of pavement crack features.Specifically, this layer reshapes feature maps of adjacent dimensions into larger feature maps with a 2× up-sampling of resolution.The final layer involves the use of a last patch expanding layer for a 4× up-sampling, restoring the resolution of feature maps to match the original input resolution.In the last step, a linear projection layer is applied to the up-sampled pavement crack features, culminating in the generation of refined pavement crack detection results.

Residual Swin Transformer Block
Building upon the foundational framework of the multi-head self-attention (MSA) module inherent to the conventional transformer architecture, the original Swin Trans- Capitalizing on the U-Net design, a symmetric decoder in the form of a "U" shape is meticulously crafted.The designed decoder incorporates residual Swin Transformer blocks and a patch expanding layer.To combat the loss of spatial information in pavement crack features, context features are extracted and fused with multiscale features from the encoder using skip attention layers.The skip attention layer serves a dual purpose-it suppresses redundant information from large background areas while enhancing the efficient utilization of crack features.In contrast to the patch merging layer, the patch expanding layer is dedicated to the up-sampling of pavement crack features.Specifically, this layer reshapes feature maps of adjacent dimensions into larger feature maps with a 2× up-sampling of resolution.The final layer involves the use of a last patch expanding layer for a 4× up-sampling, restoring the resolution of feature maps to match the original input resolution.In the last step, a linear projection layer is applied to the up-sampled pavement crack features, culminating in the generation of refined pavement crack detection results.

Residual Swin Transformer Block
Building upon the foundational framework of the multi-head self-attention (MSA) module inherent to the conventional transformer architecture, the original Swin Transformer block, serving as the basic unit of the novel model, is introduced.This block incorporates both the window-based multi-head self-attention (W-MSA) and the shifted window-based multi-head self-attention (SW-MSA) modules, thus augmenting the model's capacity for contextual understanding and feature extraction within localized regions of input sequences.As depicted in Figure 2, the basic components of the Swin Transformer block are presented.Specifically, it comprises a LayerNorm (LN) layer, a multi-head self-attention module, a residual connection, and a 2-layer MLP with GELU non-linearity.Specifically, the first block utilizes the W-MSA module while the second block adopts the SW-MSA module.On the basis of this window partitioning mechanism, a sequence of Swin Transformer blocks can be expressed in the following Equations ( 1)-(4).ẑl (1) where ẑl and z l are the outputs of the W-MSA and SW-MSA modules, respectively.Similar to the previous fashion of calculating self-attention, the similarity is computed based on the relative position encoding.The computation equation is shown in Equation (5).
where Q, K, V ∈ R M 2 ×d indicate the query, key, and value, respectively.M 2 and d are the number of patches and the dimensions, respectively.Also, it needs to be mentioned that B comes from the bias matrix.
block are presented.Specifically, it comprises a LayerNorm (LN) layer, a multi-head selfattention module, a residual connection, and a 2-layer MLP with GELU non-linearity.Specifically, the first block utilizes the W-MSA module while the second block adopts the SW-MSA module.On the basis of this window partitioning mechanism, a sequence of Swin Transformer blocks can be expressed in the following Equations ( 1)-(4).
where  ^ and  are the outputs of the W-MSA and SW-MSA modules, respectively.
Similar to the previous fashion of calculating self-attention, the similarity is computed based on the relative position encoding.The computation equation is shown in Equation ( 5).
, ,  =  √   (5) where , ,  ∈ ℝ × indicate the query, key, and value, respectively. and  are the number of patches and the dimensions, respectively.Also, it needs to be mentioned that  comes from the bias matrix.Inspired by the great success of ResNet and SwinIR, we develop the residual Swin Transformer block to enhance the crack feature aggregation.Figure 3 presents the architecture of the developed residual Swin Transformer block.This residual design offers two advantages.Firstly, the transformer structure can be conceptualized as a particular implementation of spatially varying convolution which can promote the feature extraction of the proposed model.Secondly, the inclusion of a residual connection establishes an identity-based link from diverse blocks to the decoder module, facilitating the amalgamation of pavement crack features across multiple levels.The following Equation ( 6) indicates the computation of the developed model.Inspired by the great success of ResNet and SwinIR, we develop the residual Swin Transformer block to enhance the crack feature aggregation.Figure 3 presents the architecture of the developed residual Swin Transformer block.This residual design offers two advantages.Firstly, the transformer structure can be conceptualized as a particular implementation of spatially varying convolution which can promote the feature extraction of the proposed model.Secondly, the inclusion of a residual connection establishes an identity-based link from diverse blocks to the decoder module, facilitating the amalgamation of pavement crack features across multiple levels.The following Equation ( 6) indicates the computation of the developed model.
where ∼ z l is the output of the developed residual Swin Transformer block and z l is the input.

Skip Attention Module
Semantic segmentation networks like U-Net, ResU-Net, Attention U-Net, Dense U-Net, etc. have dominated in network design, with their shared characteristic being the incorporation of skip layers to preserve spatial details and edge region information during decoding.While these skip connections effectively retain crucial crack features for prediction, they introduce a challenge by including redundant information in the extracted crack features.This redundancy disrupts crack feature extraction, which can result in irrelevant and false predictions during network training.
To conquer this challenge, we adopt the attention mechanism, a successful implementation in CNN-based networks.The Attention U-Net, introduced by Oktay et al. [50], integrates attention gates to suppress irrelevant regions during training, reducing redundant feature maps and inhibiting activations in irrelevant areas.While effective in mitigating the impact of redundant information, this approach involves a distinct drawback, as it decreases the proportion of features in the decoding process for the entire network.Recognizing the distinct design of attention modules in CNNs and transformers, we propose a novel skip attention module to connect the transformer-based encoder and decoder, aiming to enhance the overall performance of the network in semantic segmentation tasks with a pixel-level manner.
where   � is the output of the developed residual Swin Transformer block and   is the input.

Skip Attention Module
Semantic segmentation networks like U-Net, ResU-Net, Attention U-Net, Dense U-Net, etc. have dominated in network design, with their shared characteristic being the incorporation of skip layers to preserve spatial details and edge region information during decoding.While these skip connections effectively retain crucial crack features for prediction, they introduce a challenge by including redundant information in the extracted crack features.This redundancy disrupts crack feature extraction, which can result in irrelevant and false predictions during network training.
To conquer this challenge, we adopt the attention mechanism, a successful implementation in CNN-based networks.The Attention U-Net, introduced by Oktay et al. [50], integrates attention gates to suppress irrelevant regions during training, reducing redundant feature maps and inhibiting activations in irrelevant areas.While effective in mitigating the impact of redundant information, this approach involves a distinct drawback, as it decreases the proportion of features in the decoding process for the entire network.Recognizing the distinct design of attention modules in CNNs and transformers, we propose a novel skip attention module to connect the transformer-based encoder and decoder, aiming to enhance the overall performance of the network in semantic segmentation tasks with a pixel-level manner.
Figure 4 illustrates the architecture of the residual Swin Transformer.This design features a parallel structure that conducts spatial and channel attention operations concurrently.The process commences with spatial normalization, during which the attention weight (  ) is computed for each encoder block to accentuate the informative tokens that carry salient crack features.Further, by calculating the attention weight    Figure 4 illustrates the architecture of the residual Swin Transformer.This design features a parallel structure that conducts spatial and channel attention operations concurrently.The process commences with spatial normalization, during which the attention weight (W att ) is computed for each encoder block to accentuate the informative tokens that carry salient crack features.Further, by calculating the attention weight + B)) through the mapping from the encoding to the decoding, the area importance can be determined.As illustrated in the previous section, the attention value in the i-th scale can be calculated in the following Equation (7).
where Att i (Q d , K d , V d ) represents the weight of the i-th decoder path.This weight cal- culation method is adopted in all three decoding paths to update the weight with the corresponding scale during the training.To enhance the interaction between two feature calculates, two series including the spatial branch and the channel branch are included to calculate the generated features.Firstly, the token average operation is conducted for the purpose of global representation.Then, the global representation is fused with the token computed from the original version for the reconsidering of the feature representation to perform the channel-wise attention calculation.The following Equations ( 8) and ( 9) reflect the reasoning process.
where W j , W k , and W l mean the parameters that can be used in the training.C and h represent the dimension of the embedding space and the head, respectively.The CA represents the hybrid attention operation that we computed.With this method, not only can these two attention operations be achieved, but it also offers a distinctive method for formulating the manner of interaction in a non-linear fashion to connect the encoder and the decoder.
purpose of global representation.Then, the global representation is fused with the token computed from the original version for the reconsidering of the feature representation to perform the channel-wise attention calculation.The following Equations ( 8) and ( 9) reflect the reasoning process.
where  ,  , and  mean the parameters that can be used in the training. and ℎ represent the dimension of the embedding space and the head, respectively.The  represents the hybrid attention operation that we computed.With this method, not only can these two attention operations be achieved, but it also offers a distinctive method for formulating the manner of interaction in a non-linear fashion to connect the encoder and the decoder.

Experimental Setting
We conducted our experiments using a deep learning machine equipped with the NVIDIA 3080 Ti GPU, featuring 12 GB of memory.The operating system employed was Ubuntu (version 20.04).For model training, validation, and testing, we utilized the

Experimental Setting
We conducted our experiments using a deep learning machine equipped with the NVIDIA 3080 Ti GPU, featuring 12 GB of memory.

Experimental Data
The CFD dataset [52,53] is a commonly adopted pavement crack dataset which consists of 118 RGB pavement images captured by an iPhone 5 from Beijing, China, and corresponding annotations created with binary pixels.The resolution of each image is 480 × 320 pixels.The phone camera has a focus of 4mm, an aperture of f/2.4,and an exposure time of 1/135s.The dataset provides annotated pixel-level labels, indicating the presence and location of cracks in the images.With annotated images, CFD serves as a valuable resource for training and evaluating deep learning models for pavement crack detection.Image and label samples can be found in Figure 5a.
tions, Crack500 stands as a valuable resource for training and evaluating deep learning models specifically designed for crack detection.Image and label samples can be found in Figure 5b.
The CrackSC dataset [3] consists of 197 RGB images depicting pavement surfaces with noticeable noise.Captured using an iPhone 8 near Enoree Ave, Columbia, SC, these images serve as a benchmark dataset for evaluating pavement surface crack segmentation algorithms.The primary goal is to encourage the development of accurate and robust methods for detecting pavement cracks, thereby advancing the field of computer vision in transportation infrastructure inspection and maintenance.Image and label samples can be found in Figure 5c.Three widely employed evaluation metrics-precision, recall, and F1 score-are employed.The true positive (TP), true negative (TN), false positive (FP), and false negative (FN) indicators are utilized to calculate these three evaluation metrics.Precision and recall measure the accuracy of positive predictions and the completeness of positive predictions, respectively.Typically, a higher precision indicates a lower recall and vice versa.Thus, a balanced indicator is needed in the prediction performance evaluation.It is important to note that the F1 score is chosen for its ability to balance the trade-off between precision and recall, making it a superior indicator for reflecting model performance compared to The Crack500 dataset [54] includes 500 RGB images featuring pavement surface cracks, where each image has a resolution of 2000 × 1500 pixels.These images are captured using cell phones on the main campus of Temple University, United States.Notably, the dataset contains pixel-level annotations that precisely delineate the location and boundaries of cracks in the images.With its meticulously curated content and detailed annotations, Crack500 stands as a valuable resource for training and evaluating deep learning models specifically designed for crack detection.Image and label samples can be found in Figure 5b.
The CrackSC dataset [3] consists of 197 RGB images depicting pavement surfaces with noticeable noise.Captured using an iPhone 8 near Enoree Ave, Columbia, SC, these images serve as a benchmark dataset for evaluating pavement surface crack segmentation algorithms.The primary goal is to encourage the development of accurate and robust methods for detecting pavement cracks, thereby advancing the field of computer vision in transportation infrastructure inspection and maintenance.Image and label samples can be found in Figure 5c.
Three widely employed evaluation metrics-precision, recall, and F1 score-are employed.The true positive (TP), true negative (TN), false positive (FP), and false negative (FN) indicators are utilized to calculate these three evaluation metrics.Precision and recall measure the accuracy of positive predictions and the completeness of positive predictions, respectively.Typically, a higher precision indicates a lower recall and vice versa.Thus, a balanced indicator is needed in the prediction performance evaluation.It is important to note that the F1 score is chosen for its ability to balance the trade-off between precision and recall, making it a superior indicator for reflecting model performance compared to the other two metrics.Given that crack shapes are typically long and thin, the conventional indicator of intersection over union (IoU) is deemed inappropriate for performance evaluation.Many studies introduce pixel tolerances (e.g., two-pixel or five-pixel) between real crack pixels and the background.However, such an approach often leads to the inaccurate evaluation of numerous pixels, failing to truly represent a model's proficiency in pixel classification.Therefore, we opt for a zero-pixel tolerance, and the experimental results indicate lower performance compared to previous studies.Furthermore, we compute the values of each class's indicators and use their mean values for comparison.The following equations outline the computation process for these indicators.
where N is the number of classes.In this study, N equals 2, indicating the background and the pavement crack.

Training Performance
From Tables 1-3, we compare the performance of the adopted method with two other popular models on the CFD, Crack500, and CrackSC datasets.In Table 1, it is evident that our proposed model achieves the maximum values of mRecall, mF1, and mPrecision among all the models.Specifically, iSwin-Unet's mF1 value is 1.60%, 2.46%, and 2.52% higher than Swin-Unet, Swin Transformer, and Unet, respectively.Regarding mPrecision, iSwin-Unet's value is 1.26%, 2.52%, and 1.99% higher than Swin-Unet, Swin Transformer, and Unet, respectively.In terms of mRecall, iSwin-Unet outperforms Swin-Unet, Swin Transformer, and Unet by 2.17%, 1.80%, and 0.37%, respectively.This indicates that our adopted method achieves a better recall score, which is crucial for detecting pavement surface cracks.Table 2 illustrates the training performance of three models on the Crack500 dataset.With respect to the mF1 score, it is easy to find that our proposed model achieves 86.98%, which is 0.75%, 1.66%, and 7.52% higher than Swin-Unet, Swin Transformer, and Unet, respectively.In terms of the indicator of mPrecision, our proposed method is 0.08%, 0.15%, and 5.50% higher than Swin-Unet, Swin Transformer, and Unet, respectively.Additionally, the mRecall value of the developed method is 1.18%, 2.04%, and 7.04% higher than Swin-Unet, Swin Transformer, and Unet, respectively.
Table 3 presents the training performance of three models on the CrackSC dataset.It is noteworthy that our proposed model achieves an mF1 score of 78.12%, which is 1.16%, 2.20%, and 4.15% higher than Swin-Unet, Swin Transformer, and Unet, respectively.As for the metric of mPrecision, our adopted method is 2.05%, 1.03%, and 3.38% higher than Swin-Unet, Swin Transformer, and Unet, respectively.Further, it is worth highlighting that our adopted method achieves the highest value of mRecall, which stands at 78.53%.

Visualization Results
From Figures 6-8, the visualized results of the three aforementioned datasets are presented, featuring four randomly selected crack images from each dataset.In Figure 6, it is evident that the figures in the first and fourth rows exhibit similar shapes of cracks, while those in the second and third rows showcase more intricate crack patterns.In the first row, a clear observation reveals that the long and thin cracks with branches, generated by both our adopted method and Unet, closely resemble the ground truth labeling.However, the results segmented by Unet show a noticeable loss of crack information, producing an unsatisfactory pattern.Moving to the second row, where the input image contains a complex crack pattern, the prediction results of Unet and Swin Transformer exhibit a greater loss of details compared to the ground truth label.Notably, our improved Swin-Unet demonstrates better continuity on branch cracks when compared to Swin-Unet.In the third row, the bottom section of disconnected cracks is apparent in the prediction results of Swin-Unet, Swin Transformer, and Unet, indicating weaker segmentation performances on this particular image.Regarding the fourth row featuring a curve-shaped crack, due to its relatively simple appearance, all models successfully predict the crack, with the exception of noise pixels predicted by Swin Transformer.
a clear observation reveals that the long and thin cracks with branches, generated by both our adopted method and Unet, closely resemble the ground truth labeling.However, the results segmented by Unet show a noticeable loss of crack information, producing an unsatisfactory pattern.Moving to the second row, where the input image contains a complex crack pattern, the prediction results of Unet and Swin Transformer exhibit a greater loss of details compared to the ground truth label.Notably, our improved Swin-Unet demonstrates better continuity on branch cracks when compared to Swin-Unet.In the third row, the bottom section of disconnected cracks is apparent in the prediction results of Swin-Unet, Swin Transformer, and Unet, indicating weaker segmentation performances on this particular image.Regarding the fourth row featuring a curve-shaped crack, due to its relatively simple appearance, all models successfully predict the crack, with the exception of noise pixels predicted by Swin Transformer.

Input
Label iSwin-Unet Swin-Unet ST Unet  In Figure 7, four randomly selected prediction results of Crack500 images are showcased, generated by our proposed iSwin-Unet, Swin-Unet, Swin Transformer, and Unet.In the first row, a thick crack appears on the right side of the image, and only our proposed iSwin-Unet and Swin-Unet achieve results similar to the ground truth labeling.In contrast, Swin Transformer produces the least accurate result, displaying a considerably different pattern compared to the ground truth labeling.Moving to the second row, the original input features a thick crack with a clear boundary between the crack and the background.Apart from our proposed model, all predictions fail to appropriately recover the full shape of the pavement crack.Notably, the results of Swin Transformer and Unet are much wider than the actual crack.In the third row, all models accurately predict a short and deep crack.However, Swin Transformer falsely predicts an apparent bulge, unlike other models.Turning to the fourth row, where a less obvious crack is present in the input, all models struggle due to the challenging condition in which the crack "hides" in the background texture.Remarkably, the prediction result of the proposed model is still the closest to the ground truth labeling, while Unet predicts no crack pixels in the visualization.

Input
Label iSwin-Unet Swin-Unet ST Unet In Figure 8, four randomly selected prediction results of CrackSC images are displayed, generated by our proposed iSwin-Unet, Swin-Unet, Swin Transformer, and Unet.The CrackSC dataset presents a tougher challenge compared to the CFD and Crack500 datasets, as it includes images with leaves or shadows.In the first row, the original image features a branch-shaped crack, adding complexity to the prediction task.All models can Transformer and Unet in this scenario.In the fourth row, the prediction becomes more challenging due to the unclear boundary between the crack and the background, coupled with a complicated pattern.The segmentation results generated by our proposed model showcase superior performance on both the backbone and branches, indicating robustness across different scenarios.

Computational Efficiency
In this section, we conducted a comparative analysis of the computational efficiency of iSwin-Unet, Swin-Unet, ST, and Unet using indicators such as frames per second (FPS), parameters, and floating point operations (FLOPs), which are commonly used to evaluate inference performance in the test stage.Based on Table 4, it is evident that the developed iSwin-Unet model achieves the highest inference speed (50.65 FPS) with an image resolution of 320×480 on a single NVIDIA 3080Ti GPU.Specifically, it surpasses Swin-Unet, ST, and Unet by 4.6%, 0.6%, and 42.9%, respectively.In terms of model parameters, all models exhibit similar parameter sizes.The parameter count of the developed model is 6.1% smaller than Swin-Unet and 2.7% smaller than Unet, while it is 28.1% larger than the ST model.Regarding FLOPs, the developed model, Swin-Unet, and Unet show similar values, whereas the ST model has significantly fewer FLOPs.In summary, the proposed model achieves the highest inference speed for pavement crack detection, indicating its potential practical implementation.In Figure 7, four randomly selected prediction results of Crack500 images are showcased, generated by our proposed iSwin-Unet, Swin-Unet, Swin Transformer, and Unet.In the first row, a thick crack appears on the right side of the image, and only our proposed iSwin-Unet and Swin-Unet achieve results similar to the ground truth labeling.In contrast, Swin Transformer produces the least accurate result, displaying a considerably different pattern compared to the ground truth labeling.Moving to the second row, the original input features a thick crack with a clear boundary between the crack and the background.Apart from our proposed model, all predictions fail to appropriately recover the full shape of the pavement crack.Notably, the results of Swin Transformer and Unet are much wider than the actual crack.In the third row, all models accurately predict a short and deep crack.However, Swin Transformer falsely predicts an apparent bulge, unlike other models.Turning to the fourth row, where a less obvious crack is present in the input, all models struggle due to the challenging condition in which the crack "hides" in the background texture.Remarkably, the prediction result of the proposed model is still the closest to the ground truth labeling, while Unet predicts no crack pixels in the visualization.
In Figure 8, four randomly selected prediction results of CrackSC images are displayed, generated by our proposed iSwin-Unet, Swin-Unet, Swin Transformer, and Unet.The CrackSC dataset presents a tougher challenge compared to the CFD and Crack500 datasets, as it includes images with leaves or shadows.In the first row, the original image features a branch-shaped crack, adding complexity to the prediction task.All models can predict the rough contour of the crack, and notably, our adopted model and Swin-Unet accurately predict the crack.Moving to the second row, a net-shaped crack with heavy shadows is presented.Our proposed model demonstrates better recoverability on the original shape with intersecting patterns compared to other models.Unet, in contrast, only predicts the thick crack segments.In the third row, a long and thin crack is observed on the right side of the image, with a tree shadow along the left side.With little interference from the shadow on the crack, the prediction results are relatively complete.Our improved model and the original Swin-Unet exhibit greater accuracy compared to Swin Transformer and Unet in this scenario.In the fourth row, the prediction becomes more challenging due to the unclear boundary between the crack and the background, coupled with a complicated pattern.The segmentation results generated by our proposed model showcase superior performance on both the backbone and branches, indicating robustness across different scenarios.

Computational Efficiency
In this section, we conducted a comparative analysis of the computational efficiency of iSwin-Unet, Swin-Unet, ST, and Unet using indicators such as frames per second (FPS), parameters, and floating point operations (FLOPs), which are commonly used to evaluate inference performance in the test stage.Based on Table 4, it is evident that the developed iSwin-Unet model achieves the highest inference speed (50.65 FPS) with an image resolution of 320 × 480 on a single NVIDIA 3080Ti GPU.Specifically, it surpasses Swin-Unet, ST, and Unet by 4.6%, 0.6%, and 42.9%, respectively.In terms of model parameters, all models exhibit similar parameter sizes.The parameter count of the developed model is 6.1% smaller than Swin-Unet and 2.7% smaller than Unet, while it is 28.1% larger than the ST model.Regarding FLOPs, the developed model, Swin-Unet, and Unet show similar values, whereas the ST model has significantly fewer FLOPs.In summary, the proposed model achieves the highest inference speed for pavement crack detection, indicating its potential practical implementation.

Conclusions
In this paper, a pixel-level semantic segmentation model is improved based on Swin-Unet to achieve better prediction performance for pavement crack detection.Specifically, to achieve this goal, the skip attention module based on the channel attention mechanism is developed to focus on the crack region and assign larger weights to pavement crack feature channels.The residual Swin Transformer block is developed in the encoder architecture to model the crack features from a global perspective.The proposed module enhances the ability to fuse features from multi-scales, resulting in more comprehensive and informative representations of pavement cracks.
To assess the performance of the adopted network, a total of four models were trained and tested on three public datasets (CFD, Crack500, and CrackSC).Evaluation metrics such as mPrecision, mRecall, and mF1 are used to validate the accurate detection performance of our proposed model, while visualized results are also employed for evaluation purposes.From both the training performance and visualization results, it can be found that our proposed model achieves better accuracy and efficiency in pavement crack detection by better capturing and prioritizing crack feature channels and enabling global modeling of pavement crack features, thereby marking a substantial advancement over existing models in the field.
In the future, our exploration will focus on two key aspects.Firstly, despite the model's effectiveness on workstations, there is a pressing need to optimize the computational efficiency of crack detection models.This involves not only reducing the model size but also enhancing its processing speed without compromising accuracy.Secondly, it is imperative to concentrate on the practical deployment of crack detection models, particularly in terms of scalability and integration into existing infrastructure management systems.Thus, future investigations will be dedicated to the formulation of standardized protocols for the deployment of models, along with the advancement of scalable, cloud-based infrastructures designed to process and analyze pavement condition data over broad road networks.

Figure 2 .
Figure 2. The architecture of the basic Swin Transformer block.

Figure 2 .
Figure 2. The architecture of the basic Swin Transformer block.

Figure 4 .
Figure 4.The structure of the skip attention module.

Figure 4 .
Figure 4.The structure of the skip attention module.
The operating system employed was Ubuntu (version 20.04).For model training, validation, and testing, we utilized the MMSegmentation library [51], built on the PyTorch framework.The study involved the comparison of two models, namely Unet and Swin Transformer, across visualization and inference tests.Default configurations and training parameters were adhered to for each model, with training iterations set at 12 epochs.The model training utilized three public pavement surface crack datasets.

Figure 6 .
Figure 6.Prediction results of the CFD dataset.Figure 6. Prediction results of the CFD dataset.

Figure 6 .
Figure 6.Prediction results of the CFD dataset.Figure 6. Prediction results of the CFD dataset.

Table 1 .
Training performance on the CFD dataset.

Table 2 .
Training performance on the Crack500 dataset.

Table 3 .
Training performance on the CrackSC dataset.

Table 4 .
Indicators of the inference speed.

Table 4 .
Indicators of the inference speed.