Pixel-Level Concrete Crack Segmentation Using Pyramidal Residual Network with Omni-Dimensional Dynamic Convolution

: Automated crack detection technologies based on deep learning have been extensively used as one of the indicators of performance degradation of concrete structures. However, there are numerous drawbacks of existing methods in crack segmentation due to the ﬁne and microscopic properties of cracks. Aiming to address this issue, a crack segmentation method is proposed. First, a pyramidal residual network based on encoder–decoder using Omni-Dimensional Dynamic Convolu-tion is suggested to explore the network suitable for the task of crack segmentation. Additionally, the proposed method uses the mean intersection over union as the network evaluation index to lessen the impact of background features on the network performance in the evaluation and adopts a multi-loss calculation of positive and negative sample imbalance to weigh the negative impact of sample imbalance. As a ﬁnal step in performance evaluation, a dataset for concrete cracks is developed. By using our dataset, the proposed method is validated to have an accuracy of 99.05% and an mIoU of 87.00%. The experimental results demonstrate that the concrete crack segmentation method is superior to the well-known networks, such as SegNet, DeeplabV3+, and Swin-unet.


Introduction
Cracks in concrete are considered as a significant flaw when inspecting civil engineering projects.From an engineering perspective, fractures affect not only the stability and longevity of engineering constructions, but also the durability of concrete [1], which may be caused by small or massive cracks that slowly spread and cause the final collapse or destruction of the structure.Currently, the primary approaches to detect crack-like flaws refer to simple instrumental measurements and visual inspection.The latter, however, is considered as an arduous operation.Moreover, there may be significant misdetection and omission in some regions with problems [2].Manual crack detection is not ideal for mass detection since it frequently encounters problems such as heavy workload, complex structure, and inconsistent evaluation standards.Compared with manual inspection, machine vision inspection shows the features of efficiency as well as safety and reliability due to its lack of contact with the object.Traditional machine vision methods have been extensively used to solve industrial problems, including object inspection [3], material contour measurement [4], distance measurement [5], etc.For example, multi-vision measurement methods can be used to accurately measure the surface deformation and full-field strain values of steel pipe concrete columns [6].The use of exponential functional density clustering models can perform better than the clustering and deep learning (DL) methods for indoor object extraction tasks [7].Despite the considerable achievements, conventional vision technologies still require expert analysis and fine-tuning for their application, making them inappropriate for complex problems.Due to the continuous innovation and development of digital images, the combination of digital image processing methods and DL in the engineering structural defect detection industry is a new research direction in crack inspection technology in recent years.Then, the DL algorithms for autonomous detection are used to identify the target defects on the surface of the engineering structure rather than identifying them based on the artificial experience of using an unmanned aerial vehicle [8] or a wall-climbing robot [9] carrying relevant equipment to capture a significant amount of image data on the surface of engineering structures.In recent years, DL algorithms have made significant strides in the field of computer vision [10].According to recent research, convolutional neural networks (CNN) can be utilized for tasks including classification [11], localization [12], and segmentation [13] in crack detection tasks.For example, researchers can augment digital image data with Generative adversarial networks (GAN) and combine them with improved visual geometry group (VGG) networks to achieve crack classification [14].In terms of crack width measurement, a new crack width measurement method based on backbone dual-scale features can improve detection automation [15].These studies are increasingly concentrated on employing new DL methods with successful outcomes.In certain studies on new cement repair materials, fracture segmentation algorithms based on DL have even been employed to evaluate the material performance [16].
Grey-scale segmentation, conditional random fields, and other more conventional methods constitute the majority of early segmentation algorithms, although it is very challenging to describe complicated classes using only grey-level information.With the introduction of the first semantic segmentation model based on DL, it has gained popularity in semantic segmentation tasks, that is, FCN [17], which extends end-to-end convolutional networks to semantic segmentation.To increase the efficiency of training detection with minimal datasets, an image segmentation algorithm U-Net [18] for medical image segmentation is proposed.The concept of encoder-decoder proposed by SegNet [19] is crucial to modern segmentation algorithms.Two primary optimization techniques focus on using atrous convolution and amplified convolution kernel size to successfully expand the perceptual range of feature extraction.The first is to give the network a null convolution [20].Deeplab v1 [21], Deeplab v2 [22], DeepLab v3 [23], DeepLab v3+ [24], and DenseASPP [25] are more algorithms based on this concept.The second is to expand the convolution's kernel size to create a wider effective receptive field [26].Additionally, other technologies [27] make use of this concept to optimize feature extraction, and employ huge kernel pooling layers to gather these data and record the entire image.In recent years, a self-attentive semantic segmentation model [28] that employs a local-to-global approach is proposed for medical image segmentation.The success of Swin transformer in the field of image recognition shows the application potential of the transformer in vision.The main goal of these algorithms is to give the network a wider perceptual area, so as to facilitate the network to gather more global data.However, not all segmentation domains, such as crack segmentation, need global data.
The existing crack segmentation technologies can be divided into two primary groups: one is the technology of using semantic segmentation models in other domains and the other is the technology of mixing multiple networks to create a dual network for crack detection.For a crack dataset with a small amount of data, Carr et al. [29] proposed a structure with a feature pyramid core and an underlying feedforward ResNet.Yang et al. [30] proposed a pyramid and hierarchical improvement network for pavement crack detection, while Jiang et al. [31] suggested a DL-based hybrid extended convolutional block network for crack detection at the pixel level.Other effective segmentation technologies [32][33][34][35][36][37][38] likewise mostly rely on the concepts of SegNet and U-Net concepts to complete the task.Despite the excellent performance of these crack segmentation models, no studies show how the parameters of these networks affect the outcomes, for example, how much local information is collected by the network when the crack image is subjected to feature extraction, where the Skip Connection in the coding and decoding structure is located, and how many image channels are used in the feature transfer.
A pixel-level crack segmentation network (CCSN) is proposed as a solution to this issue and as a means to identify a network structure appropriate for crack segmentation.The network uses a residual network built on a feature pyramid to realize Omni-Dimensional Dynamic Convolution in the network's fundamental block using a residual network built on a feature pyramid.The loss calculation employs a mix of the dice coefficient [39] and focal loss [40] to handle the issue of sample imbalance.The created dataset supports the claim that CCSN performs better than networks such as SegNet, DeeplabV3+, and Swin-unet.The performance of these networks is shown in Figure 1.
feature extraction, where the Skip Connection in the coding and decoding structure is located, and how many image channels are used in the feature transfer.
A pixel-level crack segmentation network (CCSN) is proposed as a solution to this issue and as a means to identify a network structure appropriate for crack segmentation.The network uses a residual network built on a feature pyramid to realize Omni-Dimensional Dynamic Convolution in the network's fundamental block using a residual network built on a feature pyramid.The loss calculation employs a mix of the dice coefficient [39] and focal loss [40] to handle the issue of sample imbalance.The created dataset supports the claim that CCSN performs better than networks such as SegNet, DeeplabV3+, and Swin-unet.The performance of these networks is shown in Figure 1.The main contributions of this research are summarized as follows: 1) First, a CNN structure for crack segmentation is proposed by combining residual networks and Omni-Dimensional Dynamic Convolution.Its performance under various convolution kernels, channel numbers, connection schemes, and loss functions is thoroughly investigated to find a relatively stable and high-quality structure.2) Then, mIoU, mPA, and accuracy are used as the primary evaluation metrics and various loss functions are used to target binary classification and sample imbalance.3) Finally, a dataset for concrete cracks with distinct environments and different orientations is created and utilized for training and validation.
The remainder of this study is arranged as follows.Section 2 focuses on the work related to the method.Section 3 describes the proposed crack segmentation method in detail.Section 4 analyzes the performance of the method under different datasets.Finally, Section 5 draws the conclusion of this paper.

Residual Block with ODConv
In the proposed CCSN method, the idea of Skip Connections of the residual network [41] hops is used for reference, and the Omni-Dimensional Dynamic Convolution (OD-Conv) [42] incorporating multi-headed attention is used.The main contributions of this research are summarized as follows: 1) First, a CNN structure for crack segmentation is proposed by combining residual networks and Omni-Dimensional Dynamic Convolution.Its performance under various convolution kernels, channel numbers, connection schemes, and loss functions is thoroughly investigated to find a relatively stable and high-quality structure.2) Then, mIoU, mPA, and accuracy are used as the primary evaluation metrics and various loss functions are used to target binary classification and sample imbalance.3) Finally, a dataset for concrete cracks with distinct environments and different orientations is created and utilized for training and validation.
The remainder of this study is arranged as follows.Section 2 focuses on the work related to the method.Section 3 describes the proposed crack segmentation method in detail.Section 4 analyzes the performance of the method under different datasets.Finally, Section 5 draws the conclusion of this paper.

Residual Block with ODConv
In the proposed CCSN method, the idea of Skip Connections of the residual network [41] hops is used for reference, and the Omni-Dimensional Dynamic Convolution (ODConv) [42] incorporating multi-headed attention is used.

Residual Network
With the increase in model layers, the problem of gradient disappearance or expansion is addressed by residual networks.Traditional neural networks frequently employ numerous convolutional layers, pooling layers, etc., especially in image processing.Since each layer takes features from the one before it, it is more likely to have problems (e.g., deterioration) as the number of layers rises.To overcome various issues of the deep neural network, the residual network adopts a Skip Connection strategy.The residual structure can be simply written in the following form: where x l is the input feature, F(x l , W l ) can be two or three convolution blocks, and x l+1 is the output feature.Skip Connection is the direct output of layer input x l without processing plus F(x l , W l ), i.e., the layer output contains the complete input information.

ODConv
ODConv introduces a multi-dimensional attention mechanism with a parallel strategy to learn diverse attention of convolutional kernels along all four dimensions of kernel space.
ODConv uses an Squeeze Excitation [43] style attention module but makes it have multiple heads to compute multiple types of attention.The overall structure is shown in Figure 2. Specifically, for input, it is first shrunk by GAP to a feature vector of length and then FC is used with four heads to generate different types of attention values.Four attentional dimensions focusing on location, channel, filter, and nucleus will capture richer contextual information.ODConv leverages a novel multi-dimensional attention mechanism to compute four types of attentions α si , α ci , α f i , and α wi for W i , along with all four dimensions of the kernel space in a parallel manner.The formula is as follows: where * denotes the convolution operation and α wi ∈ R denotes the attention scalar for the convolutional kernel W i ; α si ∈ R k×k , α ci ∈ R c in , and α f i ∈ R c out denote three newly introduced attentions, which are computed along the spatial dimension, input channel dimension, and the output channel dimension of kernel space for the convolutional kernel W i , respectively; and denotes the multiplication operations along different dimensions of the kernel space.Here, α si , α ci , α f i , and α wi are computed with a multi-head attention module π i (x).

Pyramid Network
The pyramid network in our method has two levels: image pyramids [44] and feature pyramids [45].Image pyramids are created to address the issue of multi-scale variation enhancement, where the inherent pixel information of small objects is readily lost during the process of downsampling.The multi-scale variation problem in object detection can

Pyramid Network
The pyramid network in our method has two levels: image pyramids [44] and feature pyramids [45].Image pyramids are created to address the issue of multi-scale variation enhancement, where the inherent pixel information of small objects is readily lost during the process of downsampling.The multi-scale variation problem in object detection can be handled through the feature pyramids, with only a slight increase in processing effort.The feature pyramids primarily solve the weaknesses of target detection in dealing with multi-scale variation difficulties.

Loss Function
A pixel-level cross entropy (CE) loss, which analyzes each pixel separately and compares the predictions of each pixel class with our coded label vector, is the most popular loss function used for image semantic segmentation tasks.The matching loss function of each pixel is as follows: where M represents the number of categories, y c is a one-hot vector with elements taking only 0 and 1 values, and p c denotes the probability that the predicted sample belongs to class c.When there are only two categories, the binary cross entropy (BCE) loss can be written as follows: where p c is the model input and t c is the true label.Binary cross entropy with logits (BCEL) loss combines a sigmoid layer and BCE loss into a single class.For the problem of positive and negative sample imbalance, Shrivastava et al. [46] proposed an algorithm for online hard example mining (Ohem).OhemCE loss is to calculate the cross entropy, then select hard samples according to the loss and apply higher weights to them in the subsequent training process.Intersection over union (IoU) reflects the ratio of the intersection and merge of the true and predicted values and is commonly used as a loss function in semantic segmentation.The IoU loss expression is as follows: A loss function known as the focal loss (FL) is used to deal with the uneven classification of samples.The emphasis is to increase the weight for the loss related to the samples based on the ease of sample differentiation, which is to add smaller weights to the samples that are easy to distinguish, and to add larger weights to those that are difficult to differentiate.The expression for the loss function can then be written as FL is the addition of a weighting coefficient (1 − p c ) γ before the standard cross entropy.Dice loss is named after the dice coefficient [47], which is a measure function used to assess the similarity of two samples.The larger value means the similarity of the two samples.The mathematical expression of the dice coefficient is as follows: where X represents the pixel label of the real segmented image and Y represents the pixel class of the model-predicted segmented image.

Methodology
The proposed method achieves the end-to-end crack detection function with a smaller network model and higher segmentation accuracy.To evaluate the effectiveness of the suggested strategy, a concrete crack dataset (CCD) was developed.The process of establishing the CCD is shown in Figure 3.The dataset attempts to encompass all orientations and widths including fractures with simple backgrounds and cracks with complicated backgrounds.The image capture task is carried out using a smartphone.The camera sensor is the IMX600, which is a diagonal 9.2 mm (Type 1/1.7) 40 Mega-pixel CMOS active pixel type stacked image sensor with a square pixel array.The original image is obtained at a resolution of 3648 × 2763, and evenly cropped to 912 × 912.The resolution is reduced to 256 × 256.Image resizing facilitates quick processing.Considering the performance bottleneck of hardware (GPU), 2000 images are carefully selected and labelImgPlus is utilized to make the masks.The network is then trained and validated using the masks and processed images, and the trained network is then used to segment the test images.


where X represents the pixel label of the real segmented image and Y represents the pixel class of the model-predicted segmented image.

Methodology
The proposed method achieves the end-to-end crack detection function with a smaller network model and higher segmentation accuracy.To evaluate the effectiveness of the suggested strategy, a concrete crack dataset (CCD) was developed.The process o establishing the CCD is shown in Figure 3.The dataset attempts to encompass all orienta tions and widths including fractures with simple backgrounds and cracks with complicated backgrounds.The image capture task is carried out using a smartphone.The camera sensor is the IMX600, which is a diagonal 9.2 mm (Type 1/1.7) 40 Mega-pixel CMOS active pixel type stacked image sensor with a square pixel array.The original image is obtained at a resolution of 3648 × 2763, and evenly cropped to 912 × 912.The resolution is reduced to 256 × 256.Image resizing facilitates quick processing.Considering the performance bot tleneck of hardware (GPU), 2000 images are carefully selected and labelImgPlus is utilized to make the masks.The network is then trained and validated using the masks and pro cessed images, and the trained network is then used to segment the test images.

Network
As shown in Figure 4, a pixel-level concrete crack segmentation using pyramidal residual network with Omni-Dimensional Dynamic Convolution is proposed, where the first parameter in the block indicates the block name and the second parameter indicates the number of output filters.The input is an image with a size of 256 × 256 and 3 channels of RGB, and after network calculation the output is a feature map of size 256 × 256 with 2 channels to achieve pixel-level segmentation.These 2 channels correspond to the two classes of crack and background.The network consists of an encoder-decoder.The encoder descending through Conv consist of three parts: convolutional operations, batch normalization (BN), and GELU [48] activation function.The decoder ascending through TConv consists of convolutional operations, BN, and GELU activation function.The block is made of two ODConvs, including three parts: transpose convolutional operations, BN, and GELU activation function.These two blocks are sequentially connected to deepen the network, but they are also Skip Connected to prevent gradients from fading.The feature fusion of the encoder-decoder part is channel concatenating (Concat), and the connection strategy is mentioned later.Network details are tabulated in Table 1.
descending through Conv consist of three parts: convolutional operations, batch normalization (BN), and GELU [48] activation function.The decoder ascending through TConv consists of convolutional operations, BN, and GELU activation function.The block is made of two ODConvs, including three parts: transpose convolutional operations, BN, and GELU activation function.These two blocks are sequentially connected to deepen the network, but they are also Skip Connected to prevent gradients from fading.The feature fusion of the encoder-decoder part is channel concatenating (Concat), and the connection strategy is mentioned later.Network details are tabulated in Table 1.In the method proposed in this paper, Conv is mainly used for reducing the dimensionality of the feature map and changing the number of filters.The kernel size for the convolution operation in Conv is 5 × 5, the stride is set to 2 × 2, and the padding is set to ensure that the size of the resulting feature map is 2 n .TConv is used to raise the dimensionality and change the number of filters, and the kernel size in TConv is 4 × 4, with other parameters remaining the same as in Conv.

Block
As shown in Figure 5, to solve the degradation problem of deep networks, the block used in this paper adopts the idea of residual blocks.Each block has two convolution operations with different numbers of filters, which can significantly extract two different levels of features in the residual block and accelerate the training process.To find a sense field suitable for crack segmentation and inspired by ConvNeXt [49], the block in this paper attempts to increase the kernel size to improve network performance, which is actually demonstrated by Conv and TConv.However, too large of a kernel size in convolutional operation would entail a huge amount of computation, leading to a trade-off between accuracy and speed.The reasons for choosing convolutional kernel sizes of 3 × 3 and 5 × 5 will be given in the subsequent experimental section.In Figure 5, c is the number of filters.Assume that the number of filter input block is c, and the number of filters will be halved ( /2 c ) before the featu restored after entering the block.This method has the advantage of reducing the In Figure 5, c is the number of filters.Assume that the number of filters in the input block is c, and the number of filters will be halved (c/2) before the features are restored after entering the block.This method has the advantage of reducing the computation amount and somewhat alleviating the computational load caused by large convolutional kernels.

Connect Strategy
In this study, the feature connection sites during the feature fusion phase are analyzed in depth to further optimize network performance.Under the same assured feature dimensionality, four connection options are experimentally validated, and precise performance is provided in the following experimental section.

Loss
The cracked dataset is a typical example of an unbalanced class problem; the image contains a disproportionate number of non-cracked pixels and a small number of cracked pixels.To solve this problem, this paper combines focal loss and dice loss, and the loss function can be written in the following form: where the first half is the addition of a weighting coefficient (1 − p t ) γ before the standard cross entropy, and the second half is the dice coefficient.The ε is an artificially set smoothness coefficient, I denotes the intersection of positive examples, and the U denotes the union of examples.

Experiments and Results
This section presents the experimental results of the proposed method, including its performance under different network parameters, connection strategies, and losses.At the same time, this network is compared with other networks of the same type.In this paper, to verify the effectiveness of the method, the public crack dataset is selected for further validation.

Training Configuration
The experiments were conducted using AMD Ryzen7 5800H Processor with 16 GB RAM, and NVIDIA GeForce RTX 3060 Laptop with 6 GB RAM GPU.The DL framework is pytorch, and Adam with a momentum of 0.9 was chosen as the optimizer during training.The initial learning rate and minimum learning rate were set to 10 −4 and 10 −6 , respectively, and the learning rate descent formula is cos.Due to memory constraints, the number of multithreads and batch size were set to 4, and 50 epochs were used for training.The training process was analyzed using different sets of hyperparameters to select the optimal validated model configuration.The dataset was portioned into 80% (1600 images) for training, 10% (200 images) for validation, and 10% (200 images) for testing.All images used in the experiment were set to 256 × 256.

Evaluation Metric
To evaluate this crack segmentation network, the images were trained with a variety of different parameters, losses, architectures, and networks.Since the crack dataset is a class-imbalanced dataset, if the mean accuracy is simply used as an evaluation metric, the accuracy of the crack pixels will be masked by the accuracy of background pixels and the results will not be well observed.The mIoU is used as the main evaluation metric to assess the performance of the method, and the metrics such as precision (P), recall (R), F1-score, accuracy, and mPA (mean pixel accuracy) are compared.These evaluation metrics can be derived from a confusion matrix.The confusion matrix is shown in Table 2.The evaluation metrics for single class are shown as follows: The precision indicates the percentage of correctly predicted pixels out of all pixels predicted by the model as positive examples.Recall indicates the percentage of all samples with positive true pixels predicted.mPA indicates the average value of the sum of the precision of all classes.Accuracy indicates the number of correctly predicted pixels as a percentage of all pixels.Moreover, mIoU indicates the summed re-average of all classes of IoU.

Block
For cracked pixels, it has not been investigated how large a window should be used in the feature extraction process to obtain the connection with the surrounding pixels.To find the corresponding parameters suitable for crack segmentation, similar to ConvNeXt with a guaranteed number of parameters, this paper explores the different kernel sizes of the two ODConvs in the block.
The structure of the block is shown in Figure 6, with the kernel sizes k1 and k2 set in the order of network connections.To verify the advantages of ODConv and compare the results with the block using normal convolutional operations, the comparison results are shown in Table 3.It is obvious that k1 and k2 have better performance when set to 3 and 5, respectively.
Processes 2023, 11, x FOR PEER REVIEW 11 of 1 with a guaranteed number of parameters, this paper explores the different kernel sizes o the two ODConvs in the block.The structure of the block is shown in Figure 6, with the kernel sizes 1 k and 2 k set in the order of network connections.To verify the advantages of ODConv and compar the results with the block using normal convolutional operations, the comparison resul are shown in Table 3.It is obvious that 1 k and 2 k have better performance when se to 3 and 5, respectively.To verify the superiority of the method, the blocks in this paper are also compared with ResBlock, MobileNetV3 [50], and ConvNeXt.The experiments were performed in the same environment, but the difference is to replace the blocks with the corresponding methods.The experimental results are shown in Table 4.The method proposed in this paper is notably superior to the alternatives.

Connection Strategy
The feature fusion at different scales of the network is based on the feature pyramid networks.Even for the same network structure, different connection modes may not increase the number of parameters, but they will cause different results.In this paper, four different feature fusion patterns are investigated, as shown in Figure 7.The results in Table 5 show that (a) type of connectivity achieves better results in this method.The width (channels) of the feature map in the decoder is shown in Figure 8, where the numbers indicate the number of channels.A total of 128 channels were input to the decoder section, and 64, 32, 16, and 3 channels were incorporated into the feature fusion phase from the encoder section when the feature dimension was transformed.The whole network constitutes a feature pyramid.This part obtains the feature pyramid suitable for this network by studying the changing trend of the feature pyramid.The width (channels) of the feature map in the decoder is shown in Fi the numbers indicate the number of channels.A total of 128 channels wer decoder section, and 64, 32, 16, and 3 channels were incorporated into the phase from the encoder section when the feature dimension was transform network constitutes a feature pyramid.This part obtains the feature pyram this network by studying the changing trend of the feature pyramid.In this paper, the trends of the four feature widths are compared and shown in Table 6, where width indicates the number of channels per TConv In this paper, the trends of the four feature widths are compared and validated, as shown in Table 6, where width indicates the number of channels per TConv output since the features are transmitted along in the direction.The results show that widths of 128, 96, 64, and 40 perform best in this order.

Loss Results
The loss functions are described in detail in the previous sections.This section examines the performance of different loss functions on our network, as shown in Table 7.The results show that this method, which uses focal loss and takes into account the dice coefficient, outperforms other loss functions.To validate the effectiveness of the proposed method, the method in this paper is compared with other methods, including classical algorithms FCN, SegNet, Unet, and Deeplabv3+.In addition, the method is compared with Swin-unet using pure transformer, which further demonstrates its advantages in crack segmentation task.
The experimental results are shown in Table 8, where the U-shaped network and the coder-decoder structure crack segmentation task exhibit better performances, while algorithms such as Deeplabv3+ and Swin-unet show disappointing performances.The method proposed in this paper outperforms other methods in terms of accuracy, F1-score, mIoU, and mPA.Samples of crack detection using different networks are shown in Figure 9.In the second test sample, for example, fine (only one pixel wide) and blurred cracks pose a challenge to detection, but our method is able to detect the whole crack more completely and comes closest to the truth image of the ground.
PEER REVIEW 14 of 17

Results on BCL Dataset
To enhance the persuasiveness of the methods in this paper, the Bridge Crack Library (BCL) [51] dataset published on Harvard Dataverse in 2020 was used to validate the mentioned network, as shown in Table 9.In a random sample of 2000 images from the BCL dataset, 80% of them were used for training, 10% for and 10% for testing, so as

Results on BCL Dataset
To enhance the persuasiveness of the methods in this paper, the Bridge Crack Library (BCL) [51] dataset published on Harvard Dataverse in 2020 was used to validate the mentioned network, as shown in Table 9.In a random sample of 2000 images from the BCL dataset, 80% of them were used for training, 10% for validation, and 10% for testing, so as to ensure that the same environment as the CCD is used to start the experiments.The experimental results show that the proposed method in this paper achieves an accuracy of 98.89%, mIoU of 82.9%, and mPA of 90.47% on the BCL dataset, which are higher than other networks.

Computational Comparison
The computational complexity of the proposed CCSN network was evaluated against other networks (FCN, SegNet, DeepLabv3+, Unet, Unet++, Swin-unet).The actual performance of this method was evaluated by comparing the number of parameters, floating point operations (FLOPs), memory storage, and FPS comparisons of these networks.The computational complexity is shown in Table 10.For all networks including training datasets, test datasets, and hyperparameters, the network training criteria were set identically.The proposed CCSN network consists of 4.28 million learnable parameters, with only 7.17 GFLOPs and 9.61 MB storage, all of which are much lower than other networks.In addition, the training time of this method in the experiment is 65.67 min, which is shorter among all methods.In terms of FPS, the proposed network in this paper outperforms DeepLabv3+ and Unet++, and is slightly lower than SegNet, Unet, and Swin-unet.However, its overall evaluation metrics are still greatly dominated.

Figure 1 .
Figure 1.Inference mIoU, FLOPs, speed, and storage performance on our dataset.The bigger the circle, the faster the speed.The redder the color, the larger the storage.

Figure 1 .
Figure 1.Inference mIoU, FLOPs, speed, and storage performance on our dataset.The bigger the circle, the faster the speed.The redder the color, the larger the storage.

Figure 3 .
Figure 3. Overview of the proposed segmentation framework.

Figure 3 .
Figure 3. Overview of the proposed segmentation framework.

Figure 6 .
Figure 6.Schematic illustration of block with different kernels.

Figure 7 .
Figure 7. Schematic illustration of different connection mode.

Figure 9 .
Figure 9.Samples of crack detection using different networks.

Figure 9 .
Figure 9.Samples of crack detection using different networks.

Table 1 .
Detailed parameters in CCSN.
Figure 4. Schematic illustration of CCSN network architecture.

Table 1 .
Detailed parameters in CCSN.

Table 2 .
The confusion matrix for evaluation.

Table 3 .
Results of different kernel size in block.
Figure 6.Schematic illustration of block with different kernels.

Table 3 .
Results of different kernel size in block.

Table 4 .
Results of different blocks.

Table 5 .
Results of different connection mode.
Figure 7. Schematic illustration of different connection mode.

Table 5 .
Results of different connection mode.

Table 5 .
Results of different connection mode.

Table 6 .
Results of different filters.

Table 7 .
Results of different loss functions.

Table 8 .
Results of different networks by CCD.

Table 9 .
Results of different networks by BCL.

Table 10 .
Computational comparison of CCSN and other networks.