Feature Sparse Choosing VIT Model for Efficient Concrete Crack Segmentation in Portable Crack Measuring Devices

: Concrete crack measurement is important for concrete buildings. Deep learning-based segmentation methods have achieved state-of-art results. However, the model size of these models is extremely large which is impossible to use in portable crack measuring devices. To address this problem, a light-weight concrete crack segmentation model based on the Feature Sparse Choosing VIT (LTNet) is proposed by us. In our proposed model, a Feature Sparse Choosing VIT (FSVIT) is used to reduce computational complexity in VIT as well as reducing the number of channels for crack features. In addition, a Feature Channel Selecting Module (FCSM) is proposed by us to reduce channel features as well as suppressing the influence of interfering features. Finally, Depthwise Separable Convolutions are used to substitute traditional convolutions for further reducing computational complexity. As a result, the model size of our LTNet is extremely small. Experimental results show that our LTNet could achieve an accuracy of 0.887, 0.817 and 0.693, and achieve a recall of 0.882, 0.805 and 0.681 on three datasets, respectively, which is 3–8% higher than current mainstream algorithms. However, the model size of our LTNet is only 2 M.


Introduction
Concrete cracks are a common problem in building structures.There are many factors in the real environment that can cause cracks.For example, the load of heavy vehicles and high-speed driving accelerates the aging of the road surface.Changing humidity and temperature causes the softening, expansion and contraction of the concrete material of buildings.If not discovered and repaired in time, these cracks might further expand, causing building damage or safety hazards.Therefore, it is necessary to measure cracks in time so that they can be repaired immediately.However, traditional manual measurements of cracks are inefficient and costly, which is not suitable for large-scale measurements.Manual measurement of cracks usually uses a thin line or silk thread to divide the crack into several small segments, and uses a ruler or measuring instrument to measure the length of each segment of the thin line.These lengths are then added up to obtain the total length of the crack.With the development of computer vision technology, scholars have proposed using image segmentation methods to measure cracks.
Traditional computer image processing technologies for crack measurements usually include the following parts: (1) using self-designed digital filters such as Gaussian filter, Canny filter [1] and Gabor filter [2] to perform edge detection and extract cracks; (2) using wavelet transform for image de-noising and crack feature extraction [3,4]; (3) using minimum path selection (MPS) to capture complex curves containing closed loops and multiple branches to complete crack measurements [5].However, due to the existence of sharp edges and complex crack feature structures, traditional computer image processing technology is easy to incorrectly split cracks in images.In addition, due to differences in (1) A crack image segmentation model based on a light-weight VIT module is designed by us.Due to the strong continuity of cracks, there is a certain correlation between cracks in different positions in the crack images.In our model, VIT is used to capture the relationship between different crack positions in crack images, and thereby better processing global information.More importantly, the computational complexity of the VIT module is reduced by the FSVIT.On the one hand, the fully connected layers in VIT are replaced by Depthwise Separable Convolution, and on the other hand, the Feature Space Choosing layer is used to select channels for features and reduce the number of channels for crack features.(2) A Feature Channel Selecting Module (FCSM) is used to select channel features in the decoder.The key operation of our proposed FCSM is the Channel Sparse Choosing operation.In the Channel Sparse Choosing operation, each channel's feature corresponds to a scaling factor α and the channels with scaling factors approaching zero are pruned.Therefore, the number of channels of the original features significantly decrease after being processed by FCSM.In addition, the FCSM could suppress the influence of interfering features.(3) The current public concrete crack segmentation datasets have too few samples, and the crack samples are very similar.Thus, these datasets may not fully reflect the crack scenarios in the real world.Therefore, this study creates a new dataset named QUCrack which contains a large number of irregular cracks in a variety of environments.

The Structure of the LTNet
In this article, a light-weight concrete crack segmentation model based on the Feature Sparse Choosing VIT (LTNet) is proposed by us.In Figure 1, it can be seen that the LTNet has a symmetrical structure, with the encoder on the left side and the decoder on the right side.In the encoder, the input crack image is firstly input into the stacked convolution layers for high-level feature extraction.Then these high-level features are input into the stacked Feature Sparse Choosing-based VITs (FSVITs) proposed by us.It is worth noting that the FSVIT is a light-weight VIT module.Like VIT, the FSVIT utilizes the Transformer's self-attention mechanism, which could capture global dependencies when processing features.This enables FSVIT to better learn the feature relationships between different regions in crack images.Different from VIT, the computational complexity of the FSVIT is much smaller.In the decoder, stacked convolution layers and up-sampling layers are used for non-linear feature transformation and feature size recovery.Up-sampling is commonly used to increase low-resolution images to high resolution in order to obtain more detailed information.The commonly used up-sampling methods include nearest neighbor interpolation, bilinear interpolation, and bicubic interpolation.They estimate the value of new pixels based on the values of surrounding pixels, thereby increasing the resolution of the image.
In addition, for the purpose of reducing the redundant features of the model, a Feature Channel Selected Module (FCSM) is used to reduce the number of channel features.It is noticed that Depthwise Separable Convolutions (DWConv) are used to substitute traditional convolutions for further reducing computational complexity.Detailed descriptions of these modules designed by us will be illustrated in the following sections.

VIT (Vision in Transformer
) is a visual model based on the Transformer architecture.VIT treats images as a sequence and converts them into input for the Transformer model.VIT uses a self-attention mechanism to capture the relationship between different crack positions in crack images, thereby better processing global information.VIT has a strong generalization ability and transfer learning ability, which can achieve good performance in crack segmentation tasks.However, due to the complexity of the VIT model, it requires a large number of parameters and computational resources for inference.In order to reduce the computational complexity of VIT, a Feature Sparse Choosing VIT (FSVIT) is proposed by us.In our proposed FSVIT, Embedded Patches are firstly input into the Multi-Head Attention Block.The Multi-Head Attention Block maps input sequences into multiple different representations using multiple independent attention blocks, and concatenates these outputs to obtain the final attention representation of features.It is worth noting that after the Multi-Head Attention Block, a Depthwise Separable Convolution (DWConv) layer is used to replace all fully connected layers in traditional VIT to significantly reduce the computational complexity of VIT.The DWConv decomposes convolution operations into two independent steps: Depthwise Convolution and Pointwise Convolution.The DWConv only uses a large convolution kernel for each feature channel, which greatly reduces the number of parameters and computational complexity compared to fully connected layers.Then, the output features of DWConv are input into our Feature Space Choosing layer.Feature Space Choosing is another key operation for us to  VIT treats images as a sequence and converts them into input for the Transformer model.VIT uses a self-attention mechanism to capture the relationship between different crack positions in crack images, thereby better processing global information.VIT has a strong generalization ability and transfer learning ability, which can achieve good performance in crack segmentation tasks.However, due to the complexity of the VIT model, it requires a large number of parameters and computational resources for inference.In order to reduce the computational complexity of VIT, a Feature Sparse Choosing VIT (FSVIT) is proposed by us.In our proposed FSVIT, Embedded Patches are firstly input into the Multi-Head Attention Block.The Multi-Head Attention Block maps input sequences into multiple different representations using multiple independent attention blocks, and concatenates these outputs to obtain the final attention representation of features.It is worth noting that after the Multi-Head Attention Block, a Depthwise Separable Convolution (DWConv) layer is used to replace all fully connected layers in traditional VIT to significantly reduce the computational complexity of VIT.The DWConv decomposes convolution operations into two independent steps: Depthwise Convolution and Pointwise Convolution.The DWConv only uses a large convolution kernel for each feature channel, which greatly reduces the number of parameters and computational complexity compared to fully connected layers.Then, the output features of DWConv are input into our Feature Space Choosing layer.Feature Space Choosing is another key operation for us to reduce the computational complexity of VIT.In our Feature Space Selection layer, the global values of each feature channel are calculated through Global Average Pooling (GAP).Then, our Feature Selection layer sorts these global values in descending order, and only outputs the feature maps of the feature channels corresponding to the top 50% of the global values.In this way, by selecting feature channels, the number of output feature channels of our FSVIT would gradually decrease as the number of FSVITs increase, thus it could reduce the computational complexity of VIT when stacked FSVITs are used.The FSVIT is shown in Figure 2.
Electronics 2024, 13, x FOR PEER REVIEW 5 of 16 reduce the computational complexity of VIT.In our Feature Space Selection layer, the global values of each feature channel are calculated through Global Average Pooling (GAP).Then, our Feature Selection layer sorts these global values in descending order, and only outputs the feature maps of the feature channels corresponding to the top 50% of the global values.In this way, by selecting feature channels, the number of output feature channels of our FSVIT would gradually decrease as the number of FSVITs increase, thus it could reduce the computational complexity of VIT when stacked FSVITs are used.
The FSVIT is shown in Figure 2. The calculation of the whole process of the FSVIT is shown as follows: where GAP(x) represents the Global Average Pooling operation and the DWConv(x) represents the Depthwise Separable Convolution.
To illustrate the effectiveness of our FSVIT, the computational complexity of the FSVIT is compared with traditional VIT.M represents the input feature channel of the FSVIT, N represents the output feature channel of the FSVIT, where M is much smaller than N. WQ, WK, and WV represent self-attention weights in Multi-Head Attention.
The parameters in the FSVIT could be calculated as follows: In traditional VIT, there usually exists two fully connected layers, with 128 neurons in each layer.However, a single Depthwise Separable Convolution is used by us to substitute the two fully connected layers.The parameters in traditional VIT could be calculated as follows: It can be seen that the computational complexity of ( 2) is much smaller than (3).Therefore, it can be seen that compared with the traditional VIT, the FSVIT proposed by us could reduce a large number of parameters.The calculation of the whole process of the FSVIT is shown as follows: where GAP(x) represents the Global Average Pooling operation and the DWConv(x) represents the Depthwise Separable Convolution.
To illustrate the effectiveness of our FSVIT, the computational complexity of the FSVIT is compared with traditional VIT.M represents the input feature channel of the FSVIT, N represents the output feature channel of the FSVIT, where M is much smaller than N. W Q , W K , and W V represent self-attention weights in Multi-Head Attention.
The parameters in the FSVIT could be calculated as follows: (3 In traditional VIT, there usually exists two fully connected layers, with 128 neurons in each layer.However, a single Depthwise Separable Convolution is used by us to substitute the two fully connected layers.The parameters in traditional VIT could be calculated as follows: (M * 128 It can be seen that the computational complexity of ( 2) is much smaller than (3).Therefore, it can be seen that compared with the traditional VIT, the FSVIT proposed by us could reduce a large number of parameters.

The Structure of the Feature Channel Selecting Module
Feature channel selection refers to selecting which channels (or feature maps) to use in convolutional neural networks for subsequent processing and analysis.Feature channel selection has the following advantages: (1) Reducing computational complexity: In some cases, the number of channels for inputting feature maps may be very large, resulting in higher computational complexity.By selecting specific channels, the number of channels that need to be processed can be reduced, thereby reducing computational complexity.This helps to improve the efficiency and speed of the model.(2) Improving the generalization ability of the model: In some cases, certain channels may not have significant discriminative ability for specific tasks.By selecting channels with higher discrimination, the model's generalization ability can be improved to better adapt to new data beyond the training data.(3) Reducing the risk of over-fitting: Excessive feature channels may increase the complexity of the model, which can easily lead to over-fitting problems.By selecting the most representative feature channels, the complexity of the model can be reduced, the risk of over-fitting can be reduced, and the generalization ability of the model can be improved.
Based on the principle of feature channel selection, a Feature Channel Selecting Module (FCSM) is proposed by us.The key operation of our proposed FCSM is the Channel Sparse Choosing operation.In the Channel Sparse Choosing operation, each channel's feature corresponds to a scaling factor α, which α is multiplied with the channel feature matrix.Next, the modified loss function L sum is used to jointly train the network weights and these scaling factors α.Finally, the channels with scaling factors approaching zero are pruned, while fine-tuning the pruned network.Finally, features output from the Channel Sparse Choosing operation are shuffled and concatenated together.The L sum is shown as follows: where (x, y) is the training input (crack images) and training label.W is the trainable weight of the network.MSE is the mean square error loss.f (x,y) is the calculation of the model, x is the input image, and W is the weight parameter.α is the channel scaling factor.λ is the balance factor.It could be seen from ( 4) that L1 regularization is used by us to sparsify the feature channels.L1 regularization can drive certain weights to accurately become zero, thus enabling feature weight selection.This means that the L1 regularization can automatically identify and remove features unrelated to crack targets, making the model more concise and interpretable.Through the L1 regularization, the model becomes sparser.This helps to reduce the model complexity, reducing the risk of over-fitting, and improving the model's generalization ability.The process of the FCSM is shown in Figure 3.The calculation of the whole process of the FCSM is shown as follows: where ChannelSplit(x) represents splitting feature x into separate channels, the ChannelSpar-seChoosing(x) represents the Channel Sparse Choosing operation and the Concatenate(x) represents feature concatenation.Feature concatenation is the fusion of features according to channels.The Shuffle(x) represents shuffling feature x randomly.Shuffling features would sort features by channel randomly.Usually, the Channel Sparse Choosing operation only retains 30% of the features.Therefore, it could be seen that the number of channels of the original features has significantly decreased after being processed by FCSM.Therefore, the FCSM could reduce computational complexity while reducing the risk of over-fitting.The calculation of the whole process of the FCSM is shown as follows: Usually, the Channel Sparse Choosing operation only retains 30% of the features.Therefore, it could be seen that the number of channels of the original features has significantly decreased after being processed by FCSM.Therefore, the FCSM could reduce computational complexity while reducing the risk of over-fitting.

Our Collected Dataset
The current public concrete crack detection dataset has too few samples, and the crack samples are very similar.These datasets may not fully reflect the crack segmentation scenarios in the real world.For example, they may not be able to capture subtle changes, noise, lighting conditions, and other factors in the real world.In addition, the existing public crack segmentation datasets lack diversity, meaning they may not be able to cover various types, shapes, and sizes of cracks.This may lead to poor performance of the model when facing unseen crack samples, as it does not have enough diverse data for learning and generalization.In response to the above situation, our own concrete crack dataset is connected and named as the QUCrack dataset.Our dataset includes concrete cracks in various environments, such as concrete house walls, concrete molds, and concrete roads.Also, our data cover multiple environmental interference scenarios such as rainy days, snowy days, shadows, strong light and nights, in order to getting closer to real crack scenarios.
Some examples of our collected crack dataset are shown in Figure 4.

Our Collected Dataset
The current public concrete crack detection dataset has too few samples, and the crack samples are very similar.These datasets may not fully reflect the crack segmentation scenarios in the real world.For example, they may not be able to capture subtle changes, noise, lighting conditions, and other factors in the real world.In addition, the existing public crack segmentation datasets lack diversity, meaning they may not be able to cover various types, shapes, and sizes of cracks.This may lead to poor performance of the model when facing unseen crack samples, as it does not have enough diverse data for learning and generalization.In response to the above situation, our own concrete crack dataset is connected and named as the QUCrack dataset.Our dataset includes concrete cracks in various environments, such as concrete house walls, concrete molds, and concrete roads.Also, our data cover multiple environmental interference scenarios such as rainy days, snowy days, shadows, strong light and nights, in order to getting closer to real crack scenarios.Some examples of our collected crack dataset are shown in Figure 4.

Datasets
Table 1 shows the splitting of the three datasets.The detailed descriptions are as follows: The

Datasets
Table 1 shows the splitting of the three datasets.The detailed descriptions are as follows: The QUCrack dataset: This dataset is collected by us in multiple scenarios, including concrete cracks in various environments, such as concrete house walls, concrete molds and concrete roads.Our dataset covers multiple environmental interference scenarios such as rainy days, snowy days, shadows, strong light and nights, in order to get closer to real crack scenarios.

Experimental Setup
In the experiments designed by us, all images are normalized and augmented before input into our model.In DWConv, the number of convolution filters is set to 64, 128, 256, 512, 1024, 1024, 2048, 2048 in the encoder.And in the decoder, the number of convolution filters is set to 2048, 2048, 2048, 2048, 1024, 1024, 1024, 1024, 512, 512, 512, 512, 256, 256, 256, 256, 128, 128, 128, 64, 64.Also, Stochastic Gradient Descent [21] (SGD) is used as the training policy to train our model and E-Focal Loss is used as the loss function.In addition, accuracy, recall and F1 measure are used as the evaluation criterion.

Comparison with the State-of-the-Art Methods
To evaluate the performance of our LTNet model, a series of comparative experiments are designed by us.Accuracy, recall and F1 values are used as our basic testing standards.In addition, some mainstream crack segmentation models are adopted as our comparative algorithms.The specific introduction of these models are as follows: ConvNet From Table 2, it could be seen that our LTNet has achieved the best performance in accuracy, recall and F1 measure, which is about 3-8% higher than the current mainstream algorithms.However, our model size is only 2 M and could be used in portable devices.These experiments fully demonstrate the effectiveness of our algorithm.
Compared with our previously proposed LCSNet, our algorithm has a 3% higher accuracy, recall and F1 measure, but the model size is the same.The reason is that the stacked VIT modules are used in our LTNet.The VIT module has the following advantages: (1) Traditional convolutional neural networks capture local features of images through local convolution operations, while the VIT module can achieve global perception through the self-attention mechanism, modeling the entire image and better capturing of global features of the image.(2) Due to the strong continuity of cracks, there is a certain correlation between cracks in different positions in crack images.The VIT module uses Transformer's self-attention mechanism to process image patches.In the self-attention mechanism, each embedded patch interacts with others, capturing the positional relationships of cracks by from the perspective of convolutional layers.Both of these multi-scale feature information networks have a good effect on the spatial feature learning of cracks.In addition, PHCNet designed an edge feature extractor to extract the edge feature of cracks in the model to learn the edge features of cracks.Therefore, compared to ordinary attention mechanism models, PSNet and PHCNet have better performance.However, in our LTNet model, the spatial features of cracks are learned through a unified VIT architecture, thus our model does not require specially designed multi-scale feature extractors.
Compared with traditional UNet models and ConvNet models, models with attention mechanisms such as two-stage-CNN, FU-Net, ECA-Net, ACAU-Net, DMA-Net, Split-Attention Network could achieve excellent performance.The reason is that the attention mechanism can dynamically allocate attention weights based on the input contextual information, allowing the model to focus on important information and ignore irrelevant information more accurately.This operation could improve the effective utilization of features, thereby enhancing its performance.
The prediction results of our LTNet model are shown in Figure 5, and crack images are selected from Crack500 as the display images.The red boxes in the picture represent the differences in prediction results of different algorithms.
ships of cracks by calculating attention weights.In this way, the LTNet could learn correlations between cracks in different positions.
Compared with traditional attention mechanism-based models, PSNet and PHCNet could achieve better performance because crack images have strong irregularity and rich spatial features.The Pyramid Hierarchical Convolutional Module (PHCM) in PHCNet extracts multi-scale feature information from the perspective of convolutional filters, while the Parallel Convolutional Module (PCM) in PSNet extracts multi-scale feature information from the perspective of convolutional layers.Both of these multi-scale feature information networks have a good effect on the spatial feature learning of cracks.In addition, PHCNet designed an edge feature extractor to extract the edge feature of cracks in the model to learn the edge features of cracks.Therefore, compared to ordinary attention mechanism models, PSNet and PHCNet have better performance.However, in our LTNet model, the spatial features of cracks are learned through a unified VIT architecture, thus our model does not require specially designed multi-scale feature extractors.
Compared with traditional UNet models and ConvNet models, models with attention mechanisms such as two-stage-CNN, FU-Net, ECA-Net, ACAU-Net, DMA-Net, Split-Attention Network could achieve excellent performance.The reason is that the attention mechanism can dynamically allocate attention weights based on the input contextual information, allowing the model to focus on important information and ignore irrelevant information more accurately.This operation could improve the effective utilization of features, thereby enhancing its performance.
The prediction results of our LTNet model are shown in Figure 5, and crack images are selected from Crack500 as the display images.The red boxes in the picture represent the differences in prediction results of different algorithms.

Effects of Using Different Numbers of FSVIT
For the purpose of evaluating the effect of using different numbers of FSVIT, an experiment is conducted.
From Figure 6, it can be seen that different numbers of FSVITs have a significant impact on the final accuracy of the model.As the number of FSVITs increases, the accuracy of the model improves rapidly.The reasons are as follows: Due to the shape of cracks or

Effects of Using Different Numbers of FSVIT
For the purpose of evaluating the effect of using different numbers of FSVIT, an experiment is conducted.
From Figure 6, it can be seen that different numbers of FSVITs have a significant impact on the final accuracy of the model.As the number of FSVITs increases, the accuracy of the model improves rapidly.The reasons are as follows: Due to the shape of cracks or the continuity of edges, there may be a certain dependency relationship between cracks in different positions in the image.The Transformers in FSVIT use the self-attention mechanism to capture the dependency relationships between different positions in the input sequence; thus, this mechanism is particularly effective for processing crack image data.Therefore, as the number of FSVITs increases, the accuracy of the model improves rapidly.But when there are too many FSVITs, the accuracy of the model begins to slowly decline.The reason is that excessive use of the FSVIT would lead to a redundant parameter.For example, redundant FSVITs can only serve as non-linear transformations and cannot serve as feature extraction, so the parameters of these FSVITs are redundant.These redundant parameters may lead to over-fitting of the model; thus, it performs well on training data but has poor generalization ability on testing data.As a result, the model is overly sensitive to noise and subtle differences in the training data, and cannot accurately generalize to new data.Therefore, 12 FSVITs are selected for the final model.
example, redundant FSVITs can only serve as non-linear transformations and cannot serve as feature extraction, so the parameters of these FSVITs are redundant.These redundant parameters may lead to over-fitting of the model; thus, it performs well on training data but has poor generalization ability on testing data.As a result, the model is overly sensitive to noise and subtle differences in the training data, and cannot accurately generalize to new data.Therefore, 12 FSVITs are selected for the final model.

Effects of Using Different Numbers of FCSM
For the purpose of evaluating the effect of using different numbers of FCSM, an experiment is conducted by us.
From Figure 7, it could be seen that using different numbers of FCSMs have a significant impact on the final accuracy of the model.As the number of FCSMs increases, the accuracy of the model improves rapidly.
The reason is that by selecting appropriate feature channels, useful feature information for the task can be extracted, thereby improving the performance and generalization ability of the model.Different channels may have strong responses to different features, and selecting the appropriate channel can help the model better capture key image features.By selecting channels with strong responses to cracks, the model can more accurately locate and identify cracks, thereby improving accuracy.Additionally, selecting appropriate feature channels can reduce the interference of noise on the model.The sensitivity of different channels to noise may vary.Choosing channels that are not sensitive to noise can improve the accuracy of crack detection models.
However, excessive FCSM would lead to a decrease in the accuracy of the model.The reason is that pruning too many feature channels may lead to the loss of key crack features, such as pruning some channels that respond to non-crack areas, which may result in the model being unable to accurately distinguish between cracks and non-cracks, thereby reducing the accuracy of the crack segmentation model.
Thus, the numbers of FCSM is set to 4.

Effects of Using Different Numbers of FCSM
For the purpose of evaluating the effect of using different numbers of FCSM, an experiment is conducted by us.
From Figure 7, it could be seen that using different numbers of FCSMs have a significant impact on the final accuracy of the model.As the number of FCSMs increases, the accuracy of the model improves rapidly.

Comparison of Different Light-Weight Segmentation Models
For the purpose of evaluating the effect of our light-weight models, some other lightweight segmentation models are compared with our proposed LTNet.
From Table 3, it could be seen that our LTNet achieves the highest accuracy due to the use of a large number of light-weight VIT modules.The VIT module extracts the associated features of cracks at different positions in the crack image, which can better learn the detailed feature information of cracks.However, other light-weight models only use ordinary spatial channel attention mechanisms, which only suppress interference features in crack images and therefore have limited representation of crack features.The reason is that by selecting appropriate feature channels, useful feature information for the task can be extracted, thereby improving the performance and generalization ability of the model.Different channels may have strong responses to different features, and selecting the appropriate channel can help the model better capture key image features.By selecting channels with strong responses to cracks, the model can more accurately locate and identify cracks, thereby improving accuracy.Additionally, selecting appropriate feature channels can reduce the interference of noise on the model.The sensitivity of different channels to noise may vary.Choosing channels that are not sensitive to noise can improve the accuracy of crack detection models.
However, excessive FCSM would lead to a decrease in the accuracy of the model.The reason is that pruning too many feature channels may lead to the loss of key crack features, such as pruning some channels that respond to non-crack areas, which may result in the model being unable to accurately distinguish between cracks and non-cracks, thereby reducing the accuracy of the crack segmentation model.
Thus, the numbers of FCSM is set to 4.

Comparison of Different Light-Weight Segmentation Models
For the purpose of evaluating the effect of our light-weight models, some other lightweight segmentation models are compared with our proposed LTNet.
From Table 3, it could be seen that our LTNet achieves the highest accuracy due to the use of a large number of light-weight VIT modules.The VIT module extracts the associated features of cracks at different positions in the crack image, which can better learn the detailed feature information of cracks.However, other light-weight models only use ordinary spatial channel attention mechanisms, which only suppress interference features in crack images and therefore have limited representation of crack features.

Conclusions
Presently, portable crack measurement devices are being rapidly developed.These devices are usually small in size, are light-weight, easy to operate and could be used for measurement in complex environments.Portable crack measurement devices could display measurement results in real time, helping users to quickly understand the situation of cracks, and take necessary measures in a timely manner.In addition, these devices are usually equipped with high-precision cameras, which can accurately measure the size, depth and shape of cracks, providing reliable measurement results.However, current portable crack measurement devices must rely on cloud computing due to the large size of crack image segmentation models.But using cloud computing requires a stable network connection between devices and cloud servers.If the network is unstable or interrupted, it may cause data transmission interruptions or delays.Moreover, cloud computing consumes a significant amount of server resources, resulting in high costs for portable crack measurement devices.
To address the above issue, we designed a light-weight crack image segmentation model, named LTNet, which could be used for portable crack measurement devices.In our model, in order to capturing the feature relationships of different crack positions in crack images, a stacked VIT module was adopted in our design of LTNet.In addition, in order to reducing the computational complexity of the VIT module, Depthwise Separable Convolution was used by us to substitute the fully connected layers.On the other hand, the Feature Space Choosing layer was adopted to select channels for crack features and reduce the number of channels of these features.Additionally, a Feature Channel Selection Module (FCSM) was designed to select channel features in the decoder of the LTNet.The FCSM could not only reduce the size of features, but also suppress the interfering features.
In addition, in order to solve the shortcomings of the existing crack image segmentation dataset, a new dataset, named QUCrack, was created by us which contains a large number of irregular cracks in a variety of environments.
A series of experiments were deigned to validate our proposed LTNet.The experimental results show that our proposed LTNet could achieve the accuracy of 0.887, 0.817 and 0.693, and achieve the recall of 0.882, 0.805 and 0.681 on three datasets, respectively, which is 3-8% higher than current mainstream algorithms.However, the model size of our LTNet is only 2 M; thus, our work perfectly achieves our goal.Although our LTNet could solve the problem of crack measuring, evaluating and repairing these cracks is an emerging field that is still in the research and development stage.For example, by using machine learning and deep learning algorithms, data on cracks can be analyzed and predicted.By learning from a large amount of crack data, artificial intelligence can help predict the propagation and failure rate of cracks, and provide more accurate repair suggestions.Moreover, by combining artificial intelligence and robotics technology, an autonomous robot system can be developed that can automatically detect and repair cracks.These robots can detect cracks through vision and sensors, and repair them using laser, spray, or other repair methods.In addition, artificial intelligence can help design and develop intelligent materials and coatings, which can automatically perceive and repair cracks.These materials and coatings can automatically release repair agents or fillers based on the location and size of cracks, achieving self-healing functions.

Figure 1 .
Figure 1.The structure of the LTNet.

Figure 1 .
Figure 1.The structure of the LTNet.

2. 2 .
The Feature Sparse Choosing VIT Module VIT (Vision in Transformer) is a visual model based on the Transformer architecture.

Figure 2 .
Figure 2. The structure of the Feature Sparse Choosing VIT Module (FSVIT).

Figure 2 .
Figure 2. The structure of the Feature Sparse Choosing VIT Module (FSVIT).

Figure 3 .
Figure 3.The structure of the Channel Selecting Module (FCSM).
ChannelSplit(x) represents splitting feature x into separate channels, the Channel-SparseChoosing(x) represents the Channel Sparse Choosing operation and the Concatenate(x) represents feature concatenation.Feature concatenation is the fusion of features according to channels.The Shuffle(x) represents shuffling feature x randomly.Shuffling features would sort features by channel randomly.

Figure 3 .
Figure 3.The structure of the Channel Selecting Module (FCSM).

Figure 4 .
Figure 4. Some examples of images from our QUCrack dataset.(a) Crack on concrete walls.(b) Crack on concrete columns.(c) Crack on concrete roads.
Crack500 dataset [20]: This dataset includes 3020 images collected by Temple University, mainly capturing in campus by students.It has two kinds of size: 1440 × 2560 and 2560 × 1440.

Figure 4 .
Figure 4. Some examples of images from our QUCrack dataset.(a) Crack on concrete walls.(b) Crack on concrete columns.(c) Crack on concrete roads.
: a deep convolution-based segmentation neural network, this is the basic convolution based segmentation model.DWTA-U-Net: a U-Net based network with discrete wavelet transformed image features for concrete crack segmentation.CrackW-Net: a ResU-Net-based CNN for pavement crack segmentation proposed by Han.Split-Attention Network: a channel-wise attention-based network.DMA-Net: DeepLab With Multi-Scale attention for pavement crack segmentation proposed by Sun.ACAU-Net: an atrous convolution and attention U-Net model for pavement crack segmentation proposed by Feng.Cascaded Attention DenseU-Net: an attention-based network with global attention and core attention for road crack detection.ECA-Net: a light-weight channel attention-based convolution neural network.FU-Net: a generative adversarial network-based U-Net for road crack segmentation proposed by Gao.Two-stage CNN: a two-stage CNN for road crack detection and segmentation proposed by Nhung.PSNet: a Parallel Convolution-Based U-Net for Crack Detection with Self-Gated Attention Block proposed by Zhang.PHCNet: a Pyramid Hierarchical Convolution-Based U-Net for Crack Detection with Mixed Global Attention Module and Edge Feature Extractor proposed by Zhang.LCSNet: a light-weight Convolution Based Segmentation Method with a Separable Multi Directional Convolution Module for Concrete Crack Segmentation proposed by Zhang.

Figure 5 .
Figure 5.An example of the comparison of our proposed LTNet with the state-of-the-arts results, the example crack image is captured from the Crack500 datasets.

Figure 5 .
Figure 5.An example of the comparison of our proposed LTNet with the state-of-the-arts results, the example crack image is captured from the Crack500 datasets.

Figure 6 .
Figure 6.Accuracy comparison using different numbers of FSVIT.

Figure 6 .
Figure 6.Accuracy comparison using different numbers of FSVIT.

Figure 7 .
Figure 7. Accuracy comparison using different numbers of FCSM.

Figure 7 .
Figure 7. Accuracy comparison using different numbers of FCSM.

Table 1 .
Splitting of the datasets.

Table 3 .
Accuracy comparison with the state-of-the art light-weight models.

Table 3 .
Accuracy comparison with the state-of-the art light-weight models.