LCA-YOLOv8-Seg: An Improved Lightweight YOLOv8-Seg for Real-Time Pixel-Level Crack Detection of Dams and Bridges

: Remotely operated vehicles (ROVs) and unmanned aerial vehicles (UAVs) provide a solution for dam and bridges structural health information acquisition, but problems like effective damage-related information extraction also occur. Vision-based crack detection methods can replace traditional manual inspection and achieve fast and accurate crack detection. This paper thereby proposes a lightweight, real-time, pixel-level crack detection method based on an improved instance segmentation model. A lightweight backbone and a novel efﬁcient prototype mask branch are proposed to decrease the complexity of the model and maintain the accuracy of the model. The proposed method attains an accuracy of 0.945 at 129 frames per second (FPS). Moreover, our model has smaller volume, lower computational requirements and is suitable for low-performance devices.


Introduction
The rapid development of the water industry has seen more and more bridges and dams being constructed.These buildings are susceptible to cracks in their structures due to adverse factors such as ageing materials, hydraulic fracturing and water chemical corrosion, which in turn accelerate the damage to the buildings [1,2].Reliable and effective detection of cracks in buildings, as well as reinforcement and repair of buildings, is essential to ensure their proper use [3].Manual inspection has become the traditional solution for detecting cracks in dams, but with a large number of bridges and dams, detecting cracks in the main structure of bridges requires aerial work, and detecting underwater cracks in dams requires emptying the reservoir, making traditional manual inspection methods time consuming and a security risk.
The need for aerial and underwater operations has led to the development of remotely operated vehicles (ROVs) and unmanned aerial vehicles (UAVs) [4][5][6].ROVs and UAVs are often equipped with high resolution visible light cameras, self-contained light sources and some data storage.They are capable of replacing manual inspection for a wide range of underwater operations and aerial operations in harsh environments.Figure 1 shows a remotely operated vehicle in operation.During a complete ROVs or UAVs inspection mission, numerous images and videos related to structural damage to buildings can be obtained.However, relying solely on manual observation to extract damage-related information from this video data is a costly method.In addition, the accuracy of manual observation results depends on subjective human judgement.Manual observations have a high rate of misjudgment due to the complex underwater filming environment.Combining ROVs and UAVs with computer vision-based crack detection methods can replace traditional manual inspection and achieve fast and accurate crack detection.
Vision-based crack detection methods fall into two routes, one based on image processing techniques and the other on deep learning techniques.Image-processing-based crack detection techniques often binarize image pixels according to specific rules in order to distinguish cracked areas and non-cracked areas.Reference [7] summarises the history and implementation of image-processing-based crack detection methods and divides traditional rule-driven crack detection methods into threshold-based crack detection methods and edge-based crack detection methods.Traditional crack detection algorithms usually require pre-processing of images such as denoising, and it is difficult for a single algorithm to accurately extract crack features, often requiring a combination of multiple algorithms, which is computationally expensive, slow to detect and does not have real-time detection capability.Vision-based crack detection methods fall into two ro cessing techniques and the other on deep learning techni crack detection techniques often binarize image pixels acco to distinguish cracked areas and non-cracked areas.Refere and implementation of image-processing-based crack dete ditional rule-driven crack detection methods into threshol ods and edge-based crack detection methods.Traditional c ally require pre-processing of images such as denoising, a gorithm to accurately extract crack features, often requiri algorithms, which is computationally expensive, slow to d time detection capability.
The rapid development of deep learning techniques ha of computer vision tasks such as object detection [8], sem stance segmentation [10,11].Deep-learning-based crack gained rapid development.Deep-learning-based crack dete rate detection of cracks in buildings by learning crack fea The rapid development of deep learning techniques has led to the rapid development of computer vision tasks such as object detection [8], semantic segmentation [9] and instance segmentation [10,11].Deep-learning-based crack detection methods have also gained rapid development.Deep-learning-based crack detection algorithms achieve accurate detection of cracks in buildings by learning crack features from a large number of crack images and capturing the features of cracks in different forms and different contexts.This method has the advantages of high accuracy and good real-time performance.Reference [12] proposed an automatic concrete defect identification method based on convolutional neural network models and interpreted the obtained results in a form that is humanly explainable.Reference [13] proposed a concrete defect identification method based on a one-stage object detection model, which had high accuracy and real-time detection capabilities.References [14,15] provide detailed evaluations of the performance of object detection models and semantic segmentation models for automated detection.Reference [16] proposed a crack detection method based on frequency-domain images and one-dimensional convolutional neural networks, which is based on sliding window extraction of images, classification of cracks in a single image within the window and then final stitching of the output image, but this method has a slow detection speed of 5-7 s per image.Reference [17] designed a crack detection method based on a semantic segmentation algorithm, which enables pixel-level detection of cracked regions.Their method performs pixel classification of the full image, which loses background information of images, and it does not have real-time detection capabilities.References [18,19] proposed crack detection methods based on you only look once (YOLO).Their method has some improvements to the algorithms and is able to label cracks using a bounding box in real-time.The introduction of a transformer [20] into YOLOv5 can improve the performance of the model, however, the transformer based on the self-attention mechanism is computationally intensive and its introduction into the model will increase the size of the model, increase the model inference time and is not cost-effective.Reference [21] proposed crack detection methods based on object detection and semantic segmentation, respectively.Using the dense annotation method for labelling leads to dense target boxes in the predicted result image, which affects the presentation of cracked regions in the image.The semantic segmentation-based crack detection method is able to detect cracked regions at the pixel level, but classifying the full image pixels leads to the loss of image background information.Since cracks usually have different degrees of cracking and random shapes, and many cracks have small degrees of cracking and long crack trajectories, simply using a bounding box to frame the cracks has a weak detection effect, which is not enough to display the size and track of the cracks.The instance segmentation algorithm combines the features of both object detection and semantic segmentation and is able to box out objects and classify object class at the pixel level, which is more suitable for the surface crack detection task.Reference [22] proposed a crack detection method based on Mask R-CNN [23], but the Mask R-CNN model is complex, computationally intensive and does not have the capability of real-time detection on low performance devices.
To address the above issues, this study proposes a crack detection method based on an improved instance segmentation model: LCA-YOLOv8-seg.With high accuracy, low computation and small size, our model is friendly to hardware devices with low performance and our model facilitates further deployment to mobile devices.Our model uses a lighter LCANet backbone, which is based on a depthwise separable convolution and efficient channel attention (ECA) [24] mechanism, with the advantages of light weight and high accuracy.In addition, this study optimises the head structure of YOLOv8-seg by adopting a new module, ProtoC1, which reduces the computational cost of the model.It does not affect the accuracy of the mask.Our model shows a slight decrease in the mAP 0.5 metric compared to the baseline model YOLOv8n-seg, while the parameters, weights and GFLOPs of our model all decrease substantially.At the same time, to further improve the robustness of the model and reduce training costs, transfer learning technology is introduced.
The contributions of this study can be summarised as follows: • A crack detection method based on an improved one-stage instance segmentation model LCA-YOLOv8n-seg is proposed.Our method is able to frame cracks and depict crack regions at the pixel level.Our method is real-time, highly accurate, small in volume and friendly to low performance devices.

•
A new backbone network LCANet and a novel ProtoC1 module are proposed, which reduces the model volume drastically and has high detection accuracy.

Method
Our proposed real-time crack detection method is based on an improved YOLOv8nseg model, LCA-YOLOv8n-seg, which enables real-time crack detection and accurately depicts the crack area in pixel widths.The LCA-YOLOv8n-seg model has the advantages of small size, high detection accuracy and low detection delay.Figure 2 shows the overall structure of our method.As shown in Figure 2, the first step is to build a crack dataset, which includes thousands of crack images of bridges and dams, and then divide the dataset into training sets, validation sets and test sets.The training and validation datasets are passed to a preprocessing stage that marks the cracked regions of the image and uses data augmentation only on the training dataset.The pre-trained weights of the model were obtained through transfer learning; then, the training and validation process of the crack detection model LCA-YOLOv8-seg was carried out, and the performance of the method was finally tested in the test dataset.The YOLO (you only look once) series algorithms are one-stage object detection algorithms with the advantages of fast detection speed and high accuracy.The latest algorithm of the YOLO series is YOLOv8 [25], which introduces a series of changes: the C3 structure is replaced by the C2f structure with a richer gradient flow; the head part is replaced by the current mainstream decoupling head structure, and it is changed from anchor-based to anchor-free; task aligned assigner and distribution focal loss are introduced in the loss calculation.The above changes have greatly improved the detection accuracy of YOLOv8.
YOLOv8-seg is the instance segmentation model of yolov8.Compared with the object detection model, the instance segmentation model has a prototype mask branch and mask coefficients in the head structure, which are used to generate an instance mask.This method was proposed by YOLACT [26].YOLOv8-seg is divided into five models: YOLOv8n-seg, YOLOv8s-seg, YOLOv8m-seg, YOLOv8l-seg, YOLOv8x-seg.We chose the smallest model, YOLOv8n-seg, as the baseline.

LCA-YOLOv8-Seg
The specific structure of the LCA-YOLOv8n-seg model is shown in Figure 3.The LCA-YOLOv8n-seg adopts a new lightweight backbone network, LCANet, combined with a path aggregation feature fusion structure, to achieve effective extraction and fusion of multi-level image features.Meanwhile, we designed a novel efficient prototype mask branch, ProtoC1, which has fewer parameters and calculations.By using the new lightweight backbone network and the more efficient ProtoC1 module, the volume and inference time of the model were reduced.The YOLO (you only look once) series algorithms are one-stage object detection algorithms with the advantages of fast detection speed and high accuracy.The latest algorithm of the YOLO series is YOLOv8 [25], which introduces a series of changes: the C3 structure is replaced by the C2f structure with a richer gradient flow; the head part is replaced by the current mainstream decoupling head structure, and it is changed from anchor-based to anchor-free; task aligned assigner and distribution focal loss are introduced in the loss calculation.The above changes have greatly improved the detection accuracy of YOLOv8.
YOLOv8-seg is the instance segmentation model of yolov8.Compared with the object detection model, the instance segmentation model has a prototype mask branch and mask coefficients in the head structure, which are used to generate an instance mask.This method was proposed by YOLACT [26].YOLOv8-seg is divided into five models: YOLOv8n-seg, YOLOv8s-seg, YOLOv8m-seg, YOLOv8l-seg, YOLOv8x-seg.We chose the smallest model, YOLOv8n-seg, as the baseline.

LCA-YOLOv8-Seg
The specific structure of the LCA-YOLOv8n-seg model is shown in Figure 3.The LCA-YOLOv8n-seg adopts a new lightweight backbone network, LCANet, combined with a path aggregation feature fusion structure, to achieve effective extraction and fusion of multi-level image features.Meanwhile, we designed a novel efficient prototype mask branch, ProtoC1, which has fewer parameters and calculations.By using the new lightweight backbone network and the more efficient ProtoC1 module, the volume and inference time of the model were reduced.

New Backbone: Lightweight Channel Attention Network (LCANet)
In order to reduce the model volume while maintaining high detection accuracy, we designed a new lightweight backbone: LCANet.The structure of LCANet is shown in Figure 3      The standard convolutional layer is parameterized by filter K of the for

A standard convolutional layer takes as input a
H W is the size of the convolving kernel, M is number of input feature map channels and N is the number of filters and output fea map channels.The output feature map of standard convolution is usually calculated , The computational cost of a standard convolution is: where the computational cost depends multiplicatively on the number of input chan M , output channels N , the kernel size and the feature map size

A H 
Depthwise convolution with one filter per input channel (input depth) can be writte where K is the depthwise convolutional kernel of size Depthwise convolution has a computational cost of:  N is the number of filters and output feature map channels.The output feature map of standard convolution is usually calculated as: The computational cost of a standard convolution is: where the computational cost depends multiplicatively on the number of input channels M , output channels N , the kernel size Depthwise convolution with one filter per input channel (input depth) can be written as: where K is the depthwise convolutional kernel of size KK HWM , where the th m filter in K is applied to the th m channel in A .
Depthwise convolution has a computational cost of: The 1 × 1 (pointwise) convolution has a computational cost of: A standard convolutional layer takes as input a H A × W A × M feature map A and produces a H B × W B × N feature map B, where H A , W A is the spatial height and width of an input feature map, M is the number of input channels, H B , W B is the spatial height and width of an output feature map and N is the number of output channels.The standard convolutional layer is parameterized by filter K of the format H K × W K × M × N, where H K , W K is the size of the convolving kernel, M is the number of input feature map channels and N is the number of filters and output feature map channels.The output feature map of standard convolution is usually calculated as: The computational cost of a standard convolution is: where the computational cost depends multiplicatively on the number of input channels M, output channels N, the kernel size H K × W K and the feature map size Depthwise convolution with one filter per input channel (input depth) can be written as: where K is the depthwise convolutional kernel of size H K × W K × M, where the m th filter in K is applied to the m th channel in A.
Depthwise convolution has a computational cost of: The 1 × 1 (pointwise) convolution has a computational cost of: The combination of depthwise convolution and 1 × 1 (pointwise) convolution is called depthwise separable convolution, and the depthwise separable has a computational cost of: which is the sum of the depthwise and 1 × 1 pointwise convolutions.By expressing convolution as a two-step process of filtering and combining we get a reduction in computation of: In DWConv blocks, we use depthwise separable convolutions with kernel size 3 and 5, which results in 8-9 times less computation than standard convolutions.The specific structure of LCANet is shown in Table 1.Replacing the backbone with LCANet, although the accuracy of the model slightly decreases, the volume of model becomes smaller.

More Efficient Prototype Mask Branch: ProtoC1
The prototype mask branch and mask coefficients are introduced to make the onestage object detection model into a one-stage instance segmentation model.In the baseline model, the prototype mask branch is implemented with a fully convolutional network (FCN), which include one upsampling module and three Conv modules, in which there are two 2D convolutions with kernel size 3 and one 2D convolution with kernel size 1, named Proto.The input feature map is scaled up to twice its original size, and a feature map with k channel is outputted.
Although the addition of the prototype mask branch makes the one-stage object detection model become a segmentation model, it also makes the detection speed decrease and the model volume increase.In the LCA-YOLOv8-seg model, we redesign the prototype mask branch.
According to the implementation of convolution, the larger the kernel size of convolution is, the smoother the image will be, but at the same time, the calculation of convolution will increase.The same applies to the generation of an instance mask.In an experimental study, we found that using a 2D convolution with kernel size 1, while significantly reducing the parameters and calculations of the prototype mask branch, resulted in a big loss of edge detail and accuracy in the instance mask.The use of 2D convolution with kernel size 3 does not result in a big loss of mask detail, but the parameters and calculation of the branch is bigger than the branch using 2D convolution with kernel size 1.In order to strike a balance between the calculation cost of the prototype mask branch and the quality of the instance mask, we proposed a new prototype mask branch structure, named ProtoC1, only including an upsampling module and a Conv module with one 2D convolution with kernel size 3.The new prototype mask branch ProtoC1 keeps the quality of the instance masks the same as the original branch, significantly reduces the parameters and complexity in the prototype mask branch and speeds up the processing speed of the prototype branch.The specific structure of Proto and ProtoC1 is shown in Figure 7.
reducing the parameters and calculations of the prototype mask branch, resulted in a big loss of edge detail and accuracy in the instance mask.The use of 2D convolution with kernel size 3 does not result in a big loss of mask detail, but the parameters and calculation of the branch is bigger than the branch using 2D convolution with kernel size 1.In order to strike a balance between the calculation cost of the prototype mask branch and the quality of the instance mask, we proposed a new prototype mask branch structure, named ProtoC1, only including an upsampling module and a Conv module with one 2D convolution with kernel size 3.The new prototype mask branch ProtoC1 keeps the quality of the instance masks the same as the original branch, significantly reduces the parameters and complexity in the prototype mask branch and speeds up the processing speed of the prototype branch.The specific structure of Proto and ProtoC1 is shown in Figure 7.

The Transfer Learning Strategy
The introduction of a new backbone network and a novel prototype mask branch requires training a new convolutional network architecture from scratch.This is a process of iterative trial-and-error and finding the optimal parameters, which requires constant iterative parameter adjustment of the network structure and hyperparameters.In addition, unfavorable factors such as underwater shooting scenes make it difficult to obtain high-quality images of underwater dam cracks.In order to solve these problems, transfer learning (TL) technology was introduced, which utilizes prior knowledge and feature transfer to reduce the training cost.
As shown in Figure 8, model pre-training is first performed on the public dataset, feature learning is performed, and then feature migration is performed, and secondary model training is performed on the crack dataset.Using the cross-domain transfer learning strategy to adjust the model and transfer parameters can reduce the data dependence of the model, improve the robustness of the model and reduce the training cost.

The Transfer Learning Strategy
The introduction of a new backbone network and a novel prototype mask branch requires training a new convolutional network architecture from scratch.This is a process of iterative trial-and-error and finding the optimal parameters, which requires constant iterative parameter adjustment of the network structure and hyperparameters.In addition, unfavorable factors such as underwater shooting scenes make it difficult to obtain highquality images of underwater dam cracks.In order to solve these problems, transfer learning (TL) technology was introduced, which utilizes prior knowledge and feature transfer to reduce the training cost.
As shown in Figure 8, model pre-training is first performed on the public dataset, feature learning is performed, and then feature migration is performed, and secondary model training is performed on the crack dataset.Using the cross-domain transfer learning strategy to adjust the model and transfer parameters can reduce the data dependence of the model, improve the robustness of the model and reduce the training cost.
of the branch is bigger than the branch using 2D convolution with kernel size 1. to strike a balance between the calculation cost of the prototype mask branch and t ity of the instance mask, we proposed a new prototype mask branch structure ProtoC1, only including an upsampling module and a Conv module with one 2D lution with kernel size 3.The new prototype mask branch ProtoC1 keeps the q the instance masks the same as the original branch, significantly reduces the par and complexity in the prototype mask branch and speeds up the processing spee prototype branch.The specific structure of Proto and ProtoC1 is shown in Figure

The Transfer Learning Strategy
The introduction of a new backbone network and a novel prototype mask requires training a new convolutional network architecture from scratch.This is a of iterative trial-and-error and finding the optimal parameters, which requires iterative parameter adjustment of the network structure and hyperparameters.tion, unfavorable factors such as underwater shooting scenes make it difficult t high-quality images of underwater dam cracks.In order to solve these problems, learning (TL) technology was introduced, which utilizes prior knowledge and transfer to reduce the training cost.
As shown in Figure 8

Crack Dataset 3.1.1. Underwater Dam Crack Images
The underwater concrete crack pictures come from the video taken by the ROVs in the process of crack detection.Part of the pictures in the video are captured.There are a total of 600 underwater concrete crack pictures, including irregular crack pictures with different degrees of cracking.The image is 704 × 480 pixels.Some images of an underwater dam concrete crack dataset are shown in Figure 9.

Underwater Dam Crack Images
The underwater concrete crack pictures come from the video taken by the ROVs in the process of crack detection.Part of the pictures in the video are captured.There are a total of 600 underwater concrete crack pictures, including irregular crack pictures with different degrees of cracking.The image is 704 × 480 pixels.Some images of an underwater dam concrete crack dataset are shown in Figure 9.  [27] The concrete crack dataset contains a total of 10,000 images of concrete surface cracks with different degrees of cracking.The surface cracks are divided into three categories: more cracked, moderately cracked and less cracked.Also included in the dataset images are various disturbances that may be encountered in realistic scenes, such as cigarette butts, dust, etc.The resolution of the dataset images is 227 × 227.A total of 600 images were selected from this dataset in a balanced manner according to different degrees of cracking.Some images of the dataset are shown in Figure 10.

Data Pre-Processing and Data Augmentation
The locations of the cracks were manually labelled using the labelling software LabelMe (http://labelme.csail.mit.edu/Release3.0/) to form our training dataset.Figure 11 shows the pixel-level labelling process for the crack images.As observed from Figure 11, the crack area of the picture is marked using LabelMe software and saved as a JSON file.Then, the JSON file is converted to a TXT file via a program for model training.[27] The concrete crack dataset contains a total of 10,000 images of concrete surface cracks with different degrees of cracking.The surface cracks are divided into three categories: more cracked, moderately cracked and less cracked.Also included in the dataset images are various disturbances that may be encountered in realistic scenes, such as cigarette butts, dust, etc.The resolution of the dataset images is 227 × 227.A total of 600 images were selected from this dataset in a balanced manner according to different degrees of cracking.Some images of the dataset are shown in Figure 10.

Underwater Dam Crack Images
The underwater concrete crack pictures come from the video taken by the RO the process of crack detection.Part of the pictures in the video are captured.There total of 600 underwater concrete crack pictures, including irregular crack pictures different degrees of cracking.The image is 704 × 480 pixels.Some images of an underw dam concrete crack dataset are shown in Figure 9.

Concrete Crack Images for Classification [27]
The concrete crack dataset contains a total of 10,000 images of concrete surface c with different degrees of cracking.The surface cracks are divided into three catego more cracked, moderately cracked and less cracked.Also included in the dataset im are various disturbances that may be encountered in realistic scenes, such as ciga butts, dust, etc.The resolution of the dataset images is 227 × 227.A total of 600 im were selected from this dataset in a balanced manner according to different degre cracking.Some images of the dataset are shown in Figure 10.

Data Pre-Processing and Data Augmentation
The locations of the cracks were manually labelled using the labelling soft LabelMe (http://labelme.csail.mit.edu/Release3.0/) to form our training dataset.Figu shows the pixel-level labelling process for the crack images.As observed from Figur the crack area of the picture is marked using LabelMe software and saved as a JSON Then, the JSON file is converted to a TXT file via a program for model training.

Data Pre-Processing and Data Augmentation
The locations of the cracks were manually labelled using the labelling software La-belMe (http://labelme.csail.mit.edu/Release3.0/) to form our training dataset.Figure 11 shows the pixel-level labelling process for the crack images.As observed from Figure 11, the crack area of the picture is marked using LabelMe software and saved as a JSON file.Then, the JSON file is converted to a TXT file via a program for model training.To improve the diversity and richness of the data, several data augmentation strategies are utilized in the implementation, including: Moasic, augment HSV, random affine with 0.5 of scale ratio and 0.1 translation ratio and random horizontal flip with 50% probability.Meanwhile, the data augmentation is only applied during the model training phase, no data augmentation is used on the validation set. Figure 12 shows the images after data augmentation.To improve the diversity and richness of the data, several data augmentation strategies are utilized in the implementation, including: Moasic, augment HSV, random affine with 0.5 of scale ratio and 0.1 translation ratio and random horizontal flip with 50% probability.Meanwhile, the data augmentation is only applied during the model training phase, no data augmentation is used on the validation set. Figure 12 shows the images after data augmentation.To improve the diversity and richness of the data, several data augmentation strategies are utilized in the implementation, including: Moasic, augment HSV, random affine with 0.5 of scale ratio and 0.1 translation ratio and random horizontal flip with 50% probability.Meanwhile, the data augmentation is only applied during the model training phase, no data augmentation is used on the validation set. Figure 12 shows the images after data augmentation.

Implementation Details
The model was trained for 800 epochs with pre-trained weights until the model converged.The input image size is 448 × 448, and the training batch size is 8.As with the hyperparameter settings of YOLOv8n-seg, we used an SGD optimiser with a momentum of 0.937 and a weight decay of 0.0005.For the learning rate, it was set to 0.001 for the first three warm-up cycles and then reached 0.01 and kept shrinking to 0.0001 until the last epoch.
All tests in this paper were finished on the NVIDIA RTX3090 GPU.

Evaluations Metrics
During the experiments in this paper, we used mAP 0.5 and mAP 0.5−0.95 to measure the box and mask accuracy of the model and inference time to measure the inference speed of the model.mAP 0.5 and mAP 0.5−0.95can be described as: mAP 0.5−0.95= avg(mAP i ), i = 0.5 : 0.05 : 0.95 (9) where n c denotes the number of the classes, P represents precision and R represents the recall, and they satisfy: where TP is true positive, which represents the number of the prediction boxes whose IoU > 0.5; FP is false positive, which represents the number of the prediction boxes whose IoU ≤ 0.5; and FN is false negative, which represents the number of the labels without prediction.
In addition, weights, parameters and GFLOPs were used to evaluate the complexity of the model.

Experimental Results
The curves of the performance of the model during the experiment are shown in Figure 13.
hyperparameter settings of YOLOv8n-seg, we used an SGD optimiser with a momentum of 0.937 and a weight decay of 0.0005.For the learning rate, it was set to 0.001 for the first three warm-up cycles and then reached 0.01 and kept shrinking to 0.0001 until the last epoch.
All tests in this paper were finished on the NVIDIA RTX3090 GPU.

Evaluations Metrics
During the experiments in this paper, we used where TP is true positive, which represents the number of the prediction boxes whose

IoU 
; FP is false positive, which represents the number of the prediction boxes whose 0.5 IoU  ; and FN is false negative, which represents the number of the labels without prediction.
In addition, weights, parameters and GFLOPs were used to evaluate the complexity of the model.

Experimental Results
The  The final experimental results of our model are shown in Table 2.We used the same dataset on YOLOv8n-seg, YOLOv8s-seg, YOLOv8m-seg, YOLOv7-seg and Mask R-CNN as a comparison, and the experimental results are also shown in Table 2.As can be seen from the Table 2, all models perform well for the crack detection task, and the larger the model size, the higher the detection accuracy.Compared with the large model, our model has obvious advantages in size and calculation.Our model achieves a large reduction in volume and calculation with a small performance reduction, which is friendly to low-performance devices.Compared with the baseline model YOLOv8n-seg, our model has a 3% drop in detection accuracy, a 39% drop in weight, a 40% drop in parameters and a 51% drop in GFLOPs.Our model has the obvious advantage of small size, while maintaining high crack detection accuracy, and has real-time detection capability.

Crack Detection Results
Four crack images with different crack degrees and different environments were input into our model to verify the crack detection performance.The surface crack detection results of our model are shown in Figure 14.Our model accurately identified the locations and shapes of the cracks in all four images and detected surface cracks of varying degrees with good results.

Comparison of Different Crack Detection Method
Three crack images were input into crack detection methods with different algorithms, and a comparison of the crack detection results of different algorithms is shown in Figure 15.The Canny edge detector is able to detect and display the edges of cracked and non-cracked areas, but its detection results are noisy and the detection speed is slow,  The Canny edge detector is able to detect and display the edges of cracked and non-cracked areas, but its detection results are noisy and the detection speed is slow, so it is not suitable for real-time crack detection.The crack detection method based on object detection shows the confidence and labelling of the cracks and uses a target box to frame the entire crack, but does not depict the exact crack area.When the trajectory of the crack is tilted, the target box becomes large and its detection effect will appear weaker, such as in picture (a).Our method is able to display the confidence and labelling of the detected cracks, as well as being able to frame the cracks using a target box and depict the entire area of the crack at pixel level.It is clear that our method is better suited to the task of crack detection.

Comparison of Performance and Instance Mask of Different Prototype Branches
We respectively replaced the Proto module of YOLOv8n-seg and YOLOv8m-seg with ProtoC1 for comparison.The comparison of performance and instance mask is shown in this section.

Comparison of Performance of Different Proto Modules
As shown in the Table 3, after the YOLOv8m-seg model (using the standard Proto module with 256 channels of feature maps) used our proposed protoc1 structure, the GFLOPs of the model decreased by 18.8, a 17% decrease, while the mAP 0.5 of the mask decreased by 0.3%, almost no drop, and the mAP 0.5−0.95 of the mask decreased by 1.2%.With the smallest model YOLOv8n-seg using our proposed ProtoC1 structure, the model almost has the same performance, but the GFLOPs of the model decreased by 14%, and the model complexity is reduced.Compared with the Proto module used by YOLOv8n-seg (with 64 channels in the middle feature maps), our proposed ProtoC1 module reduces the depth and width of the network structure, which makes the prototype branch more lightweight.so it is not suitable for real-time crack detection.The crack detection method based on object detection shows the confidence and labelling of the cracks and uses a target box to frame the entire crack, but does not depict the exact crack area.When the trajectory of the crack is tilted, the target box becomes large and its detection effect will appear weaker, such as in picture (a).Our method is able to display the confidence and labelling of the detected cracks, as well as being able to frame the cracks using a target box and depict the entire area of the crack at pixel level.It is clear that our method is better suited to the task of crack detection.

Comparison of Performance and Instance Mask of Different Prototype Branches
We respectively replaced the Proto module of YOLOv8n-seg and YOLOv8m-seg with ProtoC1 for comparison.The comparison of performance and instance mask is shown in this section.

Comparison of Performance of Different Proto Modules
As shown in the Table 3, after the YOLOv8m-seg model (using the standard Proto module with 256 channels of feature maps) used our proposed protoc1 structure, the GFLOPs of the model decreased by 18.8, a 17% decrease, while the   The ProtoC1 (k = 1) structure is able to further reduce the parameters and computation of the model, but in our experiments, we found that using a 2D convolution of kernel size 1 reduces the accuracy of the instance mask, the mAP 0.5−0.95 of the mask decreased 7.5%, so we discarded this structure and used this structure as a comparison.
The comparison of crack detection results of the YOLOv8n-seg model using the default Proto module, ProtoC1 module, and ProtoC1 (k = 1) module is shown in Figure 16.We can see that using the ProtoC1 (k = 1) structure results in roughness at the edges of the instance mask and a decrease in mask accuracy.The YOLOv8n-seg model using the ProtoC1 module generates an instance mask with smooth edges and the same mask accuracy as the Proto model, while our proposed ProtoC1 module is more lightweight.

Ablation Study
There are two improvement measures in our model, including: replacing the backbone with LCANet and replacing Proto with a ProtoC1 module.To verify the effect of these measures on our model, an ablation experiment is undertaken in this paper.The results of the ablation study are shown in Table 4.It can be observed from the experimental data that, by replacing the Proto module of YOLOv8n-seg with ProtoC1, the detection accuracy is almost the same, while the model complexity and computation are reduced.By replacing the backbone of YOLOv8n-seg with LCANet, since LCANet uses 10 dblocks and reduces the channels of feature maps, the detection accuracy of the model is reduced by 3% and the layers of the model are increased by 15, while the GFLOPs of the model are reduced by 4.6, which is 37%, and the weights and parameters of the model are also greatly reduced.Compared with the baseline model YOLOv8n-seg, our model reduces the detection accuracy by 3% but greatly reduces the model volume, which is friendly to low-performance devices.Furthermore, the introduction of transfer learning (TL) reduces

Ablation Study
There are two improvement measures in our model, including: replacing the backbone with LCANet and replacing Proto with a ProtoC1 module.To verify the effect of these measures on our model, an ablation experiment is undertaken in this paper.The results of the ablation study are shown in Table 4.It can be observed from the experimental data that, by replacing the Proto module of YOLOv8n-seg with ProtoC1, the detection accuracy is almost the same, while the model complexity and computation are reduced.By replacing the backbone of YOLOv8n-seg with LCANet, since LCANet uses 10 dblocks and reduces the channels of feature maps, the detection accuracy of the model is reduced by 3% and the layers of the model are increased by 15, while the GFLOPs of the model are reduced by 4.6, which is 37%, and the weights and parameters of the model are also greatly reduced.Compared with the baseline model YOLOv8n-seg, our model reduces the detection accuracy by 3% but greatly reduces the model volume, which is friendly to low-performance devices.Furthermore, the introduction of transfer learning (TL) reduces the model training cost and enhances its robustness, resulting in a slight increase in model detection accuracy.

Conclusions
Accurate identification and quantification of cracks is important for understanding structural damage in dam and bridge structures.This study proposes a pixel-level realtime crack segmentation method based on the LCA-YOLOv8-seg model.A lightweight LCANet backbone and a more lightweight prototype mask branch are proposed to reduce the model complexity.A new lightweight prototype mask branch, ProtoC1, speeds up the prototype mask branch while maintaining the quality of instance masks.Our method achieves 0.945 mAP 0.5 and 129 FPS on the concrete surface crack dataset, and our model has significant advantages over YOLOv8n-seg in terms of weights, parameters and GFLOPs.This shows that our model has good accuracy and real-time detection capability and light volume, making it a practical algorithm for crack detection.In future research, we will continue to optimize the size and accuracy of the model while expanding the dataset and increasing the robustness of the model.

Figure 2 .
Figure 2. The framework of the proposed method.

Figure 2 .
Figure 2. The framework of the proposed method.

2. 2 .
New Backbone: Lightweight Channel Attention Network (LCANet) In order to reduce the model volume while maintaining high detection accuracy, we designed a new lightweight backbone: LCANet.The structure of LCANet is shown in Figure 3, which consists of 1 Conv module and 10 DWConv blocks.The Conv module include a 2D convolution, batch norm and RELU activation function.The DWConv block consists of a depthwise separable convolution, residual structure, efficient channel attention(ECA) module and RELU activation function.The specific structure of the DWConv block is shown in Figure 4.

Figure 4 .
Figure 4.The structure of DWConv block.Standard convolution uses filters in the format of K × K × C. A single filter can perform feature extraction for each channel and feature fusion between multiple channels.Depthwise separable convolution consists of a depthwise convolution, which applies a one-dimensional convolution to each channel of the input tensor for feature extraction in a single channel, and a point convolution, which applies a 1 × 1 multi-dimensional convolution to combine the feature maps extracted by the depthwise convolution.The using of depthwise separable convolution drastically reduced the computation and model size of the network.Figures5 and 6illustrate the implementation of standard convolution and depthwise separable convolution, respectively.

2. 2 .
New Backbone: Lightweight Channel Attention Network (LCANet) In order to reduce the model volume while maintaining high detection accuracy, we designed a new lightweight backbone: LCANet.The structure of LCANet is shown in Figure 3, which consists of 1 Conv module and 10 DWConv blocks.The Conv module include a 2D convolution, batch norm and RELU activation function.The DWConv block consists of a depthwise separable convolution, residual structure, efficient channel attention(ECA) module and RELU activation function.The specific structure of the DWConv block is shown in Figure 4.
, which consists of 1 Conv module and 10 DWConv blocks.The Conv module include a 2D convolution, batch norm and RELU activation function.The DWConv block consists of a depthwise separable convolution, residual structure, efficient channel attention(ECA) module and RELU activation function.The specific structure of the DWConv block is shown in Figure 4.

Figure 4 .
Figure 4.The structure of DWConv block.Standard convolution uses filters in the format of K × K × C. A single filter can perform feature extraction for each channel and feature fusion between multiple channels.Depthwise separable convolution consists of a depthwise convolution, which applies a one-dimensional convolution to each channel of the input tensor for feature extraction in a single channel, and a point convolution, which applies a 1 × 1 multi-dimensional convolution to combine the feature maps extracted by the depthwise convolution.The using of depthwise separable convolution drastically reduced the computation and model size of the network.Figures5 and 6illustrate the implementation of standard convolution and depthwise separable convolution, respectively.

Figure 4 .
Figure 4.The structure of DWConv block.Standard convolution uses filters in the format of K × K × C. A single filter can perform feature extraction for each channel and feature fusion between multiple channels.Depthwise separable convolution consists of a depthwise convolution, which applies a one-dimensional convolution to each channel of the input tensor for feature extraction in a single channel, and a point convolution, which applies a 1 × 1 multi-dimensional convolution to combine the feature maps extracted by the depthwise convolution.The using of depthwise separable convolution drastically reduced the computation and model size of the network.Figures5 and 6illustrate the implementation of standard convolution and depthwise separable convolution, respectively.

Figure 5 .
Figure 5.The implementation of standard convolution.
map B , where , A A HW is the spatial height width of an input feature map, M is the number of input channels, , B B HW is the tial height and width of an output feature map and N is the number of output chann

Figure 7 .
Figure 7.The specific structure of standard Proto and ProtoC1.(a) The specific structure of standard Proto; (b) the specific structure of ProtoC1.

Figure 8 .
Figure 8.The flowchart of the TL method.

Figure 7 .
Figure 7.The specific structure of standard Proto and ProtoC1.(a) The specific structure of standard Proto; (b) the specific structure of ProtoC1.

Figure 7 .
Figure 7.The specific structure of standard Proto and ProtoC1.(a) The specific structure of Proto; (b) the specific structure of ProtoC1.
, model pre-training is first performed on the public feature learning is performed, and then feature migration is performed, and se model training is performed on the crack dataset.Using the cross-domain transf ing strategy to adjust the model and transfer parameters can reduce the data dep of the model, improve the robustness of the model and reduce the training cost.

Figure 8 .
Figure 8.The flowchart of the TL method.

Figure 8 .
Figure 8.The flowchart of the TL method.

Figure 9 .
Figure 9. Examples of underwater dam concrete crack.

Figure 9 .
Figure 9. Examples of underwater dam concrete crack.

Figure 9 .
Figure 9. Examples of underwater dam concrete crack.

Figure 11 .
Figure 11.Flowchart of the labelling process.

Figure 12 .
Figure 12.Example images of data augmentation.

Figure 11 .
Figure 11.Flowchart of the labelling process.

Figure 12 .
Figure 12.Example images of data augmentation.

Figure 12 .
Figure 12.Example images of data augmentation.

n
mask accuracy of the model and inference time to measure the inference speed of the model.denotes the number of the classes, P represents precision and R represents the recall, and they satisfy: curves of the performance of the model during the experiment are shown in Figure 13.

Figure 13 .
Figure 13.Curves for the precision metric and mAP 0.5 metric of the training.(a) The curve of box precision and mAP 0.5 ; (b) the curve of mask precision and mAP 0.5 .

Figure 14 .
Figure 14.Crack detection performance of our model.(a-d) are crack pictures with different background and different crack size.

Figure 14 .
Figure 14.Crack detection performance of our model.(a-d) are crack pictures with different background and different crack size.

4 .
Comparative Experiment and Ablation Study 4.1.Comparison of Different Crack Detection Method Three crack images were input into crack detection methods with different algorithms, and a comparison of the crack detection results of different algorithms is shown in Figure 15.

Figure 15 .
Figure 15.Comparison of crack detection results of different algorithms.(a-c) are different crack pictures.

0. 5 mAP
of the mask decreased by 0.3%, almost no drop, and the 0.5 0.95 mAP − of the mask decreased by 1.2%.

Figure 15 .
Figure 15.Comparison of crack detection results of different algorithms.(a-c) are different crack pictures.

Figure 16 .
Figure 16.Comparison of crack detection mask of YOLOv8n-seg model using different Proto modules.(a,b) are crack pictures with different crack shape.

Figure 16 .
Figure 16.Comparison of crack detection mask of YOLOv8n-seg model using different Proto modules.(a,b) are crack pictures with different crack shape.

Table 1 .
The specification for LCANet.The dblock is DWConv block.RE means RELU function.

Table 2 .
Experimental results of different models.In Table1, the best result of each evaluation metrics is bolded.

Table 3 .
The comparison of YOLOv8n-seg and YOLOv8m-seg model using different Proto modules.

Table 4 .
The experiment result of ablation study.