1. Introduction
In the course of long-term service, a variety of concrete structures tend to develop cracks due to adverse factors such as environmental corrosion, material aging, and natural disasters. The existence of cracks can accelerate material deterioration and significantly reduce the load-bearing capacity of the structure [
1,
2], thus increasing the risk of structural failure and even causing accidents [
3]. Therefore, it is essential to conduct health monitoring and maintenance of concrete cracks. Conventional crack detection methods primarily depend on manually operated portable equipment, which often exhibit low efficiency and limited accessibility, making it challenging to guarantee the accuracy and reliability of the results. With the advancement of computer technology, image processing-based crack detection methods have been increasingly adopted [
4,
5,
6]. However, classical image-processing pipelines require handcrafted preprocessing and sensitive parameter tuning, and their performance often degrades under small cracks and non-uniform illumination—common conditions in real-world inspections [
7].
The rise in deep learning has driven advancements in image processing technologies. Convolutional Neural Networks (CNNs) have shown significant advantages in crack detection due to their strong feature extraction capabilities, with methods mainly divided into bounding-box-based localization and pixel-level semantic segmentation [
8,
9,
10,
11]. The former is used for identification and classification, while the latter achieves precise pixel-level extraction. In localization research, models can be categorized into single-stage (e.g., YOLO series, which balance efficiency and accuracy) and two-stage (higher accuracy but slower speed) [
5,
12,
13]. Among single-stage approaches, the YOLO family has been widely adopted due to its favorable balance between efficiency and accuracy. To improve YOLO-based crack detection under limited data or complex scenarios, researchers have proposed both training strategies and architectural enhancements. For example, Li et al. [
14] proposed a transfer learning framework for road crack detection, reducing the training data required for YOLOv4 by 25% while improving detection accuracy by 26.66%. Wang et al. [
15] introduced a Bidirectional Feature Pyramid Network (BiFPN) into the YOLOv5 architecture, which achieved only a 3.4% error between predicted and actual values in crack detection, with an average processing time of just 14.2 ms per image. With the advancement of research, increasing attention has been directed toward fine or small cracks, which are small in size and easily overlooked yet may indicate early-stage deterioration and thus require timely detection and treatment. To address this challenge, Hou et al. [
16] incorporated the SimAM attention mechanism into the head network of the YOLOv5s model. The improved model achieved a 29.3% increase in average Accuracy for detecting small cracks on aircraft engine blades compared to the baseline model. Zhang et al. [
17] introduced the multi-scale self-attention module, Swin Transformer, into the neck structure, significantly enhancing the model’s ability to extract features of small target cracks. Despite these advances, concrete cracks remain particularly challenging for object detectors because they are typically small, elongated, and highly edge-dependent, often exhibiting low contrast and complex branching under cluttered textures, shadows, or stains. Under such conditions, detectors may miss subtle branches or confuse cracks with background patterns. Therefore, improving edge-sensitive multi-scale representations is crucial for robust crack localization.
In practical engineering applications, relying solely on crack localization and classification is insufficient to meet the requirements for assessing the load-bearing capacity of concrete structures. To obtain detailed geometric information about cracks, such as shape, orientation, and width distribution, corresponding crack segmentation methods are required. Although semantic segmentation networks like U-Net [
18] and DeepLabv3+ [
19] have achieved high accuracy, their training depends on large-scale, high-precision annotated datasets, and data re-annotation is still needed when detection conditions (e.g., lighting, shadows) change, limiting their practical deployment [
20,
21,
22,
23]. The foundational segmentation model, the Segment Anything Model (SAM), trained on over 1 billion masks and 11 million images, demonstrates strong generalization capabilities [
24,
25,
26], offering the potential to reduce annotation costs and handle complex environments and lighting conditions. Huang et al. [
27] validated SAM’s performance on 53 medical datasets, showcasing its great potential in image segmentation. Carraro et al. [
28] and Shan et al. [
29] further demonstrated SAM’s applicability in crop feature recognition and underwater rock segmentation, indicating that prompt-guided segmentation can improve efficiency while maintaining high accuracy. Teng et al. [
30] proposed a concrete crack segmentation method based on SAM, using fractal dimension matrix prompts to effectively address environmental variations, achieving high segmentation accuracy. Preliminary studies suggest that SAM’s semi-supervised mode outperforms the fully automatic mode in crack identification. However, high-quality crack masks are often obtained in interactive settings that require manual prompts (e.g., clicks or boxes), which limit automation and throughput. In fully automated or weakly supervised modes, segmentation quality can degrade under complex backgrounds and especially under non-uniform illumination (e.g., shadows and overexposure), where ambiguous crack boundaries and texture interference reduce both accuracy and stability.
To achieve more real-time and accurate detection of concrete cracks, this paper proposes a two-stage concrete crack segmentation method by combining the strengths of YOLO and SAMs. The main contributions of this work are as follows:
- (1)
This study proposes an improved structure of the YOLOv11 model, which incorporates the novel Multi-scale Edge Information Enhancement (MSEIE), Efficient-Detection, and P2-level feature integration modules. The resulting model, named MEP-YOLOv11, significantly enhances the accuracy for crack localization.
- (2)
A two-stage crack segmentation method that combines the MEP-YOLOv11 model with the SAM is developed. The crack localization results generated by MEP-YOLOv11 are used as prompt information for SAM, enabling accurate crack segmentation without manual input and significantly enhancing the level of automation.
- (3)
A mask re-input method is proposed, where under non-uniform lighting conditions, a two-stage model can provide partial crack area localization information, enabling complete crack segmentation. This eliminates the need for retraining the neural network with images under different lighting conditions, thus reducing the reliance on expensive annotated datasets.
2. Methodology
The two-stage crack segmentation method proposed in this paper is primarily designed for visible open cracks on the surfaces of ordinary concrete bridges (such as bridge decks, beams, and piers). It demonstrates good performance in segmenting fine and branched cracks and can adapt to strong light, shadows, uneven lighting, and common background texture interference. The implementation process of the proposed two-stage concrete crack segmentation method is illustrated in
Figure 1. First, the acquired crack images are fed into the MEP-YOLOv11 model to localize the crack regions. Then, the localized crack information is used as a bounding box prompt for the SAM to perform crack segmentation. To improve the segmentation accuracy under non-uniform illumination conditions, the crack mask output generated by the SAM using the initial bounding box prompt is re-fed into the model as a secondary input, guiding it to perform a refined segmentation of the crack regions.
2.1. MEP-YOLOv11 Model-Based Crack Localization
YOLOv11 introduces several groundbreaking improvements in its network architecture compared to its predecessor models. In the backbone network, the model incorporates the C3K2 module, the C2PSA attention mechanism, and the Spatial Pyramid Pooling-Fast (SPPF) layer. The C3K2 module, with its core idea of branch processing and residual connections, along with convolution kernels of adjustable sizes, effectively extracts multi-scale feature information while maintaining the model’s lightweight design. Compared to the previous version of the C3K module, the C3K2 module allows the selection of variable convolution kernels based on the feature extraction requirements at different scales, reducing unnecessary computational resource waste. The C2PSA module also employs branch processing and residual connections, combining the PSA module used in the backbone branch to achieve adaptive weighting of key regions. This reduces computational overhead while enhancing small object detection capabilities. In contrast to the single PSA module, which provides only a single-level spatial attention weighting, the C2PSA achieves collaborative optimization of local details and global context through cross-stage connections, gaining stronger multi-scale feature representation ability without significantly increasing the number of parameters [
31]. The SPPF layer implements hierarchical progressive receptive field expansion in depth through repeated concatenated max-pooling operations and integrates feature dimensions in width via channel concatenation, thereby enhancing the model’s ability to understand small objects and complex scenes. Compared to the classic SPP module, the SPPF replaces the multi-branch parallel structure with a serial recursive structure, achieving equivalent multi-scale features with less computational and memory overhead. This results in a significant increase in model inference speed and a simplified graph structure [
32].
These modifications effectively reduce computational costs while significantly enhancing multi-scale feature fusion and robustness under complex scenarios. As a result, YOLOv11 achieves high accuracy and fast inference in conventional object detection tasks [
33,
34]. However, in the specific task of locating cracks in concrete bridges, the process heavily relies on edge information due to the small pixel area occupied by some cracks. This causes the YOLOv11 model to misclassify small, complex cracks as background textures, thereby limiting its applicability in this task. Therefore, this study improves the YOLOv11 architecture and proposes the MEP-YOLOv11 model, specifically designed for concrete crack detection and localization. The architecture of the model is shown in
Figure 2. The improved model employs an enhanced C3K2 module with MSEIE (C3K2-MSEIE) in the backbone network to extract and retain richer edge features. Furthermore, the feature fusion layers (P2) in the neck are enhanced by upgrading the original three-scale detection to a four-scale detection structure. Finally, grouped convolution layers are introduced in the Efficient-Detection head to enhance local feature extraction while preserving inter-channel information exchange. With these improvements, the MEP-YOLOv11 model achieves accurate localization of slender and complex cracks in concrete structures.
2.1.1. C3K2-MSEIE Module
Concrete cracks exhibit complex patterns and have low contrast with their background. Moreover, due to variations in shooting distance and angle, the size of cracks in the captured images can differ significantly. The original C3K2 module in the YOLOv11 model has limitations in multi-scale object detection and complex edge information extraction. This results in blurry crack edges in the generated feature maps, thereby affecting the model’s detection performance and leading to problems such as missed detections and false positives. To address this issue, the MSEIE module is proposed, as shown in
Figure 3. The module first divides the input feature map into four sub-feature maps at different scales using adaptive average pooling. Then, edge enhancement is applied to each sub-feature map. Finally, local features are concatenated with multi-scale outputs to improve the accuracy of crack edge detection. This method effectively enhances the model’s recognition capability in complex scenarios.
The core of the C3K2-MSEIE module is the adaptive average pooling operation and the edge enhancement submodule. The adaptive average pooling operation first fixes the output feature map size, and then adaptively determines the pooling window and stride based on the ratio between the input image size and the output image size. The input image is divided into several rectangular regions, and the average value of all elements in each sub-region is computed. After obtaining multi-scale feature maps, a 1 × 1 convolutional layer is used to reduce the number of channels to one-quarter of the original input, thereby improving computational efficiency. Finally, bilinear interpolation is used to restore high-resolution image information, followed by preliminary edge enhancement of the multi-scale feature maps. The formula for adaptive pooling is as follows:
where
Yc,i,j represents the average pooling output of the input image
Ri,j rectangular region, |
Ri,j| represents the number of pixels in the rectangular region
Ri,j,
Xc,i,j represents the pixels in the rectangular region
Ri,j at the (
u,
v) position.
sh(
i) ×
sw(
j) represents the size of the pooling window, while
kh(
i),
kw(
j) represent the strides of the pooling window in the vertical and horizontal directions, respectively.
To further enhance the recognition of concrete crack edge information, the multi-scale edge-processed feature maps are fed into the edge enhancement submodule, as shown in
Figure 4. Crack images contain high-frequency edge information and low-frequency background information. Therefore, the feature image
X undergoes a 3 × 3 average pooling operation to smooth high-frequency information while extracting low-frequency background information. Subsequently, the difference operation is applied to extract the high-frequency crack edge information from the image, emphasizing the edge features. For the binary classification problem between the crack edge region and the background region, a convolutional layer combined with a Sigmoid activation function is further applied to perform a nonlinear transformation on the feature map. This method compresses the pixel values of the crack edge regions, which are relatively large, towards 1, while compressing the pixel values of the background regions, which are relatively small, towards 0. This further enhances the contrast between the crack edges and the background, improving the saliency of the edge features. Finally, element-wise addition is performed to fuse the enhanced edge feature map with the original feature map
X, strengthening the target region while retaining the original features. This allows the model to focus more on the crack edge details when processing the feature map. The formula for the edge enhancement submodule is as follows:
In the equation, X represents the input feature map of the submodule, XL denotes the low-frequency features, XH represents the high-frequency features, σ is the Sigmoid activation function, C3×3 is a 3 × 3 convolution, Pe is the result after contrast enhancement of the edge regions, and Y is the result after edge information enhancement.
2.1.2. The Improved Feature Fusion Layer
In the YOLO model architecture, the backbone network continuously downsamples the image from top to bottom, gradually reducing the scale of the feature maps. Although the final feature maps contain rich semantic information, spatial details are progressively lost during the downsampling process. To address this issue, the neck network recovers the feature map resolution through upsampling, as shown in
Figure 5. The upsampled feature map is then concatenated along the channel dimension with shallow feature maps from the backbone network at the same scale, which contain rich spatial detail information. This enables precise localization and recognition of concrete cracks. The YOLOv11 base network contains only the P3, P4, and P5 layers, corresponding to feature maps at scales of 80 × 80, 40 × 40, and 20 × 20 for the original 640 × 640 image. To improve the model’s ability to detect small concrete cracks, this paper adds additional upsampling and feature concatenation steps in the neck network, introducing a P2 level specifically for small object detection. The upsampled feature map at this level has a size of 160 × 160 and is concatenated with shallow feature maps from the backbone network that contain more detailed spatial information. This helps the model more accurately capture the subtle damage in bridge cracks.
The small-object layer P2 performs upsampling using bilinear interpolation. The bilinear interpolation method estimates the value of a target point by performing two one-dimensional linear interpolations, given the values of four adjacent known points. As shown in
Figure 6, the target point
R is located inside a rectangular region, which is defined by four adjacent known points: the top-left, top-right, bottom-left, and bottom-right points. First, linear interpolation is performed horizontally between the two points on the left and right, yielding two intermediate values. Then, a second linear interpolation is performed vertically on these two intermediate values to obtain the final estimated value for the target point, as shown in the following equation:
2.1.3. The Lightweight Efficient-Detection Module
In the original detection head of the YOLOv11 model, both depthwise convolution layers and standard convolution layers are used to adjust the feature channels, converting the number of input channels into the required number for predicting localization and classification losses. However, standard convolution layers compress information across all channels during channel adjustment, which makes the output susceptible to global features and leads to blurred edges and textures. In contrast, depthwise convolution layers apply separate kernels to each channel, which preserves individual channel features but cannot integrate information across channels, potentially leading to the loss of inter-channel correlations. To overcome the limitations of both convolution types in channel adjustment and to meet the real-time requirements for crack localization in concrete structures, a module called Efficient-Detection is proposed, as shown in
Figure 7. It adopts a shared-parameter convolutional architecture that reduces the number of parameters compared with traditional models that rely on parallel processing. By applying group convolution, the input channels, output channels, and convolutional kernels are divided into g groups, with the number of parameters calculated as shown in Equation (7). As a result, the total parameters of the detection head are reduced to 1/
g of the original, significantly improving detection efficiency and fulfilling the requirement for real-time crack damage detection. This group convolution method avoids the global feature interference seen in standard convolution and also prevents the lack of inter-channel integration associated with depthwise convolution.
where
h1 and
w1 denote the height and width of the convolution kernel;
Cin and
Cout are the number of input and output channels, and
g is the number of groups.
2.2. Crack Segmentation Using SAM
SAM is a general-purpose image segmentation foundation model developed by MATE AI, with pretraining conducted on the SA-1B dataset, which includes over one billion masks and 11 million images. Pretraining employs a multi-task learning strategy, combined with cross-entropy loss and Dice loss, to improve segmentation accuracy. At the same time, data augmentation techniques such as scaling, rotation, and random cropping are applied to enhance the model’s robustness to various image transformations. This enables SAM to learn rich feature representations, and its zero-shot segmentation performance can rival that of fully supervised methods or even outperform them. The SAM consists of three core components: the image encoder, the prompt encoder, and the mask decoder. The image encoder in SAM uses Vision Transformer (ViT) as its backbone neural network. ViT directly models long-range dependencies, capturing global contextual information to improve the model’s ability to understand complex scenes. The key components of the ViT backbone network include:
- (1)
Patch Embedding: The input image is divided into multiple fixed-size patches, and each patch is then flattened and linearly embedded into a high-dimensional space.
The input image
x∈
RH×W×C is divided into patches of size
P ×
P, and each patch is then embedded into a high-dimensional space through a linear transformation. The number of patches N and the patch embedding are given by the following formula:
where
H and
W represent the height and width of the image,
C is the number of channels,
ei is the embedding vector after the linear transformation,
Wp is the embedding weight,
bp is the bias term, and
pi denotes the
i-th patch after segmentation.
- (2)
Transformer Encoder: The Transformer encoder in ViT processes the patch embeddings layer by layer using multi-head self-attention (MSA) and a feedforward network (FFN), enabling more efficient capture of both global and local image features. The operation on the patch embeddings at the l-th layer is as follows:
where
LN represents layer normalization, and
zl denotes the output of the
l-th layer.
- (3)
Position Embedding: Positional information is added to the patch embeddings to preserve the spatial relationships within the image. The combination of patch embeddings and position embeddings enhances the model’s understanding of the overall structure of the image, enabling ViT to effectively extract rich visual features from the input image and perform various segmentation tasks more accurately.
where
Eipos represents the position embedding, which encodes the position of each patch in the original image.
The prompt encoder includes two types of prompts: sparse prompts (points, boxes, text) and dense prompts (masks), which guide the model’s segmentation. Specifically, (1) point prompts are implemented by converting user-specified points into vectors, which not only represent the spatial location of the points but also indicate whether they are positive or negative samples. (2) Box prompts involve user-drawn bounding boxes, which are encoded to provide spatial constraints, helping the model determine the segmentation regions. (3) Text prompts leverage a pre-trained text model to convert the descriptive text provided by the user into feature vectors, enabling the model to understand the background information relevant to the segmentation task. (4) Mask prompts encode the initial segmentation masks provided by the user to refine or improve the existing segmentation results. These mask prompts are embedded into the model via convolution operations and added element-wise to the image embeddings, thereby enhancing the model’s understanding of the segmentation task.
The mask decoder is responsible for integrating feature information from the ViT backbone network and the prompt encoder to generate precise segmentation masks. First, feature fusion is performed, where the image features extracted by the ViT backbone network are combined with the prompt features generated by the prompt encoder using an attention mechanism, ensuring that the final segmentation result accurately responds to the user’s input prompts. Next, upsampling layers are used to gradually increase the resolution of the fused features until it matches the resolution of the original image. This step is crucial for generating high-resolution segmentation masks with rich details. Finally, a sigmoid function is applied to the upsampled output, converting the feature map into a binary mask that clearly defines the regions of the image to be segmented.
Currently, SAM offers two segmentation modes: fully automatic and semi-supervised modes. Examples of both modes applied to crack segmentation are shown in
Figure 8. Currently, both the fully automatic and semi-supervised modes have limitations in terms of accuracy and efficiency, and neither fully meets the practical requirements of crack detection in engineering applications. In the fully automatic mode, the model segments all objects in the input image without any human intervention, but the segmentation accuracy is relatively low. In the semi-supervised mode, the user can guide the model to segment specific regions more accurately by providing four types of input prompts (clicks, bounding boxes, masks, and text). However, these four input methods currently require manual interaction, which limits detection efficiency. Therefore, this study proposes a novel crack segmentation method based on automatic bounding box input.
2.2.1. Crack Segmentation Based on Automatic Bounding Box Input
The MEP-YOLOv11 model accurately localizes crack regions, and the bounding boxes generated by this model are used as bounding-box prompt inputs for the SAM, enabling fully automatic and high-precision crack segmentation. The detailed implementation process is shown in
Figure 9. The method consists of the following four steps: (1) The crack image to be identified is fed into the fully trained MEP-YOLOv11 model to obtain the bounding box image and coordinate information of the crack region. (2) The center-format coordinates (
cx,
cy,
w,
h) output by the MEP-YOLOv11 model are converted into the two-point format (
x1,
y1,
x2,
y2) required by the SAM, which is then used as bounding-box prompts. The conversion formula is as follows:
where (
x1,
y1) represents the coordinates of the top-left corner of the bounding box, and (
x2,
y2) represents the coordinates of the bottom-right corner. (
cx,
cy) denotes the normalized center coordinates of the box, while w and h represent its normalized width and height. wo and ho refer to the width and height of the input image. (3) The original image is fed into the ViT image encoder to extract image embeddings, while the converted two-point coordinates are input into the prompt encoder to generate the corresponding bounding-box prompt embeddings. (4) The mask decoder uses both the image embeddings and the bounding-box prompt embeddings to guide the model in generating the final crack segmentation mask.
2.2.2. Failure Modes of the Crack Segmentation Method Based on Automatic Bounding Box Input
SAM performs segmentation based on bounding box input, which inherently imposes a spatial constraint, restricting the model to operate only within the specified region. Therefore, if the MEP-YOLOv11 model fails to accurately and fully locate the actual crack positions, the subsequent segmentation process by SAM will be limited, making it difficult to generate effective crack masks. The improved MEP-YOLOv11 model can accurately locate long and complex cracks in concrete, but when faced with untrained complex backgrounds, overexposure, shadows, and non-uniform illumination, issues such as missed detections and missed bounding boxes still occur, as shown in
Figure 10. When the image as a whole suffers from overexposure and shadows, common image preprocessing methods such as exposure compensation and shadow removal can effectively improve the image quality, meeting the localization requirements of the MEP-YOLOv11 model, as shown in
Figure 11. In normal scenarios, duplicate bounding boxes may also appear, but due to the strong segmentation capabilities of the SAM, duplicate bounding boxes have minimal impact on crack segmentation, as shown in
Figure 12. Additionally, when inputting bounding box information into SAM, duplicate boxes with an overlap greater than 90% are removed, ensuring that the generated duplicate bounding boxes do not affect the segmentation results. However, when the model faces untrained complex backgrounds and non-uniform illumination, relying solely on image preprocessing cannot comprehensively improve the overall image quality or solve the issues of missed and false detections.
2.2.3. Crack Segmentation Under Non-Uniform Illumination Conditions
Complex backgrounds often contain substantial high-frequency noise, which interferes with the model’s ability to locate concrete crack edges. Under non-uniform illumination conditions, crack images often experience local overexposure and shadow occlusion, resulting in significant differences in brightness and contrast across different regions of the image [
35]. This uneven lighting not only severely disrupts the model’s ability to extract and understand features, but also leads to problems such as information loss, blurred edges, and shadow misguidance, thereby reducing the accuracy of crack segmentation. Therefore, when the MEP-YOLOv11 model faces untrained complex backgrounds and non-uniform lighting, if complete missed detection occurs, the training dataset needs to be expanded, and the model should be retrained. However, if the YOLO model can still recognize partial crack regions under untrained complex backgrounds and non-uniform lighting, the crack segmentation task can be completed using the mask re-input method proposed in this study. Crack segmentation under non-uniform lighting is much more challenging than in complex backgrounds, so this study uses non-uniform lighting conditions as an example to detail the process of the mask re-input method, as shown in
Figure 13. First, the partial crack regions detected by the MEP-YOLOv11 model under non-uniform lighting are input into SAM’s prompt encoder, while the original image is input into the image encoder, generating image embeddings and bounding box prompt embeddings, respectively. Next, a lightweight mask decoder is used to generate the initial crack mask, providing a preliminary segmentation of the corresponding crack regions. Then, the initial mask is used as a new prompt input to the prompt encoder, generating the mask prompt embedding. After combining this with the image embedding, it is input again into SAM to complete the full crack segmentation. This method fully utilizes the strong shape priors embedded in SAM’s mask prompt inputs. Even if these inputs are incomplete (e.g., only covering part of the target), they still contain structural cues such as continuity, edge direction, and contour features, effectively constraining the model’s output range and guiding it to complete the overall shape of the target, achieving a more comprehensive segmentation result.
2.3. Evaluation Indicator
2.3.1. Evaluation Indicators for Localization Task
Precision (
PL), Recall (
R), F1-Score (
F1), and Average Precision (
AP) are selected as evaluation metrics to assess the crack localization performance of the improved MEP-YOLOv11 model. The calculation formulas for each metric are presented as follows:
where
TPL,
FPL and
FNL are obtained by matching predicted bounding boxes with ground-truth boxes. Specifically,
TPL refers to the number of matched predicted boxes,
FPL is the number of unmatched predicted boxes, and
FNL is the number of missed ground-truth boxes.
2.3.2. Evaluation Indicators for Segmentation Task
Pixel-level Precision (
PS), IoU, and Accuracy are adopted to comprehensively evaluate the crack segmentation performance of the proposed two-stage model based on MEP-YOLOv11 and SAM. The metrics are defined as
where
TPS,
TNS,
FPS,
FNS are computed by pixel-wise comparison between the predicted mask and the ground-truth mask:
TPS refers to the number of crack pixels correctly predicted by the model as cracks,
TNS is the number of non-crack pixels correctly identified as non-cracks,
FPS is the number of non-crack pixels incorrectly predicted as cracks, and
FNS is the number of crack pixels incorrectly predicted as non-cracks.
Ap represents the predicted region by the model, and
Ar represents the true crack region.
4. Conclusions
In this paper, a novel two-stage concrete crack segmentation method combining the improved MEP-YOLOv11 and the Segment Anything Model (SAM) is proposed to address the challenges of precise crack detection in complex engineering environments. The method firstly uses an improved MEP-YOLOv11 model to achieve more accurate crack localization and then uses the extracted positional information as prompts to guide the SAM in performing pixel-level segmentation of the crack regions. MEP-YOLOv11 automatically generates high-quality prompts, effectively overcoming SAM’s heavy reliance on manual input and enabling full automation of the crack segmentation process. Compared with traditional semantic segmentation methods, this approach eliminates the need for large-scale and costly pixel-level annotated datasets, significantly reducing the time and resources required for model development and deployment while maintaining competitive segmentation accuracy. To improve the model’s practicality under challenging field conditions—particularly non-uniform illumination—a mask re-input strategy is further introduced. This strategy leverages the complementary strengths of the two-stage model and enhances its robustness without additional training costs. Based on the research findings, the following conclusions can be drawn:
- (1)
The improved MEP-YOLOv11 model achieves comprehensive performance enhancement in the crack localization compared with the baseline model, with PL, R, F1, and AP50 increasing by 0.03%, 7.9%, 3.8%, and 5.2%, respectively.
- (2)
The proposed two-stage model based on MEP-YOLOv11 and SAM performs excellently in the crack segmentation under normal lighting conditions, achieving an average Accuracy of 95.98%, average PS of 92.60%, and average IoU of 0.77.
- (3)
By employing the mask re-input method, the two-stage model maintains reliable segmentation performance under non-uniform illumination conditions, with average Accuracy, average PS, and average IoU of 92.38%, 85.70%, and 0.64, respectively.
Although the two-stage concrete crack segmentation method proposed in this paper demonstrates good performance in the segmentation of fine and branched cracks and can adapt to uneven lighting and common background texture interference, it still has the following limitations: First, it has not been verified whether the MEP-YOLOv11 model can still perform effectively on surfaces with excessively pronounced curvature, overly complex textures causing overlap and mixing with the crack’s linear features, or situations where the background color and grayscale are similar to the crack’s height. When the YOLO model cannot accurately locate cracks, SAM also struggles to perform effective segmentation. Although the mask re-input method can alleviate some false negatives, completely missed detections remain unresolved. Secondly, although the mask re-input method significantly improves segmentation performance under uneven lighting, performance metrics still show a notable decline compared to normal lighting conditions, indicating that lighting interference remains a major challenge in practical engineering applications. Future research will focus on the following directions: collecting and annotating more diverse crack images that reflect real-world engineering conditions, covering various levels of stain coverage, weather impacts, and combinations of different surface defects, to more comprehensively validate the robustness of the two-stage model. Future work should focus on improving the model’s robustness, real-time performance, and generalization ability.