1. Introduction
The escalating deterioration of global civil infrastructure, particularly concrete structures, presents a critical challenge to public safety and economic stability. National assessments consistently highlight the urgent need for maintenance and repair, with a significant portion of assets rated in poor condition [
1]. The structural integrity of this infrastructure is often compromised by surface-level damage such as cracks, efflorescences, and exposed rebars. Traditionally, the identification and evaluation of this damage have relied on manual visual inspections. However, this long-standing practice is inherently problematic; it is not only time-consuming and costly but is also prone to significant inconsistencies and errors stemming from the subjective judgment of human inspectors [
2]. To overcome these limitations, the field of Structural Health Monitoring (SHM) has explored various advanced techniques, ranging from model-based stiffness identification [
3] to the vision-based approaches that are the focus of this work.
This paradigm is being reshaped by the integration of Unmanned Aerial Vehicles (UAVs) and advanced computer vision [
4,
5]. UAVs have revolutionized data acquisition for structural health monitoring, enabling the safe, rapid, and large-scale collection of high-resolution imagery, even in difficult-to-access locations [
6,
7]. The availability of high-fidelity visual data has catalysed the development of deep learning models for automated analysis. Initial efforts successfully applied Convolutional Neural Networks (CNNs) for image-level classification, effectively determining the presence of damage like cracks [
8]. The field rapidly progressed to object detection models, such as the You Only Look Once (YOLO) family [
9] and Faster R-CNN [
10], which not only classify but also localize damages within bounding boxes, advancing beyond simple classification. These models have proven effective for identifying a wide range of defects on concrete [
11], steel [
12], and even specialized components like railway fasteners [
13,
14]. However, for accurate engineering analysis, quantifying the severity of damage, including its precise length, area, and shape, is essential. This requirement necessitates a more granular, pixel-level understanding of the image content.
This need for detailed geometric characterisation has driven the adoption of semantic and instance segmentation techniques. Architectures such as Fully Convolutional Networks (FCN) [
15], U-Net [
16], and DeepLabv3+ [
17] established the foundation for pixel-wise classification. More recent models, including Mask R-CNN [
18] and the segmentation-enabled versions of the YOLO series starting from YOLOv8, as illustrated in
Figure 1, have further advanced this capability [
19]. While these supervised models demonstrate high performance, their efficacy is fundamentally constrained by two critical bottlenecks. The first is the need for vast quantities of meticulously annotated data; the process of manually creating pixel-perfect masks for training is exceptionally laborious and requires domain expertise. To mitigate this bottleneck, some research has focused on synthetic data generation to create large, automatically annotated datasets that augment limited real-world data [
20,
21,
22]. The second is the domain shift created by high-resolution UAV imagery; models trained on standard, lower-resolution public datasets often fail to generalize to the unique scale and textural detail present in aerial data [
23].
The work by Lemos, Cabral [
6] provides a compelling case study of confronting these issues. They developed a methodology combining UAVs and the Mask R-CNN framework to automatically segment corrosion on industrial roofing systems. To address the data bottleneck, they created a large, dedicated database of over 8000 high-resolution images with more than 18,000 annotated instances. More importantly, their work directly addressed the image processing challenge. They found that simply resizing high-resolution images to fit model input dimensions was detrimental, causing small corrosion instances to become “imperceptible to the algorithm, resembling only noise”. This led to a significant drop in performance, with the resized-image model failing to detect small anomalies that the high-resolution cropped-image model successfully identified. By adopting a cropping strategy that preserved resolution and carefully optimizing their architecture, they ultimately achieved a mean Average Precision (mAP50) of 59.2% for mask segmentation, highlighting that both dataset scale and appropriate high-resolution processing are critical for effective damage assessment.
The recent introduction of foundation models, particularly the Segment Anything Model (SAM), offers a promising solution to the data-scarcity problem [
24]. Pre-trained on a dataset of over one billion masks, SAM exhibits remarkable zero-shot segmentation capabilities, generating precise masks from simple user prompts like points or bounding boxes without task-specific training. This has opened new frontiers for data-efficient analysis. Initial applications in civil engineering have explored a synergistic approach, coupling an object detector (e.g., YOLO) with SAM, where the detector’s bounding box is used as a spatial prompt to guide the final segmentation [
6,
25,
26].
Building on this, newer approaches have explored a two-stage process that leverages the distinct strengths of different foundation models. Santos and Carvalho [
26] demonstrated this by first employing a YOLOv8 object detection model to identify multiple classes of structural damage on bridges, such as cracks, rust stains, and efflorescence and then using the resulting bounding boxes to guide the powerful Segment Anything Model (SAM) for precise instance segmentation with mAP50 of 0.951. In a similar manner, Rakshitha, Srinath [
25] successfully integrated the Mask R-CNN framework to generate bounding box prompts for SAM, which significantly improved the accuracy of crack segmentation on pavement datasets, achieving a mean Intersection over Union (mIoU) score of up to 0.83. However, it is important to note that direct comparison of these studies remains challenging, as they address different structural contexts and lack a common benchmark dataset for standardized evaluation.
While this two-stage approach is a logical first step, it suffers from a fundamental limitation when applied to the high-resolution imagery characteristic of modern inspections. For geometrically complex and elongated defects like cracks, a bounding box is an inefficient and often ambiguous prompt. A tight bounding box around a thin crack contains a large percentage of background pixels, providing a noisy signal to SAM that can result in incomplete or inaccurate segmentation. This represents a need for a more intelligent and context-aware prompting mechanism that can fully leverage the detail in high-resolution images and transcend the limitations of simple bounding boxes.
This study presents a comprehensive comparative analysis designed to identify the optimal deep learning strategy for segmenting structural damage in high-resolution UAV imagery. The analysis is twofold: first, it quantitatively benchmarks the performance of varying architectures within the YOLO11 series to establish the most effective foundational detector for the target domain. Second, leveraging the optimized YOLO11 model, it evaluates the efficacy of a novel two-stage framework by systematically comparing three distinct instance segmentation strategies: the model’s native segmentation output, the SAM prompted with conventional bounding boxes, and SAM prompted with a refined approach using precise point prompts derived from the detector’s internal probability map. The objective is to determine which approach is most robust and accurate, thereby establishing a clear benchmark for the application of foundation models in automated structural inspection.
3. Discussion and Results
This section presents the results as a progressive ablation study to isolate the contribution of each methodological component.
Section 3.1 first identifies the optimal base architecture.
Section 3.2 then establishes a performance baseline and quantifies the impact of domain adaptation. Subsequently,
Section 3.3 evaluates the effect of a high-resolution processing strategy (SAHI), leading to the final hybrid pipeline, which is optimized and analyzed in
Section 3.4.
3.1. Comparative Analysis of YOLO11-Seg Variants Models
To establish the most effective foundation for the initial segmentation stage, a comparative analysis was conducted across the scaled architectures of the YOLO11-seg series. The YOLO11 family offers a spectrum of models with varying sizes, specifically Nano (n), Small (s), Medium (m), Large (l), and Extra-Large (x), which present a fundamental trade-off between model complexity and performance. Larger models, such as YOLOv11x-seg, contain more parameters, allowing them to learn more intricate features and achieve higher segmentation accuracy. However, this increased complexity comes at the cost of greater computational requirements, leading to longer training times. Conversely, smaller models like YOLOv11n-seg are lightweight and computationally less expensive but may yield lower accuracy.
For this analysis, five pre-trained models (in MS COCO) were selected for fine-tuning: yolo11n-seg.pt, yolo11s-seg.pt, yolo11m-seg.pt, yolo11l-seg.pt, and yolo11x-seg.pt. Each model was fine-tuned for 50 epochs on the subset of the public DACL10K bridge damage dataset. To ensure a fair and direct comparison, the training was conducted on an NVIDIA RTX 4090 GPU with a batch size of 4 images. Across all experiments, the model was trained using the Adam optimiser with an initial learning rate of 10-2, which was linearly reduced to 10–4 over the training duration, and a momentum factor of 0.9. To enhance the model’s robustness and generalisation capabilities, a suite of data augmentation techniques was applied, including mosaic augmentation, random horizontal flips (with a 0.5 probability), random scaling (±0.5), copy and paste flip instance and colour space adjustments to hue (±0.015), saturation (±0.7), and value (±0.4).
The results of the training reveal a clear relationship between model size, segmentation accuracy, and computational cost. The largest model, YOLO11x-seg, achieved the highest final segmentation accuracy with a mAP50 of 0.16319. The training graph in
Figure 13 shows that larger models, particularly the large (l) and extra-large (x) variants, reached higher peak performance. However, this superior accuracy comes at a significant computational cost. As shown in
Table 1, the training time increased with model size, with YOLO11x-seg requiring 99.19 min, compared to just 66.03 min for the smallest YOLO11n-seg model.
Given that the primary objective of this framework is to generate the highest-fidelity probability maps to serve as precise prompts for the Segment Anything Model, achieving maximum initial segmentation accuracy was prioritized over minimizing training time. The higher and more stable performance of the larger models ensures the most reliable input for the second stage of the framework. Therefore, the YOLO11x-seg model was selected as the baseline architecture for all subsequent experiments in this study. This trend, clearly correlated with the model parameter counts presented in
Table 1, suggests that for the specific challenge of segmenting subtle and varied damage patterns in high-resolution aerial imagery, a deep architecture with a high parameter count is critical for achieving state-of-the-art accuracy, and that a shallower model would likely be insufficient.
3.2. Analysis of Domain Adaptation and End-to-End Segmentation Strategies
Having established the YOLOv11x-seg as the most accurate foundational architecture, the subsequent analysis evaluates the most effective end-to-end strategy for generating high-fidelity instance segmentation masks. For this stage of the investigation, all models were trained and evaluated on images resized to a fixed 640 × 640 pixel resolution to establish a performance benchmark under conventional processing constraints. This evaluation follows a progressive methodology, beginning with an assessment of the foundational model’s performance before (only DACL10K, called baseline) and after fine-tuning on the custom UAV dataset to quantify the impact of domain adaptation. Following this, the output of the optimized, domain-adapted model is used to conduct a comparative test of three distinct segmentation strategies: the model’s own native segmentation output, the conventional SAM framework prompted by bounding boxes, and the proposed framework using SAM prompted by points derived from the probability map.
All strategies were benchmarked on the 161-image high-resolution validation set, which was resized accordingly, with the results presented in
Table 2. The initial baseline, representing the YOLOv11x-seg model trained only on the public DACL10K dataset, performed poorly when evaluated on the target-domain UAV imagery, achieving an mIoU of just 0.014. This result underscores the significant domain shift between the datasets and confirms that models trained on general public data are not directly applicable to specialized, high-resolution inspection tasks, especially after significant down-scaling. After fine-tuning on the 645 custom UAV images, a dramatic improvement in performance was observed. The domain-adapted model’s native segmentation output achieved an mIoU of 0.118, an over eightfold increase from the baseline. This demonstrates that adapting the model to the specific visual characteristics of the target dataset is the single most critical factor in achieving a functional foundational model, even when image resolution is compromised.
With this robust, domain-adapted model as the new baseline, the two SAM-based refinement strategies were evaluated. The results, however, reveal a critical limitation of applying these advanced segmentation frameworks to severely down-scaled imagery. Both the bounding box and point-prompted SAM strategies underperformed relative to the native output of the domain-adapted YOLO11 model. The SAM with bounding box prompts achieved an mIoU of 0.098, while the proposed point-prompting strategy yielded an mIoU of only 0.031. This counterintuitive decrease in performance suggests that the substantial loss of detail from resizing the high-resolution UAV images to a 640 × 640 resolution renders the prompts ineffective. The probability maps generated from the low-resolution images are likely too diffuse to provide reliable points, and the object boundaries become too indistinct for a bounding box to serve as a precise guide for SAM. These findings indicate that while domain adaptation is essential, the potential of sophisticated two-stage frameworks like the one proposed is fundamentally constrained by the quality of the input resolution. This motivates the investigation in the following section, which explores a training and inference strategy that preserves the native high resolution of the imagery.
3.3. Performance Enhancement with Slicing Aided Hyper Inference (SAHI)
The preceding analysis established that severely down-scaling high-resolution UAV imagery to a fixed 640 × 640 input resolution acts as a critical performance bottleneck. This process discards essential high-frequency details, fundamentally limiting the efficacy of the foundational detector and rendering advanced prompting strategies for SAM ineffective. To overcome this limitation, this final stage of the investigation evaluates the impact of a training and inference strategy that preserves the native resolution of the imagery: Slicing Aided Hyper Inference (SAHI). With this approach, high-resolution images are partitioned into overlapping tiles (20%) that match the model’s native input size, ensuring that no critical visual information is lost to down-scaling. The three end-to-end segmentation strategies, native model output, SAM with bounding box prompts, and SAM with point prompts, were re-evaluated using the foundational YOLOv11x-seg model, which was fine-tuned using the full SAHI protocol.
The results, presented in
Table 3, demonstrate a dramatic and substantial performance enhancement directly attributable to the SAHI methodology. By training on high-resolution tiles, the foundational detector learned more discriminative and fine-grained features, causing the performance of the crack class to increase compared to the resized-image experiments. More importantly, this high-fidelity context revealed a more nuanced, class-dependent behaviour among the different segmentation strategies that was previously obscured by the low-resolution data.
The quantitative results highlight that no single strategy is optimal across all damage types. For geometrically complex and elongated defects like cracks, the native, domain-adapted SAHI model demonstrates superior performance (IoU of 0.213) compared to both SAM-based approaches. This is because a bounding box provides an inefficient and ambiguous prompt for such features. A tight bounding box around a thin crack contains a large percentage of background pixels, which provides a noisy signal to SAM’s generalist segmentation engine and can result in incomplete or inaccurate masks. Conversely, the native YOLO11 segmentation head is trained end-to-end specifically on this pathology, allowing it to develop a more effective internal representation for delineating these fine, linear features directly.
Conversely, for damage types that are better-contained with more defined boundaries, such as efflorescence and exposed rebar, the two-stage framework using SAM with a Bounding Box Prompt shows a clear advantage, achieving the highest IoU scores for both classes, IoU of 0.140 and 0.092, respectively. This indicates that for less intricate shapes, the bounding box provides a sufficient spatial prior for SAM’s segmentation engine to effectively refine the boundaries of the detected region. The proposed point-prompting strategy, while conceptually sound, did not yield competitive results, suggesting that the generated probability maps, even at high resolution, may not be sufficiently sharp or that the point-selection algorithm requires further refinement.
Based on this comprehensive analysis, the optimal strategy is not a monolithic approach but a hybrid, class-specific pipeline that leverages the respective strengths of the specialist detector and the foundational model.
3.4. Final Strategy and Qualitative Analysis
It is important to address the exclusion of the point-prompting strategy from this final pipeline, despite its initial focus in the methodology. The initial hypothesis was that precise point prompts derived from the model’s activation map would outperform bounding boxes. However, the experimental results consistently demonstrated the opposite. As detailed in
Section 3.2 and
Section 3.3, the point-prompting strategy yielded the lowest performance under both resized and high-resolution SAHI conditions. This suggests that the generated probability maps, while useful internally for the model, may not be sufficiently sharp for reliable external prompting, or that the point-selection algorithm itself requires more advanced refinement. Therefore, based on this clear empirical evidence, the final hybrid model was constructed using only the strategies that proved most effective in practice: the native detector output and the bounding box-prompted SAM.
The findings from the comparative analysis culminate in a final, optimized hybrid strategy illustrated in
Figure 14. The proposed pipeline dynamically assigns the segmentation process according to the identified damage class. Specifically, for cracks, the segmentation results produced directly by the domain-adapted, SAHI-trained YOLOv11x-seg model are employed. In contrast, for Efflorescence and Exposed Rebar, the bounding box predictions generated by the same model are used as prompts for the Segment Anything Model (SAM), which subsequently produces the final segmentation output. In practice, this hybrid routing is implemented as conditional logic in the post-processing script. The YOLOv11x-seg model first generates predictions, each containing a class label, a native segmentation mask, and a bounding box. The script then iterates through each detection; if the predicted class label is ‘crack’, the native mask is retained. If the label is ‘efflorescence’ or ‘exposed rebar’, the script discards the native mask and uses the detection’s bounding box to prompt SAM, using the resulting SAM-generated mask as the final output for that instance.
To maximize the efficacy of this hybrid approach, a final hyperparameter optimization was performed using the Ray Tune library. This process revealed several key adjustments to the training configuration. It identified Stochastic Gradient Descent (SGD) as the optimal optimizer for the final fine-tuning stage. Furthermore, the optimization indicated that several common data augmentation techniques, including mosaic, random scaling, translation, and horizontal flips, were counterproductive. Disabling them for the final fine-tuning on the highly specific UAV dataset prevented the distortion of fine-grained features, leading to a more robust model. The final optimized training configuration that produced these results is detailed in
Table 4.
The implementation of this optimized hybrid strategy resulted in a substantial boost in performance. As detailed in
Table 5, the model achieved a final mAP50 (box) of 0.602 and a more stringent mAP50-95 (box) of 0.361 on the same validation set, whereas mAP50 (mask) of 0.582 and mAP50-95 (mask) of 0.281, resulting in an average mAP50 of 0.593. The final segmentation IoU for cracks reached 0.495, while the IoUs for efflorescence and exposed rebar were 0.331 and 0.205, respectively.
A qualitative analysis provides further insight into the practical performance of the proposed hybrid pipeline.
Figure 15 presents a selection of results from the validation set, comparing the model’s predictions to the ground truth annotations. In the first example (top row), the model accurately delineates a complex network of fine cracks and correctly segments a large area of exposed rebar, demonstrating the successful application of both branches of the hybrid strategy. The following examples demonstrate the model’s ability to handle challenging lighting conditions and diverse damage morphologies, and in some cases, even to identify damage missed by human ground truth.
However, the analysis also reveals limitations. In some cases, very fine hairline cracks are missed, and the boundaries of efflorescence can be ambiguous where the damage fades gradually into the concrete substrate, which helps explain the comparatively lower IoU scores for this damage class. A similar challenge occurs with exposed rebar, where defining the precise boundary between the bars and the surrounding irregular spalling leads to lower IoU scores. These instances highlight areas for future improvement, such as the potential integration of more advanced prompt-generation techniques or further refinement of the model architecture. Nonetheless, the qualitative results confirm that the proposed hybrid strategy provides a robust and effective solution for segmenting multiple classes of structural damage in high-resolution aerial imagery.
4. Conclusions
This study introduced and validated a novel hybrid deep learning pipeline that synergizes a YOLOv11x-seg detector with the Segment Anything Model (SAM) for multi-damage segmentation of cracks, efflorescences, and exposed rebars in high-resolution UAV inspection imagery. The findings demonstrate that a class-specific approach is optimal, as no single segmentation strategy proved globally effective for all damage types. Specifically, it was established that the final hybrid model achieved a robust mean Average Precision (mAP50) of 0.593, with the native YOLO output yielding a superior Intersection over Union (IoU) of 0.495 for linear cracks, while a SAM-based approach was more effective for efflorescence (0.331 IoU) and exposed rebar (0.205 IoU). Furthermore, the analysis revealed the critical importance of Slicing Aided Hyper Inference (SAHI) in overcoming the performance bottleneck caused by down-scaling high-resolution images, which provides new insights into processing aerial inspection data.
The primary implication of this work is the potential for more accurate and reliable automated structural health monitoring systems. By providing a more robust and context-aware solution, this research paves the way for intelligent inspection frameworks that dynamically leverage the respective strengths of specialized detectors and powerful foundation models. These findings will be valuable for researchers and practitioners in the field of automated structural inspection and computer vision in civil engineering.
However, the study acknowledges several limitations that offer avenues for future research. The current study was conducted using a custom high-resolution dataset focused primarily on concrete bridge infrastructure. Therefore, the generalizability of the findings to different types of structures or materials under varied environmental conditions needs further investigation. Furthermore, while this study provides a deep comparison of strategies within the YOLO/SAM framework, it does not include benchmarks against other established segmentation architectures like Mask R-CNN or DeepLabv3+. Future work should include such a comparison to more broadly validate the superiority of the proposed hybrid approach. Additionally, the proposed point-prompting strategy for SAM, a key initial component of the investigation, did not outperform bounding box prompts, indicating that the prompt-generation algorithm requires further refinement to effectively translate the detector’s confidence maps into reliable spatial cues for SAM. Finally, the proposed hybrid pipeline, while effective, introduces computational complexity not fully explored in this study. The use of SAHI is resource-intensive, and the class-conditional branching adds a layer of logic to the inference process.
Future work will focus on addressing these limitations. The immediate goal is to investigate more advanced prompt-generation techniques to enhance the performance of point-guided segmentation for all damage classes. Subsequently, it is planned to explore the integration of this hybrid pipeline into a real-time, on-board processing system for UAVs to provide immediate field-level analysis. These efforts will further enhance the robustness and utility of the developed technology, contributing to the advancement of intelligent infrastructure monitoring.