Keyframe-Guided Crack Segmentation and 3D Localization for UAV-Based Monocular Inspection

Tang, Feifei; Gongzhabayier, Wuyuntana; Li, Jing; Zhou, Tao; Qiu, Yue; Zhan, Yong; Song, Qiulin

doi:10.3390/sym18040657

Open AccessArticle

Keyframe-Guided Crack Segmentation and 3D Localization for UAV-Based Monocular Inspection

by

Feifei Tang

^1,2,3,*,

Wuyuntana Gongzhabayier

¹,

Jing Li

¹,

Tao Zhou

^3,4,

Yue Qiu

^3,4,

Yong Zhan

^3,4 and

Qiulin Song

¹

School of Smart City, Chongqing Jiaotong University, Chongqing 400074, China

²

Chongqing Key Laboratory of Spatio-Temporal Information of Mountain City, Chongqing 400074, China

³

Smart City Spatio-Temporal Information and Equipment Engineering Technology Innovation Center, Ministry of Natural Resources, Chongqing 400021, China

⁴

Chongqing Academy of Surveying and Mapping, Chongqing 401120, China

^*

Author to whom correspondence should be addressed.

Symmetry 2026, 18(4), 657; https://doi.org/10.3390/sym18040657

Submission received: 28 February 2026 / Revised: 18 March 2026 / Accepted: 22 March 2026 / Published: 15 April 2026

(This article belongs to the Special Issue Symmetry/Asymmetry in Intelligent Transportation)

Download

Browse Figures

Versions Notes

Abstract

In unmanned aerial vehicle (UAV)-based monocular inspection, cracks typically present as geometrically asymmetric, elongated, low-contrast weak targets, making accurate segmentation and spatial localization challenging. Existing methods are susceptible to missed detections and false positives when handling slender cracks, and monocular 3D reconstruction for localization is often burdened by redundant frames, resulting in limited modeling efficiency. To mitigate these issues, we propose a high-precision framework for crack segmentation and spatial localization from UAV imagery. First, Oriented FAST and Rotated BRIEF–Simultaneous Localization and Mapping, version 3 (ORB-SLAM3) is adopted for keyframe selection to suppress data redundancy and improve reconstruction stability. Second, we develop an enhanced YOLOv11-seg model by integrating the Dilation-wise Residual Segmentation (DWRSeg) module, the Weighted IoU (WIoU) loss, and the Lightweight shared convolutional separator batch-normalization detection head (LSCSBD) to strengthen feature discrimination and segmentation robustness for slender cracks, yielding high-quality crack masks. Finally, the predicted masks are projected onto the reconstructed 3D surface to obtain precise spatial localization. Our experimental results demonstrate that the proposed approach improves the segmentation mAP@50 by 7.2% over the baseline while reducing computational complexity from 10.2 to 9.8 GFLOPs. In addition, keyframe-based processing reduces the 3D modeling time by 59.4% compared to that with full-frame reconstruction. Overall, the proposed framework jointly enhances crack segmentation accuracy and substantially accelerates 3D modeling and localization, providing an effective solution for efficient UAV-based crack inspection.

Keywords:

UAV; crack segmentation; 3D localization; YOLOv11; ORB-SLAM3; SfM-MVS

1. Introduction

Concrete bridges undergo significant degradation under the prolonged influence of complex loading scenarios and environmental impacts. To ensure the structural integrity, serviceability, and durability of these assets, bridge inspection has emerged as a focal point in both academic discourse and engineering applications. Among various structural defects, surface-visible cracks represent a critical diagnostic indicator and a primary factor compromising long-term durability [1,2]. Such fissuring can trigger the spalling of concrete cover, thereby exposing internal reinforcement to an elevated risk of corrosion. Furthermore, rapidly propagating cracks often serve as precursors to catastrophic structural failure. Consequently, the precise detection and characterization of bridge cracks constitute a pivotal component in rigorous safety assessment and durability-oriented maintenance strategies.

Traditional bridge crack inspection predominantly relies on manual visual assessment, often supported by specialized bridge inspection vehicles. However, such practices are typically characterized by low efficiency, substantial labor demands, and non-negligible safety risks. In addition, inspection operations are highly constrained by traffic conditions, hindering rapid and routine deployment at scale. In recent years, Unmanned Aerial Vehicles (UAVs) have emerged as a promising alternative, owing to their high maneuverability, flexible deployment, and ability to acquire close-range, high-resolution visual data. Despite these advantages, UAV-based inspections generate large volumes of imagery and video, making manual interpretation time-consuming and susceptible to subjectivity. Notably, Graybeal et al. [3] reported substantial inter-inspector variability in crack identification for the same bridge structure, underscoring the limitations of purely manual assessments. To improve both objectivity and efficiency, deep-learning-driven computer vision techniques have increasingly been incorporated into automated crack detection and segmentation pipelines.

Deep-learning-based crack recognition methods can generally be divided into two categories: semantic segmentation methods and detection-framework-based methods. Semantic segmentation methods (e.g., U-Net [4] and DeepLab [5]) extract crack regions through pixel-wise classification, and therefore exhibit certain advantages in boundary recovery and fine-detail representation. In recent years, efficient semantic segmentation networks for real-time vision tasks have also continued to develop. For example, BiSeNetV2, proposed by Yu et al. [6], achieves a balance between speed and accuracy through the collaborative modeling of a detail branch and a semantic branch, while DDRNet, proposed by Pan et al. [7] improves real-time segmentation performance through dual-resolution parallel processing and repeated feature fusion. In addition, researchers have carried out various improvements for crack segmentation tasks. For instance, Zou Kaixin et al. [8] enhanced the segmentation capability for pavement defect images by improving a U-shaped network. RHACrackNet, proposed by Zhu et al. [9] strengthens the representation of crack details and boundary information by introducing hybrid attention and residual structures; OUR-Net, proposed by Li et al. [10] improves the extraction of crack textures at different scales through multi-frequency feature modeling; and AutoCrackNet proposed by Zhu et al. [11] improves segmentation efficiency and deployment potential through an automated lightweight network design. However, these methods usually require pixel-wise prediction over the entire image. For bridges, especially bridge piers, complex background interference is common, which makes whole-image pixel-level segmentation more prone to false detections. By contrast, detection-framework-based methods are more advantageous in focusing on target regions.

Detection-framework-based methods can be further divided into two categories: two-stage detectors based on region proposals and one-stage detectors based on direct regression. Two-stage methods (e.g., Faster R-CNN [12] and Mask R-CNN [13]) typically generate candidate regions first and then perform classification and bounding-box regression, and therefore often achieve relatively high detection accuracy in complex backgrounds and small-target scenarios. Li et al. [14] combined close-range UAV imaging with Faster R-CNN for automatic bridge crack detection, while Huang et al. [15] and Liu et al. [16] improved feature representations within the Mask R-CNN framework to enhance crack detection and segmentation performance. However, two-stage approaches involve relatively long inference pipelines with higher computational cost and latency, which limits their applicability in UAV inspections where resources are constrained or real-time performance is required. In contrast, one-stage detectors (e.g., the You Only Look Once (YOLO) family [17], Single Shot MultiBox Detector (SSD) [18], and RetinaNet [19]) feature simpler architectures and faster inference, making them more suitable for real-time inspection tasks. Among them, the YOLO series provides a favorable trade-off between accuracy and speed. Nevertheless, cracks are typically characterized by elongated shapes, small scales, and low contrast; consequently, even efficient one-stage frameworks such as YOLO remain prone to missed detections and insufficient sensitivity. To address these challenges, existing studies have improved YOLO from multiple perspectives. For instance, Dong et al. [20] introduced attention modules into YOLO11n to improve detection accuracy and incorporated a lightweight detection head to enhance efficiency; however, inference latency and limited sensitivity to fine cracks in complex structures remain issues. Yu et al. [21] integrated Transformer components into YOLOv5 to strengthen the detection of slender cracks, yet global attention mechanisms can be strongly affected by background noise. Xu et al. [22] enhanced feature fusion in YOLOv8n and optimized training with intersection over union (IoU)-based losses to improve robustness; nevertheless, detection accuracy remains low when crack–concrete contrast is weak, and crack predictions may appear fragmented. Therefore, there remains substantial room for improvement in achieving high-precision and continuous detection of subtle, slender cracks in complex scenarios.

However, image-level detection and segmentation results alone lack explicit spatial and geographic attributes and cannot adequately describe the distribution of cracks on a structure, limiting their ability to support bridge condition assessment. With the rapid development of 3D reconstruction techniques, computer-vision downstream tasks have become increasingly integrated with 3D modeling, and the demand in defect inspection has gradually shifted from recognition and segmentation toward spatial localization. Existing defect localization approaches can be broadly classified into two categories.

The first category relies on inter-image feature matching to stitch multi-view images into a panorama, thereby enabling defect localization. For example, Jiang et al. [23] collected structural images using a wall-climbing robot and, under the assumption of approximately parallel image planes, stitched crack detection results into a panoramic image for localization. Won et al. [24] proposed a deep matching-based stitching strategy to improve image correspondence quality, generating panoramic views of bridge piers and achieving crack localization. Such methods typically depend on planar or near-planar assumptions and are therefore difficult to extend to complex 3D scenarios with intricate geometries, such as bridges. The second category reconstructs the scene using 3D reconstruction algorithms and then projects 2D detection/segmentation outputs onto the reconstructed 3D model to localize defects. Liu et al. [25] generated a mesh model from UAV imagery using the structure-from-motion and multi-view stereo (SfM–MVS) pipeline and projected crack results onto the 3D model for localization. Deng et al. [26] produced dense point clouds via SfM–MVS and mapped segmented cracks onto the point cloud. Although these approaches have demonstrated effectiveness for 3D visualization of cracks, conventional SfM typically processes all images, resulting in substantial data redundancy and long reconstruction cycles. In comparison, simultaneous localization and mapping (SLAM)-based frameworks enable real-time modeling and pose estimation. Charron et al. [27] attempted to enhance SLAM stability by fusing LiDAR, cameras, and inertial measurements; however, variations in illumination and platform vibrations can lead to inconsistent map accuracy, undermining the reliability of crack localization. McLaughlin et al. [28] reported that, during repeated inspections, maps acquired in different sessions are difficult to align automatically and often require labor-intensive manual calibration, which can introduce significant accumulated drift and geometric distortion, leaving global accuracy insufficient. Overall, relying solely on real-time SLAM reconstructions makes it challenging to achieve the precision required for practical applications.

To address the above challenges in crack detection, segmentation, and localization, the main contributions of this study are as follows:

(1): A YOLO-DWL model for bridge crack segmentation is proposed. Built upon the YOLOv11 framework, the dilation-wise residual (DWR)-C3k2 module is introduced to enhance multi-scale feature extraction under different receptive fields, thereby improving the representation of slender cracks. In addition, the Weighted IoU (WIoU) loss is incorporated to improve the stability of bounding-box regression and reduce the adverse effects of low-quality samples during training. Meanwhile, the Lightweight Shared Convolution and Separate Batch Normalization Detection Head (LSCSBD) is adopted to enhance the sensitivity to small-scale crack targets while maintaining a lightweight architecture, thereby enabling robust extraction of subtle defects.
(2): An ORB-SLAM3 keyframe-constrained SfM–MVS 3D reconstruction method is developed. By employing ORB-SLAM3 for keyframe selection and pose estimation on UAV video sequences, the adverse effects of redundant images in high-frame-rate videos on feature matching and global optimization are reduced. Combined with SfM–MVS, this method enables the efficient dense reconstruction of bridge surfaces, thereby significantly improving 3D reconstruction efficiency while preserving model completeness and geometric accuracy.
(3): A spatial mapping method from crack detection results to the 3D model surface is established. By using a pixel back-projection and triangle-mesh intersection strategy, the 2D crack segmentation results are projected onto the reconstructed 3D model surface, thereby enabling the intuitive visualization and accurate spatial localization of cracks in 3D space.

2. Methodology for Crack Identification, Segmentation, and Localization

The overall workflow of the proposed methodology is illustrated in Figure 1. Initially, targeting high-frame-rate UAV bridge inspection videos, ORB-SLAM3 is employed to achieve real-time camera pose estimation and automatic keyframe selection. This process eliminates a vast number of redundant consecutive frames, retaining only keyframes with effective spatial constraints for subsequent modeling. On this basis, a dense 3D reconstruction using SfM-MVS is executed on the selected keyframes to rapidly construct the 3D bridge model, significantly enhancing modeling efficiency while maintaining reconstruction accuracy. Subsequently, the enhanced YOLO-DWL crack segmentation model is utilized to perform high-precision crack detection and pixel-level segmentation on the keyframe images. Finally, based on the camera’s intrinsic and extrinsic parameters and pose constraints, the 2D segmentation masks are back-projected onto the surface of the 3D reconstructed model. This achieves precise spatial localization of the cracks, forming a complete technical pipeline of “near-real-time modeling, precision segmentation, and spatial localization” for bridge cracks.

2.1. ORB-SLAM3 Keyframe-Constrained SfM-MVS 3D Modeling Method

In the 3D reconstruction of bridge structures, UAVs typically capture continuous video sequences at high frame rates. Directly inputting the entire image set into traditional SfM-MVS frameworks for modeling often leads to significant image redundancy, prohibitive computational overhead, and excessively long modeling cycles in engineering applications. Furthermore, the presence of a vast number of images with near-identical viewpoints and highly redundant information within high-frame-rate sequences increases the complexity of feature matching and optimization. This frequently triggers issues such as matching instability, geometric degradation, and the accumulation of mismatches, which compromise the robustness of the reconstruction process and the overall consistency of the model. Consequently, these challenges further constrain the high-precision spatial localization and 3D visualization of defects such as cracks.

To address the aforementioned challenges, this study adopts a hybrid 3D modeling approach based on SfM-MVS with ORB-SLAM3 keyframe constraints, as shown in Figure 2. Initially, monocular ORB-SLAM3 is utilized for real-time pose estimation and keyframe selection from the high-frame-rate UAV video sequences, retaining only a subset of keyframes that possess effective viewpoint variations and geometric constraints. On this basis, the standard SfM-MVS reconstruction is executed exclusively on this keyframe subset. This strategy significantly reduces the reconstruction scale and computational overhead while ensuring both geometric accuracy and model completeness.

Before ORB-SLAM3-based pose estimation, camera calibration was performed using Zhang’s calibration method [29]. During calibration, a

9 \times 12

checkerboard was used, and calibration images were captured under different viewing angles and distances, with the checkerboard distributed as much as possible across both the central and peripheral regions of the image plane. Some example calibration images are shown in Figure 3. Through corner detection, subpixel refinement, and nonlinear optimization, the camera intrinsic parameters

(f_{x}, f_{y}, c_{x}, c_{y})

and the radial–tangential distortion coefficients

(k_{1}, k_{2}, k_{3}, p_{1}, p_{2})

were estimated. The calibrated parameters were then written into the ORB-SLAM3 YAML configuration file and kept consistent with the image settings used in both ORB-SLAM3 and COLMAP 3.8 (Windows CUDA build). This ensured that distortion correction and reprojection calculations were performed under a unified camera model, thereby reducing systematic projection errors caused by image scaling or inconsistent pixel coordinate settings.

During the pose estimation stage, the calibrated camera parameters were loaded into the monocular ORB-SLAM3 pipeline, allowing lens distortion to be explicitly modeled during tracking and reprojection calculations and ensuring consistency between the geometric imaging model and the physical camera. Subsequently, the high-frame-rate video sequences captured by the UAV were fed into ORB-SLAM3 for real-time pose estimation and mapping. In this study, the original UAV video lasted 2 min 10 s at 30 fps. To reduce temporal redundancy, frames were extracted from the video at approximately 15 fps using FFmpeg [30], resulting in 1930 images for subsequent ORB-SLAM3 processing. The extracted images were retained at the original resolution to preserve sufficient feature detail for stable tracking, feature matching, and subsequent SfM-MVS reconstruction, while maintaining consistency with the calibrated camera parameters. During operation, ORB-SLAM3 adaptively inserted keyframes based on the quantity and distribution of feature points, as well as the platform’s motion state. Following the default monocular keyframe insertion strategy of ORB-SLAM3, the decision logic for keyframe insertion can be summarized in Equation (1):

I n s e r t K F (t) = [(∆ f \geq m_{m a x}) \cup (∆ f \geq m_{m i n} \cap I_{L M} = 1)] \cap (\frac{N_{t}^{t r a c k e d}}{N^{r e f}} < 0.9 \cap N_{t}^{m a p} > 15)

(1)

where

I n s e r t K F (t)

indicates whether a keyframe is inserted at frame

t

;

∆ f

denotes the frame interval;

m_{m a x}

and

m_{m i n}

are the maximum and minimum keyframe interval thresholds, respectively;

I_{L M}

is the state indicator of the local mapping thread (1/0 corresponds to idle/busy);

N_{t}^{t r a c k e d}

is the number of map points successfully tracked in the current frame

t

;

N^{r e f}

is a reference value; and

N_{t}^{m a p}

is the number of available map points.

Based on this strategy, ORB-SLAM3 automatically filters out a large number of consecutive frames with near-identical viewpoints and high informational redundancy, while retaining frames with effective viewpoint changes and sufficient geometric support. Finally, 185 keyframes were retained without manual screening. The selected keyframes, together with the estimated camera trajectories, were exported in temporal order and directly organized as the image input for subsequent COLMAP reconstruction.

In Figure 4, the red and black dots denote reconstructed 3D map points during the ORB-SLAM3 process, the blue camera wireframes distributed along the lower trajectory represent the estimated keyframe poses, the green wireframe indicates the current camera pose, and the red wireframe indicates the initial camera pose.

In COLMAP, sparse reconstruction was first carried out on the retained keyframes through feature extraction, feature matching, triangulation, and bundle adjustment, followed by dense reconstruction via image undistortion, PatchMatch-based depth estimation, depth fusion, and surface generation. This workflow effectively reduces the reconstruction scale and computational burden while maintaining geometric completeness and reconstruction stability, thereby providing a reliable 3D carrier for subsequent crack projection and localization.

2.2. Crack Segmentation Model: YOLO-DWL

The slender, low-contrast, and fragmented morphology of bridge cracks presents significant challenges for standard detection and segmentation networks, particularly in boundary localization, continuity preservation of thin lines, and perception of small-scale targets. To enhance the detection rate of continuous cracks and the precision of boundary localization, we propose a joint improvement strategy for slender cracks based on the YOLOv11 segmentation framework (as shown in the Figure 5 below). First, specific C3k2 modules in the backbone network are replaced with DWR-C3k2 to strengthen the directional thin-line representation. Second, WIoU is adopted as the bounding box regression loss to improve regression stability under hard-sample conditions. Finally, the LSCSBD lightweight detection head is introduced to enhance small-scale crack perception and feature alignment capabilities while simultaneously reducing computational overhead.

2.2.1. DWRSeg Segmented Module

In contrast to the common orientation in real-time segmentation frameworks that “larger receptive fields are inherently better” [31], DWRSeg focuses on receptive field efficiency by configuring matched receptive field scales at different semantic stages [32]. Low-level features predominantly comprise information such as edges and textures; an excessively large receptive field often introduces redundant context and weakens the representation of slender targets. Conversely, high-level features possess stronger semantics, requiring larger receptive fields to aggregate structural information. Bridge cracks are characterized by slender, low-contrast, and fragmented morphologies. Consequently, the segmentation process relies both on the sensitivity of low-level features to thin-line edges and continuous textures—to avoid the “smoothing or fracturing” caused by downsampling, and on the cross-regional contextual aggregation capability of high-level features to enhance background suppression and overall coherence. A uniform receptive field strategy typically leads to insufficient feature extraction efficiency.

As illustrated in the Figure 6 below, the input feature is first mapped by a 3 × 3 convolution and then split into three parallel branches along the channel dimension. Each branch further applies a 3 × 3 depth-wise dilated convolution with dilation rates of 1, 3, and 5, respectively, so as to capture local, mid-range, and global contextual information. The outputs of the three branches are then concatenated along the channel dimension and fused by a 1 × 1 convolution. Finally, the fused feature is combined with the shortcut branch through element-wise addition to generate the final output, thereby enhancing multi-scale representation while preserving the original feature information. In the figure, “C” denotes concatenation, and “+” denotes element-wise addition.

Based on the aforementioned insights, DWRSeg was embedded into the C3k2 modules of the YOLOv11 backbone to build the C3k2_DWR module. The original external connections of C3k2 were preserved, while the internal feature extraction part was strengthened by DWRSeg. This allows the network to better balance fine crack detail preservation and contextual information extraction. However, the added multi-branch dilated operations increase the complexity of feature fusion and bring additional computational overhead. The influence of different replacement positions was further evaluated in the subsequent experiments, where the configuration using positions 1, 2, and 4 showed the best overall performance.

2.2.2. Introducing WIoU to Improve Regression Stability and Small-Target Localization Accuracy

Given that bridge cracks are characterized by slender shapes and irregular boundaries, a large number of low-quality predicted boxes often emerge during the initial training phase. This not only exacerbates the imbalance between positive and negative samples but also introduces unstable gradient interference, making the loss curve difficult to converge. To address this, the bounding box regression loss is replaced by WIoU, which is jointly optimized with Distribution Focal Loss (DFL). This strategy reduces the contribution of both outlier-quality samples (ultra-high or extremely low) to the total loss, thereby focusing the training on higher-value samples. The aggregate loss function is formulated as

L = λ_{c l s} L_{B C E} + λ_{b o x} L_{W I o U} + λ_{d f l} L_{D F L}

(2)

The core concept of WIoU involves introducing a weighting mechanism based on sample outlierness (dispersion) alongside the standard IoU regression term. This approach prevents the model from being excessively steered by extreme, low-quality samples during training, thereby achieving smoother and more stable bounding box convergence. The formulation is expressed as

L_{W I o U} = R_{W I o U} \cdot (1 - I o U)

(3)

Specifically, it actively suppresses the influence of “outlier bounding boxes”—those with significant localization deviations or that potentially represent background noise—on the loss function. Simultaneously, DFL is incorporated as a distribution regression term to refine recognition precision. This integration enhances the detection capabilities for slender, small-scale targets without incurring a significant increase in computational overhead.

2.2.3. LSCSBD-Based Lightweight Detection Head for Small-Scale Crack Perception

Traditional segmentation heads typically employ convolutional branches with identical structures but independent parameters across the P3, P4, and P5 scales. This often results in the repetitive stacking of convolutions, leading to significant structural redundancy in terms of both parameter count and computational overhead. Furthermore, statistical distribution disparities among multi-scale features can amplify training fluctuations, which is detrimental to the segmentation of small-scale cracks. To achieve a lightweight architecture and stable predictions, this study replaces the default segmentation head of YOLOv11-seg with LSCSBD. This modified head directly receives features from the P3, P4, and P5 scales to perform joint prediction.

As illustrated in Figure 7, this study replaces the original detection head of the YOLOv11-seg model with the LSCSBD head. This module first performs channel alignment on the multi-scale features (P3–P5) via 1 × 1 convolutions. Subsequently, a cross-scale shared 3 × 3 convolution is employed for unified feature refinement, thereby mitigating parameter redundancy and computational overhead caused by repetitive convolutions in traditional multi-scale heads. Simultaneously, scale-independent normalization (SN) is introduced following the shared convolution to accommodate the statistical distribution disparities across different hierarchical levels, enhancing training stability and the consistency of cross-scale fusion. On this basis, the original features from P3–P5 are directly mapped to the output stage and fused with the mask score maps generated by the shared segmentation branch. This strategy ensures mask continuity, significantly improving the boundary adherence and localization precision of the crack masks.

2.3. 3D Crack Projection-Based Localization Method

To achieve accurate mapping of 2D crack segmentation results onto the 3D bridge surface, this study adopts a camera-geometry-constrained localization framework based on pixel back-projection ray–triangle mesh intersection. Specifically, the intrinsic matrix K is obtained through camera calibration. The SfM reconstruction then provides the camera with extrinsic parameters for each keyframe, i.e., the camera pose (

R_{c w,} t_{c w}

), together with the triangular mesh M generated from dense reconstruction and Poisson surface reconstruction. To control the number of rays, the set of crack pixels in the keyframe crack mask is uniformly sampled with a fixed stride

s_{p}

, where

s_{p}

denotes the sampling step and is set to

s_{p}

= 4 in this paper. For any sampled pixel

(u, v),

its viewing direction is back-projected into a unit ray direction in the camera coordinate system using the intrinsic matrix. The ray is then transformed into the world coordinate system using the extrinsic parameters and intersected with the mesh to obtain the spatial coordinates of the crack on the triangular surface. The intrinsic matrix and normalization are given as follows:

K = [\begin{matrix} f_{x} & 0 & c_{x} \\ 0 & f_{y} & c_{y} \\ 0 & 0 & 1 \end{matrix}]

(4)

\hat{d_{c}} = \frac{K^{- 1} {[u, v, 1]}^{T}}{‖ K^{- 1} {[u, v, 1]}^{T} ‖}

(5)

The extrinsic parameters output by COLMAP adopt the “world-to-camera” transformation. Therefore, the camera optical center and the rotation are transformed as follows:

C = - R_{c w}^{T} t_{c w}

(6)

R_{w c} = R_{c w}^{T}

(7)

Based on this, the unit ray direction in the world coordinate system and the corresponding ray equation are derived as

\hat{d_{w}} = R_{w c} \hat{d_{c}}

(8)

r (s) = C + s \hat{d_{w}}, s > 0

(9)

Subsequently, the triangular mesh Mis was used as the geometric constraint carrier. For each ray, a ray–triangle intersection test is performed, and the depth parameter s* corresponding to the nearest visible intersection point is selected. The resulting 3D crack point is then given by:

P = C + s^{*} \hat{d_{w}}

(10)

By traversing all keyframes and their corresponding crack masks, a 3D point set of cracks can be obtained. Because each 3D coordinate is computed from the nearest intersection between a back-projected ray and the triangular mesh, the localized points satisfy P ∈ M, which geometrically guarantees that crack points lie on the reconstructed surface rather than at arbitrary locations in space.

3. Experimental Setup and Data Acquisition

This study employed a cross-platform environment to implement crack segmentation, 3D reconstruction, and spatial localization. The training and inference of YOLO-DWL were conducted on a Windows 11 Professional platform, with the deep learning framework implemented in PyTorch 2.8.0 and CUDA version 11.8. To ensure experimental fairness, no pretrained weights were used in either the comparative experiments or the ablation experiments, and all models were trained from random initialization. The training hyperparameters were set as follows: the number of epochs was 500, the batch size was 16, the initial learning rate (lr₀) was 0.1, and the optimizer was SGD. Under these fixed parameter settings, the training and validation loss curves, together with the validation mAP@50 curve, indicate that YOLO-DWL achieves reliable convergence during training (see Figure 8). On the same platform, COLMAP was used to perform SfM-MVS reconstruction on the selected keyframe subset to generate the 3D models. The localization stage employed Open3D’s ray–mesh intersection method utility to back-project crack mask pixels into spatial rays and calculate their intersections with the mesh, thereby obtaining the 3D point set of the cracks. Keyframe selection and pose estimation were conducted on the Ubuntu 20.04 platform using ORB-SLAM3 to acquire the keyframe sequence and camera trajectories, which were subsequently exported for COLMAP reconstruction. The hardware configuration consisted of an Intel Core i9-14900KF CPU (Intel Corporation, Santa Clara, CA, USA) and an NVIDIA GeForce RTX 4090 GPU with 24 GB VRAM (NVIDIA Corporation, Santa Clara, CA, USA) to satisfy the computational and memory demands of the integrated workflow. All experiments were conducted under consistent experimental conditions to ensure fair performance comparisons across different experimental settings.

The dataset for this study was collected by capturing videos of cracks on bridge facades using a DJI Mavic 3 UAV (SZ DJI Technology Co., Ltd., Shenzhen, China). The platform is equipped with a monocular camera sensor, capturing video at a resolution of 1920 × 1080 and a frame rate of 30 fps. The UAV was operated manually, with a total video acquisition duration of 2 min and 10 s.

4. Results and Analysis

4.1. YOLO-DWL Segmentation Model

4.1.1. Crack Defect Dataset

This study focuses on inspection tasks from a UAV perspective. Due to safety clearance constraints during flight and complex environmental factors, existing public datasets are often ill-suited for practical inspection scenarios. To address this, 1930 frames were first extracted from the UAV video data. After removing irrelevant images and content-redundant images with high similarity, 620 crack images were finally retained and annotated at the pixel level using the Roboflow [33] online platform. The images were then divided into training and validation sets, with no independent test set used in the current study. Flip and rotation augmentations were applied to the training set. The final dataset used in this study contained 1052 images in total, including 912 training images and 140 validation images. During training, Mosaic data augmentation was further enabled. This technique randomly selects, scales, and stitches together four training images, effectively enriching the training distribution while enhancing the model’s robustness to variations in spatial scale.

4.1.2. Evaluation Metrics

In this study, an evaluation system for the segmentation model is established from three aspects: segmentation accuracy, overall performance, and runtime efficiency. Precision (P) and recall (R) are adopted to assess the basic segmentation performance. Here, TP denotes the number of crack pixels correctly predicted as cracks, FP denotes the number of background pixels incorrectly predicted as cracks, and FN denotes the number of crack pixels incorrectly predicted as background. Thus, the metrics are defined as follows:

P = \frac{T P}{T P + F P}

(11)

R = \frac{T P}{T P + F N}

(12)

In the ablation studies, Average Precision (AP) and mean Average Precision (mAP) are employed as key metrics to evaluate the comprehensive performance and structural integrity of the model.

A P = \int_{0}^{1} P (R) d R

(13)

m A P = \frac{\sum_{i = 1}^{K} {A P}_{i}}{K}

(14)

In the comparative experiments, the mean Intersection over Union (mIoU) is incorporated to quantify the spatial overlap between the predicted segmentation masks and the ground truth labels. Additionally, the F1-score is employed to provide a comprehensive evaluation of the segmentation performance, balancing precision and recall.

m I o U = \frac{1}{K + 1} \sum_{i = 0}^{K} \frac{T P}{T P + F P + F N}

(15)

F_{1} = \frac{2 P R}{P + R}

(16)

Furthermore, considering the real-time requirements of UAV inspection tasks, Giga Floating-point Operations per second (GFLOPs) was utilized to measure the computational complexity of the model; a lower value indicates reduced hardware resource consumption and a more lightweight architecture. Additionally, Frames Per Second (FPS) is employed as the primary metric to evaluate the inference speed and real-time performance of the model.

4.1.3. Comparative Experiments

To verify the effectiveness of the YOLO-DWL model, this study selected a variety of representative segmentation models for comparison, including the two-stage segmentation model Mask R-CNN, as well as the one-stage models YOLOv5-seg, YOLOv8-seg and YOLOv11-seg. Given that crack targets typically exhibit slender geometric shapes and pronounced multi-scale characteristics, U-Net, DeepLabv3+, and RHACrackNet were further introduced as supplementary benchmark models to strengthen the comparative analysis of thin-structure segmentation capability. Meanwhile, BiSeNetV2 and DDRNet were additionally included as representative lightweight real-time segmentation networks in recent years to enrich the comparison of lightweight baselines. The comparative results of all the models are summarized in Table 1. All model outputs were converted into pixel-level binary masks to ensure consistency in evaluation. All experiments were conducted under unified experimental settings and identical hardware and software conditions for training and testing.

As evidenced by the table above, YOLO-DWL achieves the optimal trade-off between segmentation accuracy and inference speed, delivering the highest Precision, mIoU, and F1-score among all compared methods. Regarding real-time performance, the proposed method reaches an inference speed of 185 FPS while maintaining a lower computational overhead (GFLOPs) than YOLOv11-seg. This demonstrates that YOLO-DWL does not rely on parameter stacking to obtain higher accuracy; instead, it enhances the segmentation quality of crack regions through structural optimization, effectively reducing computational redundancy while simultaneously boosting performance.

While the YOLO series has demonstrated progressive improvements in crack segmentation across successive iterations, challenges such as fragmented detection and missed extractions persist when dealing with low-contrast or fine-scale cracks (as illustrated in the Figure 9 below). This is primarily due to the sparse pixel representation of micro-cracks and significant interference from complex backgrounds. In contrast, YOLO-DWL effectively preserves fine branches and reduces the number of breakpoints; numerically, this is reflected by a 0.100 improvement in mIoU compared with YOLOv11-seg, indicating enhanced geometric fidelity. Although Mask R-CNN and DeepLabv3+ exhibit robust multi-scale fusion capabilities—demonstrating superior continuity and noise suppression in visualization—their prohibitive inference costs (as indicated by the FPS values) render them unsuitable for real-time UAV inspection tasks. Instead, these models are better categorized for offline precision analysis where computational resources are abundant. In contrast, the traditional U-Net maintains a relatively high recall rate but, as shown in the visualization, only identifies a few high-contrast primary crack regions. It fails to recognize cracks in weak-textured or micro-scale areas, struggling to extract effective semantic features. RHACrackNet achieves the highest recall among all compared methods, indicating strong sensitivity to crack responses; however, its relatively low precision suggests that false detections remain under complex background conditions. BiSeNetV2 exhibits relatively balanced performance as a lightweight segmentation network, but still shows inferior overall accuracy compared with YOLO-DWL. DDRNet demonstrates the highest inference speed, highlighting its advantage in real-time processing, whereas its limited precision and mIoU indicate insufficient capability for fine-scale crack segmentation. The comparative visualization results (shown below) demonstrate that YOLO-DWL achieves higher overall segmentation precision and significantly fewer false positives compared to existing methods when detecting facade cracks.

4.1.4. Module Ablation Study

Using the YOLOv11n-seg as the baseline model, the results of the ablation studies are summarized in the Table 2 below. Models A, B, and C represent the variants derived by independently integrating the DWRSeg, WIoU and LSCSBD modules into the baseline architecture, respectively.

As indicated in the table, the integration of the DWRSeg module into the baseline model led to an increase in mAP@50 from 0.802 to 0.844, with mAP@50–95 improving by approximately 3.4%. This suggests that the module enhances feature extraction capabilities and effectively boosts segmentation precision. The introduction of the WIoU loss function resulted in significant improvements in both R and mAP@50, validating its superiority in handling bounding box regression. For Model C, the inclusion of the LSCSBD module markedly reduced the computational load—decreasing GFLOPs from 10.2 to 9.8—while maintaining high accuracy, thereby achieving a lightweight architecture. Results from Models D, E, and F demonstrate that any combination of two modules outperforms the baseline across various dimensions. Notably, Model F, which fuses the DWRSeg module with the LSCSBD detection head, achieves an excellent balance between precision and computational efficiency, reaching an mAP@50 of 0.871. When all three enhancement modules are integrated (Model G), the model achieves peak performance. Its mAP@50 reaches 0.874, a 7.2% improvement over the baseline, with Precision (P) increasing by 6.3%, while GFLOPs simultaneously drop from 10.2 to 9.8.

The results of the ablation studies demonstrate that DWRSeg significantly enhances detection precision, while WIoU optimizes the regression performance. Simultaneously, LSCSBD successfully reduces computational overhead while boosting overall performance. The integrated Model G, which combines these three components, achieves a substantial improvement in segmentation accuracy while ensuring real-time operation. These findings validate the effectiveness and synergy of the proposed enhancement strategies in this study.

4.1.5. Ablation Study of the DWRSeg Stage-Wise Network

To evaluate the effectiveness of the C3k2_DWR—an improved module based on the DWRSeg segmentation network—within the backbone, eleven sets of ablation experiments were designed. Building upon the integration of the LSCSBD detection head and the WIoU loss function, various combinations were selected to partially or fully substitute the four consecutive C3k2 modules in the backbone. The experimental results are illustrated in the Figure 10 below, where the x–y notation on the x-axis denotes the substitution of C3k2 modules with C3k2_DWR at positions x and y. Notably, the 1-2-4 combination exhibits a significant performance advantage without increasing the computational load.

This indicates that replacing the early-stage modules helps preserve crack edges and local texture details, while replacing the deeper module enhances contextual aggregation and structural continuity. In contrast, more concentrated or full-stage replacement introduces redundant multi-scale feature fusion, thereby weakening the overall gain.

4.2. Impact of Keyframe Filtering on 3D Reconstruction Efficiency

The data for this study consist of concrete surface video sequences captured by a DJI Mavic 3E (monocular), with facade concrete cracks selected as the experimental scenario. To quantify the impact of keyframe filtering on 3D modeling efficiency, three comparative experimental groups were established: a full-sequence reconstruction group, a keyframe-based reconstruction group, and a uniform sampling control group. In the full-sequence group, 1930 images were extracted from the original video using FFmpeg based on standard rules. For the keyframe group, ORB-SLAM3 was executed on the video sequence to extract 185 keyframes. To maintain a consistent input scale with the keyframe group, the uniform sampling group extracted approximately 185 images at equal intervals from the same video. All groups followed an identical reconstruction pipeline and parameter configuration in COLMAP, with the input image set being the only variable. The results are summarized in the Table 3 below.

The table and reconstruction results shown in Figure 11 indicate that different sampling strategies substantially affect the stability, efficiency, and model quality of COLMAP reconstruction. Although conventional full-frame sampling provides the largest input scale, it tends to trigger unstable feature matching and geometric constraint degradation in scenarios with weak textures, repetitive structures, and highly redundant viewpoints. As a result, the image registration rate is only 1.71%, yielding an insufficient number of effective views for reconstruction. Consequently, the dense point cloud contains only 7.52 × 10⁵ points, while the SfM stage requires 3713.157 min, leading to extremely low overall efficiency. The reconstructed model also exhibits structural discontinuities and blurred edges, suggesting that the massive input set does not translate into effective geometric constraints. In contrast, both uniform frame sampling and the ORB-SLAM3 keyframe set reduce the input size to 185 images, achieving a 100% registration rate and a more stable reconstruction process that produces complete and usable dense models. With uniform sampling, the overall structure is continuous and the model is largely complete, yet the top and boundary regions remain relatively sparse. The ORB-SLAM3 keyframe set is markedly superior in efficiency: SfM takes only 5.249 min, dense reconstruction takes 40.876 min, and the total time is 46.125 min, representing a 59.4% reduction compared to uniform sampling. Meanwhile, the dense point cloud reaches 6.23 × 10⁶ points, which is 1.68× that of uniform sampling. These results indicate that, under the same input scale, the keyframe strategy preserves more informative viewpoints and effectively suppresses the negative impact of redundant frames.

4.3. Crack Localization Method

To evaluate the localization accuracy of the proposed method, five specific crack endpoints or vertices were selected from the study area. Their coordinates in the CGCS2000 coordinate system were acquired using a Total Station as the ground truth. The 3D model was subsequently transformed into the same coordinate system to obtain the projected algorithmic coordinates. Finally, the 3D Euclidean distance error between the projected points and the ground truth points was calculated within this unified coordinate system to serve as the localization precision metric.

As indicated in the Table 4, the proposed method achieves centimeter-level accuracy in 3D localization, with individual point errors ranging from 54.9 to 213.1 mm and an average error of 107.9 mm. A significant deviation is observed at Point 5, which can be attributed to the amplification of subtle camera pose inaccuracies during the projection process, as well as discrepancies in the corresponding positions of feature points within the 2D segmentation masks. The errors for the majority of the remaining points fall within the 54.9–117.8 mm range, satisfying the spatial localization requirements for large-scale structural defects. The visualization results shown in Figure 12 demonstrate that the crack points are closely linked to the 3D surface, strictly adhering to the geometric constraint P ∈ M. This effectively eliminates common issues such as point cloud drifting or surface penetration (clipping).

5. Conclusions

This study proposes an integrated framework for crack segmentation and 3D localization in UAV-based monocular bridge inspection, aiming to address the difficulties of recognizing slender and faint crack targets, missed detections under low-contrast conditions, and excessive computational redundancy. The main conclusions are as follows:

(1): An improved crack segmentation model was developed based on YOLOv11-seg. By integrating the DWRSeg, WIoU, and LSCSBD modules, the proposed model enhanced the feature representation of slender cracks while reducing unnecessary computational cost. Experimental results verified the effectiveness of the proposed segmentation strategy. The final model (Model G) achieved an mAP@50 of 0.874, improved the Precision by 6.3%, and reduced GFLOPs from 10.2 to 9.8, indicating a favorable balance between detection accuracy and computational efficiency.
(2): An ORB-SLAM3-based keyframe filtering strategy effectively improved the reconstruction efficiency. Compared with uniform frame sampling, the proposed strategy increased the image registration rate from 1.71% to 100%, reduced the SfM processing time to 5.249 min, and shortened the total execution time by 59.4%, thereby effectively alleviating the degradation of geometric constraints caused by redundant inputs.
(3): A reliable 2D-to-3D crack localization method was established, and the overall framework was validated in practical inspection scenarios. By employing a pixel back-projection and ray–triangular mesh intersection strategy, crack pixels segmented in 2D images were accurately projected onto the reconstructed 3D surface. The average localization error of crack feature points was 107.9 mm. Visualization results showed that the projected points were tightly attached to the model surface without obvious drifting or penetration artifacts.

To demonstrate the feasibility of the proposed framework in real-world inspection applications, future research will mainly focus on the following aspects. First, based on the established 3D localization results, automated quantitative measurement of crack geometric attributes, such as length, width, and spatial distribution, will be further developed to enhance the engineering value of the method. Second, in order to better balance accuracy and efficiency, lightweight optimization strategies will be further explored, including channel compression, lightweight feature extraction, and multi-scale/multi-frequency feature modeling, to reduce computational redundancy while preserving the representation ability for slender cracks. Ultimately, the goal is to better support UAV deployment and achieve synchronized crack perception and 3D scene reconstruction during the inspection process.

Author Contributions

Conceptualization, F.T.; Methodology, F.T. and W.G.; Software, W.G.; Validation, W.G. and Y.Q.; Formal analysis, T.Z. and Y.Q.; Investigation, T.Z.; Data curation, W.G., J.L., Y.Z. and Q.S.; Writing—original draft preparation, W.G.; Writing—review and editing, W.G. and F.T.; Visualization, W.G.; Project administration, F.T. and Y.Z.; Funding acquisition, F.T. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Special Key Projects for Chongqing Technological Innovation and Application Development, grant number CSTB2022TIAD-KPX0098; the Chongqing Jiaotong University Joint Postgraduate Training Base, grant number XJLHPYJD202511; and the 2025 Chongqing Municipal Research Institution Performance Incentive Guidance Special Program (Research and Development Application), grant number CSTB2025JXJL-YFX0008.

Data Availability Statement

The data presented in this study are available on reasonable request from the corresponding author. The data are not publicly available because the dataset is part of ongoing research and has restricted access.

Acknowledgments

The authors would like to thank the Smart City Institute, Chongqing Jiaotong University, and the Chongqing Key Laboratory of Spatio-temporal Information of Mountain City for providing equipment and experimental support.

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

References

Chaiyasarn, K.; Buatik, A.; Mohamad, H.; Zhou, M.; Kongsilp, S.; Poovarodom, N. Integrated pixel-level CNN-FCN crack detection via photogrammetric 3D texture mapping of concrete structures. Autom. Constr. 2022, 140, 104388. [Google Scholar] [CrossRef]
Kong, S.-Y.; Fan, J.-S.; Liu, Y.-F.; Wei, X.-C.; Ma, X.-W. Automated crack assessment and quantitative growth monitoring. Comput.-Aided Civ. Inf. Eng. 2021, 36, 656–674. [Google Scholar] [CrossRef]
Graybeal, B.A.; Phares, B.M.; Rolander, D.D.; Moore, M.; Washer, G. Visual inspection of highway bridges. J. Nondestruct. Eval. 2002, 21, 67–83. [Google Scholar] [CrossRef]
Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional networks for biomedical image segmentation. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI); Springer: Cham, Switzerland, 2015; pp. 234–241. [Google Scholar] [CrossRef]
Chen, L.-C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. DeepLab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 40, 834–848. [Google Scholar] [CrossRef]
Yu, C.; Gao, C.; Wang, J.; Yu, G.; Shen, C.; Sang, N. BiSeNet V2: Bilateral network with guided aggregation for real-time semantic segmentation. Int. J. Comput. Vis. 2021, 129, 3051–3068. [Google Scholar] [CrossRef]
Pan, H.; Hong, Y.; Sun, W.; Jia, Y. Deep dual-resolution networks for real-time and accurate semantic segmentation of traffic scenes. IEEE Trans. Intell. Transp. Syst. 2023, 24, 3448–3460. [Google Scholar] [CrossRef]
Zou, K.; Zhang, Z.; Sun, W.; Fu, J. Improved U-shaped network-based image segmentation algorithm for pavement defects. J. Electron. Meas. Instrum. 2024, 38, 15–25. [Google Scholar]
Zhu, G.; Liu, J.; Fan, Z.; Yuan, D.; Ma, P.; Wang, M.; Sheng, W.; Wang, K.C.P. A lightweight encoder–decoder network for automatic pavement crack detection. Comput.-Aided Civ. Infrastruct. Eng. 2024, 39, 1743–1765. [Google Scholar] [CrossRef]
Li, P.; Wang, M.; Fan, Z.; Huang, H.; Zhu, G.; Zhuang, J. OUR-Net: A multi-frequency network with octave max unpooling and octave convolution residual block for pavement crack segmentation. IEEE Trans. Intell. Transp. Syst. 2024, 25, 13833–13848. [Google Scholar] [CrossRef]
Zhu, G.; Shen, S.-L.; Yao, J.; Wang, M.; Zhuang, J.; Fan, Z. Automatic lightweight networks for real-time road crack detection with DPSO. Adv. Eng. Inform. 2025, 68, 103610. [Google Scholar] [CrossRef]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. Adv. Neural Inf. Process. Syst. 2015, 28, 91–99. [Google Scholar] [CrossRef]
He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask R-CNN. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 2961–2969. [Google Scholar] [CrossRef]
Li, R.; Yu, J.; Li, F.; Yang, R.; Wang, Y.; Peng, Z. Automatic bridge crack detection using unmanned aerial vehicle and Faster R-CNN. Constr. Build. Mater. 2023, 362, 129659. [Google Scholar] [CrossRef]
Huang, C.; Zhou, Y.; Xie, X. Intelligent diagnosis of concrete defects based on improved Mask R-CNN. Appl. Sci. 2024, 14, 4148. [Google Scholar] [CrossRef]
Liu, Y. DeepLabV3+ Based Mask R-CNN for Crack Detection and Segmentation in Concrete Structures. Int. J. Adv. Comput. Sci. Appl. 2025, 16, 423–431. [Google Scholar] [CrossRef]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar] [CrossRef]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.-Y.; Berg, A.C. SSD: Single Shot MultiBox Detector. In Computer Vision—ECCV 2016; Lecture Notes in Computer Science; Springer: Cham, Switzerland, 2016; Volume 9905, pp. 21–37. [Google Scholar] [CrossRef]
Lin, T.-Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal Loss for Dense Object Detection. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar] [CrossRef]
Dong, X.; Yuan, J.; Dai, J. Study on Lightweight Bridge Crack Detection Algorithm Based on YOLO11. Sensors 2025, 25, 3276. [Google Scholar] [CrossRef]
Yu, G.; Zhou, X. An Improved YOLOv5 Crack Detection Method Combined with a Bottleneck Transformer. Mathematics 2023, 11, 2377. [Google Scholar] [CrossRef]
Xu, W.; Li, H.; Li, G.; Ji, Y.; Xu, J.; Zang, Z. Improved YOLOv8n-based bridge crack detection algorithm under complex background conditions. Sci. Rep. 2025, 15, 13074. [Google Scholar] [CrossRef]
Jiang, S.; Zhang, J. Real-time crack assessment using deep neural networks with wall-climbing unmanned aerial system. Comput.-Aided Civ. Inf. Eng. 2020, 35, 549–564. [Google Scholar] [CrossRef]
Won, J.; Park, J.-W.; Shim, C.; Park, M.-W. Bridge-surface panoramic-image generation for automated bridge-inspection using deepmatching. Struct. Health Monit. 2021, 20, 1689–1703. [Google Scholar] [CrossRef]
Liu, Y.-F.; Nie, X.; Fan, J.-S.; Liu, X.-G. Image-based crack assessment of bridge piers using unmanned aerial vehicles and three-dimensional scene reconstruction. Comput.-Aided Civ. Inf. Eng. 2020, 35, 511–529. [Google Scholar] [CrossRef]
Deng, L.; Sun, T.; Yang, L.; Cao, R. Binocular video-based 3D reconstruction and length quantification of cracks in concrete structures. Autom. Constr. 2023, 148, 104743. [Google Scholar] [CrossRef]
Charron, N.; McLaughlin, J.; Narasimhan, S. SLAM-centric visual inspection of civil infrastructure. Autom. Constr. 2026, 181, 106682. [Google Scholar] [CrossRef]
McLaughlin, J.; Charron, N.; Narasimhan, S. Visual-Lidar Map Alignment for Infrastructure Inspections. arXiv 2025, arXiv:2501.14486. [Google Scholar] [CrossRef]
Zhang, Z. A flexible new technique for camera calibration. IEEE Trans. Pattern Anal. Mach. Intell. 2000, 22, 1330–1334. [Google Scholar] [CrossRef]
FFmpeg Developers. FFmpeg. Available online: https://ffmpeg.org/ (accessed on 16 March 2026).
Liao, W.; Xu, C.; Liu, H.; Li, X. Research on real-time semantic segmentation of road scene based on multi-branch network. Appl. Res. Comput. 2023, 40, 2526–2530. [Google Scholar] [CrossRef]
Wei, H.; Liu, X.; Xu, S.; Dai, Z.; Dai, Y.; Xu, X. DWRSeg: Rethinking Efficient Acquisition of Multi-scale Contextual Information for Real-time Semantic Segmentation. arXiv 2022, arXiv:2212.01173. [Google Scholar]
Roboflow. Available online: https://roboflow.com/ (accessed on 16 March 2026).

Figure 1. Methodological framework.

Figure 2. Schematic diagram of the ORB-SLAM3 keyframe selection-driven SfM–MVS 3D reconstruction workflow.

Figure 3. Example calibration images used for camera calibration.

Figure 4. ORB-SLAM3 sparse mapping and camera trajectory.

Figure 5. Network architecture of YOLO-DWL.

Figure 6. Network architecture of DWRSeg.

Figure 7. Network architecture of LSCSBD.

Figure 8. Training and validation segmentation loss and validation mAP@50.

Figure 9. Comparison of visualization results from different models.

Figure 10. Ablation results of the DWRSeg network.

Figure 11. (a) Conventional frame sampling; (b) uniform frame sampling; (c) ORB-SLAM3 keyframes.

Figure 12. Localization visualization results.

Table 1. Model comparison results.

Model	Precision	Recall	mIoU	F1	GFLOPs	FPS
Mask R-CNN	0.862	0.811	0.718	0.836	62.3	25
YOLOv5-seg	0.734	0.706	0.562	0.720	6.7	153
YOLOv8-seg	0.834	0.752	0.654	0.791	12.0	183
YOLOv11-seg	0.817	0.777	0.662	0.797	10.2	174
U-Net	0.623	0.863	0.567	0.724	48	46
DeepLabv3+	0.865	0.811	0.721	0.837	55	43
RHACrackNet	0.641	0.936	0.614	0.761	4.6	75
BiSeNetV2	0.821	0.808	0.687	0.814	8.9	109
DDRNet	0.695	0.745	0.561	0.719	9.0	291
YOLO-DWL	0.880	0.850	0.762	0.865	9.8	185

Table 2. Comparison of results from different models.

Model	DWRSeg	WIoU	LSCSBD	P	R	mAP@50	mAP@50-95	GFLOPs
YOLOv11n-seg				0.817	0.777	0.802	0.681	10.2
A	√			0.871	0.848	0.844	0.715	10.3
B		√		0.784	0.85	0.851	0.68	10.2
C			√	0.823	0.826	0.84	0.674	9.8
D	√	√		0.811	0.749	0.803	0.668	10.3
E		√	√	0.915	0.769	0.841	0.694	9.8
F	√		√	0.878	0.832	0.871	0.691	9.8
G	√	√	√	0.88	0.85	0.874	0.703	9.8

Table 3. Efficiency of different frame sampling strategies.

Sampling Strategy	Image Count	Image Registration Rate	Dense Point Count	SfM Time (min)	Dense Reconstruction Time (min)	Total Time
Full-Frame Extraction	1930	1.71%	752,385	3713.157	21.882	3735.039
Uniform Sampling	185	100%	3,712,601	22.721	90.887	113.608
ORB-SLAM Keyframes	185	100%	6,231,036	5.249	40.876	46.125

Table 4. Comparison of localization accuracy.

ID	Total Station Measured Coordinates			Algorithm-Projected Coordinates			Error (mm)
ID	E/m	N/m	Z/m	Ep/m	Np/m	Zp/m	Error (mm)
1	627,355.220	3,257,270.031	381.761	627,355.314	3,257,270.054	381.750	97.4
2	627,356.450	3,257,270.820	381.720	627,356.485	3,257,270.852	381.705	54.9
3	627,359.249	3,257,272.442	381.642	627,359.354	3,257,272.490	381.666	117.8
4	627,357.180	3,257,271.310	381.685	627,357.212	3,257,271.345	381.670	56.5
5	627,358.129	3,257,271.766	381.765	627,357.957	3,257,271.654	381.702	213.1
Mean error	107.9

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Tang, F.; Gongzhabayier, W.; Li, J.; Zhou, T.; Qiu, Y.; Zhan, Y.; Song, Q. Keyframe-Guided Crack Segmentation and 3D Localization for UAV-Based Monocular Inspection. Symmetry 2026, 18, 657. https://doi.org/10.3390/sym18040657

AMA Style

Tang F, Gongzhabayier W, Li J, Zhou T, Qiu Y, Zhan Y, Song Q. Keyframe-Guided Crack Segmentation and 3D Localization for UAV-Based Monocular Inspection. Symmetry. 2026; 18(4):657. https://doi.org/10.3390/sym18040657

Chicago/Turabian Style

Tang, Feifei, Wuyuntana Gongzhabayier, Jing Li, Tao Zhou, Yue Qiu, Yong Zhan, and Qiulin Song. 2026. "Keyframe-Guided Crack Segmentation and 3D Localization for UAV-Based Monocular Inspection" Symmetry 18, no. 4: 657. https://doi.org/10.3390/sym18040657

APA Style

Tang, F., Gongzhabayier, W., Li, J., Zhou, T., Qiu, Y., Zhan, Y., & Song, Q. (2026). Keyframe-Guided Crack Segmentation and 3D Localization for UAV-Based Monocular Inspection. Symmetry, 18(4), 657. https://doi.org/10.3390/sym18040657

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Keyframe-Guided Crack Segmentation and 3D Localization for UAV-Based Monocular Inspection

Abstract

1. Introduction

2. Methodology for Crack Identification, Segmentation, and Localization

2.1. ORB-SLAM3 Keyframe-Constrained SfM-MVS 3D Modeling Method

2.2. Crack Segmentation Model: YOLO-DWL

2.2.1. DWRSeg Segmented Module

2.2.2. Introducing WIoU to Improve Regression Stability and Small-Target Localization Accuracy

2.2.3. LSCSBD-Based Lightweight Detection Head for Small-Scale Crack Perception

2.3. 3D Crack Projection-Based Localization Method

3. Experimental Setup and Data Acquisition

4. Results and Analysis

4.1. YOLO-DWL Segmentation Model

4.1.1. Crack Defect Dataset

4.1.2. Evaluation Metrics

4.1.3. Comparative Experiments

4.1.4. Module Ablation Study

4.1.5. Ablation Study of the DWRSeg Stage-Wise Network

4.2. Impact of Keyframe Filtering on 3D Reconstruction Efficiency

4.3. Crack Localization Method

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI