1. Introduction
Concrete bridges undergo significant degradation under the prolonged influence of complex loading scenarios and environmental impacts. To ensure the structural integrity, serviceability, and durability of these assets, bridge inspection has emerged as a focal point in both academic discourse and engineering applications. Among various structural defects, surface-visible cracks represent a critical diagnostic indicator and a primary factor compromising long-term durability [
1,
2]. Such fissuring can trigger the spalling of concrete cover, thereby exposing internal reinforcement to an elevated risk of corrosion. Furthermore, rapidly propagating cracks often serve as precursors to catastrophic structural failure. Consequently, the precise detection and characterization of bridge cracks constitute a pivotal component in rigorous safety assessment and durability-oriented maintenance strategies.
Traditional bridge crack inspection predominantly relies on manual visual assessment, often supported by specialized bridge inspection vehicles. However, such practices are typically characterized by low efficiency, substantial labor demands, and non-negligible safety risks. In addition, inspection operations are highly constrained by traffic conditions, hindering rapid and routine deployment at scale. In recent years, Unmanned Aerial Vehicles (UAVs) have emerged as a promising alternative, owing to their high maneuverability, flexible deployment, and ability to acquire close-range, high-resolution visual data. Despite these advantages, UAV-based inspections generate large volumes of imagery and video, making manual interpretation time-consuming and susceptible to subjectivity. Notably, Graybeal et al. [
3] reported substantial inter-inspector variability in crack identification for the same bridge structure, underscoring the limitations of purely manual assessments. To improve both objectivity and efficiency, deep-learning-driven computer vision techniques have increasingly been incorporated into automated crack detection and segmentation pipelines.
Deep-learning-based crack recognition methods can generally be divided into two categories: semantic segmentation methods and detection-framework-based methods. Semantic segmentation methods (e.g., U-Net [
4] and DeepLab [
5]) extract crack regions through pixel-wise classification, and therefore exhibit certain advantages in boundary recovery and fine-detail representation. In recent years, efficient semantic segmentation networks for real-time vision tasks have also continued to develop. For example, BiSeNetV2, proposed by Yu et al. [
6], achieves a balance between speed and accuracy through the collaborative modeling of a detail branch and a semantic branch, while DDRNet, proposed by Pan et al. [
7] improves real-time segmentation performance through dual-resolution parallel processing and repeated feature fusion. In addition, researchers have carried out various improvements for crack segmentation tasks. For instance, Zou Kaixin et al. [
8] enhanced the segmentation capability for pavement defect images by improving a U-shaped network. RHACrackNet, proposed by Zhu et al. [
9] strengthens the representation of crack details and boundary information by introducing hybrid attention and residual structures; OUR-Net, proposed by Li et al. [
10] improves the extraction of crack textures at different scales through multi-frequency feature modeling; and AutoCrackNet proposed by Zhu et al. [
11] improves segmentation efficiency and deployment potential through an automated lightweight network design. However, these methods usually require pixel-wise prediction over the entire image. For bridges, especially bridge piers, complex background interference is common, which makes whole-image pixel-level segmentation more prone to false detections. By contrast, detection-framework-based methods are more advantageous in focusing on target regions.
Detection-framework-based methods can be further divided into two categories: two-stage detectors based on region proposals and one-stage detectors based on direct regression. Two-stage methods (e.g., Faster R-CNN [
12] and Mask R-CNN [
13]) typically generate candidate regions first and then perform classification and bounding-box regression, and therefore often achieve relatively high detection accuracy in complex backgrounds and small-target scenarios. Li et al. [
14] combined close-range UAV imaging with Faster R-CNN for automatic bridge crack detection, while Huang et al. [
15] and Liu et al. [
16] improved feature representations within the Mask R-CNN framework to enhance crack detection and segmentation performance. However, two-stage approaches involve relatively long inference pipelines with higher computational cost and latency, which limits their applicability in UAV inspections where resources are constrained or real-time performance is required. In contrast, one-stage detectors (e.g., the You Only Look Once (YOLO) family [
17], Single Shot MultiBox Detector (SSD) [
18], and RetinaNet [
19]) feature simpler architectures and faster inference, making them more suitable for real-time inspection tasks. Among them, the YOLO series provides a favorable trade-off between accuracy and speed. Nevertheless, cracks are typically characterized by elongated shapes, small scales, and low contrast; consequently, even efficient one-stage frameworks such as YOLO remain prone to missed detections and insufficient sensitivity. To address these challenges, existing studies have improved YOLO from multiple perspectives. For instance, Dong et al. [
20] introduced attention modules into YOLO11n to improve detection accuracy and incorporated a lightweight detection head to enhance efficiency; however, inference latency and limited sensitivity to fine cracks in complex structures remain issues. Yu et al. [
21] integrated Transformer components into YOLOv5 to strengthen the detection of slender cracks, yet global attention mechanisms can be strongly affected by background noise. Xu et al. [
22] enhanced feature fusion in YOLOv8n and optimized training with intersection over union (IoU)-based losses to improve robustness; nevertheless, detection accuracy remains low when crack–concrete contrast is weak, and crack predictions may appear fragmented. Therefore, there remains substantial room for improvement in achieving high-precision and continuous detection of subtle, slender cracks in complex scenarios.
However, image-level detection and segmentation results alone lack explicit spatial and geographic attributes and cannot adequately describe the distribution of cracks on a structure, limiting their ability to support bridge condition assessment. With the rapid development of 3D reconstruction techniques, computer-vision downstream tasks have become increasingly integrated with 3D modeling, and the demand in defect inspection has gradually shifted from recognition and segmentation toward spatial localization. Existing defect localization approaches can be broadly classified into two categories.
The first category relies on inter-image feature matching to stitch multi-view images into a panorama, thereby enabling defect localization. For example, Jiang et al. [
23] collected structural images using a wall-climbing robot and, under the assumption of approximately parallel image planes, stitched crack detection results into a panoramic image for localization. Won et al. [
24] proposed a deep matching-based stitching strategy to improve image correspondence quality, generating panoramic views of bridge piers and achieving crack localization. Such methods typically depend on planar or near-planar assumptions and are therefore difficult to extend to complex 3D scenarios with intricate geometries, such as bridges. The second category reconstructs the scene using 3D reconstruction algorithms and then projects 2D detection/segmentation outputs onto the reconstructed 3D model to localize defects. Liu et al. [
25] generated a mesh model from UAV imagery using the structure-from-motion and multi-view stereo (SfM–MVS) pipeline and projected crack results onto the 3D model for localization. Deng et al. [
26] produced dense point clouds via SfM–MVS and mapped segmented cracks onto the point cloud. Although these approaches have demonstrated effectiveness for 3D visualization of cracks, conventional SfM typically processes all images, resulting in substantial data redundancy and long reconstruction cycles. In comparison, simultaneous localization and mapping (SLAM)-based frameworks enable real-time modeling and pose estimation. Charron et al. [
27] attempted to enhance SLAM stability by fusing LiDAR, cameras, and inertial measurements; however, variations in illumination and platform vibrations can lead to inconsistent map accuracy, undermining the reliability of crack localization. McLaughlin et al. [
28] reported that, during repeated inspections, maps acquired in different sessions are difficult to align automatically and often require labor-intensive manual calibration, which can introduce significant accumulated drift and geometric distortion, leaving global accuracy insufficient. Overall, relying solely on real-time SLAM reconstructions makes it challenging to achieve the precision required for practical applications.
To address the above challenges in crack detection, segmentation, and localization, the main contributions of this study are as follows:
- (1)
A YOLO-DWL model for bridge crack segmentation is proposed. Built upon the YOLOv11 framework, the dilation-wise residual (DWR)-C3k2 module is introduced to enhance multi-scale feature extraction under different receptive fields, thereby improving the representation of slender cracks. In addition, the Weighted IoU (WIoU) loss is incorporated to improve the stability of bounding-box regression and reduce the adverse effects of low-quality samples during training. Meanwhile, the Lightweight Shared Convolution and Separate Batch Normalization Detection Head (LSCSBD) is adopted to enhance the sensitivity to small-scale crack targets while maintaining a lightweight architecture, thereby enabling robust extraction of subtle defects.
- (2)
An ORB-SLAM3 keyframe-constrained SfM–MVS 3D reconstruction method is developed. By employing ORB-SLAM3 for keyframe selection and pose estimation on UAV video sequences, the adverse effects of redundant images in high-frame-rate videos on feature matching and global optimization are reduced. Combined with SfM–MVS, this method enables the efficient dense reconstruction of bridge surfaces, thereby significantly improving 3D reconstruction efficiency while preserving model completeness and geometric accuracy.
- (3)
A spatial mapping method from crack detection results to the 3D model surface is established. By using a pixel back-projection and triangle-mesh intersection strategy, the 2D crack segmentation results are projected onto the reconstructed 3D model surface, thereby enabling the intuitive visualization and accurate spatial localization of cracks in 3D space.
2. Methodology for Crack Identification, Segmentation, and Localization
The overall workflow of the proposed methodology is illustrated in
Figure 1. Initially, targeting high-frame-rate UAV bridge inspection videos, ORB-SLAM3 is employed to achieve real-time camera pose estimation and automatic keyframe selection. This process eliminates a vast number of redundant consecutive frames, retaining only keyframes with effective spatial constraints for subsequent modeling. On this basis, a dense 3D reconstruction using SfM-MVS is executed on the selected keyframes to rapidly construct the 3D bridge model, significantly enhancing modeling efficiency while maintaining reconstruction accuracy. Subsequently, the enhanced YOLO-DWL crack segmentation model is utilized to perform high-precision crack detection and pixel-level segmentation on the keyframe images. Finally, based on the camera’s intrinsic and extrinsic parameters and pose constraints, the 2D segmentation masks are back-projected onto the surface of the 3D reconstructed model. This achieves precise spatial localization of the cracks, forming a complete technical pipeline of “near-real-time modeling, precision segmentation, and spatial localization” for bridge cracks.
2.1. ORB-SLAM3 Keyframe-Constrained SfM-MVS 3D Modeling Method
In the 3D reconstruction of bridge structures, UAVs typically capture continuous video sequences at high frame rates. Directly inputting the entire image set into traditional SfM-MVS frameworks for modeling often leads to significant image redundancy, prohibitive computational overhead, and excessively long modeling cycles in engineering applications. Furthermore, the presence of a vast number of images with near-identical viewpoints and highly redundant information within high-frame-rate sequences increases the complexity of feature matching and optimization. This frequently triggers issues such as matching instability, geometric degradation, and the accumulation of mismatches, which compromise the robustness of the reconstruction process and the overall consistency of the model. Consequently, these challenges further constrain the high-precision spatial localization and 3D visualization of defects such as cracks.
To address the aforementioned challenges, this study adopts a hybrid 3D modeling approach based on SfM-MVS with ORB-SLAM3 keyframe constraints, as shown in
Figure 2. Initially, monocular ORB-SLAM3 is utilized for real-time pose estimation and keyframe selection from the high-frame-rate UAV video sequences, retaining only a subset of keyframes that possess effective viewpoint variations and geometric constraints. On this basis, the standard SfM-MVS reconstruction is executed exclusively on this keyframe subset. This strategy significantly reduces the reconstruction scale and computational overhead while ensuring both geometric accuracy and model completeness.
Before ORB-SLAM3-based pose estimation, camera calibration was performed using Zhang’s calibration method [
29]. During calibration, a
checkerboard was used, and calibration images were captured under different viewing angles and distances, with the checkerboard distributed as much as possible across both the central and peripheral regions of the image plane. Some example calibration images are shown in
Figure 3. Through corner detection, subpixel refinement, and nonlinear optimization, the camera intrinsic parameters
and the radial–tangential distortion coefficients
were estimated. The calibrated parameters were then written into the ORB-SLAM3 YAML configuration file and kept consistent with the image settings used in both ORB-SLAM3 and COLMAP 3.8 (Windows CUDA build). This ensured that distortion correction and reprojection calculations were performed under a unified camera model, thereby reducing systematic projection errors caused by image scaling or inconsistent pixel coordinate settings.
During the pose estimation stage, the calibrated camera parameters were loaded into the monocular ORB-SLAM3 pipeline, allowing lens distortion to be explicitly modeled during tracking and reprojection calculations and ensuring consistency between the geometric imaging model and the physical camera. Subsequently, the high-frame-rate video sequences captured by the UAV were fed into ORB-SLAM3 for real-time pose estimation and mapping. In this study, the original UAV video lasted 2 min 10 s at 30 fps. To reduce temporal redundancy, frames were extracted from the video at approximately 15 fps using FFmpeg [
30], resulting in 1930 images for subsequent ORB-SLAM3 processing. The extracted images were retained at the original resolution to preserve sufficient feature detail for stable tracking, feature matching, and subsequent SfM-MVS reconstruction, while maintaining consistency with the calibrated camera parameters. During operation, ORB-SLAM3 adaptively inserted keyframes based on the quantity and distribution of feature points, as well as the platform’s motion state. Following the default monocular keyframe insertion strategy of ORB-SLAM3, the decision logic for keyframe insertion can be summarized in Equation (1):
where
indicates whether a keyframe is inserted at frame
;
denotes the frame interval;
and
are the maximum and minimum keyframe interval thresholds, respectively;
is the state indicator of the local mapping thread (1/0 corresponds to idle/busy);
is the number of map points successfully tracked in the current frame
;
is a reference value; and
is the number of available map points.
Based on this strategy, ORB-SLAM3 automatically filters out a large number of consecutive frames with near-identical viewpoints and high informational redundancy, while retaining frames with effective viewpoint changes and sufficient geometric support. Finally, 185 keyframes were retained without manual screening. The selected keyframes, together with the estimated camera trajectories, were exported in temporal order and directly organized as the image input for subsequent COLMAP reconstruction.
In
Figure 4, the red and black dots denote reconstructed 3D map points during the ORB-SLAM3 process, the blue camera wireframes distributed along the lower trajectory represent the estimated keyframe poses, the green wireframe indicates the current camera pose, and the red wireframe indicates the initial camera pose.
In COLMAP, sparse reconstruction was first carried out on the retained keyframes through feature extraction, feature matching, triangulation, and bundle adjustment, followed by dense reconstruction via image undistortion, PatchMatch-based depth estimation, depth fusion, and surface generation. This workflow effectively reduces the reconstruction scale and computational burden while maintaining geometric completeness and reconstruction stability, thereby providing a reliable 3D carrier for subsequent crack projection and localization.
2.2. Crack Segmentation Model: YOLO-DWL
The slender, low-contrast, and fragmented morphology of bridge cracks presents significant challenges for standard detection and segmentation networks, particularly in boundary localization, continuity preservation of thin lines, and perception of small-scale targets. To enhance the detection rate of continuous cracks and the precision of boundary localization, we propose a joint improvement strategy for slender cracks based on the YOLOv11 segmentation framework (as shown in the
Figure 5 below). First, specific C3k2 modules in the backbone network are replaced with DWR-C3k2 to strengthen the directional thin-line representation. Second, WIoU is adopted as the bounding box regression loss to improve regression stability under hard-sample conditions. Finally, the LSCSBD lightweight detection head is introduced to enhance small-scale crack perception and feature alignment capabilities while simultaneously reducing computational overhead.
2.2.1. DWRSeg Segmented Module
In contrast to the common orientation in real-time segmentation frameworks that “larger receptive fields are inherently better” [
31], DWRSeg focuses on receptive field efficiency by configuring matched receptive field scales at different semantic stages [
32]. Low-level features predominantly comprise information such as edges and textures; an excessively large receptive field often introduces redundant context and weakens the representation of slender targets. Conversely, high-level features possess stronger semantics, requiring larger receptive fields to aggregate structural information. Bridge cracks are characterized by slender, low-contrast, and fragmented morphologies. Consequently, the segmentation process relies both on the sensitivity of low-level features to thin-line edges and continuous textures—to avoid the “smoothing or fracturing” caused by downsampling, and on the cross-regional contextual aggregation capability of high-level features to enhance background suppression and overall coherence. A uniform receptive field strategy typically leads to insufficient feature extraction efficiency.
As illustrated in the
Figure 6 below, the input feature is first mapped by a 3 × 3 convolution and then split into three parallel branches along the channel dimension. Each branch further applies a 3 × 3 depth-wise dilated convolution with dilation rates of 1, 3, and 5, respectively, so as to capture local, mid-range, and global contextual information. The outputs of the three branches are then concatenated along the channel dimension and fused by a 1 × 1 convolution. Finally, the fused feature is combined with the shortcut branch through element-wise addition to generate the final output, thereby enhancing multi-scale representation while preserving the original feature information. In the figure, “C” denotes concatenation, and “+” denotes element-wise addition.
Based on the aforementioned insights, DWRSeg was embedded into the C3k2 modules of the YOLOv11 backbone to build the C3k2_DWR module. The original external connections of C3k2 were preserved, while the internal feature extraction part was strengthened by DWRSeg. This allows the network to better balance fine crack detail preservation and contextual information extraction. However, the added multi-branch dilated operations increase the complexity of feature fusion and bring additional computational overhead. The influence of different replacement positions was further evaluated in the subsequent experiments, where the configuration using positions 1, 2, and 4 showed the best overall performance.
2.2.2. Introducing WIoU to Improve Regression Stability and Small-Target Localization Accuracy
Given that bridge cracks are characterized by slender shapes and irregular boundaries, a large number of low-quality predicted boxes often emerge during the initial training phase. This not only exacerbates the imbalance between positive and negative samples but also introduces unstable gradient interference, making the loss curve difficult to converge. To address this, the bounding box regression loss is replaced by WIoU, which is jointly optimized with Distribution Focal Loss (DFL). This strategy reduces the contribution of both outlier-quality samples (ultra-high or extremely low) to the total loss, thereby focusing the training on higher-value samples. The aggregate loss function is formulated as
The core concept of WIoU involves introducing a weighting mechanism based on sample outlierness (dispersion) alongside the standard IoU regression term. This approach prevents the model from being excessively steered by extreme, low-quality samples during training, thereby achieving smoother and more stable bounding box convergence. The formulation is expressed as
Specifically, it actively suppresses the influence of “outlier bounding boxes”—those with significant localization deviations or that potentially represent background noise—on the loss function. Simultaneously, DFL is incorporated as a distribution regression term to refine recognition precision. This integration enhances the detection capabilities for slender, small-scale targets without incurring a significant increase in computational overhead.
2.2.3. LSCSBD-Based Lightweight Detection Head for Small-Scale Crack Perception
Traditional segmentation heads typically employ convolutional branches with identical structures but independent parameters across the P3, P4, and P5 scales. This often results in the repetitive stacking of convolutions, leading to significant structural redundancy in terms of both parameter count and computational overhead. Furthermore, statistical distribution disparities among multi-scale features can amplify training fluctuations, which is detrimental to the segmentation of small-scale cracks. To achieve a lightweight architecture and stable predictions, this study replaces the default segmentation head of YOLOv11-seg with LSCSBD. This modified head directly receives features from the P3, P4, and P5 scales to perform joint prediction.
As illustrated in
Figure 7, this study replaces the original detection head of the YOLOv11-seg model with the LSCSBD head. This module first performs channel alignment on the multi-scale features (P3–P5) via 1 × 1 convolutions. Subsequently, a cross-scale shared 3 × 3 convolution is employed for unified feature refinement, thereby mitigating parameter redundancy and computational overhead caused by repetitive convolutions in traditional multi-scale heads. Simultaneously, scale-independent normalization (SN) is introduced following the shared convolution to accommodate the statistical distribution disparities across different hierarchical levels, enhancing training stability and the consistency of cross-scale fusion. On this basis, the original features from P3–P5 are directly mapped to the output stage and fused with the mask score maps generated by the shared segmentation branch. This strategy ensures mask continuity, significantly improving the boundary adherence and localization precision of the crack masks.
2.3. 3D Crack Projection-Based Localization Method
To achieve accurate mapping of 2D crack segmentation results onto the 3D bridge surface, this study adopts a camera-geometry-constrained localization framework based on pixel back-projection ray–triangle mesh intersection. Specifically, the intrinsic matrix K is obtained through camera calibration. The SfM reconstruction then provides the camera with extrinsic parameters for each keyframe, i.e., the camera pose (
), together with the triangular mesh M generated from dense reconstruction and Poisson surface reconstruction. To control the number of rays, the set of crack pixels in the keyframe crack mask is uniformly sampled with a fixed stride
, where
denotes the sampling step and is set to
= 4 in this paper. For any sampled pixel
its viewing direction is back-projected into a unit ray direction in the camera coordinate system using the intrinsic matrix. The ray is then transformed into the world coordinate system using the extrinsic parameters and intersected with the mesh to obtain the spatial coordinates of the crack on the triangular surface. The intrinsic matrix and normalization are given as follows:
The extrinsic parameters output by COLMAP adopt the “world-to-camera” transformation. Therefore, the camera optical center and the rotation are transformed as follows:
Based on this, the unit ray direction in the world coordinate system and the corresponding ray equation are derived as
Subsequently, the triangular mesh Mis was used as the geometric constraint carrier. For each ray, a ray–triangle intersection test is performed, and the depth parameter s* corresponding to the nearest visible intersection point is selected. The resulting 3D crack point is then given by:
By traversing all keyframes and their corresponding crack masks, a 3D point set of cracks can be obtained. Because each 3D coordinate is computed from the nearest intersection between a back-projected ray and the triangular mesh, the localized points satisfy P ∈ M, which geometrically guarantees that crack points lie on the reconstructed surface rather than at arbitrary locations in space.
3. Experimental Setup and Data Acquisition
This study employed a cross-platform environment to implement crack segmentation, 3D reconstruction, and spatial localization. The training and inference of YOLO-DWL were conducted on a Windows 11 Professional platform, with the deep learning framework implemented in PyTorch 2.8.0 and CUDA version 11.8. To ensure experimental fairness, no pretrained weights were used in either the comparative experiments or the ablation experiments, and all models were trained from random initialization. The training hyperparameters were set as follows: the number of epochs was 500, the batch size was 16, the initial learning rate (lr
0) was 0.1, and the optimizer was SGD. Under these fixed parameter settings, the training and validation loss curves, together with the validation mAP@50 curve, indicate that YOLO-DWL achieves reliable convergence during training (see
Figure 8). On the same platform, COLMAP was used to perform SfM-MVS reconstruction on the selected keyframe subset to generate the 3D models. The localization stage employed Open3D’s ray–mesh intersection method utility to back-project crack mask pixels into spatial rays and calculate their intersections with the mesh, thereby obtaining the 3D point set of the cracks. Keyframe selection and pose estimation were conducted on the Ubuntu 20.04 platform using ORB-SLAM3 to acquire the keyframe sequence and camera trajectories, which were subsequently exported for COLMAP reconstruction. The hardware configuration consisted of an Intel Core i9-14900KF CPU (Intel Corporation, Santa Clara, CA, USA) and an NVIDIA GeForce RTX 4090 GPU with 24 GB VRAM (NVIDIA Corporation, Santa Clara, CA, USA) to satisfy the computational and memory demands of the integrated workflow. All experiments were conducted under consistent experimental conditions to ensure fair performance comparisons across different experimental settings.
The dataset for this study was collected by capturing videos of cracks on bridge facades using a DJI Mavic 3 UAV (SZ DJI Technology Co., Ltd., Shenzhen, China). The platform is equipped with a monocular camera sensor, capturing video at a resolution of 1920 × 1080 and a frame rate of 30 fps. The UAV was operated manually, with a total video acquisition duration of 2 min and 10 s.
5. Conclusions
This study proposes an integrated framework for crack segmentation and 3D localization in UAV-based monocular bridge inspection, aiming to address the difficulties of recognizing slender and faint crack targets, missed detections under low-contrast conditions, and excessive computational redundancy. The main conclusions are as follows:
- (1)
An improved crack segmentation model was developed based on YOLOv11-seg. By integrating the DWRSeg, WIoU, and LSCSBD modules, the proposed model enhanced the feature representation of slender cracks while reducing unnecessary computational cost. Experimental results verified the effectiveness of the proposed segmentation strategy. The final model (Model G) achieved an mAP@50 of 0.874, improved the Precision by 6.3%, and reduced GFLOPs from 10.2 to 9.8, indicating a favorable balance between detection accuracy and computational efficiency.
- (2)
An ORB-SLAM3-based keyframe filtering strategy effectively improved the reconstruction efficiency. Compared with uniform frame sampling, the proposed strategy increased the image registration rate from 1.71% to 100%, reduced the SfM processing time to 5.249 min, and shortened the total execution time by 59.4%, thereby effectively alleviating the degradation of geometric constraints caused by redundant inputs.
- (3)
A reliable 2D-to-3D crack localization method was established, and the overall framework was validated in practical inspection scenarios. By employing a pixel back-projection and ray–triangular mesh intersection strategy, crack pixels segmented in 2D images were accurately projected onto the reconstructed 3D surface. The average localization error of crack feature points was 107.9 mm. Visualization results showed that the projected points were tightly attached to the model surface without obvious drifting or penetration artifacts.
To demonstrate the feasibility of the proposed framework in real-world inspection applications, future research will mainly focus on the following aspects. First, based on the established 3D localization results, automated quantitative measurement of crack geometric attributes, such as length, width, and spatial distribution, will be further developed to enhance the engineering value of the method. Second, in order to better balance accuracy and efficiency, lightweight optimization strategies will be further explored, including channel compression, lightweight feature extraction, and multi-scale/multi-frequency feature modeling, to reduce computational redundancy while preserving the representation ability for slender cracks. Ultimately, the goal is to better support UAV deployment and achieve synchronized crack perception and 3D scene reconstruction during the inspection process.