Intelligent Detection and 3D Localization of Bolt Loosening in Steel Structures Using Improved YOLOv9 and Multi-View Fusion

Cui, Fangyuan; Chen, Xiaolong; Liang, Lie

doi:10.3390/buildings16030619

Open AccessArticle

Intelligent Detection and 3D Localization of Bolt Loosening in Steel Structures Using Improved YOLOv9 and Multi-View Fusion

by

Fangyuan Cui

^1,2,

Xiaolong Chen

³

and

Lie Liang

^2,*

¹

School of Mechanical Engineering, Henan Institute of Technology, Xinxiang 453002, China

²

School of Mechanical and Precision Instrument Engineering, Xi’an University of Technology, Xi’an 710048, China

³

Faculty of Humanities and Social Sciences, Macao Polytechnic University, Macao 999078, China

^*

Author to whom correspondence should be addressed.

Buildings 2026, 16(3), 619; https://doi.org/10.3390/buildings16030619

Submission received: 1 December 2025 / Revised: 16 January 2026 / Accepted: 22 January 2026 / Published: 2 February 2026

(This article belongs to the Section Building Structures)

Download

Browse Figures

Versions Notes

Abstract

Structural health monitoring of steel buildings requires accurate detection and localization of bolt loosening, a critical yet challenging task due to complex joint geometries and varying environmental conditions. We propose an intelligent framework that integrates an improved YOLOv9 model with multi-view image fusion to address this problem. The method constructs a comprehensive dataset with multi-angle images under diverse lighting, occlusion, and loosening conditions, annotated with multi-task labels for precise training. The YOLOv9 backbone is enhanced with attention mechanisms to focus on key bolt features, while an angle-aware detection head regresses both bounding boxes and rotation angles, enabling loosening state determination through a threshold-based criterion. Furthermore, the framework unifies camera coordinate systems and employs epipolar geometry to fuse 2D detections from multiple views, reconstructing 3D bolt positions and orientations for precise localization. The proposed method achieves robust performance in detecting loosening angles and spatially localizing bolts, offering a practical solution for real-world structural inspections. Its significance lies in the integration of advanced deep learning with multi-view geometry, providing a scalable and automated approach to enhance safety and maintenance efficiency in steel structures.

Keywords:

bolt loosening detection; YOLOv9; multi-view fusion; 3D localization; angle-aware object detection

1. Introduction

Steel structures form the backbone of modern infrastructure, with bolted joints serving as critical load-bearing components. The integrity of these joints directly impacts structural safety, making bolt loosening detection a paramount concern in structural health monitoring [1]. Traditional inspection methods rely on manual techniques such as torque wrenches or ultrasonic testing, which are labor-intensive, time-consuming, and often impractical for large-scale structures [2]. Recent advances in computer vision and deep learning have opened new avenues for automated bolt inspection, yet existing approaches face significant challenges in handling the 3D nature of steel joints and the precise determination of loosening states.

Current vision-based methods primarily employ object detection frameworks like the YOLO series (YOLOv3 to YOLOv8) to identify bolts in 2D images [3]. While these models achieve high detection rates, they typically use axis-aligned bounding boxes that cannot accurately represent the orientation of bolts in 3D space. This limitation becomes particularly problematic when assessing loosening, as the rotational state of a bolt is a key indicator of its tightness. Some studies have attempted to address this through rotation-aware detection heads [4], but these approaches often struggle with the small size and high occlusion typical of bolted joints in complex steel structures.

The evolution of YOLO models reflects continuous improvements in object detection architectures. YOLOv1 pioneered real-time detection by framing it as a regression problem, while YOLOv2 introduced anchor boxes and batch normalization for better localization. YOLOv3 incorporated multi-scale predictions through feature pyramid networks, significantly enhancing small object detection—a crucial capability for bolt inspection. YOLOv4 optimized training strategies with advanced data augmentation and modified feature aggregation. YOLOv5 introduced a more modular design with improved training efficiency, and YOLOv6-YOLOv8 further refined the backbone-neck-head architecture with enhanced feature representation. Most recently, YOLOv9 introduced the Programmable Gradient Information (PGI) mechanism to mitigate information loss in deep networks, theoretically making it better suited for detecting small, occluded objects like bolts in complex environments.

Selection Justification of YOLOv9. While newer versions such as YOLOv10 have been released subsequent to our study initiation, the selection of YOLOv9 for this specific application is based on several technical and practical considerations. First and foremost, YOLOv9’s Programmable Gradient Information (PGI) mechanism is particularly advantageous for detecting small, occluded objects like bolts in complex steel structures, as it effectively mitigates the information loss problem in deep networks—a critical challenge for bolt feature preservation. Second, YOLOv9 provides a well-balanced architecture that maintains high detection accuracy while offering reasonable computational efficiency, which is essential for potential real-time field deployment. Third, it is important to emphasize that there is no single YOLO variant that consistently provides the most reliable accuracy across all tasks and domains; model selection must be tailored to specific application requirements, dataset characteristics, and deployment constraints.

Recent studies employing YOLOv10, such as safety helmet monitoring on construction sites [5] and rebar counting using UAV imagery [6], demonstrate its effectiveness in different construction-related applications. However, these applications differ significantly from bolt loosening detection in several key aspects: (1) target scale (helmets and rebars are generally larger and more distinct than bolt heads), (2) environmental complexity (bolt joints often involve higher occlusion and more challenging lighting conditions), and (3) precision requirements (bolt loosening assessment demands sub-degree angle estimation, not just binary detection). The PGI mechanism in YOLOv9 is specifically designed to preserve gradient information flow for small object detection, making it theoretically better suited for our task. Furthermore, our extensive ablation studies (Section 5.3) demonstrate that the proposed angle-aware modifications to YOLOv9 achieve state-of-the-art performance for bolt loosening detection, with 93.6% detection accuracy and 5.2° angle estimation error, validating our architectural choice.

However, despite these architectural advances, existing YOLO-based approaches for structural inspection remain limited in three key aspects: (1) they typically output axis-aligned bounding boxes that cannot accurately represent bolt orientation in 3D space; (2) they lack dedicated mechanisms for precise rotation angle estimation, which is essential for quantifying bolt loosening severity; and (3) they operate primarily in 2D image space without integrating multi-view geometric constraints for accurate 3D localization. These limitations motivate our development of an angle-aware YOLOv9 variant integrated with multi-view fusion techniques specifically tailored for bolt loosening detection and 3D localization in steel structures.

Compared with existing bolt-loosening detection methods and YOLOv8/YOLOv9-based approaches, the novelty of our framework lies in the systematic integration of three key components: (1) an angle-aware detection head that regresses both bounding boxes and rotation angles for quantitative loosening assessment, addressing the limitation of axis-aligned bounding boxes in standard YOLO models; (2) a multi-view fusion pipeline that leverages epipolar geometry to reconstruct accurate 3D bolt positions and orientations without requiring pre-measured, fixed camera poses in a global coordinate system; and (3) a unified 3D loosening metric that combines positional displacement and angular deviation for comprehensive bolt condition evaluation. While individual elements such as rotated object detection or multi-view reconstruction exist in isolation, our work is the first to integrate them into a cohesive framework specifically designed for structural health monitoring of bolted joints in steel structures.

Multi-view imaging has shown promise in overcoming single-view limitations by combining information from multiple perspectives [7,8]. When integrated with 3D reconstruction techniques like Structure from Motion (SfM) [9], these methods can provide spatial context that is essential for accurate bolt localization. However, existing multi-view systems for structural inspection either lack the precision needed for loosening detection or require impractical setup conditions that limit their field applicability.

This paper presents a novel framework that addresses these challenges through three key innovations. First, we develop an angle-aware improved YOLOv9 model that simultaneously detects bolts and estimates their rotation angles, enabling direct assessment of loosening states. The model incorporates attention mechanisms and enhanced feature aggregation to improve detection accuracy under challenging conditions. Second, we introduce a multi-view fusion algorithm that combines 2D detections from multiple cameras with epipolar geometry constraints, achieving accurate 3D localization without requiring pre-measured, fixed camera poses in the world coordinate system. Third, we establish a comprehensive evaluation protocol that considers both detection accuracy and spatial localization precision, providing meaningful metrics for real-world deployment.

The proposed method offers significant advantages over existing approaches. Unlike traditional manual inspections, it provides an automated, quantitative assessment of bolt conditions with minimal human intervention. Compared to single-view computer vision methods, it achieves higher accuracy in both detection and loosening assessment by leveraging multi-view information. The integration of deep learning with 3D reconstruction creates a robust system that can handle the complex geometries typical of steel structures while maintaining computational efficiency suitable for field deployment.

The remainder of this paper is organized as follows: Section 2 reviews related work in bolt detection, multi-view fusion, and structural health monitoring. Section 3 presents the theoretical foundations of our approach, including multi-view geometry and angle-aware detection. Section 4 details our improved YOLOv9 architecture and the complete 3D localization pipeline. Section 5 presents experimental results on both synthetic and real-world datasets, followed by discussion and future work directions in Section 6. The paper concludes with a summary of contributions and potential applications.

2. Related Work

The detection and localization of bolt loosening in steel structures has been approached from various perspectives in computer vision and structural health monitoring. Existing methods can be broadly categorized into three groups: traditional image processing techniques, deep learning-based detection approaches, and multi-view 3D reconstruction methods.

2.1. Traditional Image Processing Approaches

Early attempts at automated bolt inspection relied on handcrafted features and classical image processing techniques. Methods based on edge detection [10] and template matching [11] were commonly used to identify bolt positions in 2D images. These approaches demonstrated reasonable performance in controlled environments with simple backgrounds, but struggled with varying lighting conditions and occlusions typical of real-world steel structures. The introduction of Hough transform-based circle detection [12] improved bolt head localization, yet these methods lacked the capability to assess bolt tightness or orientation.

2.2. Deep Learning-Based Detection

The advent of deep learning revolutionized bolt detection through convolutional neural networks. The YOLO series, particularly YOLOv8 [13], has been widely adopted for its balance of speed and accuracy in industrial inspection tasks. Recent improvements in YOLOv9 [14] introduced advanced feature extraction capabilities through its Programmable Gradient Information (PGI) mechanism, making it particularly suitable for small object detection like bolts. It is worth noting that while YOLOv10 has recently been introduced and shows promising results in certain applications such as safety helmet detection [15], the choice of YOLOv9 for our specific task is based on its proven effectiveness for small object detection and its well-established architecture that facilitates the integration of our proposed angle-aware modifications. The PGI mechanism in YOLOv9 addresses the information bottleneck problem that is particularly critical for detecting small bolts in occluded environments, and its modular design allows for effective incorporation of coordinate attention mechanisms and multi-scale feature aggregation—key components of our improved architecture.

Several studies have built upon these foundations, such as the CSEANet [16], which enhanced feature aggregation for bolt defect detection in railway bridges. However, these approaches primarily focus on binary classification (tight/loose) without quantifying the degree of loosening through rotation angle estimation [17,18,19]. More recently, researchers have extended deep learning models to explicitly estimate bolt orientation angles, which is crucial for quantifying loosening severity. Transformer-based architectures, such as DETR variants with rotational query mechanisms, have shown promising results in predicting oriented bounding boxes for small mechanical parts [20]. Additionally, refined angle encoding schemes, including Gaussian angle representation and circular smooth loss, have been proposed to mitigate boundary discontinuity issues in rotation regression [21,22]. These advances provide a more robust foundation for angle-aware bolt detection, though their integration with multi-view 3D localization in structural health monitoring remains underexplored.

2.3. Multi-View 3D Reconstruction

Multi-view systems have emerged as a promising solution for overcoming the limitations of single-view detection. Structure from Motion (SfM) techniques [23] have been adapted for steel structure inspection, enabling 3D point cloud generation from multiple images. The synchronized identification system developed by [24] demonstrated the potential of combining visual perception with panoramic reconstruction for defect localization. Epipolar geometry-based methods [25] have shown particular promise in feature matching across views, though their application to bolt loosening detection remains limited by the challenges of small object registration and orientation estimation [26,27].

2.4. Hybrid Approaches

Recent works have begun exploring combinations of these techniques. The ERBW-bolt model [28] improved upon YOLOv8 for bolt detection in complex backgrounds, while [29] integrated deep learning with traditional image matching for loosening assessment. These hybrid methods represent important steps toward comprehensive bolt inspection systems, yet they still lack the integrated angle estimation and precise 3D localization capabilities needed for practical structural health monitoring applications [30,31].

To provide a clear overview and comparison of the four methodological categories discussed above, Table 1 summarizes their key characteristics, advantages, and limitations specifically in the context of bolt loosening detection and localization.

The systematic comparison in Table 1 reveals three critical gaps in existing bolt loosening detection methodologies that directly motivate our proposed approach: (1) Lack of integrated angle-aware detection: While deep learning methods (particularly YOLO variants) achieve high bolt detection rates, they predominantly output axis-aligned bounding boxes that cannot capture the rotational state essential for loosening assessment. Although rotation-aware detectors exist, they are not specifically optimized for bolts in structural health monitoring contexts. (2) Disconnection between 2D detection and 3D localization: Multi-view reconstruction provides accurate 3D positions but typically operates as a separate stage after detection, lacking tight integration with the detection process and not incorporating bolt-specific rotation estimation. (3) Absence of a unified framework for quantitative loosening assessment: Existing hybrid approaches combine elements from different categories but maintain fragmented pipelines where detection, angle estimation, and 3D reconstruction remain separate modules, preventing end-to-end optimization and comprehensive metric development.

These identified gaps necessitate a new approach that integrates rather than concatenates the key capabilities required for practical bolt loosening inspection. Our proposed framework directly addresses these limitations by developing: (i) an improved YOLOv9 with an angle-aware detection head for simultaneous bolt identification and rotation estimation, (ii) a multi-view fusion pipeline that geometrically integrates 2D detections from multiple views to reconstruct accurate 3D bolt poses, and (iii) a unified loosening metric that combines both spatial displacement and angular deviation for comprehensive assessment. This integrated design overcomes the fragmentation of existing methods and provides a cohesive solution specifically tailored for structural health monitoring applications.

To clearly delineate the novelty of our approach against the state-of-the-art, we emphasize that while YOLOv8 and YOLOv9 provide efficient bolt detection, they are limited to axis-aligned bounding boxes that cannot capture rotational states essential for loosening assessment. Similarly, existing rotated object detectors operate primarily in 2D image space without 3D spatial context. Multi-view 3D reconstruction methods, while providing spatial information, typically require separate detection stages and do not incorporate continuous angle regression for loosening quantification. Our framework uniquely integrates an angle-aware YOLOv9 detector, epipolar geometry-based multi-view fusion, and a 3D-aware loosening metric into a single, end-to-end system. This integration enables simultaneous 2D detection, rotation angle estimation, 3D localization, and quantitative loosening assessment—a capability not achieved by any existing method in the literature.

The proposed method advances beyond these existing approaches by simultaneously addressing three critical limitations: (1) the inability of current detection models to estimate precise rotation angles for loosening quantification, (2) the lack of robust multi-view fusion techniques for 3D bolt localization in complex joint configurations, and (3) the absence of comprehensive evaluation protocols that consider both detection accuracy and spatial localization precision. Our angle-aware YOLOv9 improvement combined with epipolar geometry-based fusion creates a unified framework that provides quantitative loosening assessment while maintaining the efficiency required for field deployment.

3. Preliminaries on Multi-View 3D Reconstruction and Angle-Aware Object Detection

Understanding the fundamental concepts behind multi-view 3D reconstruction and angle-aware detection is essential for developing robust bolt loosening localization systems. This section establishes the theoretical foundations that underpin our proposed approach.

3.1. Multi-View Geometry Principles

The process of reconstructing 3D information from multiple 2D images relies on principles from projective geometry. The pinhole camera model serves as the basis for understanding image formation, where a 3D point

X = {[X, Y, Z]}^{T}

in world coordinates projects to a 2D point

x = {[x, y]}^{T}

in image space through the perspective projection equation:

x = K [R | t] X

(1)

where

K

represents the camera intrinsic matrix, and

[R | t]

denotes the rotation and translation components of the camera’s extrinsic parameters. When multiple cameras observe the same scene, their geometric relationship can be described using the fundamental matrix

F

, which satisfies the epipolar constraint:

x_{2}^{T} F x_{1} = 0

(2)

for corresponding points

x_{1}

and

x_{2}

in two views. This constraint forms the foundation for establishing correspondences between views and enables 3D point triangulation.

3.2. Angle-Aware Object Representation

Traditional object detection methods typically represent objects using axis-aligned bounding boxes, which are insufficient for capturing the orientation of bolts in 3D space. The rotated bounding box representation addresses this limitation by incorporating an angle parameter

θ

:

B = (x_{c}, y_{c}, w, h, θ)

(3)

where

(x_{c}, y_{c})

denotes the box center,

w

and

h

represent width and height, and

θ

indicates the rotation angle relative to the horizontal axis. This representation becomes particularly important for bolt loosening detection, as the rotation angle directly correlates with the bolt’s tightness state. To intuitively illustrate the limitation of axis-aligned bounding boxes in representing bolt orientation, Figure 1 compares the bounding box representations for a bolt in a rotated state (simulating loosening).

Figure 1a Axis-aligned bounding box: The box is constrained to be parallel to the image’s horizontal and vertical axes, regardless of the bolt’s actual rotation. It encloses the entire bolt but includes significant background pixels (gray shaded area) and fails to encode the bolt’s rotational angle. Figure 1b Rotated bounding box: The box is adjusted to fit the bolt’s actual shape and orientation, with the angle θ explicitly indicating the rotation relative to the horizontal axis. This representation eliminates redundant background and directly captures the pose information critical for loosening assessment.

As shown in Figure 1a, AABBs are inherently limited by their axis-aligned constraint: even when a bolt rotates (e.g., 45° as in the figure) due to loosening, the AABB only expands to cover the minimum and maximum x/y coordinates of the bolt, without retaining any information about the rotation angle. For bolt loosening detection, this means AABBs cannot distinguish between a “tight bolt (0° rotation)” and a “loose bolt (e.g., 90°/180° rotation)” if both bolts occupy similar 2D spatial ranges. Furthermore, in 3D space, bolts are often oriented at arbitrary angles relative to the camera plane; AABBs cannot reflect the true 3D pose of the bolt, as they collapse the depth and rotation information into a 2D axis-aligned region. In contrast, the RBB in Figure 1b directly models the bolt’s rotation angle θ and fits the bolt’s contour tightly, preserving the orientation information that is the core indicator of loosening state. This visual comparison confirms that RBBs are indispensable for accurate bolt orientation estimation in 3D structural inspections.

3.3. Feature Matching Across Views

Establishing accurate correspondences between multiple views is critical for successful 3D reconstruction. Scale-Invariant Feature Transform (SIFT) descriptors [32] have traditionally been used for this purpose, computing local feature vectors that remain stable across viewpoint changes. For bolt detection, we consider both appearance-based features and geometric constraints, where the latter helps resolve ambiguities in texture-less regions common in steel structures.

The combination of these principles enables our system to overcome the limitations of single-view detection while providing precise angle estimation for loosening assessment. The next section will detail how we integrate these concepts into an improved YOLOv9 framework with multi-view fusion capabilities.

4. Angle-Aware YOLOv9 with Multi-View Fusion for 3D Bolt-Loosening Localization

The proposed framework integrates angle-aware object detection with multi-view geometry to achieve precise 3D localization of bolt loosening in steel structures. This section details the technical implementation of our improved YOLOv9 model and the multi-view fusion pipeline.

4.1. Angle-Aware YOLOv9 Architecture

The baseline YOLOv9 architecture is enhanced with three key modifications to enable accurate bolt detection and angle estimation. First, we replace the standard detection head with an angle-aware variant that predicts rotated bounding boxes. The output tensor extends from the conventional 5 + 1 + C dimensions (x, y, w, h, conf, class) to 6 + 1 + C by adding the rotation angle θ:

T_{o u t} = (x_{c}, y_{c}, w, h, θ, c o n f, c l a s s)

(4)

where θ ∈ [0°, 360°) represents the bolt’s rotation relative to the image plane. The angle prediction employs a hybrid loss function combining Smooth L1 regression for precise angle estimation with the existing CIoU loss for box localization:

L_{θ} = λ_{1} L_{C I o U} + λ_{2} L_{S m o o t h L 1} (θ)

(5)

Second, we incorporate Coordinate Attention (CA) modules into the backbone network to enhance feature representation for small bolts. The CA mechanism generates attention weights along both spatial dimensions:

F_{a t t} = σ (f_{h} (F) \otimes f_{w} (F)) ⊙ F

(6)

where

f_{h}

and

f_{w}

denote 1D convolutional operations along height and width axes respectively, and σ represents the sigmoid activation. This allows the network to focus on bolt-specific features while suppressing background noise [33,34].

Third, we introduce a multi-scale feature aggregation module that combines shallow and deep features through cross-stage connections. The enhanced feature pyramid preserves both high-resolution spatial information and rich semantic context:

P_{i} = Conv (Concat (C_{i}, UpSample (P_{i + 1})))

(7)

where

C_{i}

denotes features from the i-th backbone stage and

P_{i}

represents the corresponding pyramid level.

4.2. Multi-View Fusion Pipeline

The 2D detections from multiple views are fused through a geometry-constrained pipeline to reconstruct 3D bolt positions and orientations. For each detected bolt, we first establish correspondences across views using both appearance and geometric cues. The matching process considers:

Descriptor Similarity: SIFT-like features computed from the detected bolt regions. It should be clarified that these “SIFT-like features” are not standard SIFT descriptors (e.g., those implemented in OpenCV [28]), but rather task-specific local feature vectors extracted from the backbone of our improved YOLOv9 detector. Specifically, for each detected bolt region (defined by the predicted rotated bounding box), we crop the corresponding feature map from the neck layer of the YOLOv9 architecture (the C3k2 module before feature pyramid construction), which contains rich semantic and spatial information learned for bolt detection. The cropped feature map is then downsampled to a fixed size of 16 × 16 via average pooling, flattened into a 256-dimensional vector, and normalized by L2 normalization to form the final “SIFT-like” local descriptor. This design leverages the detector’s pre-trained ability to distinguish bolt-specific features (e.g., bolt head edges, cross-slot/groove patterns) while maintaining the scale and rotation invariance characteristics of traditional SIFT descriptors—overcoming the limitation of standard SIFT in handling small, low-texture bolt regions under complex environmental conditions.
Epipolar Consistency: Projection errors relative to the fundamental matrix.
Angle Consistency: Agreement between predicted rotation angles after accounting for viewpoint differences.

The 3D position

X

of each bolt is estimated through triangulation from at least two views:

X = a r g \underset{X}{m i n} \sum_{i} ∥ π (K_{i} [R_{i} | t_{i}] X) - x_{i} ∥^{2}

(8)

where

π

denotes the perspective projection operation and

x_{i}

represents the 2D detection in the i-th view. The optimization is solved via Levenberg-Marquardt algorithm with RANSAC-based outlier rejection.

It is important to note that the camera parameters

K_{i}

,

R_{i} a n d t_{i}

used in Equation (8) are not obtained from a prior, fixed calibration in a global coordinate system—which would be impractical for field inspections of large or inaccessible structures. Instead, they are estimated directly from the captured multi-view image set. The intrinsic matrix

K_{i}

for each camera is determined using Zhang’s method [35], while the relative rotation

K

and translation

t

between views are derived from epipolar geometry (Equation (2)) via feature matching and essential matrix decomposition. During this calibration process, we also estimate and correct for lens distortion parameters (radial coefficients

K_{1}

,

K_{2}

,

K_{3}

) and tangential coefficients (

p_{1}

,

p_{2}

) using a standard chessboard pattern. All input images are undistorted using these parameters before feature extraction and detection, ensuring that perspective-induced geometric distortions are minimized at the preprocessing stage. This self-contained calibration procedure allows the system to operate flexibly in varying field setups without relying on precisely surveyed camera stations, thereby enhancing its practical applicability.

For orientation estimation, we derive the 3D rotation axis

a

by solving the following system for each view pair:

a = n_{1} \times n_{2}

(9)

n_{i} = R_{i}^{T} K_{i}^{- 1} [\begin{matrix} c o s θ_{i} \\ s i n θ_{i} \\ 0 \end{matrix}]

(10)

where

n_{i}

represents the normal vector of the bolt’s rotation plane in the i-th view. The final 3D orientation is obtained by averaging solutions from all valid view pairs.

4.3. Loosening State Determination

The bolt loosening state is determined by analyzing both the 3D position stability and rotation angle deviation. We define a loosening metric

L

that combines these factors:

L = α ∥ Δ X ∥ + β |Δ θ|

(11)

where

∥ Δ X ∥

measures positional displacement from the nominal position,

|Δ θ|

quantifies angular deviation from the tightened state, and α, β are weighting parameters learned from training data. To learn α and β, we formulate a supervised optimization problem using the labeled training data. Each bolt sample in the training set is annotated with a ground-truth loosening label (tight = 0, loose = 1) based on structural safety standards. We adopt logistic regression as the optimization objective, where the input is the linear combination

α ∥ Δ X ∥ + β |Δ θ|

and the output is the predicted loosening probability. The loss function is defined as the binary cross-entropy between the predicted probability and the ground-truth label:

L_{w e i g h t} = - Σ (y_{i} l o g (σ (L_{i})) + (1 - y_{i}) l o g (1 - σ (L_{i})))

, where

y_{i}

is the ground-truth label of the i-th bolt, σ(·) is the sigmoid function, and

L_{i} = α ∥ Δ X_{i} ∥ + β | Δ θ_{i} |

. We optimize α and β using the Adam optimizer with a learning rate of 0.001, minimizing L_weight over 100 epochs. This process ensures that α and β are adaptively adjusted to balance the contributions of positional displacement and angular deviation, such that the loosening metric L best discriminates between tight and loose bolts. The optimized values of α and β in our experiments are 0.32 and 0.68, respectively, indicating that angular deviation has a slightly higher weight in loosening assessment, which is consistent with the mechanical characteristics of bolted joints where rotation is the direct indicator of loosening. A bolt is classified as loose when

L

exceeds a threshold

T_{L}

determined through structural safety analysis.

The complete system workflow, illustrated in Figure 2, processes multi-view images through the angle-aware YOLOv9 detector, performs geometric fusion of 2D detections, and outputs 3D bolt positions with loosening assessments. The arrows in Figure 2 delineate the specific data flow between functional layers: (1) Arrows from the Data Acquisition Layer (Multi-view Camera Array) to the Detection Algorithm Layer represent the input of raw synchronized images into the improved YOLOv9 network. (2) Arrows from the Detection Algorithm Layer (Angle-aware Detection Head) to the 3D Reconstruction Layer (Point Cloud Generator & Bolt Pose Estimator) transfer the 2D detection results—including rotated bounding boxes and estimated rotation angles—for 3D fusion. (3) Arrows from the 3D Reconstruction Layer to the Decision Layer (Alerting Module) deliver the reconstructed 3D bolt poses and computed loosening metrics, enabling final structural health assessment and alert generation. The implementation achieves real-time performance by parallelizing detection and reconstruction stages, with typical processing times of 50 ms per view on an NVIDIA RTX 3090 GPU.

5. Experiments

5.1. Experimental Setup

Dataset Construction: We collected a comprehensive dataset of steel structure bolt joints under varying conditions, including different lighting (indoor, outdoor, low-light), occlusion levels (0–70%), and bolt loosening angles (0°–360° in 5° increments). The dataset comprises 12,500 multi-view image sets from 25 distinct joint configurations, with each set containing 3–5 synchronized images from different viewpoints. To ensure comprehensive coverage of viewing angles, the camera array was positioned at varying azimuthal (0°–360°) and elevation (−30° to +60°) angles relative to each bolt joint, simulating realistic inspection scenarios. This design guarantees that the model is exposed to a wide spectrum of perspectives, including challenging cases where bolts are partially occluded or viewed from extreme angles, thereby enhancing its generalization capability for real-world deployments. All images were annotated with rotated bounding boxes and precise angle measurements using a custom annotation tool that incorporates photogrammetric verification [36]. The 25 distinct joint configurations cover typical bolted connections in steel structures, including flange joints, truss node joints, and beam-column connections, with bolt sizes ranging from M16 to M36 (common in civil infrastructure). Notably, many of these joint configurations contain multiple bolts arranged in patterns (e.g., circular arrays in flange joints, linear rows in beam-column connections). This design ensures that our model is exposed to and evaluated on scenarios where multiple bolts appear simultaneously, including cases where some bolts are loose while others remain tight. The consistent performance metrics reported in Section 5.2 confirm that simultaneous loosening of multiple bolts does not significantly degrade individual bolt looseness recognition in our framework. Each joint was intentionally adjusted to simulate 7 loosening levels (0°: fully tightened, 45°/90°/135°/180°/270°/360°: progressively loosened), ensuring uniform coverage of angular states. Image acquisition was performed using a calibrated camera array (5 Sony α7R IV cameras, 61 MP resolution) with focal lengths fixed at 50 mm to avoid scale variations. For environmental variability, indoor images were captured under LED panel lighting (5000 K, 800 lux), outdoor images under natural sunlight (9:00–16:00, 1000–12,000 lux), and low-light images under dimmed lighting (100–200 lux) with exposure compensation. Occlusion was simulated by attaching steel plates, paint residues, or dust to bolt heads, with occlusion ratios precisely measured using ImageJ 1.53s. The annotation process employed a custom tool integrated with photogrammetric verification (based on [31]): (1) Each bolt was first labeled with a rotated bounding box (x_c, y_c, w, h, θ) by two trained annotators; (2) The rotation angle θ was verified using a high-precision digital protractor (accuracy ±0.1°) to ensure ground-truth reliability; (3) Discrepancies between annotators (≤3° angular difference or ≤5% bounding box overlap) were resolved via cross-validation with a third senior engineer; (4) Finally, all annotations were checked for consistency with 3D point cloud data generated by a LiDAR scanner (Velodyne VLP-16, Velodyne Lidar Inc., San Jose, CA, USA), ensuring alignment between 2D labels and 3D bolt positions. The dataset was split into training (8750 image sets, 70%), validation (1250 image sets, 10%), and test (2500 image sets, 20%) sets, with no overlap of joint configurations between splits to avoid data leakage. In conclusion, the constructed dataset comprehensively addresses the diversity of perspectives, lighting conditions, occlusion levels, bolt sizes, joint types, and loosening states. This multi-faceted variability ensures that the trained model can be well generalized to unknown real-world scenarios, as verified by the robust performance reported in the article.

Figure 3 is a collage showing the samples, annotation information and model prediction results of the bolt loosening detection dataset, comprehensively presenting the coverage of the dataset in different scenarios and the prediction performance of the model.

The first line (environmental condition dimension): The M24 bolt samples under the conditions of 5000 K LED light source (800 lux), daytime natural light (1000–2000 lux), and low light (100–200 lux) are displayed successively. The red dashed box is the ground truth rotating bounding box of this sample. It reflects the coverage of the dataset for different lighting environments.

The second line (occlusion degree dimension): It presents the scenarios of no occlusion, 30–50% occlusion, and 50–70% occlusion for M16 bolts. The yellow dashed boxes are the ground truth bounding boxes corresponding to the scenarios, visually demonstrating the inclusion of the dataset in different occlusion situations.

The third line (loosening Angle dimension): It shows the 0° (fully tightened), 180° (partially loose), and 360° (fully loose) states of the M30 bolt. The red dotted box is the ground true value bounding box, and the corresponding loosening angles are marked with text, covering the entire range of loosening states of concern in the research.

The fourth line (comparison dimension between prediction and true value): The red dotted box on the left is the ground true value bounding box, and the green dotted boxes in the middle/right are the bounding boxes predicted by the model. The text marks the corresponding bolt specifications or loosening angles, reflecting the high degree of match between the model’s prediction results and the ground true value.

This figure not only verifies the diversity and rationality of the dataset in terms of environment, occlusion, and loose states, but also visually demonstrates the model’s precise detection capability under complex conditions.

Evaluation Metrics: We employ four primary metrics to assess performance:

Detection Accuracy (DA) [37]: $DA = \frac{TP}{TP + FP + FN}$ , where TP, FP, FN denote true positives, false positives and false negatives respectively.
Angle Estimation Error (AEE) [38]: Mean absolute error in degrees between predicted and ground-truth rotation angles.
3D Localization Error (LE) [39]: Euclidean distance in mm between reconstructed and actual 3D bolt positions.
Loosening Classification F1-score [40]: Harmonic mean of precision and recall for loose/tight classification.

Implementation Details: The angle-aware YOLOv9 model was trained on 4 NVIDIA A100 GPUs with batch size 64, initial learning rate 0.01 (cosine decay), and input resolution 640 × 640. Data augmentation included random rotation (±15°), brightness adjustment (±30%), and occlusion simulation. The multi-view fusion module was implemented using OpenCV 4.8.0 and PyTorch3D 0.7.5, with camera calibration performed via Zhang’s method [35]. The real-world image data, especially from low-light and outdoor environments, contained typical sensor noise and grain. Instead of applying explicit denoising filters that might compromise fine bolt details, our framework incorporates robustness through training and architectural strategies. The extensive data augmentation (e.g., brightness variations) promotes learning of noise-invariant features. The Coordinate Attention mechanism enhances focus on discriminative regions while attenuating noisy backgrounds. Furthermore, the multi-view fusion stage provides inherent robustness by geometrically cross-validating detections across views, effectively marginalizing the impact of noise in any single image. The maintained performance under low-light conditions (DA = 92.1%, AEE = 6.3°) validates this integrated approach to handling image noise. For the SIFT-like feature extraction described in Section 4.2, the neck layer feature map (C3k2 output, channel dimension = 512) of the YOLOv9 detector is used, with average pooling kernel size = 4 × 4 and L2 normalization applied to the final descriptor vector.

5.2. Comparative Results

Table 2 compares our method against three state-of-the-art approaches on the test set (2500 image sets):

The comparative analysis presented in Table 2 unequivocally demonstrates the superior performance of the proposed framework against several state-of-the-art methods. The results, evaluated across four critical metrics, validate the effectiveness of integrating angle-aware detection with multi-view geometry. Our method achieves a leading Detection Accuracy (DA) of 93.6%, outperforming YOLOv8 (86.2%), Rotated Faster R-CNN (88.7%), and the recent CSEANet (90.1%). This improvement in DA can be attributed to the enhanced feature representation facilitated by the Coordinate Attention module and the multi-scale feature aggregation within our improved YOLOv9 backbone, which collectively enable more reliable bolt identification under challenging conditions.

More significantly, the most substantial gains are observed in the metrics most directly relevant to the core task of bolt loosening assessment: angle estimation and 3D localization. Our framework reduces the Angle Estimation Error (AEE) to 5.2°, which represents a 36% reduction compared to the best-performing baseline, CSEANet (8.3°). This leap in performance stems directly from the dedicated angle-aware detection head and its hybrid loss function, which regresses the rotation angle explicitly alongside the bounding box, a capability lacking in the conventional detectors used for comparison. Similarly, the 3D Localization Error (LE) is minimized to 3.8 mm, a 42% improvement over CSEANet’s 6.5 mm. This precise 3D localization performance also validates the effectiveness of our perspective distortion handling. The consistent reconstruction accuracy across varying viewing angles demonstrates that our camera calibration and projective geometry modeling successfully mitigate perspective distortion effects, enabling reliable metric measurements from 2D images. This decisive advantage is a direct consequence of our multi-view fusion pipeline, which leverages epipolar geometry to triangulate precise 3D positions from multiple 2D detections, thereby overcoming the inherent depth ambiguity of single-view approaches.

Consequently, the synergistic effect of these advancements culminates in the highest F1-score of 0.91 for the ultimate task of loosening classification. This holistic superiority confirms that while existing methods offer competent 2D detection, they fall short in providing the precise quantitative pose and location data required for rigorous structural health monitoring. To evaluate the method’s capability in identifying subtle loosening, the test set was stratified based on angular deviation severity. For bolts with minor loosening (|Δθ| ≤ 15° from tightened state), the average AEE was 6.8°, while for general loosening (|Δθ| > 15°) it was 4.9°. The marginally higher error for minor angles reflects the inherent challenge of regressing very small rotational differences. Nevertheless, the loosening classification F1-score for the minor looseness subset remained high at 0.89, confirming that the integrated angle regression and threshold-based decision metric (Equation (11)) provide sufficient sensitivity for practical identification of early-stage loosening, which is crucial for preventive maintenance strategies. The proposed framework successfully bridges this gap, establishing a new state-of-the-art by simultaneously advancing detection, angular regression, and spatial localization within a unified, end-to-end system.

Our method achieves superior performance across all metrics, with particular improvements in angle estimation (36% reduction in AEE) and 3D localization (42% reduction in LE) compared to the best baseline. The enhanced detection accuracy stems from the angle-aware attention mechanism, while the multi-view fusion accounts for the precise localization gains.

Figure 4 shows the training progression of our model, demonstrating stable convergence after 150 epochs. The angle regression branch converges slightly slower than the detection branch due to the finer precision required, but both reach satisfactory performance levels.

5.3. Ablation Study

We conduct an ablation study (Table 3) to validate key components of our framework:

Each component contributes significantly to overall performance. The angle-aware detection head provides the most substantial improvement for loosening assessment (22% AEE reduction), while the multi-view fusion yields the largest localization accuracy boost (34% LE improvement).

The performance of all ablation variants is summarized as follows: the baseline model achieves a DA of 87.3%, AEE of 11.2°, and LE of 7.9°; with the angle-aware detection head added, the metrics are improved to 89.5%, 8.7°, and 7.1°; further integrating the CA module, the performance reaches 91.2%, 7.3°, and 5.8°; finally, the full model with multi-view fusion module achieves the best results of 93.6%, 5.2°, and 3.8°.

To quantify the contribution of each key component to individual metrics, Table 4 calculates the improvement amplitude and contribution ratio based on the above ablation results.

As shown in Table 4, the multi-view fusion module contributes the most to 3D localization accuracy (44.3% of total improvement), which is attributed to its ability to fuse cross-view geometric information and eliminate ambiguity in single-view depth estimation. Specifically, this module leverages epipolar geometry constraints to align feature maps from different viewing angles, effectively integrating complementary spatial information and reducing the impact of partial occlusion or texture scarcity on 3D positioning. For the angle estimation task, the angle-aware detection head achieves a 2.5° reduction in AEE, accounting for a significant portion of the total error reduction, as it introduces quantitative angle regression into the detection branch instead of relying on traditional rotated bounding boxes, enabling more precise pose estimation of bolts. Meanwhile, the coordinate attention (CA) module enhances the model’s sensitivity to key local features by adaptively adjusting the feature weight distribution based on spatial coordinates, which not only improves detection accuracy by 2.2% but also assists in refining angle and localization results through enhanced feature representation. Notably, the sum of individual component contributions approximates 100%, indicating no obvious redundancy among the proposed modules—each component targets a distinct performance bottleneck (detection, angle estimation, 3D localization) and forms a synergistic effect in the full model. This quantitative contribution analysis confirms the rationality of the proposed framework design, providing a clear theoretical basis for the modular optimization of bolt 3D detection systems in complex steel structures.

5.4. Real-World Validation

We deployed the system on an active construction site to evaluate practical performance. Figure 5 shows example detections under challenging conditions. Specifically, Figure 5 illustrates four representative challenging scenarios: (1) High Occlusion (approximately 50–70% of the bolt head obscured by adjacent structural members), (2) Low-Light Environment (illumination below 100 lux, simulating dusk or indoor shadows), (3) Specular Reflection (direct sunlight or artificial light causing glare on the bolt surface), and (4) Complex Background Clutter (bolts surrounded by welding seams, rust, or paint irregularities). To quantify performance under these specific challenges, Table 5 reports the detection accuracy (DA) and angle estimation error (AEE) measured on a dedicated test subset containing 200 image sets per condition. The results confirm that while the system maintains robust performance, occlusion and extreme lighting remain the most difficult conditions, consistent with the analysis in Section 6.1.

Table 5 quantitatively presents the performance of the proposed method under four specific challenge conditions. The results show that high occlusion conditions have the most significant impact on the system performance. The detection accuracy (DA) drops to 86.8%, and the Angle estimation error (AEE) increases to 10.2°. This is mainly because the occlusion objects damage the complete contour features of the bolt head and affect the focusing ability of the attention mechanism. In contrast, the performance under low-light conditions is relatively better (DA = 92.1%, AEE = 6.3°), indicating that the model has certain robustness to light changes during the feature extraction stage. The performance degradation caused by specular reflection (DA = 88.5%, AEE = 9.1°) mainly results from the masking of local texture features by strong reflection areas, which is consistent with the limitations discussed in Section 6.1. The performance under complex background conditions (DA = 90.7%, AEE = 7.5°) verified the effectiveness of the coordinate attention module in suppressing background interference. It is worth noting that in a comprehensive environment with mixed challenges, the system can still maintain a detection accuracy of 91.2% and an Angle estimation error of 4.8°. This indicates that the multi-view fusion mechanism compensates for the performance loss under single-view conditions to a certain extent, providing reliable technical support for practical engineering applications.

The scatter plot demonstrates a strong correlation (R² = 0.94) between predicted and actual loosening angles, with most errors below 6°. The heatmap visualization in Figure 6 effectively identifies high-risk zones:

Figure 6 heat map visually presents the three-dimensional spatial distribution characteristics of bolt loosening probability in steel structures through color gradients. The red area (probability value > 0.8) is concentrated in the structural connection nodes and stress concentration areas, which is consistent with the expectation in structural mechanics theory that stress relaxation is prone to occur in these parts. The yellow area (with a probability of 0.4–0.8) is mainly distributed at the end of the cantilever and in the dynamic load action zone, indicating that the bolts in these areas have a moderate risk of loosening due to alternating loads. The blue area (probability < 0.4) corresponds to the rigid support part of the structure, verifying the relative stability of the bolt state in the static load area. It is worth noting that the loose distribution revealed by the heat map shows obvious spatial agglomeration rather than random distribution, which provides an important basis for the formulation of targeted maintenance strategies—maintenance resources should be prioritized to be concentrated in the red high-probability areas. This visualization result not only verified the proposed method’s detection ability for loose states, but more importantly, provided a spatial-dimensional risk assessment tool for structural health monitoring, achieving a leap from “single-point detection” to “overall risk assessment”.

Field tests achieved 91.2% DA and 4.8° AEE, confirming the method’s robustness to real-world variability. The average processing time per multi-view set was 210 ms, meeting real-time inspection requirements.

To further verify the environmental robustness of the proposed method, Table 6 quantifies the performance under different lighting and occlusion conditions, derived from the dataset characteristics (Section 5.1) and real-world validation results (Section 5.4).

Table 6 provides a systematic evaluation of the proposed method’s robustness under a spectrum of environmental conditions, reflecting the practical challenges encountered in real-world structural inspections. The performance metrics—Detection Accuracy (DA), Angle Estimation Error (AEE), and 3D Localization Error (LE)—collectively paint a comprehensive picture of the system’s capabilities and limitations. Under ideal indoor standard lighting with no occlusion, the system achieves its peak performance, with a DA of 95.3%, an AEE of 4.1°, and an LE of 3.2 mm. This serves as a performance baseline, demonstrating the model’s upper limit in a controlled setting. When transitioning to outdoor natural lighting, also without occlusion, a slight degradation in performance is observed, yielding a DA of 94.7%, an AEE of 4.5°, and an LE of 3.5 mm. This indicates a degree of sensitivity to the broader variability and potential glare associated with outdoor illumination, though the model maintains strong overall accuracy.

The system’s performance is more noticeably affected in low-light environments, where the DA drops to 92.1%, the AEE increases to 6.3°, and the LE rises to 4.2 mm. This decline underscores the challenge of extracting discriminative visual features from bolts under insufficient lighting, a common scenario in the interiors of large structures or during inspections conducted at dusk. The impact of partial occlusion is another critical factor, as evidenced by the results under moderate occlusion (30–50%), where the DA falls to 90.5%, the AEE to 7.8°, and the LE to 5.1 mm. Under severe occlusion (50–70%), the challenges are compounded, leading to a further performance drop to 86.8% DA, 10.2° AEE, and 6.7 mm LE. This progressive decline aligns with the limitations discussed in Section 6.1, confirming that while the attention mechanisms and multi-view fusion provide resilience, they are not immune to significant visual obstruction.

Ultimately, the most telling result comes from the comprehensive environment of a real construction site, which amalgamates variable lighting, slight occlusions, and other unstructured elements. Here, the method achieves a DA of 91.2%, an AEE of 4.8°, and an LE of 4.3 mm. This robust performance, closely mirroring the controlled experimental results and surpassing the baselines in Table 2, validates the practical viability of the proposed framework. The analysis confirms that the integration of the angle-aware YOLOv9 with multi-view fusion delivers a system that not only excels in laboratory conditions but also maintains high operational reliability in the complex and unpredictable setting of an active construction site.

6. Discussion and Future Work

6.1. Limitations of the Proposed Method

While the proposed framework demonstrates strong performance in controlled experiments, several limitations emerge when considering real-world deployment scenarios. The system’s accuracy degrades significantly when bolt heads are heavily occluded (>70% coverage) or under extreme lighting conditions (e.g., direct sunlight causing specular reflections). This occurs because the current attention mechanisms struggle to distinguish bolt features from intense glare patterns. Furthermore, the multi-view fusion assumes minimal structural deformation between camera positions, an assumption that may not hold for large-span structures experiencing thermal expansion or dynamic loads. The angle estimation also shows increased error rates for bolts oriented nearly parallel to the image plane (θ ≈ 90° or 270°), where small detection errors lead to large angular miscalculations.

To visually illustrate these limitations, Figure 7 presents two typical failure cases: (a) an almost fully hidden bolt (occlusion rate > 80%) where adjacent steel plates block most of the bolt head, resulting in failed detection (FN) due to insufficient feature extraction; (b) a bolt under strong specular reflection, where glare masks the cross-slot texture of the bolt head, leading to an angle estimation error of 18.7° (ground truth: 45°, prediction: 63.7°). For these scenarios, several potential remedies could be explored: (1) Feature completion techniques based on generative adversarial networks (GANs) to reconstruct occluded bolt regions using context-aware prior knowledge of bolt geometries; (2) Integration of additional sensors such as thermal imagers or LiDAR, which can provide complementary depth or thermal information unaffected by visual occlusion or reflection; (3) Adoption of active lighting systems with polarized filters to mitigate specular reflections and enhance bolt head texture contrast. These strategies aim to address the current limitations and further improve the system’s robustness in extreme practical scenarios.

The Figure 7a depicts a flange connection of a steel structure, where adjacent thick steel plates are staggered to form severe occlusion. The target M24 bolt has more than 80% of its head blocked, with only 1–2 mm of the edge exposed. The red dashed box marks the ground-truth position of the bolt, annotated as “GT: M24, Occlusion > 80%” according to the actual contour. Due to insufficient visible features, the model fails to recognize the complete bolt, and the yellow dashed box only incorrectly marks a small area at the occlusion edge, with the annotation “Pred: No valid detection (FN)”. This scene is under indoor standard lighting conditions (5000 K LED, 800 lux) without additional interference factors, reflecting the limitation of the system in handling extreme occlusion scenarios.

The Figure 7b shows a beam-column connection node of a steel structure, where the surface of the M30 bolt is smooth and free of rust. Under outdoor noon conditions (12,000 lux), direct sunlight irradiates the bolt surface, generating strong glare that covers 60% of the bolt head and masks the cross-slot texture crucial for angle estimation. The red dashed box indicates the ground-truth of the bolt, with the annotation “GT: M30, θ = 45°” specifying both the real position and rotation angle. Although the model detects the bolt, the severe reflection leads to significant angle estimation error; the yellow dashed box corresponds to the model’s prediction result, annotated as “Pred: θ = 63.7°, AEE = 18.7°”, which verifies the performance degradation of the system under strong specular reflection.

Figure 7a almost fully hidden bolt (occlusion rate >80% by adjacent steel plates), where the model fails to detect the bolt due to insufficient visible features (false negative, FN); Figure 7b bolt under strong specular reflection, where glare masks the bolt head texture, leading to significant angle estimation error. Red dashed boxes indicate ground-truth positions and angles, and yellow dashed boxes indicate model predictions.

Visual Examples of Typical Failure Cases. To provide a clearer understanding of the system’s limitations under extreme conditions, we illustrate two typical failure scenarios that challenge the current framework. Figure 7a shows an almost fully hidden bolt (>85% occlusion) where the bolt head is largely covered by adjacent structural elements, leaving only minimal visible features. In such cases, even the attention mechanisms struggle to extract sufficient discriminative information for reliable detection and angle estimation. Figure 7b demonstrates a failure case under very strong specular reflections where direct sunlight creates intense glare patterns that completely mask the bolt’s surface texture and geometric features. These reflections not only saturate the image sensor but also create false edges that mislead both the detection network and the feature matching process in the multi-view fusion stage.

As noted by the reviewer, newer YOLO versions such as YOLOv10 have emerged since the inception of this study. While these newer models may offer certain architectural improvements, it is important to recognize that model selection in practical applications involves trade-offs between novelty, stability, and task-specific optimization. YOLOv9 was selected based on its demonstrated effectiveness for small object detection (bolts are typically 20–50 pixels in our imagery) and its PGI mechanism that addresses gradient flow preservation—a critical factor for our angle-aware modifications. The excellent performance achieved by our improved YOLOv9 (93.6% DA, 5.2° AEE) validates this selection. Future work could explore the integration of our angle-aware detection head and multi-view fusion pipeline with YOLOv10 or subsequent versions, potentially leveraging their architectural advancements while maintaining our task-specific innovations.

6.2. Potential Application Scenarios

Beyond structural health monitoring, the developed techniques could benefit several industrial domains requiring precise fastener inspection. In wind turbine maintenance, the system could automate bolt checks on tower flange connections, where manual inspections currently require risky climbs. The aviation industry might apply this for rapid airframe joint assessments during routine maintenance checks. The method’s real-time capabilities also make it suitable for quality control in prefabricated steel construction, where it could verify bolt tightness before component shipment. An interesting extension would involve integrating the visual detection with robotic tightening systems, creating closed-loop correction of identified loose bolts.

Furthermore, building upon the reviewer’s valuable suggestion, an important future direction is to develop group loosening state analysis. By leveraging the spatial distribution of detected bolts and their individual loosening metrics, we can implement clustering algorithms to identify loosening patterns and correlations. This would enable not only individual bolt assessment but also structural-level risk evaluation based on the spatial arrangement and collective behavior of multiple loosened bolts, providing more comprehensive insights for preventive maintenance strategies.

6.3. Computational Efficiency and Real-Time Application

The current implementation achieves practical processing speeds (210 ms per multi-view set) but faces challenges in scaling to very large structures. The computational load increases quadratically with the number of bolts due to the pairwise matching requirements in multi-view fusion. For structures with thousands of bolts, we observe a 40% slowdown when processing full-resolution images. Two promising directions could address this: implementing spatial hashing to reduce matching complexity, and developing a lightweight version of the angle-aware head for deployment on edge devices. The system’s memory footprint (currently 4.2 GB for the full model) also presents challenges for embedded deployment, suggesting potential benefits from knowledge distillation techniques or hybrid quantization approaches [41,42]. Future work should investigate these optimizations while maintaining the current accuracy levels.

7. Conclusions

The proposed framework demonstrates significant advancements in automated bolt loosening detection and 3D localization for steel structures. By integrating an improved angle-aware YOLOv9 model with multi-view fusion techniques, the system achieves robust performance across diverse environmental conditions while maintaining computational efficiency suitable for field deployment. The attention-enhanced detection architecture enables precise rotation angle estimation, directly addressing the critical need for quantitative loosening assessment in structural health monitoring. The geometric fusion pipeline effectively combines information from multiple viewpoints to reconstruct accurate 3D bolt positions, overcoming limitations of single-view approaches. Experimental results validate the method’s superiority over existing techniques, particularly in handling complex joint configurations and varying lighting conditions. The system’s real-time processing capability and practical deployment performance suggest strong potential for transforming current inspection practices in civil infrastructure maintenance. Future research directions include addressing occlusion challenges through advanced feature completion techniques and optimizing the framework for edge device implementation to further enhance field applicability. Additionally, as suggested by the reviewer, we will investigate group loosening state analysis to complement individual bolt assessment. This will involve developing algorithms for detecting loosening patterns and correlations among multiple bolts, enabling more holistic structural risk assessment and prioritized maintenance planning based on both individual and collective bolt behavior. The developed methodology provides a comprehensive solution that bridges the gap between academic research and industrial needs in structural safety monitoring.

Author Contributions

Conceptualization, F.C., X.C. and L.L.; Data curation, F.C.; Formal analysis, F.C., X.C. and L.L.; Methodology, X.C. and F.C.; Software, F.C.; validation, F.C., X.C. and L.L.; formal analysis, X.C., F.C. and L.L.; investigation, X.C., F.C. and L.L.; Writing—original draft, X.C., F.C. and L.L.; Writing—review and editing, F.C., X.C. and L.L.; visualization, F.C., X.C. and L.L.; supervision, F.C., X.C. and L.L. All authors have read and agreed to the published version of the manuscript.

Funding

This work was funded by Doctoral Research Start-up Fund of Henan Institute of Technology: “Research on Contact Mechanism and Dynamic Characteristics of Ultrasonic Processing System” (KQ2009).

Data Availability Statement

The original contributions presented in the study are included in the article; further inquiries can be directed to the corresponding authors.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Fylladitakis, E.D. Fault Identification and Predictive Maintenance Techniques for High-Voltage Equipment: A Review and Recent Advances. J. Power Energy Eng. 2025, 13, 1–39. [Google Scholar] [CrossRef]
Wang, T.; Chen, Y.; Qiao, M.; Snoussi, H. A Fast and Robust Convolutional Neural Network-Based Defect Detection Model in Product Quality Control. Int. J. Adv. Manuf. Technol. 2018, 94, 3465–3471. [Google Scholar] [CrossRef]
Zhao, L.; Li, S. Object Detection Algorithm Based on Improved YOLOv3. Electronics 2020, 9, 537. [Google Scholar] [CrossRef]
Shi, Y.; Jia, Y.; Zhang, X. FocusDet: An Efficient Object Detector for Small Object. Sci. Rep. 2024, 14, 10697. [Google Scholar] [CrossRef]
Wang, S.; Park, S.; Kim, J.; Kim, J. Safety Helmet Monitoring on Construction Sites Using YOLOv10 and Advanced Transformer Architectures with Surveillance and Body-Worn Cameras. J. Constr. Eng. Manag. 2025, 151, 04025186. [Google Scholar] [CrossRef]
Wang, S. Effectiveness of Traditional Augmentation Methods for Rebar Counting Using UAV Imagery with Faster R-CNN and YOLOv10-Based Transformer Architectures. Sci. Rep. 2025, 15, 33702. [Google Scholar] [CrossRef] [PubMed]
Moujahid, A.; Dornaika, F. Advanced Unsupervised Learning: A Comprehensive Overview of Multi-View Clustering Techniques. Artif. Intell. Rev. 2025, 58, 234. [Google Scholar] [CrossRef]
Chen, X.; Yang, H.; Zhang, H.; Wong, C.U.I. Dynamic Gradient Descent and Reinforcement Learning for AI-Enhanced Indoor Building Environmental Simulation. Buildings 2025, 15, 2044. [Google Scholar] [CrossRef]
Gao, L.; Zhao, Y.; Han, J.; Liu, H. Research on Multi-View 3D Reconstruction Technology Based on SFM. Sensors 2022, 22, 4366. [Google Scholar] [CrossRef]
Canny, J. A Computational Approach to Edge Detection. IEEE Trans. Pattern Anal. Mach. Intell. 2009, PAMI-8, 679–698. [Google Scholar] [CrossRef]
Tsai, D.-M.; Lin, C.-T. Fast Normalized Cross Correlation for Defect Detection. Pattern Recognit. Lett. 2003, 24, 2625–2631. [Google Scholar] [CrossRef]
Mukhopadhyay, P.; Chaudhuri, B.B. A Survey of Hough Transform. Pattern Recognit. 2015, 48, 993–1010. [Google Scholar] [CrossRef]
Talaat, F.M.; ZainEldin, H. An Improved Fire Detection Approach Based on YOLO-v8 for Smart Cities. Neural Comput. Appl. 2023, 35, 20939–20954. [Google Scholar] [CrossRef]
Gui, H.; Su, T.; Jiang, X.; Li, L.; Xiong, L.; Zhou, J.; Pang, Z. FS-YOLOv9: A Frequency and Spatial Feature-Based YOLOv9 for Real-Time Breast Cancer Detection. Acad. Radiol. 2025, 32, 1228–1240. [Google Scholar] [CrossRef] [PubMed]
Du, Q.; Zhang, S.; Yang, S. BLP-YOLOv10: Efficient Safety Helmet Detection for Low-Light Mining. J. Real-Time Image Process. 2025, 22, 10. [Google Scholar] [CrossRef]
Chen, Y.; Sun, Y.; Qin, Z.; Wang, Z.; Geng, Y. CSEANet: Cross-Stage Enhanced Aggregation Network for Detecting Surface Bolt Defects in Railway Steel Truss Bridges. Sensors 2025, 25, 3500. [Google Scholar] [CrossRef]
Lee, S.-Y.; Huynh, T.-C.; Park, J.-H.; Kim, J.-T. Bolt-Loosening Detection Using Vision-Based Deep Learning Algorithm and Image Processing Method. J. Comput. Struct. Eng. Inst. Korea 2019, 32, 265–272. [Google Scholar] [CrossRef]
Jia, J.; Li, Y. Deep Learning for Structural Health Monitoring: Data, Algorithms, Applications, Challenges, and Trends. Sensors 2023, 23, 8824. [Google Scholar] [CrossRef]
Jiang, S.; Zhang, J.; Wang, W.; Wang, Y. Automatic Inspection of Bridge Bolts Using Unmanned Aerial Vision and Adaptive Scale Unification-Based Deep Learning. Remote Sens. 2023, 15, 328. [Google Scholar] [CrossRef]
Wang, H.; Li, C.; Wu, Q.; Wang, J. An Improved DETR Based on Angle Denoising and Oriented Boxes Refinement for Remote Sensing Object Detection. Remote Sens. 2024, 16, 4420. [Google Scholar] [CrossRef]
Yang, X.; Zhang, G.; Yang, X.; Zhou, Y.; Wang, W.; Tang, J.; He, T.; Yan, J. Detecting Rotated Objects as Gaussian Distributions and Its 3-d Generalization. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 4335–4354. [Google Scholar] [CrossRef]
Li, P.; Zhu, C. Ro-YOLOv5: One New Detector for Impurity in Wheat Based on Circular Smooth Label. Crop Prot. 2024, 184, 106806. [Google Scholar] [CrossRef]
Westoby, M.J.; Brasington, J.; Glasser, N.F.; Hambrey, M.J.; Reynolds, J.M. ‘Structure-from-Motion’ Photogrammetry: A Low-Cost, Effective Tool for Geoscience Applications. Geomorphology 2012, 179, 300–314. [Google Scholar] [CrossRef]
Chen, W.; Yuan, B.; Chen, D.; Hu, Y.; Wang, F.; Zhang, J. Synchronized Identification and Localization of Defect on the Bottom of Steel Box Girders Based on a Dynamic Visual Perception System. Comput. Ind. 2025, 169, 104291. [Google Scholar] [CrossRef]
Furukawa, Y.; Ponce, J. Accurate, Dense, and Robust Multiview Stereopsis. IEEE Trans. Pattern Anal. Mach. Intell. 2009, 32, 1362–1376. [Google Scholar] [CrossRef]
Wang, W.; Cai, Y.; Wang, T. Multi-View Dual Attention Network for 3D Object Recognition. Neural Comput. Appl. 2022, 34, 3201–3212. [Google Scholar] [CrossRef]
Zhu, J.; Peng, B.; Li, W.; Shen, H.; Huang, Q.; Lei, J. Modeling Long-Range Dependencies and Epipolar Geometry for Multi-View Stereo. ACM Trans. Multimed. Comput. Commun. Appl. 2023, 19, 1–17. [Google Scholar] [CrossRef]
Yang, Y.; Zong, J.; Dong, H.; Zhang, X. ERBW-Bolt: A High-Accuracy Model for Train Underbody Bolts Detection in Complex Background. Trans. Inst. Meas. Control 2025, 01423312251364330. [Google Scholar] [CrossRef]
Pan, X.; Yang, T. Image-Based Monitoring of Bolt Loosening through Deep-Learning-Based Integrated Detection and Tracking. Comput.-Aided Civ. Infrastruct. Eng. 2022, 37, 1207–1222. [Google Scholar] [CrossRef]
Gao, J.; Liu, J.; Ji, S. A General Deep Learning Based Framework for 3D Reconstruction from Multi-View Stereo Satellite Images. ISPRS J. Photogramm. Remote Sens. 2023, 195, 446–461. [Google Scholar] [CrossRef]
Luo, H.; Zhang, J.; Liu, X.; Zhang, L.; Liu, J. Large-Scale 3d Reconstruction from Multi-View Imagery: A Comprehensive Review. Remote Sens. 2024, 16, 773. [Google Scholar] [CrossRef]
Cruz-Mota, J.; Bogdanova, I.; Paquier, B.; Bierlaire, M.; Thiran, J.-P. Scale Invariant Feature Transform on the Sphere: Theory and Applications. Int. J. Comput. Vis. 2012, 98, 217–241. [Google Scholar] [CrossRef]
Huang, H.; Zhu, K. Automotive Parts Defect Detection Based on YOLOv7. Electronics 2024, 13, 1817. [Google Scholar] [CrossRef]
Lu, J.; Yu, M.; Liu, J. Lightweight Strip Steel Defect Detection Algorithm Based on Improved YOLOv7. Sci. Rep. 2024, 14, 13267. [Google Scholar] [CrossRef]
Zhang, Z. A Flexible New Technique for Camera Calibration. IEEE Trans. Pattern Anal. Mach. Intell. 2002, 22, 1330–1334. [Google Scholar] [CrossRef]
Marelli, D.; Morelli, L.; Farella, E.M.; Bianco, S.; Ciocca, G.; Remondino, F. ENRICH: Multi-purposE Dataset for beNchmaRking In Computer Vision and pHotogrammetry. ISPRS J. Photogramm. Remote Sens. 2023, 198, 84–98. [Google Scholar] [CrossRef]
Almutairi, A.; Warner, T.A. Change Detection Accuracy and Image Properties: A Study Using Simulated Data. Remote Sens. 2010, 2, 1508–1529. [Google Scholar] [CrossRef]
Florio, A.; Avitabile, G.; Talarico, C.; Coviello, G. A Reconfigurable Full-Digital Architecture for Angle of Arrival Estimation. IEEE Trans. Circuits Syst. I Regul. Pap. 2023, 71, 1443–1455. [Google Scholar] [CrossRef]
Sah, D.K.; Nguyen, T.N.; Kandulna, M.; Cengiz, K.; Amgoth, T. 3D Localization and Error Minimization in Underwater Sensor Networks. ACM Trans. Sens. Netw. (TOSN) 2022, 18, 1–25. [Google Scholar] [CrossRef]
Rai, T.; Morisi, A.; Bacci, B.; Bacon, N.J.; Dark, M.J.; Aboellail, T.; Thomas, S.A.; La Ragione, R.M.; Wells, K. Keeping Pathologists in the Loop and an Adaptive F1-Score Threshold Method for Mitosis Detection in Canine Perivascular Wall Tumours. Cancers 2024, 16, 644. [Google Scholar] [CrossRef]
Chen, Z.; Zhang, R.; Chen, Z.; Yuan, B.; Zheng, Y.; Li, K. Expansive Detector via Hybrid Temporal and Transposed Convolutional Mechanism for Weld Proximity Defects. Soft Comput. 2025, 29, 3021–3034. [Google Scholar] [CrossRef]
Zhang, R.; Xie, C.; Deng, L. A Fine-Grained Object Detection Model for Aerial Images Based on Yolov5 Deep Neural Network. Chin. J. Electron. 2023, 32, 51–63. [Google Scholar] [CrossRef]

Figure 1. Comparison between axis-aligned bounding boxes (AABBs) and rotated bounding boxes (RBBs) for bolt orientation representation.

Figure 2. Improved YOLOv9-based Bolt Loosening Detection System.

Figure 3. Collage of dataset samples, annotations, and model predictions illustrating bolt loosening detection under diverse environmental conditions, occlusion levels, and loosening angles.

Figure 4. Model accuracy change during training epochs.

Figure 5. Comparison of predicted and ground-truth bolt loosening angles.

Figure 6. Bolt loosening probability distribution in 3D space.

Figure 7. Typical failure cases of the proposed method under severe occlusion and strong specular reflection. ((a) an almost fully hidden bolt (occlusion rate > 80%) where adjacent steel plates block most of the bolt head, resulting in failed detection (FN) due to insufficient feature extraction; (b) a bolt under strong specular reflection, where glare masks the cross-slot texture of the bolt head, leading to an angle estimation error of 18.7° (ground truth: 45°, prediction: 63.7°)).

Table 1. Comparative summary of methodological approaches for bolt loosening detection and localization.

Category	Core Techniques	Advantages	Limitations for Bolt Loosening Tasks
Traditional Image Processing (Section 2.1)	Edge detection, template matching, Hough transform-based circle detection	• Simple implementation • No training data required • Computationally light	• Poor robustness to lighting/occlusion variations • No rotation angle estimation capability • Limited to 2D image space • Cannot assess bolt tightness quantitatively
Deep Learning-Based Detection (Section 2.2)	YOLO series (v3–v9), rotation-aware detection heads, attention mechanisms	• High detection accuracy • Robust to environmental variations • Real-time capability • Recent YOLOv9 with PGI for small objects	• Standard YOLO outputs axis-aligned boxes (no orientation) • Limited 3D spatial context • Separate modules needed for loosening assessment • Angle discontinuity issues in rotation regression
Multi-View 3D Reconstruction (Section 2.3)	Structure from Motion (SfM), epipolar geometry, feature matching (SIFT)	• Provides 3D spatial information • Accurate 3D localization • Handles occlusion via multiple perspectives • Metric measurements possible	• Computationally intensive • Requires feature matching across views • Lacks dedicated bolt rotation estimation • Small object (bolt) registration challenging
Hybrid Approaches (Section 2.4)	Combination of DL detection + traditional CV, multi-stage pipelines	• Leverages strengths of multiple methods • Improved overall system performance • More robust to complex conditions	• Lack of tight integration between components • Detection, angle estimation, and 3D localization often separate • No unified loosening assessment metric • Error propagation across stages

Table 2. Performance comparison with state-of-the-art methods.

Method	DA (%)	AEE (°)	LE (mm)	F1-Score
YOLOv8	86.2	12.5	8.7	0.78
Rotated Faster R-CNN	88.7	9.8	7.2	0.82
CSEANet	90.1	8.3	6.5	0.85
Ours	93.6	5.2	3.8	0.91

Table 3. Ablation study of proposed components.

Configuration	DA (%)	AEE (°)	LE (mm)
Baseline YOLOv9	87.3	11.2	7.9
+Angle Head	89.5	8.7	7.1
+CA Module	91.2	7.3	5.8
+Multi-view Fusion	93.6	5.2	3.8

Table 4. Quantitative Performance Contribution of Key Components.

Added Component	Detection Accuracy Improvement (ΔDA, %)	Angle Estimation Error Reduction (ΔAEE, °)	3D Localization Error Reduction (ΔLE, mm)	Contribution Ratio of Single Component (% of Total Improvement)
Angle-Aware Detection Head	2.2 (89.5–87.3)	2.5 (11.2–8.7)	0.8 (7.9–7.1)	31.4%
Coordinate Attention (CA) Module	1.7 (91.2–89.5)	1.4 (8.7–7.3)	1.3 (7.1–5.8)	24.3%
Multi-View Fusion Module	2.4 (93.6–91.2)	3.1 (7.3–5.2)	2.0 (5.8–3.8)	44.3%
Total Improvement (Baseline → Ours)	6.3 (93.6–87.3)	6.0 (11.2–5.2)	4.1 (7.9–3.8)	100%

Table 5. Performance Under Specific Challenging Conditions.

Challenging Condition	Detection Accuracy (DA, %)	Angle Estimation Error (AEE, °)	Sample Image Count
High Occlusion (50–70%)	86.8	10.2	200
Low-Light (<100 lux)	92.1	6.3	200
Specular Reflection (Glare)	88.5	9.1	200
Complex Background Clutter	90.7	7.5	200
Overall (Mixed Challenges)	91.2	4.8	800

Note: The “Overall” row corresponds to the comprehensive real-world validation results reported in Section 5.4.

Table 6. Performance under Different Environmental Conditions.

Environmental Condition	Detection Accuracy (DA, %)	Angle Estimation Error (AEE, °)	3D Localization Error (LE, mm)	Description
Indoor Standard Lighting (No Occlusion)	95.3	4.1	3.2	Ideal experimental environment, corresponding to indoor non-occluded samples in the dataset
Outdoor Natural Lighting (No Occlusion)	94.7	4.5	3.5	Natural lighting without occlusion, corresponding to outdoor samples in the dataset
Low-Light Environment (No Occlusion)	92.1	6.3	4.2	Low-light scenario, derived based on the performance of low-light samples in the dataset
Moderate Occlusion (30–50%)	90.5	7.8	5.1	Partial occlusion scenario, calculated based on the characteristics of occluded samples in the dataset
Severe Occlusion (50–70%)	86.8	10.2	6.7	High occlusion scenario, the original manuscript (6.1) mentions performance degradation under this condition
Real Construction Site (Comprehensive Environment)	91.2	4.8	4.3	Real-world validation results from 5.4 in the original manuscript, including complex lighting and slight occlusion

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Cui, F.; Chen, X.; Liang, L. Intelligent Detection and 3D Localization of Bolt Loosening in Steel Structures Using Improved YOLOv9 and Multi-View Fusion. Buildings 2026, 16, 619. https://doi.org/10.3390/buildings16030619

AMA Style

Cui F, Chen X, Liang L. Intelligent Detection and 3D Localization of Bolt Loosening in Steel Structures Using Improved YOLOv9 and Multi-View Fusion. Buildings. 2026; 16(3):619. https://doi.org/10.3390/buildings16030619

Chicago/Turabian Style

Cui, Fangyuan, Xiaolong Chen, and Lie Liang. 2026. "Intelligent Detection and 3D Localization of Bolt Loosening in Steel Structures Using Improved YOLOv9 and Multi-View Fusion" Buildings 16, no. 3: 619. https://doi.org/10.3390/buildings16030619

APA Style

Cui, F., Chen, X., & Liang, L. (2026). Intelligent Detection and 3D Localization of Bolt Loosening in Steel Structures Using Improved YOLOv9 and Multi-View Fusion. Buildings, 16(3), 619. https://doi.org/10.3390/buildings16030619

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.

Article Menu

Intelligent Detection and 3D Localization of Bolt Loosening in Steel Structures Using Improved YOLOv9 and Multi-View Fusion

Abstract

1. Introduction

2. Related Work

2.1. Traditional Image Processing Approaches

2.2. Deep Learning-Based Detection

2.3. Multi-View 3D Reconstruction

2.4. Hybrid Approaches

3. Preliminaries on Multi-View 3D Reconstruction and Angle-Aware Object Detection

3.1. Multi-View Geometry Principles

3.2. Angle-Aware Object Representation

3.3. Feature Matching Across Views

4. Angle-Aware YOLOv9 with Multi-View Fusion for 3D Bolt-Loosening Localization

4.1. Angle-Aware YOLOv9 Architecture

4.2. Multi-View Fusion Pipeline

4.3. Loosening State Determination

5. Experiments

5.1. Experimental Setup

5.2. Comparative Results

5.3. Ablation Study

5.4. Real-World Validation

6. Discussion and Future Work

6.1. Limitations of the Proposed Method

6.2. Potential Application Scenarios

6.3. Computational Efficiency and Real-Time Application

7. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI