1. Introduction
Bolt-nut fasteners are critical fastening components in mechanical connections and are widely used in power systems, rail transit, aerospace, and civil engineering. They play a fundamental role in maintaining the structural integrity of equipment components and engineering systems. However, under prolonged mechanical loads, environmental corrosion, and complex operating conditions, bolt-nut fasteners may experience loosening or failure, posing potential structural safety risks [
1]. For instance, post-incident investigation reports on the 2021 Mexico City subway viaduct collapse indicated that missing bolts at steel beam connections, along with welding defects, were among the contributing factors to the structural failure [
2]. This case highlights that the integrity of bolt-nut fasteners is closely related to the safety and reliability of large-scale engineering structures. Within power systems, substations act as critical hubs where operational stability directly impacts grid security. Consequently, automated detection of bolt-nut fasteners is essential, serving not only as a prerequisite for identifying loosening and defects but also as a key enabler for intelligent substation inspection and grid safety assurance [
3].
Traditionally, substation maintenance has relied heavily on manual inspection. However, this approach is inherently subjective and often suffers from inefficiencies, high labor intensity, and a substantial risk of missed detections [
4]. The inspection challenge is compounded by the physical characteristics of the targets: bolts and nuts are typically minute in size and situated in concealed or inaccessible locations. For instance, similar to the difficulties encountered in rail [
5] and steel bridge inspections [
6], where bolt-nut fasteners are distributed across elevated or structurally cluttered locations, bolt-nut fasteners in substations are often distributed across high-altitude pylons or obscured bases where comprehensive manual coverage is nearly impossible. Moreover, real-world interferences—such as visual obstructions, variable viewing angles, and uneven lighting—further increase inspection uncertainty [
7]. These limitations underscore the urgent need for automated computer vision–based detection methods for bolt-nut fasteners to replace or augment traditional manual methods.
Convolutional neural networks (CNNs) have demonstrated strong capability in object detection tasks by learning hierarchical feature representations. With the continuous improvement of GPU computing resources, the performance of CNN-based detection models has been significantly enhanced [
8]. Existing deep learning detection frameworks are commonly categorized into two types: two-stage detectors and single-stage detectors [
9]. Two-stage algorithms, exemplified by the R-CNN series, rely on region proposal mechanisms. For example, Lee et al. [
10] combined R-CNN detection with geometric image processing to quantify bolt loosening angles. Pham et al. [
11] proposed a framework integrating Faster R-CNN with graphical models, using Canny edge detection and the Hough transform to track bolt angles and loosening trends, while employing synthetic data to address scarcity. Zhang et al. [
12] developed a 1D Deep Convolutional Neural Network (1D-DCNN) to process raw vibration signals directly, demonstrating the noise resistance of non-visual methods. VijayaNirmala et al. [
13] utilized CNNs for nut-bolt dimension detection and classification, facilitating automated industrial sorting. While these algorithms offer high accuracy, they are computationally intensive and slow. Conversely, single-stage algorithms, such as SSD and the YOLO (You Only Look Once) series, prioritize speed. Zou et al. [
14] enhanced YOLOv5 with BiFPN and coordinate attention to improve small bolt detection in complex transmission line backgrounds. Hua et al. [
15] proposed an improved YOLOv8 algorithm integrating Self-Calibrating Convolutions (SCConv) and Bilayer Routing Attention (BRA) to address occlusion and small target detection. Yang et al. [
16] developed a quantized YOLOv5s-based method for missing bolt detection, optimizing for edge devices like the Jetson Nano. Li et al. [
17] introduced the YOLO-FDD network with an Attention Fusion Feature Pyramid Network (AF-FPN) and Swin Transformer to detect minute defects in aircraft skin fasteners. The primary advantage of single-stage algorithms is their real-time capability, though they historically trail two-stage models in detection accuracy.
Despite these advancements, practical application in substations remains challenging. First, long-distance or wide-field-of-view imaging causes bolt-nut fasteners to occupy only a limited number of pixels in the image. Critical structural features often degrade during subsampling, increasing the risk of missed detections. Second, industrial environments introduce complex lighting, strong background interference, and noise from oil and dust, all of which significantly degrade image quality. While YOLOv8 excels in balancing speed and accuracy, its performance requires optimization to effectively handle the specific challenges of small targets in noisy environments [
18]. In response to these challenges, several studies have explored improved solutions for small-object detection. For instance, Lou et al. [
19] improved YOLOv8’s downsampling and feature fusion for small object detection, while Yao et al. [
20] developed HP-YOLOv8 for remote sensing, utilizing a C2f-D-Mixer and dual-layer routing attention.
In response to the above challenges, an improved lightweight detection framework termed YOLOv8n-ALC is developed based on the YOLOv8n architecture. The proposed model is tailored for accurate identification of bolt–nut fasteners in substation inspection scenarios. The primary contributions of this study are summarized as follows:
Development of the C2f-AC Feature Extraction Module: We reconstructed the C2f unit in the backbone using the AdditiveBlock from CAS-ViT and integrated a CGLU context gating mechanism. This design maintains computational efficiency while enhancing fine-grained feature modeling through additive attention and gating strategies, effectively mitigating complex background noise interference.
Design of the SPPF-LSKA Spatial Enhancement Module: Inspired by large kernel attention mechanisms, we introduced an improved Large Separable Kernel Attention (LSKA) unit after the SPPF module. By implementing spatially adaptive weighting, the model prioritizes discriminative regions and suppresses redundant background responses, improving high-level feature representation in complex scenes.
Proposal of the CGRFPN Neck Network: We designed a Cross-Global Representation Feature Pyramid Network (CGRFPN) to replace the original neck architecture. By introducing cross-layer attention and global context guidance, this network enhances collaborative feature expression across scales. This allows high-level semantic information to more effectively guide low-level detail features, improving the stability of small target localization while maintaining real-time performance.
3. Method
To enhance small object detection of bolt-nut fasteners in complex substation environments, we propose YOLOv8n-ALC, an improved model based on YOLOv8n. The network architecture is illustrated in
Figure 1. The proposed method integrates three core components: the C2f with AdditiveBlock and Convolutional Gated Linear Unit (C2f-AC) module, the Spatial Pyramid Pooling-Fast integrated with Large Separable Kernel Attention (SPPF-LSKA) module, and the Context-Guided Reconstruction Feature Pyramid Network (CGRFPN) neck network.
3.1. C2f-AC
To improve the backbone’s ability to represent small bolt–nut fasteners, the CSP Bottleneck with 2 Convolutions (C2f) module in YOLOv8 was redesigned. Inspired by the CAS-ViT architecture [
37], the Bottleneck units in the original C2f structure were replaced with Additive Blocks to construct an enhanced feature extraction module. This modification strengthens the network’s capability to capture fine-grained spatial details and helps preserve important local information during hierarchical feature learning. This modification strengthens local detail modeling and promotes cross-region feature interaction while maintaining low computational overhead. To further improve feature selectivity and background noise suppression, we incorporate the Convolutional Gated Linear Unit (CGLU) [
38]. Specifically, the MLP branch within the Additive Block is replaced by a gated convolutional hybrid unit, forming the C2f with the AdditiveBlock and Convolutional Gated Linear Unit (C2f-AC) module. The core component of this module is Additive Block–CGLU (Add-CGLU). This design employs a three-stage residual path—Local Perception (LP), Additive Token Mixing, and Gated Channel/Spatial Modulation—to adaptively modulate feature responses. It strengthens key region representations while suppressing interference from complex background textures, thereby providing a stable foundation for subsequent neck fusion and detection head prediction. The overall Add-CGLU architecture is illustrated in
Figure 2.
3.1.1. LP Mechanism
The LP mechanism constructs an efficient channel-aware unit using sequential pointwise convolutions. It employs three cascaded convolutional layers, Batch Normalization (
BN), and the Gaussian Error Linear Unit (GELU) activation function to perform high-dimensional channel remapping and nonlinear transformation. This design promotes deep channel interaction and fusion while incurring extremely low computational cost. Given an input feature x, the feature transformation process of the LP mechanism can be expressed as
In the formula,
represents the LP output,
denotes the weight function,
BN stands for Batch Normalization, and
denotes the Gaussian Error Linear Unit (GELU) activation function.
3.1.2. Convolutional Additive Token Mixer
Following preliminary perception and transformation via the LP mechanism, input features are fed into the Convolutional Additive Token Mixer (CATM) for context modeling. The input features are first mapped into three representations—Query (
Q), Key (
K), and Value (
V)—via a
BN layer and independent linear transformations. For simplicity, the input features are denoted as x in the following discussion:
where
denotes the context mapping function,
represents the spatial attention mechanism,
indicates the channel attention operation. The fusion function
M performs element-wise additive integration of the context-enhanced Query and Key representations, enabling efficient aggregation of spatial–channel contextual information.
The structural diagram of the spatial attention branch and channel attention branch is shown in
Figure 3.
CATM utilizes an Additive Attention mechanism to efficiently capture contextual dependencies across spatial and channel dimensions. By combining Token Mixer outputs through element-wise superposition, the architecture jointly models feature responses in both dimensions, enabling comprehensive attention modulation. Compared to traditional dot-product self-attention, additive attention significantly reduces computational complexity by avoiding the quadratic overhead of large-scale matrix multiplication while still effectively modeling global context. The final output of CATM can be expressed as
Here,
denotes a linear mapping function that integrates and remaps the contextual features obtained from modeling the additive attention mechanism, thereby promoting information fusion and collaborative expression among different feature branches.
3.1.3. Convolutional Gated Linear Unit
Features enhanced by CATM are then input into the Convolutional Gated Linear Unit (CGLU) for modulation. CGLU combines depthwise separable convolutions with a gating mechanism to adaptively modulate and reweight feature responses. Specifically, CGLU employs a dual-branch architecture comprising BN, a
depthwise separable convolution, and GELU activation to perform parallel transformations on the input features: one branch generates the gating signal, while the other produces the feature representation to be modulated. Finally, the outputs from both branches are fused through element-wise multiplication, with the computational process expressed as
Specifically,
and
correspond to the weight parameters on the two branches. CGLU employs a gating mechanism to adaptively modulate features, enhancing feature selectivity with minimal additional computational overhead, thereby improving the model’s robustness.
3.2. SPPF-LSKA
In substation inspection scenarios, bolt-nut fasteners are often tiny and visually similar to complex background structures. As network depth increases, fine-grained shallow features tend to degrade during consecutive downsampling and feature fusion, thereby limiting localization accuracy for small bolt-nut fasteners. YOLOv8 employs a Spatial Pyramid Pooling–Fast (SPPF) module at the end of the backbone. Fundamentally, max pooling functions as a subsampling strategy based on local maxima; it prioritizes the most prominent features within the receptive field while suppressing texture and edge information in non-maximal regions. Under strong background interference, this high-frequency information filtering may weaken responses to critical features such as thread edges and nut contours, compromising the stability of small bolt-nut fasteners.
To address this limitation, we introduce the Large Separable Kernel Attention (LSKA) mechanism to optimize the original architecture [
39]. This approach strikes a balance between large receptive field modeling capability and computational cost. While traditional Large Kernel Attention (LKA) effectively captures long-range dependencies, its reliance on large two-dimensional convolutions incurs excessive parameters and memory overhead, making it ill-suited for lightweight networks [
40]. LSKA decomposes the two-dimensional convolutional kernel into cascaded horizontal and vertical one-dimensional depthwise separable convolutions. By incorporating dilated convolutions, this approach maintains effective long-range context modeling while significantly reducing computational overhead. As a result, spatial perception capability is enhanced while overall model efficiency is preserved.
Let the input feature map be denoted as
LSKA models features in both horizontal and vertical directions via cascaded one-dimensional separable convolutions, with dilated convolutions employed to further expand the effective receptive field. For channel C, the separable convolution process in the first stage can be expressed as:
Here,
and
represent the depth convolution kernels along the horizontal and vertical directions, respectively, and
d denotes the dilation rate.
The receptive field is further expanded with minimal additional parameters by introducing a second-stage dilated separable convolution:
Here,
denotes the size of the convolution kernel, and
represents the floor operation.
Subsequently, a
convolution is applied to fuse multi-scale contextual information and generate a spatial attention weight map:
Finally, the attention weights are applied to the original features through element-wise multiplication:
This structural design reduces the computational burden of large-kernel convolutions while maintaining efficiency. This enhances long-range dependency modeling, making it well suited for detecting densely distributed small objects.
To mitigate spatial information loss caused by max pooling in SPPF, we integrate LSKA into the SPPF architecture, forming the SPPF-LSKA module, as illustrated in
Figure 4. After concatenating multi-scale pooled features, LSKA performs spatial re-weighting on the fused features, followed by feature integration and channel compression via subsequent convolutional layers. This design preserves the lightweight and multi-scale aggregation characteristics of SPPF while mitigating spatial detail loss introduced by max pooling. By leveraging separable large-kernel modeling, LSKA expands spatial context coverage. Compared to direct large-kernel two-dimensional convolutions, this approach effectively controls additional computational overhead and enhances spatial discrimination and localization stability for small bolt-nut fasteners.
3.3. CGRFPN
In substation inspection scenarios, bolt-nut fasteners are typically small in scale and embedded in complex mechanical structures. Due to illumination variations, occlusion, corrosion, and background clutter, their discriminative visual features are easily overwhelmed. In lightweight detectors such as YOLOv8n, high-level features provide strong semantic information but suffer from spatial detail degradation, whereas low-level features preserve fine-grained textures but lack sufficient global semantic constraints. When these features are fused using conventional PAN-FPN with linear up/down sampling and concatenation, cross-scale semantic misalignment and cumulative background interference are likely to occur, limiting localization accuracy and detection robustness for bolt-nut fasteners.
To address these limitations, we propose a Context-Guided Reconstruction Feature Pyramid Network (CGRFPN) as the neck architecture. CGRFPN is designed to enhance cross-scale feature consistency and contextual awareness while maintaining computational efficiency. It consists of four key components: Pyramid Context Extraction (PCE), Rectangular Self-Calibration Module (RCM), Multi-Scale Feature Fusion Block (FBM), and Dynamic Interpolation Fusion (DIF), as illustrated in
Figure 5.
Specifically, PCE aggregates multi-level backbone features to construct a unified pyramid context representation, providing global semantic guidance for feature reconstruction. RCM incorporates coordinate attention and ConvMLP to calibrate orientation- and position-sensitive features. FBM performs context-guided, scale-weighted fusion to enhance responses in discriminative regions while suppressing background noise. DIF further mitigates spatial misalignment during cross-scale fusion through adaptive interpolation, ensuring spatial consistency across feature maps. Together, these modules enable CGRFPN to deliver more stable and informative feature representations for detecting bolt-nut fasteners in complex environments.
3.3.1. Pyramid Context Extraction
CGRFPN first employs the Pyramid Context Extraction (PCE) module to perform scale alignment and channel fusion on features from different backbone layers. This constructs multi-scale contextual representations while maintaining computational efficiency, providing global semantic guidance for subsequent feature reconstruction. Subsequently, the Rectangular Self-Calibration Module (RCM) integrates a Coordinate Attention mechanism and ConvMLP. Coordinate Attention enhances the representation of target geometry and positional information by decoupling horizontal and vertical spatial information. ConvMLP strengthens cross-channel dependency modeling, facilitating multi-scale feature interactions to improve representation stability and consistency.
3.3.2. Context-Guided Feature Reconstruction
Following context extraction, CGRFPN performs adaptive cross-scale fusion and reconstruction via the Multi-Scale Feature Fusion Block (FBM) and the Dynamic Interpolation Fusion (DIF) module. The FBM module facilitates context-guided feature reconstruction by utilizing global context from PCE to perform weighted fusion on current-level features using spatial weights, thereby amplifying key region responses while suppressing redundant or distracting information. This context-aware gated fusion helps establish robust semantic associations between multi-scale features. The DIF module mitigates spatial misalignment during cross-scale feature fusion. By introducing an adaptive interpolation mechanism, DIF aligns upsampled features with current-scale features, reducing inconsistencies caused by scale differences and enhancing the stability and continuity of the multi-scale feature fusion process. Overall, CGRFPN enhances the neck network’s fusion capability with minimal additional computational overhead, providing stable and information-rich feature representations for the detection head.
4. Experiments
4.1. Evaluation Metrics
To rigorously evaluate the detection accuracy and computational efficiency of the proposed algorithm in complex substation scenarios, this study employs a comprehensive set of evaluation metrics, including Precision (P), Recall (R), Average Precision (AP), Mean Average Precision (mAP), parameter count (Params), and floating-point operations (GFLOPs).
Precision measures the correctness of positive predictions, defined as the ratio of true positive samples to all predicted positives. It reflects the model’s ability to minimize false positives. Recall measures the coverage of actual positive samples, indicating the model’s capability to reduce missed detections (false negatives). The calculation formulas are as follows:
Here,
TP,
FP, and
FN represent the number of True Positives, False Positives, and False Negatives, respectively, under a predefined Intersection over Union (IoU) threshold.
Typically, there exists a trade-off between Precision and Recall. To evaluate performance across varying confidence thresholds, Average Precision (AP) is introduced. Defined as the area under the Precision-Recall (P-R) curve, AP characterizes detection performance for a specific category. The formula is as follows:
Based on this, Mean Average Precision (mAP) is calculated as the arithmetic mean of AP values across all categories, providing a unified metric for overall performance. For a detection task with
n classes, the mAP formula is
where
denotes the Average Precision of the
i category, and
n represents the total number of classes. Specifically, mAP@0.5 represents the mAP calculated at an IoU threshold of 0.5, which is widely adopted in industrial object detection tasks. Considering the practical requirements of detection reliability and real-time performance in substation inspection scenarios, multiple evaluation metrics—including Precision, Recall, F1-score, and mAP@0.5—are adopted to comprehensively assess detection performance. Among these metrics, Recall reflects the model’s ability to reduce missed detections, Precision measures the correctness of predicted targets, and mAP@0.5 provides an overall evaluation of detection accuracy.
In practical inspection systems, the confidence score associated with each predicted bounding box can be interpreted as the posterior confidence that the predicted region corresponds to a true bolt-nut fastener. Although standard metrics such as Precision, Recall, F1-score, and mAP quantitatively evaluate detection performance, confidence scores also play an important role in practical maintenance decision-making. In engineering applications, predefined confidence thresholds are commonly used to balance detection reliability and inspection workload. High-confidence detections can be directly accepted by automated inspection systems, whereas low-confidence predictions may be flagged for manual verification to reduce the risk of erroneous decisions. This mechanism enables automated inspection systems to maintain reliable detection performance while ensuring practical usability in real-world industrial environments.
Furthermore, from the perspective of engineering maintenance, the implications of detection metrics extend beyond purely numerical evaluation. A missed detection (False Negative), which directly reduces Recall, indicates that an existing bolt–nut fastener is not identified by the model. In critical infrastructures such as substations or nuclear power facilities, missing a potentially defective fastener may result in undetected structural risks and compromise operational safety. In contrast, a false positive detection (False Positive), which lowers Precision, incorrectly identifies background structures as targets. Although such errors do not directly threaten structural safety, they may trigger unnecessary manual inspections, thereby increasing maintenance workload and operational costs. Therefore, maintaining high and balanced Precision and Recall is essential for reliable automated inspection in real-world industrial environments, while metrics such as the F1-score and mAP provide comprehensive indicators for evaluating overall detection performance.
In addition to accuracy metrics, model complexity is evaluated using Params and GFLOPs. Params reflects the model’s memory footprint, while GFLOPs quantify the computational cost during inference. These efficiency-related metrics are critical for assessing the practical deployability of detection models on resource-constrained edge devices commonly used in substation inspection systems.
4.2. Experimental Setup
4.2.1. Experimental Platform and Parameter Configuration
Experiments were performed on a standardized computing platform. The hardware configuration included an Intel Xeon Platinum 8474C CPU and a single NVIDIA GeForce RTX 4090D GPU (24 GB VRAM) for accelerated training. The software environment was based on Linux, utilizing Python 3.8 and the PyTorch 2.0.0 framework with CUDA 11.8.
For training, input images were resized to fixed dimensions of
pixels. The batch size was set to 64 to balance training efficiency and convergence stability. The model was trained for 300 epochs using the Stochastic Gradient Descent (SGD) optimizer. Initial hyperparameters were set as follows: learning rate of 0.01, momentum of 0.937, and weight decay of 0.0005, as detailed in
Table 1.
4.2.2. Dataset Preparation
To validate the robustness of the proposed algorithm for bolt-nut fastener detection under unstructured industrial environments, we employed the public NPU-BOLT dataset for training and evaluation. The original NPU-BOLT dataset contains 337 images with 1275 annotated bolt-nut fastener instances. The dataset was randomly divided into training, validation, and test sets with an approximate ratio of 7:1.5:1.5. In contrast to datasets captured under ideal laboratory conditions, NPU-BOLT reflects the visual complexities inherent in real-world engineering applications caused by uncontrollable environmental factors. The dataset includes key categories such as bolt heads and nuts. Its samples closely mirror the challenges of outdoor industrial inspection, featuring uncontrolled lighting (e.g., specular reflections and shadows), complex background textures, target occlusion, and edge blurring. These conditions are representative of typical inspection scenarios in which bolt-nut fasteners are often obscured by dense mechanical structures or affected by motion blur during image acquisition. Furthermore, the diverse shooting distances and viewing angles facilitate the evaluation of the model’s scale invariance and robustness to viewpoint variations.
While NPU-BOLT provides diverse samples, relying solely on raw data may still limit generalization under more extreme or unseen environmental conditions, as real-world inspection scenarios often involve severe disturbances. Therefore, we implemented a scenario-driven hybrid data augmentation strategy to expand the feature space. These augmentation operations were applied exclusively to the training set, increasing the number of training samples to 1416 images in order to mitigate potential overfitting and improve robustness to environmental variations. The augmentation pipeline includes both photometric and geometric transformations designed to simulate real-world inspection conditions. First, to simulate outdoor lighting variations (e.g., direct sunlight), random photometric distortion was applied by adjusting brightness, contrast, and saturation. Second, noise-based perturbations, including Gaussian noise, salt-and-pepper noise, and Gaussian blur, were introduced to mimic low signal-to-noise ratios and motion blur commonly encountered during industrial image acquisition. Third, geometric transformations such as random rotation were used to improve viewpoint invariance. Finally, Mosaic and Mixup augmentation techniques were employed to enhance robustness against densely cluttered backgrounds and overlapping visual patterns. Through random image stitching and pixel-level fusion, these methods simulate complex occlusion patterns and background interactions, constructing an augmented dataset that more closely approximates the visual complexity and variability of real-world industrial inspection environments.
4.3. Analysis of YOLOv8n-ALC Experimental Results
Figure 6 illustrates the Precision and Recall curves for both the baseline and the improved model during training. As shown in
Figure 6a, YOLOv8n-ALC exhibits consistently higher Precision, faster convergence, and reduced fluctuation compared to the baseline. This indicates superior stability in suppressing false positives under identical evaluation settings. In
Figure 6b, YOLOv8n-ALC surpasses the baseline in Recall during the mid-to-late training phases, maintaining a sustained advantage. This demonstrates enhanced target coverage and a reduction in false negatives. Collectively, these curves confirm that the proposed structural improvements effectively strengthen feature discrimination and detection stability in complex backgrounds.
To further evaluate performance, we analyzed the mAP evolution, as depicted in
Figure 7. YOLOv8n-ALC consistently achieves higher mAP levels throughout the process, indicating superior overall detection accuracy. Additionally, the model stabilizes within fewer epochs, demonstrating faster convergence and robust performance in the later stages. These results suggest that the proposed architectural enhancements significantly improve detection stability and overall performance under complex environmental conditions.
4.4. Performance Evaluation
4.4.1. Ablation Experiment
To systematically assess the specific contributions and interaction effects of the C2f-AC module, SPPF-LSKA module, and CGRFPN, a series of ablation experiments were conducted based on the YOLOv8n baseline. As summarized in
Table 2, the evaluation followed a three-stage protocol, including single-module integration, pairwise stacking, and full integration.
Single-module analysis reveals that each component contributes in a manner consistent with its design objective. The C2f-AC module (Model 1) improves parameter efficiency by increasing Precision from 88.3% to 89.7% while simultaneously reducing the parameter count to 2.7 M and the computational cost to 7.6 GFLOPs. This indicates that replacing standard bottleneck structures with additive attention effectively prunes redundant feature representations while preserving discriminative capability. The SPPF-LSKA module (Model 2) yields the most pronounced improvement in Recall, which rises from 79.5% to 85.0%, suggesting that the large-kernel decomposition strategy expands the effective receptive field and enhances sensitivity to small and easily missed targets with only marginal computational overhead (7.9 GFLOPs). The CGRFPN (Model 3) provides a balanced performance gain, achieving an mAP@0.5 of 89.1%. Although the context-guided cross-scale fusion slightly increases the parameter count to 3.7 M, the inference cost remains comparable to the baseline (8.6 GFLOPs), supporting its effectiveness in multi-scale feature reconstruction.
To further evaluate the statistical stability of the proposed approach, the baseline YOLOv8n and the final YOLOv8n-ALC models were trained three times using different random seeds. The mAP@0.5 metric is therefore reported as the mean value together with its standard deviation. Synergistic integration of the proposed modules (Models 4–6 and YOLOv8n-ALC) further demonstrates their complementary nature. Pairwise combinations consistently outperform single-module configurations, indicating that the modules address different performance bottlenecks without functional conflict. Ultimately, the fully integrated YOLOv8n-ALC achieves the best overall performance, reaching an mAP@0.5 of 92.1 ± 0.2%, a Precision of 93.5%, and a Recall of 87.1%. Compared with the baseline YOLOv8n (87.8 ± 0.1%), this corresponds to a 4.3% improvement in mAP@0.5. The relatively small standard deviations further indicate that both models maintain stable training behavior across repeated runs. Notably, these gains are achieved while reducing the total parameter count to 2.9 M and lowering the computational cost to 8.2 GFLOPs, suggesting that the performance improvements arise from architectural efficiency and operator optimization rather than increased model capacity. As shown in
Figure 8, YOLOv8n-ALC also demonstrates faster convergence and more stable validation performance during training.
4.4.2. Comparison Experiment
To further validate the effectiveness of the proposed approach, YOLOv8n-ALC was benchmarked against representative lightweight object detection models, including YOLOv3-tiny, YOLOv5n, YOLOv7-tiny, and the baseline YOLOv8n. All models were evaluated under identical training environments and datasets to ensure a fair comparison. The quantitative results are reported in
Table 3.
As shown in
Table 3, YOLOv8n-ALC achieves the best overall detection performance among all compared methods. It attains an mAP@0.5 of 92.1%, significantly outperforming the baseline YOLOv8n (87.8%) as well as earlier lightweight variants such as YOLOv7-tiny (86.9%). In addition, YOLOv8n-ALC records the highest Precision (93.5%) and Recall (87.1%), resulting in a peak F1-score of 90.2. These results indicate that the proposed model effectively suppresses false positives while maintaining robust target coverage.
Figure 9 illustrates the training dynamics on the validation set. YOLOv8n-ALC demonstrates faster convergence and consistently higher accuracy throughout training, reflecting improved optimization stability compared with the baseline model.
Beyond detection accuracy, YOLOv8n-ALC also exhibits favorable computational efficiency. As detailed in
Table 3, the proposed model requires only 2.9 M parameters and 8.3 GFLOPs, making it lighter and faster than YOLOv8n (3.2 M/8.7 GFLOPs) and substantially more efficient than YOLOv7-tiny (6.2 M/13.8 GFLOPs). Although YOLOv5n employs fewer parameters (1.9 M), its detection accuracy is notably lower (87.2% mAP@0.5). Overall, these results demonstrate that YOLOv8n-ALC achieves a favorable accuracy–efficiency trade-off, making it well suited for deployment in resource-constrained substation inspection scenarios.
Beyond quantitative accuracy and computational efficiency, it is essential to examine the qualitative detection behavior of the proposed model in real-world substation scenarios.
Figure 10 presents representative visual detection examples that illustrate the practical performance of YOLOv8n-ALC under diverse industrial conditions.
Specifically,
Figure 10a shows representative detection results on the test set, covering several challenging scenarios such as complex structural backgrounds, severe surface degradation, and adverse illumination conditions. Despite these difficulties, the proposed model can accurately localize bolt–nut fasteners with stable confidence scores. These qualitative results provide intuitive visual evidence that complements the quantitative comparisons reported in
Table 3, further validating the practical reliability of the proposed method.
To further evaluate the cross-domain generalization capability of the proposed model, an external verification dataset consisting of 50 bolt–nut images was constructed using samples collected from publicly available industrial inspection platforms. All images were manually annotated following the same labeling protocol used for the NPU-BOLT dataset. These out-of-distribution samples contain complex industrial backgrounds, varying illumination conditions, partial occlusions, and surface corrosion, which differ significantly from those present in the original dataset and therefore provide a more challenging evaluation scenario for assessing real-world robustness.
As illustrated in
Figure 10b, YOLOv8n-ALC maintains reliable detection performance on these unseen images, accurately localizing bolt–nut fasteners across diverse scales and complex visual conditions. These qualitative observations demonstrate that the proposed model not only performs well on the original test set but also exhibits strong cross-domain generalization capability, highlighting its practical applicability for real-world substation inspection tasks.
The qualitative results demonstrate that YOLOv8n-ALC maintains robust and consistent detection performance under highly complex substation conditions. As illustrated in
Figure 10, the test scenarios encompass adverse illumination effects caused by specular reflections and hard shadows, severe surface degradation resulting from heavy oil contamination and rust-induced camouflage, as well as strong background texture interference. Despite the extremely small physical scale of bolt–nut fasteners and frequent partial occlusions, YOLOv8n-ALC accurately localizes most targets with stable and relatively high confidence scores, typically ranging from 85% to 95%. Notably, the proposed model preserves reliable detections in heavily oil-stained scenes and severely corroded structures, where fastener appearances are highly degraded and visually blended with the surrounding background. Moreover, it effectively adapts to complex geometric configurations and discriminates fasteners from repetitive structural textures, such as diamond-patterned plates.
Nevertheless, several challenging cases remain where the detection performance deteriorates. As shown in
Figure 10a, when bolt–nut fasteners exhibit strong visual similarity to the surrounding background or are affected by extreme specular reflections, the confidence scores of some detections decrease noticeably, with a few predictions dropping to approximately 35%. In addition, as illustrated in
Figure 10b, under severe occlusion or motion-induced blur, a small number of bolt–nut fasteners may be partially missed by the detector. These cases typically occur when the visual boundaries of the fasteners become indistinguishable from the surrounding structures or when the discriminative features are heavily degraded.
Despite these limitations, YOLOv8n-ALC still maintains reliable detection performance in the majority of practical inspection scenarios. The observed failure cases also reveal potential directions for future improvements, such as enhancing occlusion-aware feature modeling and improving robustness to motion blur and background camouflage. Overall, the qualitative results further confirm the strong generalization capability and environmental robustness of YOLOv8n-ALC in real-world substation inspection environments characterized by cluttered backgrounds and adverse visual conditions.
5. Conclusions
This study addressed the challenge of detecting bolt-nut fasteners in complex substation environments by enhancing the backbone architecture and optimizing the multi-scale feature fusion pathway. Based on comprehensive experiments and ablation analyses, the main findings are summarized from the perspectives of methodological design, detection performance, engineering applicability, and remaining limitations.
Experimental results demonstrate that the proposed YOLOv8n-ALC consistently outperforms lightweight baseline detectors across key evaluation metrics. By integrating the C2f-AC module into the backbone, the network improves fine-grained feature discrimination for bolt-nut fasteners while simultaneously reducing parameter redundancy through additive attention and gated feature modulation. The SPPF-LSKA module effectively expands the receptive field using separable large-kernel attention, leading to a notable improvement in target recall for small and easily missed fasteners. In addition, the CGRFPN neck network enhances cross-scale feature consistency by incorporating context-guided reconstruction, thereby improving localization stability in complex backgrounds. Ablation studies confirm that each component contributes in a complementary manner, and their joint integration yields cumulative performance gains.
Under complex substation inspection scenarios, YOLOv8n-ALC achieves an mAP@0.5 of 92.1%, corresponding to a 4.3% improvement over the YOLOv8n baseline. Precision and Recall are improved by 5.2% and 7.6%, respectively. Importantly, these accuracy gains are achieved alongside reduced model complexity, with the final network requiring only 2.9 M parameters and 8.2 GFLOPs. This demonstrates that the proposed architectural modifications enhance detection performance through operator efficiency and effective feature modeling rather than increased model capacity, supporting practical deployment under resource constraints.
Despite these promising results, several limitations remain. The current evaluation does not include extreme weather conditions, where severe illumination variation or environmental noise may affect detection robustness. Moreover, although large-kernel attention improves contextual modeling, it may introduce local aliasing effects in scenarios with densely clustered fasteners or highly repetitive textures. Future work will explore the integration of multimodal sensing information and more efficient sparse attention mechanisms to further improve robustness. Combined with model compression and quantization techniques, these extensions are expected to enhance deployability on edge devices without compromising detection accuracy.
In summary, YOLOv8n-ALC provides an effective and efficient solution for bolt-nut fastener detection in complex industrial environments. Future research will further investigate its cross-scenario generalization capability and validate its performance in real-world edge-side deployment scenarios.