YOLOv8n-ALC: An Efficient Network for Bolt-Nut Fastener Detection in Complex Substation Environments

You, Dazhang; Li, Fangke; Wang, Sicheng; Zhang, Yepeng

doi:10.3390/app16062716

Open AccessArticle

YOLOv8n-ALC: An Efficient Network for Bolt-Nut Fastener Detection in Complex Substation Environments

by

Dazhang You

,

Fangke Li

^*,

Sicheng Wang

and

Yepeng Zhang

Hubei Key Laboratory of Modern Manufacturing Quality Engineering, School of Mechanical Engineering, Hubei University of Technology, Wuhan 430068, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2026, 16(6), 2716; https://doi.org/10.3390/app16062716

Submission received: 9 February 2026 / Revised: 6 March 2026 / Accepted: 9 March 2026 / Published: 12 March 2026

(This article belongs to the Section Computing and Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

Bolt-nut fasteners are critical components of substation equipment, and their integrity directly affects the operational reliability of power systems. In practical inspection scenarios, however, the small physical scale of bolt-nut fasteners, together with complex background structures, often obscures their discriminative visual features, making accurate automated detection particularly challenging. Reliable detection is a prerequisite for downstream tasks such as loosening identification and defect diagnosis. To address these challenges, this paper proposes YOLOv8n-ALC, an enhanced detection network built upon the lightweight YOLOv8n framework. The backbone is redesigned by integrating the AdditiveBlock from CAS-ViT and a Convolutional Gated Linear Unit (CGLU) to strengthen fine-grained feature extraction and suppress background interference without increasing computational burden. In addition, an improved Large Separable Kernel Attention (LSKA) module is introduced to expand the effective receptive field while maintaining efficiency, enabling more robust multi-scale feature representation. To further alleviate feature degradation of small bolt-nut fasteners in deep layers, a Context-Guided Reconstruction Feature Pyramid Network (CGRFPN) is employed in the neck to optimize cross-layer feature fusion and enhance localization accuracy. Experimental results demonstrate that YOLOv8n-ALC achieves an mAP@0.5 of 92.1%, with precision and recall of 93.5% and 87.1%, respectively, outperforming the baseline by clear margins. These results confirm the effectiveness and robustness of the proposed method for intelligent substation inspection and bolt-nut fastener condition monitoring.

Keywords:

object detection; YOLOv8; deep learning; bolted connections

1. Introduction

Bolt-nut fasteners are critical fastening components in mechanical connections and are widely used in power systems, rail transit, aerospace, and civil engineering. They play a fundamental role in maintaining the structural integrity of equipment components and engineering systems. However, under prolonged mechanical loads, environmental corrosion, and complex operating conditions, bolt-nut fasteners may experience loosening or failure, posing potential structural safety risks [1]. For instance, post-incident investigation reports on the 2021 Mexico City subway viaduct collapse indicated that missing bolts at steel beam connections, along with welding defects, were among the contributing factors to the structural failure [2]. This case highlights that the integrity of bolt-nut fasteners is closely related to the safety and reliability of large-scale engineering structures. Within power systems, substations act as critical hubs where operational stability directly impacts grid security. Consequently, automated detection of bolt-nut fasteners is essential, serving not only as a prerequisite for identifying loosening and defects but also as a key enabler for intelligent substation inspection and grid safety assurance [3].

Traditionally, substation maintenance has relied heavily on manual inspection. However, this approach is inherently subjective and often suffers from inefficiencies, high labor intensity, and a substantial risk of missed detections [4]. The inspection challenge is compounded by the physical characteristics of the targets: bolts and nuts are typically minute in size and situated in concealed or inaccessible locations. For instance, similar to the difficulties encountered in rail [5] and steel bridge inspections [6], where bolt-nut fasteners are distributed across elevated or structurally cluttered locations, bolt-nut fasteners in substations are often distributed across high-altitude pylons or obscured bases where comprehensive manual coverage is nearly impossible. Moreover, real-world interferences—such as visual obstructions, variable viewing angles, and uneven lighting—further increase inspection uncertainty [7]. These limitations underscore the urgent need for automated computer vision–based detection methods for bolt-nut fasteners to replace or augment traditional manual methods.

Convolutional neural networks (CNNs) have demonstrated strong capability in object detection tasks by learning hierarchical feature representations. With the continuous improvement of GPU computing resources, the performance of CNN-based detection models has been significantly enhanced [8]. Existing deep learning detection frameworks are commonly categorized into two types: two-stage detectors and single-stage detectors [9]. Two-stage algorithms, exemplified by the R-CNN series, rely on region proposal mechanisms. For example, Lee et al. [10] combined R-CNN detection with geometric image processing to quantify bolt loosening angles. Pham et al. [11] proposed a framework integrating Faster R-CNN with graphical models, using Canny edge detection and the Hough transform to track bolt angles and loosening trends, while employing synthetic data to address scarcity. Zhang et al. [12] developed a 1D Deep Convolutional Neural Network (1D-DCNN) to process raw vibration signals directly, demonstrating the noise resistance of non-visual methods. VijayaNirmala et al. [13] utilized CNNs for nut-bolt dimension detection and classification, facilitating automated industrial sorting. While these algorithms offer high accuracy, they are computationally intensive and slow. Conversely, single-stage algorithms, such as SSD and the YOLO (You Only Look Once) series, prioritize speed. Zou et al. [14] enhanced YOLOv5 with BiFPN and coordinate attention to improve small bolt detection in complex transmission line backgrounds. Hua et al. [15] proposed an improved YOLOv8 algorithm integrating Self-Calibrating Convolutions (SCConv) and Bilayer Routing Attention (BRA) to address occlusion and small target detection. Yang et al. [16] developed a quantized YOLOv5s-based method for missing bolt detection, optimizing for edge devices like the Jetson Nano. Li et al. [17] introduced the YOLO-FDD network with an Attention Fusion Feature Pyramid Network (AF-FPN) and Swin Transformer to detect minute defects in aircraft skin fasteners. The primary advantage of single-stage algorithms is their real-time capability, though they historically trail two-stage models in detection accuracy.

Despite these advancements, practical application in substations remains challenging. First, long-distance or wide-field-of-view imaging causes bolt-nut fasteners to occupy only a limited number of pixels in the image. Critical structural features often degrade during subsampling, increasing the risk of missed detections. Second, industrial environments introduce complex lighting, strong background interference, and noise from oil and dust, all of which significantly degrade image quality. While YOLOv8 excels in balancing speed and accuracy, its performance requires optimization to effectively handle the specific challenges of small targets in noisy environments [18]. In response to these challenges, several studies have explored improved solutions for small-object detection. For instance, Lou et al. [19] improved YOLOv8’s downsampling and feature fusion for small object detection, while Yao et al. [20] developed HP-YOLOv8 for remote sensing, utilizing a C2f-D-Mixer and dual-layer routing attention.

In response to the above challenges, an improved lightweight detection framework termed YOLOv8n-ALC is developed based on the YOLOv8n architecture. The proposed model is tailored for accurate identification of bolt–nut fasteners in substation inspection scenarios. The primary contributions of this study are summarized as follows:

Development of the C2f-AC Feature Extraction Module: We reconstructed the C2f unit in the backbone using the AdditiveBlock from CAS-ViT and integrated a CGLU context gating mechanism. This design maintains computational efficiency while enhancing fine-grained feature modeling through additive attention and gating strategies, effectively mitigating complex background noise interference.
Design of the SPPF-LSKA Spatial Enhancement Module: Inspired by large kernel attention mechanisms, we introduced an improved Large Separable Kernel Attention (LSKA) unit after the SPPF module. By implementing spatially adaptive weighting, the model prioritizes discriminative regions and suppresses redundant background responses, improving high-level feature representation in complex scenes.
Proposal of the CGRFPN Neck Network: We designed a Cross-Global Representation Feature Pyramid Network (CGRFPN) to replace the original neck architecture. By introducing cross-layer attention and global context guidance, this network enhances collaborative feature expression across scales. This allows high-level semantic information to more effectively guide low-level detail features, improving the stability of small target localization while maintaining real-time performance.

2. Related Works

2.1. Overview of YOLO Series Algorithms

Characterized by a lightweight architecture, efficient multi-scale feature fusion, and outstanding real-time performance, the YOLO (You Only Look Once) series has been widely adopted in industrial inspection and intelligent perception [21]. Representing a significant evolution in the series, YOLOv8 further optimizes detection paradigms and network structures by building on its predecessors. While the Cross Stage Partial (CSP) structure in YOLOv5 effectively reduced computational redundancy and improved inference efficiency, its feature aggregation capabilities remain limited when handling small-scale objects in complex background scenarios. Subsequently, YOLOv7 enhanced generalization via the ELAN architecture and reparameterization, albeit at the cost of increased training complexity and resource consumption [22].

Synthesizing these advancements, YOLOv8 adopts an Anchor-Free detection paradigm, thereby eliminating reliance on predefined anchor boxes. This simplifies the training process while improving adaptability to objects of varying scales. Structurally, YOLOv8 integrates CSP connection patterns with ELAN’s gradient flow concepts to introduce the C2f module. This design enhances cross-stage feature reuse and gradient propagation, optimizing feature expression while maintaining compactness. Additionally, the detection head employs a decoupled branch design for classification and regression, mitigating potential conflicts during multi-task learning [23]. Owing to YOLOv8’s balanced performance in terms of accuracy, stability, and speed, this study adopts it as the baseline for subsequent structural improvements.

The YOLOv8 architecture follows a three-part design consisting of a backbone network, a neck module, and a detection head. The backbone network progressively captures hierarchical visual representations from the input image. The neck module then aggregates features from multiple layers to strengthen the interaction between semantic information and spatial details. Finally, the detection head generates object category predictions and bounding box regressions based on the fused feature maps [24].

YOLOv8n is the most compact variant within the YOLOv8 architecture family. By scaling the network depth and width, the model significantly reduces the number of parameters and computational complexity while maintaining essential detection capability. Due to its efficient structure and low computational demand, YOLOv8n is well suited for real-time inference on resource-constrained edge devices. As a result, it is widely used as a baseline model in research on lightweight object detection [25].

The YOLOv8n backbone is built around the C2f module, which enhances feature extraction and gradient propagation while controlling complexity through cross-stage connections and feature reuse. Additionally, an SPPF module at the end of the backbone aggregates multi-scale contextual information at low cost, expanding the effective receptive field. The neck adopts the PAN-FPN bidirectional feature fusion topology, enabling effective cross-scale feature interaction and enhancing the representation of objects of varying sizes. The detection head employs an Anchor-Free decoupled design to mitigate interference during multi-task joint optimization [26]. However, the lightweight design of YOLOv8n restricts network depth, channel capacity, and cross-scale feature fusion. Consequently, aggressive downsampling leads to substantial spatial resolution loss in high-level features, while fine-grained structural information progressively degrades through multi-stage fusion. These effects severely impair the localization accuracy and detection stability of small, low-texture, and densely clustered objects, forming a critical bottleneck for performance improvement [27].

2.2. Attention Mechanism

In small object detection within complex backgrounds, challenges such as low pixel coverage, blurred textures, and susceptibility to environmental noise can cause discriminative features to degrade during layer-wise downsampling and multi-scale fusion. Inspired by the selective attention mechanism of the human visual system, attention mechanisms dynamically adjust feature response intensity by adaptively weighting input features. By suppressing irrelevant background noise while enhancing salient regions, such mechanisms have become pivotal for improving model performance in complex scenes.

Early research primarily focused on feature modeling across channel and spatial dimensions. Hu et al. [28] proposed SENet, which utilizes global average pooling to compress spatial information for channel-wise weight learning. However, reliance solely on first-order global statistics limits its ability to capture features with significant local variations or complex distributions. Additionally, the fully connected layer introduces parameter redundancy. To improve efficiency, Wang et al. [29] proposed ECA-Net, which replaces fully connected layers with one-dimensional convolutions, enabling cross-channel interactions at a low computational cost. In the spatial dimension, Jaderberg et al. [30] designed the Spatial Transformation Network (STN), which explicitly learns geometric transformation parameters, granting the model adaptability to target pose variations. Woo et al. [31] later introduced CBAM, which concatenates channel and spatial attention to jointly model “what” and “where” to attend to. However, the spatial branch of such hybrid mechanisms typically relies on standard convolutions. Constrained by local receptive fields, these methods struggle to capture long-range contextual dependencies in scenarios with heavy background interference or significant scale variations.

To overcome the locality constraints of convolutions, Wang et al. [32] introduced Non-local Networks, which utilize self-attention to directly model dependencies between arbitrary image locations, thereby enhancing global context capture. Dosovitskiy et al. [33] further validated the effectiveness of this global feature modeling paradigm via the Vision Transformer (ViT). However, computational complexity scales quadratically with feature resolution, leading to excessive memory consumption and high inference latency when processing high-resolution industrial images. While models such as CCNet [34] and EMANet [35] attempt to reduce this burden through sparse connections or approximation strategies, such simplifications risk losing fine-grained information that is critical for small object detection.

Despite significant progress in general object detection, existing attention mechanisms exhibit clear limitations when applied to small objects in complex backgrounds. Convolution-based lightweight attention, while computationally efficient, suffers from limited receptive fields, hindering the use of wide-field-of-view context for auxiliary discrimination. Conversely, global self-attention models, despite their strong long-range modeling capabilities, incur computational and storage overheads that fail to meet the real-time requirements of industrial edge devices. Furthermore, high-frequency features such as edges and textures—crucial for small object detection—are easily obscured by noise in deep networks, and existing lightweight mechanisms often lack targeted enhancement strategies for this issue [36].

Since relying solely on lightweight backbones or existing attention mechanisms often fails to strike an ideal balance between performance and efficiency, this paper introduces a novel attention modeling approach based on structural reconstruction. This solution enhances contextual awareness by optimizing feature interactions without requiring additional dynamic weight storage, offering a feasible strategy to balance efficiency and accuracy in complex bolt-nut fastener detection scenarios.

3. Method

To enhance small object detection of bolt-nut fasteners in complex substation environments, we propose YOLOv8n-ALC, an improved model based on YOLOv8n. The network architecture is illustrated in Figure 1. The proposed method integrates three core components: the C2f with AdditiveBlock and Convolutional Gated Linear Unit (C2f-AC) module, the Spatial Pyramid Pooling-Fast integrated with Large Separable Kernel Attention (SPPF-LSKA) module, and the Context-Guided Reconstruction Feature Pyramid Network (CGRFPN) neck network.

3.1. C2f-AC

To improve the backbone’s ability to represent small bolt–nut fasteners, the CSP Bottleneck with 2 Convolutions (C2f) module in YOLOv8 was redesigned. Inspired by the CAS-ViT architecture [37], the Bottleneck units in the original C2f structure were replaced with Additive Blocks to construct an enhanced feature extraction module. This modification strengthens the network’s capability to capture fine-grained spatial details and helps preserve important local information during hierarchical feature learning. This modification strengthens local detail modeling and promotes cross-region feature interaction while maintaining low computational overhead. To further improve feature selectivity and background noise suppression, we incorporate the Convolutional Gated Linear Unit (CGLU) [38]. Specifically, the MLP branch within the Additive Block is replaced by a gated convolutional hybrid unit, forming the C2f with the AdditiveBlock and Convolutional Gated Linear Unit (C2f-AC) module. The core component of this module is Additive Block–CGLU (Add-CGLU). This design employs a three-stage residual path—Local Perception (LP), Additive Token Mixing, and Gated Channel/Spatial Modulation—to adaptively modulate feature responses. It strengthens key region representations while suppressing interference from complex background textures, thereby providing a stable foundation for subsequent neck fusion and detection head prediction. The overall Add-CGLU architecture is illustrated in Figure 2.

3.1.1. LP Mechanism

The LP mechanism constructs an efficient channel-aware unit using sequential pointwise convolutions. It employs three cascaded convolutional layers, Batch Normalization (BN), and the Gaussian Error Linear Unit (GELU) activation function to perform high-dimensional channel remapping and nonlinear transformation. This design promotes deep channel interaction and fusion while incurring extremely low computational cost. Given an input feature x, the feature transformation process of the LP mechanism can be expressed as

L_{p} (x) = C o n v (W_{L p}, x) + σ (B N (C o n v (W_{L p}, x))) .

(1)

In the formula,

L_{p}

represents the LP output,

W

denotes the weight function, BN stands for Batch Normalization, and

σ

denotes the Gaussian Error Linear Unit (GELU) activation function.

3.1.2. Convolutional Additive Token Mixer

Following preliminary perception and transformation via the LP mechanism, input features are fed into the Convolutional Additive Token Mixer (CATM) for context modeling. The input features are first mapped into three representations—Query (Q), Key (K), and Value (V)—via a BN layer and independent linear transformations. For simplicity, the input features are denoted as x in the following discussion:

Q = (W_{q}, x), K = (W_{k}, x), V = (W_{v}, x),

(2)

Φ (Q) = C_{h} (S_{p} (Q)), Φ (K) = C_{h} (S_{p} (K)),

(3)

M (Q, K) = Φ (Q) + Φ (K),

(4)

where

Φ

denotes the context mapping function,

S_{p}

represents the spatial attention mechanism,

C_{h}

indicates the channel attention operation. The fusion function M performs element-wise additive integration of the context-enhanced Query and Key representations, enabling efficient aggregation of spatial–channel contextual information.

The structural diagram of the spatial attention branch and channel attention branch is shown in Figure 3.

CATM utilizes an Additive Attention mechanism to efficiently capture contextual dependencies across spatial and channel dimensions. By combining Token Mixer outputs through element-wise superposition, the architecture jointly models feature responses in both dimensions, enabling comprehensive attention modulation. Compared to traditional dot-product self-attention, additive attention significantly reduces computational complexity by avoiding the quadratic overhead of large-scale matrix multiplication while still effectively modeling global context. The final output of CATM can be expressed as

C A T M (X) = Γ (M (Q, K)) \cdot V .

(5)

Here,

Γ

denotes a linear mapping function that integrates and remaps the contextual features obtained from modeling the additive attention mechanism, thereby promoting information fusion and collaborative expression among different feature branches.

3.1.3. Convolutional Gated Linear Unit

Features enhanced by CATM are then input into the Convolutional Gated Linear Unit (CGLU) for modulation. CGLU combines depthwise separable convolutions with a gating mechanism to adaptively modulate and reweight feature responses. Specifically, CGLU employs a dual-branch architecture comprising BN, a

3 \times 3

depthwise separable convolution, and GELU activation to perform parallel transformations on the input features: one branch generates the gating signal, while the other produces the feature representation to be modulated. Finally, the outputs from both branches are fused through element-wise multiplication, with the computational process expressed as

C G L U (x) = Γ (W_{b 1}, x) \cdot Γ (σ (D W C o n v (W_{b 2}, x))) .

(6)

Specifically,

W_{b 1}

and

W_{b 2}

correspond to the weight parameters on the two branches. CGLU employs a gating mechanism to adaptively modulate features, enhancing feature selectivity with minimal additional computational overhead, thereby improving the model’s robustness.

3.2. SPPF-LSKA

In substation inspection scenarios, bolt-nut fasteners are often tiny and visually similar to complex background structures. As network depth increases, fine-grained shallow features tend to degrade during consecutive downsampling and feature fusion, thereby limiting localization accuracy for small bolt-nut fasteners. YOLOv8 employs a Spatial Pyramid Pooling–Fast (SPPF) module at the end of the backbone. Fundamentally, max pooling functions as a subsampling strategy based on local maxima; it prioritizes the most prominent features within the receptive field while suppressing texture and edge information in non-maximal regions. Under strong background interference, this high-frequency information filtering may weaken responses to critical features such as thread edges and nut contours, compromising the stability of small bolt-nut fasteners.

To address this limitation, we introduce the Large Separable Kernel Attention (LSKA) mechanism to optimize the original architecture [39]. This approach strikes a balance between large receptive field modeling capability and computational cost. While traditional Large Kernel Attention (LKA) effectively captures long-range dependencies, its reliance on large two-dimensional convolutions incurs excessive parameters and memory overhead, making it ill-suited for lightweight networks [40]. LSKA decomposes the two-dimensional convolutional kernel into cascaded horizontal and vertical one-dimensional depthwise separable convolutions. By incorporating dilated convolutions, this approach maintains effective long-range context modeling while significantly reducing computational overhead. As a result, spatial perception capability is enhanced while overall model efficiency is preserved.

Let the input feature map be denoted as

F \in R^{C \times H \times W} .

(7)

LSKA models features in both horizontal and vertical directions via cascaded one-dimensional separable convolutions, with dilated convolutions employed to further expand the effective receptive field. For channel C, the separable convolution process in the first stage can be expressed as:

{\bar{Z}}^{C} = \sum_{H, W} {W^{C}}_{(2 d - 1) \times 1} * (\sum_{H, W} {W^{C}}_{1 \times (2 d - 1)} * F^{C})

(8)

Here,

{W^{C}}_{1 \times (2 d - 1)}

and

{W^{C}}_{(2 d - 1) \times 1}

represent the depth convolution kernels along the horizontal and vertical directions, respectively, and d denotes the dilation rate.

The receptive field is further expanded with minimal additional parameters by introducing a second-stage dilated separable convolution:

Z^{C} = \sum_{H, W} {W^{C}}_{⌊\frac{k}{d}⌋ \times 1} * (\sum_{H, W} {W^{C}}_{1 \times ⌊\frac{k}{d}⌋} * {\bar{Z}}^{C}) .

(9)

Here,

k

denotes the size of the convolution kernel, and

⌊\cdot⌋

represents the floor operation.

Subsequently, a

1 \times 1

convolution is applied to fuse multi-scale contextual information and generate a spatial attention weight map:

A^{C} = W_{1 \times 1} * Z^{C} .

(10)

Finally, the attention weights are applied to the original features through element-wise multiplication:

{\bar{F}}^{C} = A^{C} \otimes F^{C} .

(11)

This structural design reduces the computational burden of large-kernel convolutions while maintaining efficiency. This enhances long-range dependency modeling, making it well suited for detecting densely distributed small objects.

To mitigate spatial information loss caused by max pooling in SPPF, we integrate LSKA into the SPPF architecture, forming the SPPF-LSKA module, as illustrated in Figure 4. After concatenating multi-scale pooled features, LSKA performs spatial re-weighting on the fused features, followed by feature integration and channel compression via subsequent convolutional layers. This design preserves the lightweight and multi-scale aggregation characteristics of SPPF while mitigating spatial detail loss introduced by max pooling. By leveraging separable large-kernel modeling, LSKA expands spatial context coverage. Compared to direct large-kernel two-dimensional convolutions, this approach effectively controls additional computational overhead and enhances spatial discrimination and localization stability for small bolt-nut fasteners.

3.3. CGRFPN

In substation inspection scenarios, bolt-nut fasteners are typically small in scale and embedded in complex mechanical structures. Due to illumination variations, occlusion, corrosion, and background clutter, their discriminative visual features are easily overwhelmed. In lightweight detectors such as YOLOv8n, high-level features provide strong semantic information but suffer from spatial detail degradation, whereas low-level features preserve fine-grained textures but lack sufficient global semantic constraints. When these features are fused using conventional PAN-FPN with linear up/down sampling and concatenation, cross-scale semantic misalignment and cumulative background interference are likely to occur, limiting localization accuracy and detection robustness for bolt-nut fasteners.

To address these limitations, we propose a Context-Guided Reconstruction Feature Pyramid Network (CGRFPN) as the neck architecture. CGRFPN is designed to enhance cross-scale feature consistency and contextual awareness while maintaining computational efficiency. It consists of four key components: Pyramid Context Extraction (PCE), Rectangular Self-Calibration Module (RCM), Multi-Scale Feature Fusion Block (FBM), and Dynamic Interpolation Fusion (DIF), as illustrated in Figure 5.

Specifically, PCE aggregates multi-level backbone features to construct a unified pyramid context representation, providing global semantic guidance for feature reconstruction. RCM incorporates coordinate attention and ConvMLP to calibrate orientation- and position-sensitive features. FBM performs context-guided, scale-weighted fusion to enhance responses in discriminative regions while suppressing background noise. DIF further mitigates spatial misalignment during cross-scale fusion through adaptive interpolation, ensuring spatial consistency across feature maps. Together, these modules enable CGRFPN to deliver more stable and informative feature representations for detecting bolt-nut fasteners in complex environments.

3.3.1. Pyramid Context Extraction

CGRFPN first employs the Pyramid Context Extraction (PCE) module to perform scale alignment and channel fusion on features from different backbone layers. This constructs multi-scale contextual representations while maintaining computational efficiency, providing global semantic guidance for subsequent feature reconstruction. Subsequently, the Rectangular Self-Calibration Module (RCM) integrates a Coordinate Attention mechanism and ConvMLP. Coordinate Attention enhances the representation of target geometry and positional information by decoupling horizontal and vertical spatial information. ConvMLP strengthens cross-channel dependency modeling, facilitating multi-scale feature interactions to improve representation stability and consistency.

3.3.2. Context-Guided Feature Reconstruction

Following context extraction, CGRFPN performs adaptive cross-scale fusion and reconstruction via the Multi-Scale Feature Fusion Block (FBM) and the Dynamic Interpolation Fusion (DIF) module. The FBM module facilitates context-guided feature reconstruction by utilizing global context from PCE to perform weighted fusion on current-level features using spatial weights, thereby amplifying key region responses while suppressing redundant or distracting information. This context-aware gated fusion helps establish robust semantic associations between multi-scale features. The DIF module mitigates spatial misalignment during cross-scale feature fusion. By introducing an adaptive interpolation mechanism, DIF aligns upsampled features with current-scale features, reducing inconsistencies caused by scale differences and enhancing the stability and continuity of the multi-scale feature fusion process. Overall, CGRFPN enhances the neck network’s fusion capability with minimal additional computational overhead, providing stable and information-rich feature representations for the detection head.

4. Experiments

4.1. Evaluation Metrics

To rigorously evaluate the detection accuracy and computational efficiency of the proposed algorithm in complex substation scenarios, this study employs a comprehensive set of evaluation metrics, including Precision (P), Recall (R), Average Precision (AP), Mean Average Precision (mAP), parameter count (Params), and floating-point operations (GFLOPs).

Precision measures the correctness of positive predictions, defined as the ratio of true positive samples to all predicted positives. It reflects the model’s ability to minimize false positives. Recall measures the coverage of actual positive samples, indicating the model’s capability to reduce missed detections (false negatives). The calculation formulas are as follows:

P = \frac{T P}{T P + F P},

(12)

R = \frac{T P}{T P + F N} .

(13)

Here, TP, FP, and FN represent the number of True Positives, False Positives, and False Negatives, respectively, under a predefined Intersection over Union (IoU) threshold.

Typically, there exists a trade-off between Precision and Recall. To evaluate performance across varying confidence thresholds, Average Precision (AP) is introduced. Defined as the area under the Precision-Recall (P-R) curve, AP characterizes detection performance for a specific category. The formula is as follows:

A P = \int_{0}^{1} P (R) d R .

(14)

Based on this, Mean Average Precision (mAP) is calculated as the arithmetic mean of AP values across all categories, providing a unified metric for overall performance. For a detection task with n classes, the mAP formula is

m A P = \frac{1}{n} \sum_{k = 1}^{k = n} A P_{k},

(15)

where

A P_{i}

denotes the Average Precision of the i category, and n represents the total number of classes. Specifically, mAP@0.5 represents the mAP calculated at an IoU threshold of 0.5, which is widely adopted in industrial object detection tasks. Considering the practical requirements of detection reliability and real-time performance in substation inspection scenarios, multiple evaluation metrics—including Precision, Recall, F1-score, and mAP@0.5—are adopted to comprehensively assess detection performance. Among these metrics, Recall reflects the model’s ability to reduce missed detections, Precision measures the correctness of predicted targets, and mAP@0.5 provides an overall evaluation of detection accuracy.

In practical inspection systems, the confidence score associated with each predicted bounding box can be interpreted as the posterior confidence that the predicted region corresponds to a true bolt-nut fastener. Although standard metrics such as Precision, Recall, F1-score, and mAP quantitatively evaluate detection performance, confidence scores also play an important role in practical maintenance decision-making. In engineering applications, predefined confidence thresholds are commonly used to balance detection reliability and inspection workload. High-confidence detections can be directly accepted by automated inspection systems, whereas low-confidence predictions may be flagged for manual verification to reduce the risk of erroneous decisions. This mechanism enables automated inspection systems to maintain reliable detection performance while ensuring practical usability in real-world industrial environments.

Furthermore, from the perspective of engineering maintenance, the implications of detection metrics extend beyond purely numerical evaluation. A missed detection (False Negative), which directly reduces Recall, indicates that an existing bolt–nut fastener is not identified by the model. In critical infrastructures such as substations or nuclear power facilities, missing a potentially defective fastener may result in undetected structural risks and compromise operational safety. In contrast, a false positive detection (False Positive), which lowers Precision, incorrectly identifies background structures as targets. Although such errors do not directly threaten structural safety, they may trigger unnecessary manual inspections, thereby increasing maintenance workload and operational costs. Therefore, maintaining high and balanced Precision and Recall is essential for reliable automated inspection in real-world industrial environments, while metrics such as the F1-score and mAP provide comprehensive indicators for evaluating overall detection performance.

In addition to accuracy metrics, model complexity is evaluated using Params and GFLOPs. Params reflects the model’s memory footprint, while GFLOPs quantify the computational cost during inference. These efficiency-related metrics are critical for assessing the practical deployability of detection models on resource-constrained edge devices commonly used in substation inspection systems.

4.2. Experimental Setup

4.2.1. Experimental Platform and Parameter Configuration

Experiments were performed on a standardized computing platform. The hardware configuration included an Intel Xeon Platinum 8474C CPU and a single NVIDIA GeForce RTX 4090D GPU (24 GB VRAM) for accelerated training. The software environment was based on Linux, utilizing Python 3.8 and the PyTorch 2.0.0 framework with CUDA 11.8.

For training, input images were resized to fixed dimensions of

640 \times 640

pixels. The batch size was set to 64 to balance training efficiency and convergence stability. The model was trained for 300 epochs using the Stochastic Gradient Descent (SGD) optimizer. Initial hyperparameters were set as follows: learning rate of 0.01, momentum of 0.937, and weight decay of 0.0005, as detailed in Table 1.

4.2.2. Dataset Preparation

To validate the robustness of the proposed algorithm for bolt-nut fastener detection under unstructured industrial environments, we employed the public NPU-BOLT dataset for training and evaluation. The original NPU-BOLT dataset contains 337 images with 1275 annotated bolt-nut fastener instances. The dataset was randomly divided into training, validation, and test sets with an approximate ratio of 7:1.5:1.5. In contrast to datasets captured under ideal laboratory conditions, NPU-BOLT reflects the visual complexities inherent in real-world engineering applications caused by uncontrollable environmental factors. The dataset includes key categories such as bolt heads and nuts. Its samples closely mirror the challenges of outdoor industrial inspection, featuring uncontrolled lighting (e.g., specular reflections and shadows), complex background textures, target occlusion, and edge blurring. These conditions are representative of typical inspection scenarios in which bolt-nut fasteners are often obscured by dense mechanical structures or affected by motion blur during image acquisition. Furthermore, the diverse shooting distances and viewing angles facilitate the evaluation of the model’s scale invariance and robustness to viewpoint variations.

While NPU-BOLT provides diverse samples, relying solely on raw data may still limit generalization under more extreme or unseen environmental conditions, as real-world inspection scenarios often involve severe disturbances. Therefore, we implemented a scenario-driven hybrid data augmentation strategy to expand the feature space. These augmentation operations were applied exclusively to the training set, increasing the number of training samples to 1416 images in order to mitigate potential overfitting and improve robustness to environmental variations. The augmentation pipeline includes both photometric and geometric transformations designed to simulate real-world inspection conditions. First, to simulate outdoor lighting variations (e.g., direct sunlight), random photometric distortion was applied by adjusting brightness, contrast, and saturation. Second, noise-based perturbations, including Gaussian noise, salt-and-pepper noise, and Gaussian blur, were introduced to mimic low signal-to-noise ratios and motion blur commonly encountered during industrial image acquisition. Third, geometric transformations such as random rotation were used to improve viewpoint invariance. Finally, Mosaic and Mixup augmentation techniques were employed to enhance robustness against densely cluttered backgrounds and overlapping visual patterns. Through random image stitching and pixel-level fusion, these methods simulate complex occlusion patterns and background interactions, constructing an augmented dataset that more closely approximates the visual complexity and variability of real-world industrial inspection environments.

4.3. Analysis of YOLOv8n-ALC Experimental Results

Figure 6 illustrates the Precision and Recall curves for both the baseline and the improved model during training. As shown in Figure 6a, YOLOv8n-ALC exhibits consistently higher Precision, faster convergence, and reduced fluctuation compared to the baseline. This indicates superior stability in suppressing false positives under identical evaluation settings. In Figure 6b, YOLOv8n-ALC surpasses the baseline in Recall during the mid-to-late training phases, maintaining a sustained advantage. This demonstrates enhanced target coverage and a reduction in false negatives. Collectively, these curves confirm that the proposed structural improvements effectively strengthen feature discrimination and detection stability in complex backgrounds.

To further evaluate performance, we analyzed the mAP evolution, as depicted in Figure 7. YOLOv8n-ALC consistently achieves higher mAP levels throughout the process, indicating superior overall detection accuracy. Additionally, the model stabilizes within fewer epochs, demonstrating faster convergence and robust performance in the later stages. These results suggest that the proposed architectural enhancements significantly improve detection stability and overall performance under complex environmental conditions.

4.4. Performance Evaluation

4.4.1. Ablation Experiment

To systematically assess the specific contributions and interaction effects of the C2f-AC module, SPPF-LSKA module, and CGRFPN, a series of ablation experiments were conducted based on the YOLOv8n baseline. As summarized in Table 2, the evaluation followed a three-stage protocol, including single-module integration, pairwise stacking, and full integration.

Single-module analysis reveals that each component contributes in a manner consistent with its design objective. The C2f-AC module (Model 1) improves parameter efficiency by increasing Precision from 88.3% to 89.7% while simultaneously reducing the parameter count to 2.7 M and the computational cost to 7.6 GFLOPs. This indicates that replacing standard bottleneck structures with additive attention effectively prunes redundant feature representations while preserving discriminative capability. The SPPF-LSKA module (Model 2) yields the most pronounced improvement in Recall, which rises from 79.5% to 85.0%, suggesting that the large-kernel decomposition strategy expands the effective receptive field and enhances sensitivity to small and easily missed targets with only marginal computational overhead (7.9 GFLOPs). The CGRFPN (Model 3) provides a balanced performance gain, achieving an mAP@0.5 of 89.1%. Although the context-guided cross-scale fusion slightly increases the parameter count to 3.7 M, the inference cost remains comparable to the baseline (8.6 GFLOPs), supporting its effectiveness in multi-scale feature reconstruction.

To further evaluate the statistical stability of the proposed approach, the baseline YOLOv8n and the final YOLOv8n-ALC models were trained three times using different random seeds. The mAP@0.5 metric is therefore reported as the mean value together with its standard deviation. Synergistic integration of the proposed modules (Models 4–6 and YOLOv8n-ALC) further demonstrates their complementary nature. Pairwise combinations consistently outperform single-module configurations, indicating that the modules address different performance bottlenecks without functional conflict. Ultimately, the fully integrated YOLOv8n-ALC achieves the best overall performance, reaching an mAP@0.5 of 92.1 ± 0.2%, a Precision of 93.5%, and a Recall of 87.1%. Compared with the baseline YOLOv8n (87.8 ± 0.1%), this corresponds to a 4.3% improvement in mAP@0.5. The relatively small standard deviations further indicate that both models maintain stable training behavior across repeated runs. Notably, these gains are achieved while reducing the total parameter count to 2.9 M and lowering the computational cost to 8.2 GFLOPs, suggesting that the performance improvements arise from architectural efficiency and operator optimization rather than increased model capacity. As shown in Figure 8, YOLOv8n-ALC also demonstrates faster convergence and more stable validation performance during training.

4.4.2. Comparison Experiment

To further validate the effectiveness of the proposed approach, YOLOv8n-ALC was benchmarked against representative lightweight object detection models, including YOLOv3-tiny, YOLOv5n, YOLOv7-tiny, and the baseline YOLOv8n. All models were evaluated under identical training environments and datasets to ensure a fair comparison. The quantitative results are reported in Table 3.

As shown in Table 3, YOLOv8n-ALC achieves the best overall detection performance among all compared methods. It attains an mAP@0.5 of 92.1%, significantly outperforming the baseline YOLOv8n (87.8%) as well as earlier lightweight variants such as YOLOv7-tiny (86.9%). In addition, YOLOv8n-ALC records the highest Precision (93.5%) and Recall (87.1%), resulting in a peak F1-score of 90.2. These results indicate that the proposed model effectively suppresses false positives while maintaining robust target coverage.

Figure 9 illustrates the training dynamics on the validation set. YOLOv8n-ALC demonstrates faster convergence and consistently higher accuracy throughout training, reflecting improved optimization stability compared with the baseline model.

Beyond detection accuracy, YOLOv8n-ALC also exhibits favorable computational efficiency. As detailed in Table 3, the proposed model requires only 2.9 M parameters and 8.3 GFLOPs, making it lighter and faster than YOLOv8n (3.2 M/8.7 GFLOPs) and substantially more efficient than YOLOv7-tiny (6.2 M/13.8 GFLOPs). Although YOLOv5n employs fewer parameters (1.9 M), its detection accuracy is notably lower (87.2% mAP@0.5). Overall, these results demonstrate that YOLOv8n-ALC achieves a favorable accuracy–efficiency trade-off, making it well suited for deployment in resource-constrained substation inspection scenarios.

Beyond quantitative accuracy and computational efficiency, it is essential to examine the qualitative detection behavior of the proposed model in real-world substation scenarios. Figure 10 presents representative visual detection examples that illustrate the practical performance of YOLOv8n-ALC under diverse industrial conditions.

Specifically, Figure 10a shows representative detection results on the test set, covering several challenging scenarios such as complex structural backgrounds, severe surface degradation, and adverse illumination conditions. Despite these difficulties, the proposed model can accurately localize bolt–nut fasteners with stable confidence scores. These qualitative results provide intuitive visual evidence that complements the quantitative comparisons reported in Table 3, further validating the practical reliability of the proposed method.

To further evaluate the cross-domain generalization capability of the proposed model, an external verification dataset consisting of 50 bolt–nut images was constructed using samples collected from publicly available industrial inspection platforms. All images were manually annotated following the same labeling protocol used for the NPU-BOLT dataset. These out-of-distribution samples contain complex industrial backgrounds, varying illumination conditions, partial occlusions, and surface corrosion, which differ significantly from those present in the original dataset and therefore provide a more challenging evaluation scenario for assessing real-world robustness.

As illustrated in Figure 10b, YOLOv8n-ALC maintains reliable detection performance on these unseen images, accurately localizing bolt–nut fasteners across diverse scales and complex visual conditions. These qualitative observations demonstrate that the proposed model not only performs well on the original test set but also exhibits strong cross-domain generalization capability, highlighting its practical applicability for real-world substation inspection tasks.

The qualitative results demonstrate that YOLOv8n-ALC maintains robust and consistent detection performance under highly complex substation conditions. As illustrated in Figure 10, the test scenarios encompass adverse illumination effects caused by specular reflections and hard shadows, severe surface degradation resulting from heavy oil contamination and rust-induced camouflage, as well as strong background texture interference. Despite the extremely small physical scale of bolt–nut fasteners and frequent partial occlusions, YOLOv8n-ALC accurately localizes most targets with stable and relatively high confidence scores, typically ranging from 85% to 95%. Notably, the proposed model preserves reliable detections in heavily oil-stained scenes and severely corroded structures, where fastener appearances are highly degraded and visually blended with the surrounding background. Moreover, it effectively adapts to complex geometric configurations and discriminates fasteners from repetitive structural textures, such as diamond-patterned plates.

Nevertheless, several challenging cases remain where the detection performance deteriorates. As shown in Figure 10a, when bolt–nut fasteners exhibit strong visual similarity to the surrounding background or are affected by extreme specular reflections, the confidence scores of some detections decrease noticeably, with a few predictions dropping to approximately 35%. In addition, as illustrated in Figure 10b, under severe occlusion or motion-induced blur, a small number of bolt–nut fasteners may be partially missed by the detector. These cases typically occur when the visual boundaries of the fasteners become indistinguishable from the surrounding structures or when the discriminative features are heavily degraded.

Despite these limitations, YOLOv8n-ALC still maintains reliable detection performance in the majority of practical inspection scenarios. The observed failure cases also reveal potential directions for future improvements, such as enhancing occlusion-aware feature modeling and improving robustness to motion blur and background camouflage. Overall, the qualitative results further confirm the strong generalization capability and environmental robustness of YOLOv8n-ALC in real-world substation inspection environments characterized by cluttered backgrounds and adverse visual conditions.

5. Conclusions

This study addressed the challenge of detecting bolt-nut fasteners in complex substation environments by enhancing the backbone architecture and optimizing the multi-scale feature fusion pathway. Based on comprehensive experiments and ablation analyses, the main findings are summarized from the perspectives of methodological design, detection performance, engineering applicability, and remaining limitations.

Experimental results demonstrate that the proposed YOLOv8n-ALC consistently outperforms lightweight baseline detectors across key evaluation metrics. By integrating the C2f-AC module into the backbone, the network improves fine-grained feature discrimination for bolt-nut fasteners while simultaneously reducing parameter redundancy through additive attention and gated feature modulation. The SPPF-LSKA module effectively expands the receptive field using separable large-kernel attention, leading to a notable improvement in target recall for small and easily missed fasteners. In addition, the CGRFPN neck network enhances cross-scale feature consistency by incorporating context-guided reconstruction, thereby improving localization stability in complex backgrounds. Ablation studies confirm that each component contributes in a complementary manner, and their joint integration yields cumulative performance gains.

Under complex substation inspection scenarios, YOLOv8n-ALC achieves an mAP@0.5 of 92.1%, corresponding to a 4.3% improvement over the YOLOv8n baseline. Precision and Recall are improved by 5.2% and 7.6%, respectively. Importantly, these accuracy gains are achieved alongside reduced model complexity, with the final network requiring only 2.9 M parameters and 8.2 GFLOPs. This demonstrates that the proposed architectural modifications enhance detection performance through operator efficiency and effective feature modeling rather than increased model capacity, supporting practical deployment under resource constraints.

Despite these promising results, several limitations remain. The current evaluation does not include extreme weather conditions, where severe illumination variation or environmental noise may affect detection robustness. Moreover, although large-kernel attention improves contextual modeling, it may introduce local aliasing effects in scenarios with densely clustered fasteners or highly repetitive textures. Future work will explore the integration of multimodal sensing information and more efficient sparse attention mechanisms to further improve robustness. Combined with model compression and quantization techniques, these extensions are expected to enhance deployability on edge devices without compromising detection accuracy.

In summary, YOLOv8n-ALC provides an effective and efficient solution for bolt-nut fastener detection in complex industrial environments. Future research will further investigate its cross-scenario generalization capability and validate its performance in real-world edge-side deployment scenarios.

Author Contributions

Conceptualization, F.L. and D.Y.; methodology, F.L. and D.Y.; software, F.L. and S.W.; validation, F.L.; formal analysis, F.L.; investigation, F.L.; resources, D.Y. and Y.Z.; data curation, F.L.; writing—original draft preparation, F.L.; writing—review and editing, F.L.; visualization, F.L.; supervision, D.Y. and Y.Z.; project administration, D.Y. and Y.Z.; funding acquisition, D.Y. and Y.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in the study are included in the article material; further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Coelho, J.S.; Machado, M.R.; Dutkiewicz, M.; Teloli, R.O. Data-Driven Machine Learning for Pattern Recognition and Detection of Loosening Torque in Bolted Joints. J. Braz. Soc. Mech. Sci. Eng. 2024, 46, 75. [Google Scholar] [CrossRef]
Mexico News Daily. Report Cites Missing Bolts, Poor Welding as Factors in Mexico City Metro Accident. Mexico News Daily. 8 September 2021. Available online: https://Mexiconewsdaily.Com/News/Report-Cites-Missing-Bolts-Poor-Welding (accessed on 9 February 2026).
Li, Z.-J.; Adamu, K.; Yan, K.; Xu, X.-L.; Shao, P.; Li, X.-H.; Bashir, H.M. Detection of Nut–Bolt Loss in Steel Bridges Using Deep Learning Techniques. Sustainability 2022, 14, 10837. [Google Scholar] [CrossRef]
Wang, C.; Yin, L.; Zhao, Q.; Wang, W.; Li, C.; Luo, B. An Intelligent Robot for Indoor Substation Inspection. Ind. Robot 2020, 47, 705–712. [Google Scholar] [CrossRef]
Sun, J.; Xie, Y.; Cheng, X. A Fast Bolt-Loosening Detection Method of Running Train’s Key Components Based on Binocular Vision. IEEE Access 2019, 7, 32227–32239. [Google Scholar] [CrossRef]
Li, Z.; Zhang, Y.; Wu, H.; Suzuki, S.; Namiki, A.; Wang, W. Design and Application of a UAV Autonomous Inspection System for High-Voltage Power Transmission Lines. Remote Sens. 2023, 15, 865. [Google Scholar] [CrossRef]
Ma, S.; Li, R.; Hu, H. Train Track Fastener Defect Detection Algorithm Based on MGSF-YOLO. J. Supercomput. 2025, 81, 494. [Google Scholar] [CrossRef]
Malik, Z.; Mirani, A.; Gopi, T.; Alapati, M. A Review on Vision-Based Deep Learning Techniques for Damage Detection in Bolted Joints. Asian J. Civ. Eng. 2024, 25, 5697–5707. [Google Scholar] [CrossRef]
Chengkang, W.; Longxin, Z. A Survey of Object Detection Models Based on Deep Learning. Comput. Sci. Technol. 2023, 2, 93. [Google Scholar] [CrossRef]
Lee, S.-Y.; Huynh, T.-C.; Park, J.-H.; Kim, J.-T. Bolt-Loosening Detection using Vision-Based Deep Learning Algorithm and Image Processing Method. J. Comput. Struct. Eng. Inst. Korea 2019, 32, 265–272. [Google Scholar] [CrossRef]
Pham, H.C.; Ta, Q.-B.; Kim, J.-T.; Ho, D.-D.; Tran, X.-L.; Huynh, T.-C. Bolt-Loosening Monitoring Framework Using an Image-Based Deep Learning and Graphical Model. Sensors 2020, 20, 3382. [Google Scholar] [CrossRef]
Zhang, X.; Xia, Y.; Zhao, J. Intelligent Identification of Bolt Looseness with One-Dimensional Deep Convolutional Neural Networks. Signal Image Video Processing 2025, 19, 158. [Google Scholar] [CrossRef]
VijayaNirmala, B.; Kousalya, G.; Manohar, Y.S.; Adarsh, C.S.; Varun, T.S. Detection of Standard Dimension of Nut And Bolt And Their Segregation. Int. J. Innov. Res. Electr. Electron. Instrum. Control. Eng. 2022, 10, 15–19. [Google Scholar] [CrossRef]
Zou, H.; Sun, J.; Ye, Z.; Yang, J.; Yang, C.; Li, F.; Xiong, L. A Bolt Defect Detection Method for Transmission Lines Based on Improved YOLOv5. Front. Energy Res. 2024, 12, 1269528. [Google Scholar] [CrossRef]
Hua, G.; Zhang, H.; Huang, C.; Pan, M.; Yan, J.; Zha, H. An Enhanced YOLOv8-based Bolt Detection Algorithm for Transmission Line. IET Gener. Transm. Distrib. 2024, 18, 4065–4077. [Google Scholar]
Yang, Z.; Zhao, Y.; Xu, C. Detection of Missing Bolts for Engineering Structures in Natural Environment Using Machine Vision and Deep Learning. Sensors 2023, 23, 5655. [Google Scholar] [CrossRef] [PubMed]
Li, H.; Wang, C.; Liu, Y. YOLO-FDD: Efficient Defect Detection Network of Aircraft Skin Fastener. Signal, Image Video Process. 2024, 18, 3197–3211. [Google Scholar] [CrossRef]
Sapkota, R.; Karkee, M. Ultralytics YOLO Evolution: An Overview of YOLO26, YOLO11, YOLOv8 and YOLOv5 Object Detectors for Computer Vision and Pattern Recognition. arXiv 2025, arXiv:2510.09653. [Google Scholar]
Lou, H.; Duan, X.; Guo, J.; Liu, H.; Gu, J.; Bi, L.; Chen, H. DC-YOLOv8: Small-Size Object Detection Algorithm Based on Camera Sensor. Electronics 2023, 12, 2323. [Google Scholar] [CrossRef]
Yao, G.; Zhu, S.; Zhang, L.; Qi, M. HP-YOLOv8: High-Precision Small Object Detection Algorithm for Remote Sensing Images. Sensors 2024, 24, 4858. [Google Scholar] [CrossRef] [PubMed]
Kang, S.; Hu, Z.; Liu, L.; Zhang, K.; Cao, Z. Object Detection YOLO Algorithms and Their Industrial Applications: Overview and Comparative Analysis. Electronics 2025, 14, 1104. [Google Scholar] [CrossRef]
Wang, C.-Y.; Bochkovskiy, A.; Liao, H.-Y.M. YOLOv7: Trainable Bag-of-Freebies Sets New State-of-the-Art for Real-Time Object Detectors. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 7464–7475. [Google Scholar]
Terven, J.; Córdova-Esparza, D.-M.; Romero-González, J.-A. A Comprehensive Review of YOLO Architectures in Computer Vision: From YOLOv1 to YOLOv8 and YOLO-NAS. Mach. Learn. Knowl. Extr. 2023, 5, 1680–1716. [Google Scholar] [CrossRef]
Wang, J.; Zhao, H. Improved YOLOv8 Algorithm for Water Surface Object Detection. Sensors 2024, 24, 5059. [Google Scholar] [CrossRef] [PubMed]
Liang, R.; Jiang, M.; Li, S. YOLO-DPDG: A Dual-Pooling Dynamic Grouping Network for Small and Long-Distance Traffic Sign Detection. Appl. Sci. 2025, 15, 10921. [Google Scholar] [CrossRef]
Reis, D.; Kupec, J.; Hong, J.; Daoudi, A. Real-Time Flying Object Detection with YOLOv8. arXiv 2023, arXiv:2305.09972. [Google Scholar] [CrossRef]
Wang, G.; Chen, Y.; An, P.; Hong, H.; Hu, J.; Huang, T. UAV-YOLOv8: A Small-Object-Detection Model Based on Improved YOLOv8 for UAV Aerial Photography Scenarios. Sensors 2023, 23, 7190. [Google Scholar] [CrossRef] [PubMed]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-Excitation Networks. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
Wang, Q.; Wu, B.; Zhu, P.; Li, P.; Zuo, W.; Hu, Q. ECA-Net: Efficient Channel Attention for Deep Convolutional Neural Networks. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 11534–11542. [Google Scholar]
Jaderberg, M.; Simonyan, K.; Zisserman, A.; Kavukcuoglu, K. Spatial Transformer Networks. In Proceedings of the Advances in Neural Information Processing Systems; Curran Associates, Inc.: New York, NY, USA, 2015; Volume 28. [Google Scholar]
Woo, S.; Park, J.; Lee, J.-Y.; Kweon, I.S. CBAM: Convolutional Block Attention Module. In Computer Vision–ECCV 2018; Springer: Cham, Switzerland, 2018; pp. 3–19. [Google Scholar]
Wang, X.; Girshick, R.; Gupta, A.; He, K. Non-Local Neural Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018; pp. 7794–7803. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image Is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Huang, Z.; Wang, X.; Huang, L.; Huang, C.; Wei, Y.; Liu, W. CCNet: Criss-Cross Attention for Semantic Segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 603–612. [Google Scholar]
Li, X.; Zhong, Z.; Wu, J.; Yang, Y.; Lin, Z.; Liu, H. Expectation-Maximization Attention Networks for Semantic Segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 9167–9176. [Google Scholar]
Cheng, G.; Yuan, X.; Yao, X.; Yan, K.; Zeng, Q.; Xie, X.; Han, J. Towards Large-Scale Small Object Detection: Survey and Benchmarks. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 13467–13488. [Google Scholar] [CrossRef]
Zhang, T.; Li, L.; Zhou, Y.; Liu, W.; Qian, C.; Hwang, J.-N.; Ji, X. CAS-ViT: Convolutional Additive Self-Attention Vision Transformers for Efficient Mobile Applications. arXiv 2024, arXiv:2408.03703. [Google Scholar] [CrossRef]
Shi, D. TransNeXt: Robust Foveal Visual Perception for Vision Transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2024; pp. 17773–17783. [Google Scholar]
Lau, K.W.; Po, L.-M.; Rehman, Y.A.U. Large Separable Kernel Attention: Rethinking the Large Kernel Attention Design in CNN. Expert Syst. Appl. 2024, 236, 121352. [Google Scholar] [CrossRef]
Guo, M.-H.; Lu, C.-Z.; Liu, Z.-N.; Cheng, M.-M.; Hu, S.-M. Visual Attention Network. Comput. Vis. Media 2023, 9, 733–752. [Google Scholar] [CrossRef]

Figure 1. YOLOv8n-ALC network structure. Note:

\oplus

denotes element-wise addition, and

\otimes