Next Article in Journal
Structural Vibration Analysis of UAVs Under Ground Engine Test Conditions
Previous Article in Journal
SAWGAN-BDCMA: A Self-Attention Wasserstein GAN and Bidirectional Cross-Modal Attention Framework for Multimodal Emotion Recognition
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

An Attention-Based Bidirectional Feature Fusion Algorithm for Insulator Detection

1
College of Information Engineering, Inner Mongolia University of Technology, Hohhot 010080, China
2
Inner Mongolia Key Laboratory of Intelligent Perception and System Engineering, Hohhot 010080, China
*
Author to whom correspondence should be addressed.
Sensors 2026, 26(2), 584; https://doi.org/10.3390/s26020584
Submission received: 5 December 2025 / Revised: 7 January 2026 / Accepted: 13 January 2026 / Published: 15 January 2026
(This article belongs to the Section Industrial Sensors)

Abstract

To maintain reliability, safety, and sustainability in power transmission, insulator defect detection has become a critical task in power line inspection. Due to the complex backgrounds and small defect sizes encountered in insulator defect images, issues such as false detections and missed detections often occur. The existing You Only Look Once (YOLO) object detection algorithm is currently the mainstream method for image-based insulator defect detection in power lines. However, existing models suffer from low detection accuracy. To address this issue, this paper presents an improved YOLOv5-based MC-YOLO insulator detection algorithm. To effectively extract multi-scale information and enhance the model’s ability to represent feature information, a multi-scale attention convolutional fusion (MACF) module incorporating an attention mechanism is proposed. This module utilises parallel convolutions with different kernel sizes to effectively extract features at various scales and highlights the feature representation of key targets through the attention mechanism, thereby improving the detection accuracy. Additionally, a cross-context feature fusion module (CCFM) is designed, where shallow features gain partial deep semantic supplementation and deep features absorb shallow spatial information, achieving bidirectional information flow. Furthermore, the Spatial-Channel Dual Attention Module (SCDAM) is introduced into CCFM. By incorporating a dynamic attention-guided bidirectional cross-fusion mechanism, it effectively resolves the feature deviation between shallow details and deep semantics during multi-scale feature fusion. The experimental results show that the MC-YOLO algorithm achieves an mAP@0.5 of 67.4% on the dataset used in this study, which is a 4.1% improvement over the original YOLOv5. Although the FPS is slightly reduced compared to the original model, it remains practical and capable of rapidly and accurately detecting insulator defects.

1. Introduction

With the rapid development of high-speed rail technology in China, the area in which contact network power transmission lines are installed has been expanding, making it especially important to ensure the safe and stable operation of the contact network. As a critical component of the overhead contact system, insulators provide support for conductors and facilitate electrical insulation within power transmission circuits. However, prolonged exposure to harsh external environments can cause insulators to become susceptible to damage, flashover, and other defects (see Figure 1). These issues compromise the insulation performance, posing significant threats to the safety and reliability of transmission lines. Consequently, regularly conducting inspections is essential in the expeditious identification of insulator defects and the implementation of appropriate measures. Conventional manual field inspections are characterised by inefficiency and an inability to adequately address the demands of insulator defect detection. The integration of machine learning and artificial intelligence technologies has effectively reduced the complexity of insulator inspection tasks, promoted personnel safety, and significantly enhanced inspection efficiency [1,2,3,4].
In recent years, deep learning algorithms have achieved significant advances in image recognition and object detection. Object detection methods based on deep learning offer advantages such as high detection accuracy and strong generalisation capabilities, leading to their widespread application in insulator defect detection [5]. The starting point for the development of these algorithms was two-stage detectors, including the Fast Region-based Convolutional Neural Network (Fast R-CNN) proposed by Ross et al. [6], the Faster R-CNN proposed by Ren et al. [7], and the Cascade R-CNN proposed by Cai et al. [8]. The functioning of these algorithms is initiated by the selection of candidate regions prior to the performance of object localisation and classification. The protracted nature of the candidate region selection process gives rise to a comparatively elevated computational complexity. Subsequent single-stage detectors concurrently perform object bounding box localisation and recognition tasks, offering faster detection speeds and greater simplicity. Notable examples include YOLOv1 by Redmon et al. [9], YOLOX by Ge et al. [10], PP-YOLO by Long et al. [11], YOLOv7 by Wang et al. [12], and SSD by Liu et al. [13].
At present, research in this area is focused in two primary directions. Firstly, there is an emphasis on enhancing detection accuracy. Secondly, there is a focus on the development of lightweight models. Hu et al. [14] proposed a methodology combining Faster R-CNN and U-shaped Network. The former is utilised for the location of glass insulator strings, whilst the latter performs precise pixel classification on cropped images of differing scales. This approach has been demonstrated to outperform certain classical algorithms in terms of insulator localisation and crack location determination. Guo et al. [15] addressed the challenges of indistinct target features and low detection accuracy for small objects in drone inspections by proposing an improved YOLOv5-based insulator defect detection algorithm. This approach incorporates ConvNeXt architecture into the backbone network with a view to enhancing feature extraction capabilities and integrates a coordinate attention mechanism to boost small object detection performance. Wang et al. [16] proposed a methodology for the detection of glass insulator defects, which is based on a dynamic difference algorithm. The apparatus utilises a rotating control platform and an inverted dam-type LED light source to suppress complex backgrounds, combined with a support vector machine classifier for defect identification. The experimental results demonstrate that this method achieves a detection accuracy of 2 mm and a processing speed of 10 samples per minute, effectively meeting industrial automation inspection requirements. The intricacy of glass insulator shapes and the heterogeneity of defects present considerable challenges in automated detection. Test results confirm that this detection method achieves 2 mm accuracy and 10 samples per minute, enabling the effective detection and identification of glass insulator defects and satisfying industrial automation production needs.
In the area of lightweight and real-time detection, Ma et al. [17] proposed a lightweight YOLOv4 detection model to address the issues of large backbone architecture and high parameter count in YOLOv4. This approach incorporates GhostNet as the feature extraction network, thereby significantly reducing the parameter count and accelerating inference while maintaining detection accuracy. Furthermore, the K-means++ clustering algorithm is employed to optimise anchor box sizes, and Quality Focal Loss is integrated into the loss function to enhance model performance. Jia et al. [18] proposed the lightweight detection model MDD-YOLOv3, in which standard convolutions in the YOLOv3 backbone were replaced with deep separable convolutions to construct the D-Darknet53 backbone. The experimental results demonstrated that the model under investigation achieved a slight improvement in detection accuracy, whilst simultaneously leading to a significant increase in detection speed. These findings demonstrate the efficacy of the proposed method in rapidly and accurately identifying and locating insulators against complex backgrounds. As demonstrated in the preceding discussion, the aforementioned methods have achieved varying degrees of improvement in insulator and defect detection performance. However, it is important to note that these methods primarily address a limited range of defect types. Furthermore, inspection images have characteristics such as diverse defect types and small defect scales. Existing algorithms have proven ineffective in the context of extracting features for multiple defect targets, a situation which gives rise to the possibility of issues such as the failure to detect insulator defects or the erroneous detection of such defects. Moreover, contemporary models are characterised by elevated computational complexity.
The efficacy of these algorithms in enhancing the detection of insulators has been demonstrated to a certain extent. However, their utilisation of image features remains somewhat inadequate, particularly in terms of leveraging global contextual information within images. The following modules are proposed in this paper to enhance the baseline YOLOv5 model:
  • An effective multi-scale attention-based convolutional fusion module is proposed. It employs a parallel multi-scale convolutional branch structure to extract features at different scales. Through an attention mechanism, it adaptively assigns weights to features across scales, performs weighted summation across the entire branch, and applies residual projection. This approach enhances more meaningful features while suppressing irrelevant ones, thereby improving feature effectiveness.
  • A novel cross-context feature fusion module is proposed, designed to guide and adaptively adjust contextual information during multi-scale feature fusion. Through the SCDAM attention mechanism, the module captures and leverages crucial contextual information during feature fusion, thereby enhancing the effectiveness of feature representations. This effectively guides the model to learn information about detection targets, improving detection accuracy. Simultaneously, through weighted feature re-organisation operations, the module enhances the discriminative capability of feature maps.

2. YOLOv5 Architecture

YOLOv5 is a single-stage object detection algorithm characterised by its compact model size, fast processing speed, and high accuracy. As illustrated in Figure 2, the architecture employs a backbone network, which is primarily utilised for feature extraction, resulting in the generation of feature maps that include a variety of semantic information. The Neck component serves to fuse multi-scale features, thereby constructing a feature pyramid. The YOLOv5 Neck architecture employs the FPN + PAN structure. The employment of a Feature Pyramid Network (FPN) to extract high-level semantic feature maps is complemented by the utilisation of a Path Aggregation Network (PAN) to supplement low-level object information. This approach serves to enhance and reinforce the capabilities of localisation. The Head (output stage) serves as the final prediction component, determining the category and location of objects of varying sizes based on feature maps of different dimensions.

3. MC-YOLO Model Architecture

The selection of YOLOv5s as the base model for refinement was informed by its high accuracy and real-time performance, leading to the development of the enhanced MC-YOLO model. During the feature fusion process, the YOLOv5 algorithm necessitates continuous feature map downsampling; however, a solitary sampling method inevitably results in feature loss. In order to optimise the number of features, MC-YOLO introduces a multi-scale attention convolution fusion module, providing the network with a rich source of information. Concurrently, a cross-context feature fusion approach is introduced to mitigate feature loss during feature concatenation in the original YOLOv5 model, thereby endowing the model with more advanced and comprehensive feature fusion capabilities. The block diagram of the MC-YOLO algorithm is shown in Figure 3. In the feature extraction component, the original C3 feature extraction module is replaced with the multi-scale attention convolutional fusion module (MACF) to enhance the model’s feature extraction capability. The CCFM has been demonstrated to capture and utilise crucial contextual information during feature fusion, thereby enhancing the effectiveness of feature representations. SCDAM has been proposed as a means of amplifying important features and enhancing the discriminative power of feature maps.

3.1. Multi-Scale Attention Convolution Fusion (MACF) Module

The feature extraction module is responsible for the extraction of feature information from input images, and the subsequent conversion of image data into high-level representations that are rich in semantic information, thus facilitating the efficient execution of subsequent detection tasks. Compared to single-branch structures, multi-branch architectures have been demonstrated to be more effective in capturing features. This paper presents a multi-scale attention convolution fusion (MACF) module to enhance the model’s representational capability for objects at different scales while maintaining lightweight and portable properties. This module employs a parallel multi-scale convolutional branch structure, utilising an attention mechanism to adaptively assign weights to features at different scales. Residual connections are introduced at the output to ensure gradient stability. Following the incorporation of feature map input, convolutions with distinct kernel sizes are employed to facilitate the extraction of features across three parallel branches. It is imperative to note that each individual branch undergoes a process of feature weighting with the objective of retaining effective information. Convolutional operations subsequently fuse the output feature maps into a unified output feature. The model architecture is illustrated in Figure 4.
The input feature X is passed through three parallel convolutional branches with kernel sizes of 3 × 3, 5 × 5, and 7 × 7, respectively, generating multi-scale features feat3, feat5, and feat7. The attention weight generation module performs joint modelling of these multi-scale features to obtain the corresponding weight coefficients w1, w2, and w3, and adaptively fuses the features at each scale through weighted summation to produce the fused feature Fuse. Finally, the fused feature is combined with the input via a residual connection and passed through an activation function to output the final feature Fout. This Fout serves as the output of the MACF module in the overall architecture (Figure 3). Concurrently, it guarantees that the configuration of input and output feature maps remains constant, thereby minimising feature loss in images of small objects. This contributes to enhanced model accuracy in object detection tasks, particularly in complex scenes.
Subsequent to multi-scale feature extraction, the attention mechanism assigns a weighting to features from each branch, thereby facilitating feature selection and enhancement. This process serves to amplify more meaningful features and thereby improve the effectiveness of feature representation.
As demonstrated in Figure 5, the generation of attention weights within the Attention block is contingent on the global information of the input feature x. The configuration of the input feature x R B × C × H × W is represented by B , C , H , W , with C denoting the number of channels. The initial procedure involves the implementation of Global Average Pooling (GAP) on x, with the objective of acquiring the feature representation B , C , 1 , 1 . This process entails the compression of information within the spatial dimension, whilst ensuring the preservation of the global response of each individual channel. The following Formula (1) illustrates this process. S represents the average activation of each channel. Subsequently, a bottleneck structure composed of two 1 × 1 convolutional layers and ReLU activation functions maps the channel features from C dimensions to c_ dimensions and then further maps them to three dimensions to represent the weight scores of the three convolutional branches ( 3 × 3 , 5 × 5 , 7 × 7 ) . SoftMax (dim = 1) performs normalisation along the channel dimension, thereby enabling the three weights to compete numerically while satisfying the constraint that their sum equals 1, as demonstrated in Formula (2).
s b , c = 1 H W i = 1 H j = 1 W x b , c , i , j
k = 1 3 w b , k = 1 w b , k > 0  
This process consequently yields the attention weights w 3 ,   w 5 ,   w 7 , as illustrated in Equation (3), where G A P ( ) denotes the global average pooling operation. The attention weights are multiplied with their corresponding features (feat3, feat5, and feat7) to obtain the final multi-scale feature fusion (see Equation (4)). This approach facilitates the integration of different receptive field branches through adaptive weighted fusion, thereby enabling the model to dynamically adjust the contribution of multi-scale information based on input features.
[ w 3 , w 5 , w 7 ] = S o f t m a x ( C o n v ( R e L U ( C o n v ( G A P ( x ) ) ) ) )
f f u s e = w 3 f e a t 3 + w 5 f e a t 5 + w 7 f e a t 7

3.2. Cross-Context Fusion Module (CCFM)

In the context of object detection tasks, the accurate identification of targets of varying scales frequently necessitates the integration of feature maps from disparate levels to concurrently capture high-level semantic information and low-level spatial details. However, during the process of feature extraction, as convolutional neural networks become more complex, the spatial resolution of feature maps undergoes a continuous decrease. This phenomenon results in the partial loss of an object’s local structural information and overall contextual relationships, leading to significant semantic differences between shallow and deep features. Specifically, shallow features have been shown to retain more detail and positional information, exhibiting strong equivariant properties that aid in precise object localisation. In contrast, deep features exhibit a propensity to encompass abstract semantic information, characterised by high levels of invariance that facilitate semantic classification. The classification and localisation capabilities of detection models are collectively underpinned by these complementary features.
In the YOLOv5 network, an FPN + PAN architecture is employed to fuse multi-scale features. The Feature Pyramid Network (FPN) primarily conveys high-level semantic information through a top–down approach, while the Path Aggregation Network (PAN) further enhances the expressive power of low-level features. However, prior to the execution of feature fusion between the PAN and FPN structures, it is imperative that YOLOv5 reduces the dimensionality of features from disparate levels through multiple 1 × 1 convolutions. This ensures that the channel counts remain consistent, thus enabling the element-wise addition of the features. This fixed channel alignment approach simplifies the structure to a certain extent but also presents certain issues. Channel dimension inconsistencies lead to imbalanced energy distribution. Forced dimensionality reduction has been shown to disrupt the channel expression structure of original features, causing uneven energy distribution across features of different scales. Fusion conflicts arise from semantic distribution discrepancies. Significant differences in semantic levels between shallow and deep features can cause information interference when directly combined, thereby weakening the discriminative power of multi-scale features. The integration method is limited, and information exchange is constrained. Conventional techniques for straightforward addition or concatenation prove inadequate in facilitating adequate interaction between the shallow and deep layers, consequently leading to suboptimal feature utilisation efficiency.
In order to address the aforementioned issues, the present paper presents a Cross Context Fusion Module (CCFM), as illustrated in Figure 6. By introducing a bidirectional cross-fusion mechanism guided by dynamic attention, it effectively resolves the feature deviation issue between shallow-level details and deep-level semantics during multi-scale feature fusion. Structurally, the module achieves channel alignment; in fusion strategy, enables bidirectional information flow; in weight allocation, implements adaptive adjustment; and through a residual architecture, ensures training stability. In comparison with conventional fusion modules, CCFM has been shown to significantly enhance semantic consistency and detail fidelity in feature representation without a substantial increase in parameter count, thereby providing more discriminative features for downstream tasks such as detection and segmentation.
The overall structure of CCFM is illustrated in Figure 6, comprising four principal stages: feature alignment, context concatenation and compression, attention weight generation, and bidirectional fusion with residual enhancement. In the overall architecture (Figure 3), the CCFM module receives two inputs: the upper input, which carries deeper semantic features, is defined as x 1 R B × C 1 × H × W , and the lower input, which carries shallower spatial features, is defined as x 0 R B × C 0 × H × W . Since the features from different levels typically have inconsistent channel numbers, the module first aligns the channels of the shallow features through a 1 × 1 convolution:
x 0 = C o n v 1 × 1 ( x 0 ) , i f   C 0 = C 1 , x 0   o t h e r w i s e .
where C denotes the number of channels. This operation realigns channels without compromising spatial resolution, thereby establishing the foundation for subsequent feature fusion. Subsequent to channel alignment, shallow and deep features are concatenated along the channel dimension:
x c a t = c o n c a t ( x 0 , x 1 )
The number of feature channels obtained is twice that of the original. In order to reduce computational complexity and facilitate interaction between information at different levels, CCFM employs a 1   ×   1 convolution for compression:
x r e d = C o n v 1 × 1 ( x c a t )
Through the above operation, the module ensures sufficient feature fusion while effectively controlling the model’s parameter size and computational overhead. The compressed fused features are then fed into SCDAM, with the objective of generating a dynamic weight map w [ 0 , 1 ] B × C × 1 × 1 .
w = σ ( A t t e n t i o n ( x r e d ) )
In the formula, σ ( ) denotes the sigmoid function. Its role is to map the weights to the [0, 1] range, facilitating the network to adaptively enhance key information.
This weight map dynamically adjusts the response intensity of different channels based on input feature content, thereby achieving adaptive information distribution between shallow and deep features. This enables the network to adaptively select information sources across different semantic levels. Subsequently, employing the generated weights, the fusion outputs for both directions are computed separately:
y 0 = w x 0 + ( 1 w ) x 1 y 1 = ( 1 w ) x 0 + w x 1
Shallow features gain partial semantic supplementation from deep features, while deep features also absorb shallow spatial information, enabling bidirectional information flow. Ultimately, the two sets of fusion results are concatenated along the channel dimension, incorporating residual connections and activation functions.
F o u t = S i L U ( C o n c a t ( y 0 , y 1 ,     [ x 0 , x 1 ] ) )
where SiLU refers to the Sigmoid Linear Unit activation function. The choice of SiLU over ReLU at this point is mainly based on its smooth and non-monotonic response characteristics, which help to alleviate potential information loss caused by the hard truncation of ReLU during feature fusion.

Spatial-Channel Dual Attention Module (SCDAM)

In order to enhance the expressive power of feature maps and increase the network’s focus on critical regions, in this study, we introduced a module called SCDAM. Although the classic Convolutional Block Attention Module (CBAM) [19] also adopts a combined channel and spatial attention design, SCDAM has been optimised in terms of structural design, parameter efficiency, and synergy with CCFM to meet its specific requirements. SCDAM integrates dual attention mechanisms for both channels and spatial dimensions and was developed to enable the adaptive adjustment of the response intensity of feature maps across these dimensions, thereby facilitating more precise feature recalibration. The module is composed of two sub-modules: the channel attention unit and the spatial attention unit. The overall structure of the system is illustrated in Figure 7. The initial step in the process is the weighting of the input features using the channel attention mechanism. This is carried out so that the channel information that is most critical to overall target discrimination can be highlighted. Consequently, the spatial attention mechanism directs the model to concentrate on the regions within the feature map that are most responsive to space.
As shown in Figure 7, the channel attention mechanism is employed to adaptively adjust the importance of different channels across the entire image. Given input features X r e d , they pass through a 1 × 1 convolutional layer that reduces the number of channels from C to C / r (where r is the reduction parameter used to decrease the computational load). Subsequently, the ReLU activation function enhances the nonlinear expressive capability of the features. The module forms a bottleneck structure through two convolutional layers and the ReLU activation function:
C A ( X ) = σ ( W 2 δ ( W 1 X r e d ) )
where W 1 , W 2 represents the weights of the convolutional layer, δ · denotes the ReLU activation function, σ · denotes the Sigmoid function, and r denotes the channel reduction ratio. The generation of the weighted coefficients C A 0 , 1 C for each channel is informed by the modelling of information between channels, resulting in the final output:
X = X C A
where denotes the element-wise multiplication operation at the channel level. This facilitates the suppression of redundant channel features while concomitantly enhancing key semantic channels.
The spatial attention mechanism processes the feature X after channel weighting. The channel-weighted feature X is further input into the spatial attention module to capture the importance of different spatial locations. As shown in the figure, spatial attention extracts spatial saliency information through average pooling and max pooling operations on the channel dimension. Average pooling yields the average response value, while max pooling yields the maximum response value at each location.
M a v g i , j = 1 C k = 1 C X k , i , j
M max i , j = max K 1 , C X k , i , j
It is evident that both outputs have the shape [B,1,H,W]. Following the concatenation of the data along the channel dimension, the information is fed into a convolutional layer with a kernel size of 7 × 7 . Subsequent to this, a Sigmoid activation is applied, resulting in a spatial weight map.
S A X = σ f 7 × 7 M a v g , M max
The module’s final output is as follows:
Y S C D A M = X S A ( X )
Through this operation, the model can focus on salient target regions in the spatial dimension, thereby enhancing the discriminative power and localisation capabilities of features.

4. Experimental Results and Analysis

4.1. Experimental Environment

The hardware platform and software environment utilised in this experiment are delineated in Table 1.
The YOLOv5 code is version 7.0. The input image resolution for the network is 640 × 640. The batch size has been set to 16, and the training is conducted for 200 epochs. The SGD (stochastic gradient descent) optimiser is utilised, with an initial learning rate of 0.01 and momentum of 0.937. Mosaic data augmentation is employed during the training phase, with all other settings left at their default values.

4.2. Experimental Dataset

The experiment utilised a dataset comprising 2912 images of transmission line insulators, with the dataset meticulously annotated using labelling tools such as LabelImg. The dataset was then divided into a training set and a testing set, with the former constituting 80% and the latter comprising the remaining 20%. The training set comprised 2004 images, while the validation set included 908 images. The images encompassed both normal and defective insulator instances, thereby ensuring balanced class distribution. The dataset was divided into three categories: normal insulators, damaged insulators, and flashover insulators and encompassed images captured under a variety of environmental conditions, thereby ensuring that the requisite robustness was achieved.

4.3. Evaluation Metrics

mAP is defined as the mean of all category AP values at an IoU threshold of 0.5. The mAP@50:95 is used to denote the mean average precision across IoU thresholds from 0.5 to 0.95. The AP value can be obtained via Formula (17) and also represents the average area under the precision–recall curve. The recall metric is defined as the probability that the model will correctly predict all positive instances. It is important to note that recall is inversely related to the false-negative rate; that is to say, a higher recall indicates a lower false-negative rate. The calculation formula is presented in Equation (18). Precision, also known as the true-positive rate, is defined as the proportion of correctly predicted positives among all predicted positives. The calculation formula is presented in Equation (19).
A P = 0 1 P r dr
P = T P T P + F P
R = T P T P + F N
The true positive (TP) is defined as the number of positive samples that have been correctly predicted by the model. The false positive (FP) is the number of negative samples that have been incorrectly predicted as positive. The false negative (FN) is the number of positive samples that have been incorrectly predicted as negative by the model. In this study, a predicted box is considered a TP only if it simultaneously meets the following conditions: (1) IoU ≥ 0.5; (2) the predicted class matches the class of the corresponding ground truth box; (3) the prediction confidence is ≥0.25. If a predicted box has an IoU < 0.5 with all ground truth boxes, or if the IoU meets the threshold but the predicted class is incorrect, it is considered an FP. If a ground truth box is not matched by any predicted box that meets the above criteria, it is considered an FN.
In this paper, we evaluate object detection accuracy using the mAP@0.5 and mAP@0.5:0.95 metrics; FPS is used to assess the number of image frames processed per second. The assessment of model complexity employs two distinct metrics: parameters and GFLOPs.

4.4. Analysis of Experimental Results

In this study, we utilised YOLOv5s 7.0 as the baseline model and aimed to enhance its capabilities. Ablation experiments were conducted by sequentially adding the MACF and CCFMs, with the results displayed in Table 2. In Table 3, a performance analysis is presented for different insulator types under identical conditions.
As demonstrated in Table 2, following the integration of the multi-scale attention convolution fusion (MACF) module, the model is capable of extracting feature information within disparate receptive fields through multiple parallel convolutional branches, thereby achieving the simultaneous modelling of local details and global contextual information. This module incorporates an attention mechanism while preserving spatial structural features, thereby enabling the model to focus more on feature responses in key regions. Consequently, it has been demonstrated to enhance feature representation capabilities and information utilisation. Experimental results demonstrate that incorporating the MACF module significantly enhances both the detection accuracy and recall of the model. The mAP@0.5 improves to 0.665, while the mAP@0.5:0.95 ratio increases to 0.432. This improvement validates that our designed three-branch parallel structure effectively fuses features from different receptive fields, thereby alleviating the feature loss problem associated with small insulator defects.
Following the introduction of the Cross-Layer Context-Guided Fusion module (CCFM), the model demonstrates enhanced adaptability during the feature fusion stage. By incorporating a cross-scale information interaction mechanism, the CCFM effectively leverages the complementarity between features at different levels. This enables the model to fully integrate high-level semantic features while preserving spatial details. This module automatically selects feature maps that have been assigned higher scores and that contain richer semantic information for the purpose of object detection. The effect of this process is to improve detection accuracy. A comparison of the baseline YOLOv5 model with the model incorporating CCFM reveals an approximate 2.6% improvement in mAP@0.5, validating the module’s significant role in feature optimisation and information filtering.
The experimental results in Table 3 further demonstrate that different modules significantly improve detection performance across various defect types. For example, for the relatively complex defect types “breakage” and “flashover”, the MC-YOLO model has shown enhanced detection accuracy, reaching 0.600 and 0.449, respectively. This signifies a substantial enhancement in comparison to the original YOLOv5 model. This demonstrates that the enhanced modules effectively strengthen the model’s robustness and generalisation capabilities through multi-scale feature fusion and interaction between pieces of contextual information.
It is evident that the MACF module has a positive impact on the diversity and effectiveness of feature extraction. Furthermore, the CCFM optimises information flow during the feature fusion stage. This facilitates the demonstration of enhanced detection accuracy and stability by the model in complex defect identification tasks. Despite the enhanced model demonstrating a marginal augmentation in parameters and computational intricacy (GFLOPs escalating from 16.0 to 19.0), it maintains commendable real-time performance (FPS ≈ 121), signifying that the proposed structural optimisation achieves a favourable equilibrium between detection capability and computational expenditure.
To validate the effectiveness of the SCDAM attention mechanism, different attention mechanisms were incorporated into the CCFM under identical experimental conditions, and the results are presented in Table 4.
The experimental analysis revealed that the introduction of various attention modules did not result in a significant increase in the model’s parameter count. However, the impact of these modules on the detection accuracy varied considerably. In comparison with the baseline model, attention mechanisms such as SE, ECA, and CA yielded only marginal improvements in accuracy. Although CBAM performs well in general tasks, its performance in this study is inferior to the specially designed SCDAM. The EMA module exhibited marginally superior outcomes, though its precision did not meet expectations. The most significant enhancement in performance was achieved following the integration of the SCDAM attention mechanism. This indicates that SCDAM exhibits superior compatibility with the cross-contextual feature fusion module (CCFM) compared to other attention mechanisms. SCDAM enables more precise weight allocation during feature fusion, thereby effectively enhancing object detection accuracy.
To comprehensively evaluate the overall performance of the proposed MC-YOLO model, we compared it with representative YOLO series models from recent years. Considering the strict requirements for real-time performance and ease of deployment in high-speed railway catenary inspection tasks, this comparative experiment specifically selected lightweight and fast versions from each series, which aligns well with the application scenarios and improvement objectives of our work.
Based on the comparative experimental results shown in Table 5, the proposed model demonstrates significant advantages in balancing accuracy and efficiency. The performance of YOLOv5s, YOLOv8n, YOLOv11, YOLOv12, and MC-YOLO was compared. The results show that YOLOv8n achieved the fastest inference speed at 270 FPS, while YOLOv11 and YOLOv12 exhibited higher accuracy with mAP@0.5 values of 0.667 and 0.651, respectively. In contrast, MC-YOLO achieved the highest mAP@0.5 value of 0.674 among all compared models while maintaining a relatively low parameter count and inference speed, representing a 4.1% improvement over the original YOLOv5s baseline model.
Despite achieving an accuracy of 0.711, YOLOv9c has a parameter count of 25.3 M and an inference speed of only 57 FPS. Models such as YOLOv6s and YOLOv8n, while offering advantages in terms of speed, exhibit detection accuracies that are lower than those of the proposed method. The experimental results demonstrate that the proposed model achieves superior detection accuracy while maintaining a moderate parameter size and high inference efficiency, reflecting excellent overall performance.

4.5. Visual Analysis

As demonstrated in Figure 8, the MC-YOLOmAP@0.5 model exhibited a 3.1% enhancement over the baseline model. The proposed enhancement method has been demonstrated to be effective in optimising the performance of the MC-YOLO model curve, which integrates all enhancement strategies. The result is the maximum coverage area in terms of both the horizontal and vertical axes. This finding suggests that MC-YOLO achieves the highest levels of precision and recall across all detection categories, and that its enhancements effectively enhance the detection capability for insulators on railway overhead contact systems.
In order to provide a more intuitive demonstration of the detection performance between MC-YOLO and YOLOv5, images were randomly selected from the dataset for detection, as illustrated in Figure 9.
As demonstrated in Figure 9, under real-world conditions and in the presence of natural lighting, the MC-YOLO model exhibits a substantially higher level of detection performance in comparison to YOLOv5. Specifically, MC-YOLO demonstrates notable robustness in densely populated target areas, with its predicted bounding boxes generally achieving higher confidence scores. For targets that are densely clustered, YOLOv5 exhibits significant false-negative and false-positive results, whereas MC-YOLO achieves complete and accurate detection. In scenarios involving minute objects, YOLOv5 demonstrates deficiencies in its perception capabilities, while MC-YOLO exhibits a capacity for precise recognition. Furthermore, YOLOv5 is susceptible to false detections against complex backgrounds, such as misclassifying damaged structures. The proposed MC-YOLO algorithm has been demonstrated to achieve an optimal balance of accuracy and speed in the detection of defects in insulators. This has resulted in a significant reduction in both false-negative and false-positive rates, ensuring greater alignment with practical detection requirements.
In conclusion, in order to provide a more precise evaluation of the model’s detection capabilities, a visualisation analysis was conducted utilising the Gradient-weighted Class Activation Mapping (Grad-CAM) technique. Grad-CAM is a visualisation technique that highlights key image regions through heatmaps to visualise model decisions, clearly displaying the location and area of the target that the model is attempting to predict within the image [25]. Furthermore, Grad-CAM enables a better understanding of the model during object detection and aids in evaluating its detection capabilities, as shown in Figure 10.
Figure 10 presents a visual comparison between the YOLOv5 model and the MC-YOLO model. As demonstrated in the category activation maps, MC-YOLO exhibits stronger responses, more precise localisation, and superior resistance to background interference. In the figure, warm colors like red and yellow show important areas, while cool colors like blue show less important areas. This finding suggests that MC-YOLO generates more accurate predictions and has superior recognition capabilities, enhancing its application prospects across a range of domains.

5. Conclusions

The main object detection algorithms employed in the field of insulator detection have demonstrated a lack of precision and have thus fallen short of the requisite standards for practical applications. In order to enhance precision in detecting railway catenary insulators whilst minimising the complexity of the model, an algorithm has been proposed. It is known as MC-YOLO and is designed to improve accuracy whilst maintaining real-time performance. The modules were designed to effectively extract multi-scale information and enhance the model’s ability to represent feature information. Subsequently, a cross-context feature fusion module (CCFM) was developed. It has been shown that shallow features benefit from partial semantic supplementation by deep features while deep features absorb shallow spatial information, thereby facilitating bidirectional information flow. The Spatial-Channel and Attention Module (SCDAM) was integrated into CCFM. The integration of a dynamic attention-guided bidirectional cross-fusion mechanism has been shown to effectively address the discrepancy between shallow details and deep semantics during multi-scale feature fusion. The MC-YOLO algorithm achieved an mAP0.5 of 67.4% on the dataset presented in this paper, with a frame rate of 121 FPS, enabling the rapid and accurate detection of insulator defects. To promote the transition of this research from laboratory validation to practical engineering applications, we further envision its industrial implementation pathway. Specifically, MC-YOLO can be embedded as the core detection engine within drones or intelligent inspection platforms. This enables real-time detection and preliminary alerts on edge devices, while complex sample verification and continuous model optimisation are performed in the cloud, thereby forming a comprehensive intelligent inspection system. To address practical challenges in field inspections—such as variable lighting and complex weather conditions—enhancing the model’s environmental robustness requires constructing incremental datasets and incorporating domain adaptation techniques [26]. Furthermore, the small target and complex background detection problems addressed by this method are broadly applicable, and its application scenarios can naturally extend to fields such as high-voltage power line inspection, substation equipment monitoring, and surface defect detection of infrastructure, including wind turbine blades and bridge cables [27,28,29]. Notably, the current model has increased parameter volume due to the introduction of attention mechanisms. Future work will employ lightweight techniques like model pruning and quantisation compression to further optimise computational efficiency while maintaining accuracy, thereby meeting practical deployment requirements for embedded devices [30,31,32]. Through continuous refinement along this technical trajectory, MC-YOLO holds promise to evolve into a stable and reliable automated inspection tool, delivering a practical technical solution for intelligent operation and maintenance of power systems.

Author Contributions

Conceptualisation, B.G. and D.L.; methodology, B.G. and D.L.; software, B.G. and J.G.; validation, B.G. and Y.W.; investigation, B.G. and D.L.; data curation, J.G. and Y.W.; writing—original draft preparation, D.L. and B.G.; writing—review and editing, B.G. and J.G.; visualisation, B.G. and X.J.; project administration, B.G.; funding acquisition, D.L. and X.J. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Inner Mongolia Autonomous Region College Students’ Innovation and Entrepreneurship Training Program, grant number S202510128018; the Inner Mongolia Autonomous Region Key Scientific and Technological R&D and Achievement Transformation Program, grant number 2025YFHH0035; and the Basic Scientific Research Business Fee Project for Universities Directly under the Inner Mongolia Autonomous Region, grant number JY20250087.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in the study are included in the article; further inquiries can be directed to the corresponding author.

Acknowledgments

The authors express their gratitude to the editor and reviewers for their valuable contributions to this manuscript.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Li, B.; Zhang, J.; Bu, X. Insulator Detection from Arbitrary Angles in Aerial Photography Based on Neighborhood Information Interaction. Electron. Meas. Technol. 2023, 46, 117–122. [Google Scholar]
  2. Long, Y.; Wei, W.; Shu, Y.; Zhang, Z.; Wang, D.; Li, F. Defect Detection Method for Rotating Insulators Based on Adaptive Keypoints. Comput. Eng. 2023, 49, 272–278. [Google Scholar]
  3. Yin, L.; Hu, J.; Wang, W.; He, L.; Zhou, K.; Su, Z.; Yan, L.; Liu, R.; Wang, B.; Tu, Y. Study on the Effect of Environmental Conditions on the Zero-Value Insulator Detection Criterion Based on UAV Infrared Inspection. Electr. Porcelain Arresters 2023, 171–177. [Google Scholar] [CrossRef]
  4. Li, Y.; Zhang, Z.; Huang, Y. Welding Fixture and Welding Method for High-Voltage Power Transmission and Transformation Insulators: CN202310870871.8. Patent CN116586881A, 27 February 2024. [Google Scholar]
  5. Jiang, W.; Yang, Q. Transformative Impact of Industrial Intelligence Based on Deep Learning on Economic Development. J. Yulin Univ. 2023, 33, 62–68. [Google Scholar]
  6. Ross, G. Fast R-CNN. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar] [CrossRef]
  7. Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef]
  8. Cai, Z.; Vasconcelos, N. Cascade R-CNN: Delving into high quality object detection. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 6154–6162. [Google Scholar]
  9. Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar] [CrossRef]
  10. Ge, Z.; Liu, S.; Wang, F.; Li, Z.; Sun, J. Yolox: Exceeding yolo series in 2021. arXiv 2021, arXiv:2107.08430. [Google Scholar] [CrossRef]
  11. Long, X.; Deng, K.; Wang, G.; Zhang, Y.; Dang, Q.; Gao, Y.; Shen, H.; Ren, J.; Han, S.; Ding, E.; et al. PP-YOLO: An Effective and Efficient Implementation of Object Detector. arXiv 2020. [Google Scholar] [CrossRef]
  12. Wang, C.Y.; Bochkovskiy, A.; Liao, H.Y.M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. arXiv 2022. [Google Scholar] [CrossRef]
  13. Liu, W.; Dragomir, A.; Erhn, D.; Szegedy, C.; Reed, S.; Fu, C.-Y. SSD: Single Shot MultiBox Detector. In Proceedings of the 14th European Conference on Computer Vision (ECCV), Amsterdam, The Netherlands, 11–14 October 2016; Volume 9905, pp. 21–37. [Google Scholar]
  14. Hu, X.; Li, Y. Hybrid Insulator Fault Detection Based on Improved Faster R-CNN and U-net. Telev. Technol. 2021, 45, 6. [Google Scholar] [CrossRef]
  15. Guo, Y.; Ma, M.; Li, D. Lightweight Insulator Surface Defect Detection Based on Improved YOLOv5. Laser Optoelectron. Prog. 2023, 60, 2412007. [Google Scholar]
  16. Wang, X.; Wu, H.; Zhao, J. Study on Dynamic Differential Defect Detection Method for Glass Insulators. J. China Jiliang Univ. 2014, 25, 6. [Google Scholar] [CrossRef]
  17. Ma, J.; Bai, Y. Lightweight YOLOv4 for Insulator Defect Detection. Electron. Meas. Technol. 2022, 45, 123–130. [Google Scholar]
  18. Jia, X.; Yu, Y.; Guo, Y.; Huang, Y.; Zhao, B. Lightweight Detection Method for Self-Explosion Defects of Insulators in Aerial Photography. High Volt. Eng. 2023, 49, 294–300. [Google Scholar]
  19. Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. CBAM: Convolutional Block Attention Module. In Proceedings of the 15th European Conference, Munich, Germany, 8–14 September 2018; Springer: Cham, Switzerland, 2018. [Google Scholar] [CrossRef]
  20. Ouyang, D.; He, S.; Zhang, G.; Luo, M.; Guo, H.; Zhan, J.; Huang, Z. Efficient Multi-Scale Attention Module with Cross-Spatial Learning. arXiv 2023, arXiv:2305.13563. [Google Scholar]
  21. Hu, J.; Shen, L.; Sun, G. Squeeze-and-Excitation Networks. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
  22. Hou, Q.; Zhou, D.; Feng, J. Coordinate Attention for Efficient Mobile Network Design. arXiv 2021, arXiv:2103.02907. [Google Scholar] [CrossRef]
  23. Li, C.; Li, L.; Jiang, H.; Weng, K.; Geng, Y.; Li, L.; Ke, Z.; Li, Q.; Cheng, M.; Nie, W.; et al. YOLOv6:a single-stage object detection framework for industrial applications. arXiv 2022, arXiv:2209.02976. [Google Scholar]
  24. Liu, C.; Wu, Y.; Liu, J.; Sun, Z.; Xu, H. Insulator faults detection in aerial images from high-voltage transmission lines based on deep learning model. Appl. Sci. 2021, 11, 4647. [Google Scholar] [CrossRef]
  25. Akdoğan, C.; Özer, T.; Oğuz, Y. PP-YOLO: Deep learning based detection model to detect apple and cherry trees in orchard based on Histogram and Wavelet preprocessing techniques. Comput. Electron. Agric. 2025, 232, 110052. [Google Scholar] [CrossRef]
  26. Kang, S.; Hu, Z.; Liu, L.; Zhang, K.; Cao, Z. Object detection YOLO algorithms and their industrial applications: Overview and comparative analysis. Electronics 2025, 14, 1104. [Google Scholar] [CrossRef]
  27. Tang, Q.; Su, C.; Tian, Y.; Zhao, S.; Yang, K.; Hao, W.; Feng, X.; Xie, M. YOLO-SS: Optimizing YOLO for enhanced small object detection in remote sensing imagery. J. Supercomput. 2025, 81, 303. [Google Scholar] [CrossRef]
  28. Li, X.; Chen, L.; Huang, T.; Yang, A.; Liu, W. YOLO-edge: Real-time vehicle detection for edge devices. Clust. Comput. 2025, 28, 289. [Google Scholar] [CrossRef]
  29. Priadana, A.; Nguyen, D.-L.; Vo, X.-T.; Choi, J.; Ashraf, R.; Jo, K. HFD-YOLO: Improved YOLO Network Using Efficient Attention Modules for Real-Time One-Stage Human Fall Detection. IEEE Access 2025, 13, 41248–41258. [Google Scholar] [CrossRef]
  30. Hu, Y.; Jiang, X.; Guo, S.; Xian, R.; Zong, C.; Yang, Z.; Han, X. Influence of snow accretion on arc flashover gradient for various types of insulators. IET Gener. Transm. Distrib. 2020, 14, 2361–2367. [Google Scholar] [CrossRef]
  31. Liu, C.; Wu, Y.; Liu, J.; Han, J. MTI-YOLO: A light-weight and real-time deep neural network for insulator detection in complex aerial images. Energies 2021, 14, 1426. [Google Scholar] [CrossRef]
  32. Lu, Y.; Li, D.; Li, D.; Li, X.; Gao, Q.; Yu, X. A lightweight insulator defect detection model based on drone images. Drones 2024, 8, 431. [Google Scholar] [CrossRef]
Figure 1. Typical Defect Images of Insulators.
Figure 1. Typical Defect Images of Insulators.
Sensors 26 00584 g001
Figure 2. YOLOv5 network structure diagram.
Figure 2. YOLOv5 network structure diagram.
Sensors 26 00584 g002
Figure 3. MC-YOLO network structure diagram.
Figure 3. MC-YOLO network structure diagram.
Sensors 26 00584 g003
Figure 4. Diagram of the MACF module.
Figure 4. Diagram of the MACF module.
Sensors 26 00584 g004
Figure 5. Diagram of how weights are obtained.
Figure 5. Diagram of how weights are obtained.
Sensors 26 00584 g005
Figure 6. CCFM feature fusion module.
Figure 6. CCFM feature fusion module.
Sensors 26 00584 g006
Figure 7. Spatial-Channel Dual Attention Module.
Figure 7. Spatial-Channel Dual Attention Module.
Sensors 26 00584 g007
Figure 8. Comparison chart of mAP before and after improvement.
Figure 8. Comparison chart of mAP before and after improvement.
Sensors 26 00584 g008
Figure 9. Detection performance of YOLOv5 and MC-YOLO.
Figure 9. Detection performance of YOLOv5 and MC-YOLO.
Sensors 26 00584 g009aSensors 26 00584 g009b
Figure 10. Grad-CAM visualisation.
Figure 10. Grad-CAM visualisation.
Sensors 26 00584 g010
Table 1. Experimental configuration information.
Table 1. Experimental configuration information.
NameParameters
GPU RTX 3060
CPU6 x E5-2680 v4
Operating SystemLinux ubuntu22.04
CUDA11.1
Programming LanguagePython 3.10
Pytorch2.0.1
Table 2. Ablation experiments (√ indicates that the corresponding module in the column is enabled).
Table 2. Ablation experiments (√ indicates that the corresponding module in the column is enabled).
MACFCCFMPrecisionRecallParametersGFLOPsmAP@0.5mAP@50:95FPS
0.7190.6227,018,21616.00.6330.411169
0.7110.6717,804,77617.60.6650.432125
0.7220.6608,215,53619.00.6740.436121
Table 3. Ablation experiments for a single module.
Table 3. Ablation experiments for a single module.
ModelDefectGFLOPsParamsPrecisionRecallmAP@0.5mAP@50:95
YOLOv5Normal16.07,018,2160.970.9540.9750.745
Damage16.07,018,2160.6450.5520.5580.275
Flashover16.07,018,2160.5430.3890.3650.223
YOLOv5 + MACFNormal17.67,804,7760.9530.950.9780.734
Damage17.67,804,7760.6600.5870.5950.299
Flashover17.67,804,7760.5190.4750.4230.243
MC-YOLONormal19.08,215,5360.9630.9560.980 0.746
Damage19.08,215,5360.6620.5710.6000.302
Flashover19.08,215,5360.5420.4560.4490.260
Table 4. Attention comparison experiments.
Table 4. Attention comparison experiments.
ModelParams (M)GFLOPsmAP@0.5mAP@0.5:0.95
YOLOv57.0216.00.6330.411
SCDAM8.1919.00.6740.436
EMA [20]8.1719.40.6710.435
SE [21]8.1618.80.6680.432
CBMA [22]8.1618.80.6590.427
CA [23]8.1619.00.6560.430
Table 5. Contrast experiment.
Table 5. Contrast experiment.
ModelParams (M)mAP@0.5FPS
YOLOv5s7.10.633169
YOLOv6s [24]4.20.597276
YOLOv737.40.55150
YOLOv7-tiny6.10.476126
YOLOv8n3.20.664270
YOLOv9c25.30.71157
YOLOv108.10.61765
YOLOv119.40.667182
YOLOv129.20.651133
Ours8.20.674121
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Gao, B.; Guo, J.; Wang, Y.; Li, D.; Jia, X. An Attention-Based Bidirectional Feature Fusion Algorithm for Insulator Detection. Sensors 2026, 26, 584. https://doi.org/10.3390/s26020584

AMA Style

Gao B, Guo J, Wang Y, Li D, Jia X. An Attention-Based Bidirectional Feature Fusion Algorithm for Insulator Detection. Sensors. 2026; 26(2):584. https://doi.org/10.3390/s26020584

Chicago/Turabian Style

Gao, Binghao, Jinyu Guo, Yongyue Wang, Dong Li, and Xiaoqiang Jia. 2026. "An Attention-Based Bidirectional Feature Fusion Algorithm for Insulator Detection" Sensors 26, no. 2: 584. https://doi.org/10.3390/s26020584

APA Style

Gao, B., Guo, J., Wang, Y., Li, D., & Jia, X. (2026). An Attention-Based Bidirectional Feature Fusion Algorithm for Insulator Detection. Sensors, 26(2), 584. https://doi.org/10.3390/s26020584

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop