Next Article in Journal
Nonlinear Transport of Tracer Particles Immersed in a Strongly Sheared Dilute Gas with Inelastic Collisions
Next Article in Special Issue
Defect Classification Dataset and Algorithm for Magnetic Random Access Memory
Previous Article in Journal
A Tuning-Free Constrained Team-Oriented Swarm Optimizer (CTOSO) for Engineering Problems
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

CAG-Net: A Novel Change Attention Guided Network for Substation Defect Detection

by
Dao Xiang
1,
Xiaofei Du
2 and
Zhaoyang Liu
1,*
1
School of Information Engineering, Xuzhou University of Technology, Xuzhou 221018, China
2
School of Mechanical Engineering, Southeast University, Nanjing 211189, China
*
Author to whom correspondence should be addressed.
Mathematics 2026, 14(1), 178; https://doi.org/10.3390/math14010178
Submission received: 24 November 2025 / Revised: 25 December 2025 / Accepted: 28 December 2025 / Published: 2 January 2026

Abstract

Timely detection and handling of substation defects plays a foundational role in ensuring the stable operation of power systems. Existing substation defect detection methods fail to make full use of the temporal information contained in substation inspection samples, resulting in problems such as weak generalization ability and susceptibility to background interference. To address these issues, a change attention guided substation defect detection algorithm (CAG-Net) based on a dual-temporal encoder–decoder framework is proposed. The encoder module employs a Siamese backbone network composed of efficient local-global context aggregation modules to extract multi-scale features, balancing local details and global semantics, and designs a change attention guidance module that takes feature differences as attention weights to dynamically enhance the saliency of defect regions and suppress background interference. The decoder module adopts an improved FPN structure to fuse high-level and low-level features, supplement defect details, and improve the model’s ability to detect small targets and multi-scale defects. Experimental results on the self-built substation multi-phase defect dataset (SMDD) show that the proposed method achieves 81.76% in terms of mAP, which is 3.79% higher than that of Faster R-CNN and outperforms mainstream detection models such as GoldYOLO and YOLOv10. Ablation experiments and visualization analysis demonstrate that the method can effectively focus on defect regions in complex environments, improving the positioning accuracy of multi-scale targets.

1. Introduction

As the core hub of the “generation, transmission, transformation, distribution, and consumption” in the power system, the stable operation of power transformer substations is crucial to power grid security, national economy, and people’s livelihood. Due to the impacts of natural environments, high-voltage loads, and human factors, substations are prone to equipment state change defects such as meter damage and cabinet door opening, as well as safety risks including vehicle/personnel intrusion and foreign object invasion. These issues not only affect the normal operation of equipment but may even lead to power outages [1]. Traditional manual inspection mechanisms suffer from limitations such as low work efficiency, high rates of missed detection and false detection, posing challenges for adapting to the rapid expansion of substation scales [2]. To improve inspection efficiency and quality, State Grid Corporation of China has promoted the application of intelligent inspection robots and surveillance cameras as the core equipment for smart grid, and successively issued a number of industry standards and guiding documents, aiming to construct a new intelligent operation and maintenance mode for substations featuring “centralized control station + unattended operation” and promote the replacement of manual inspections with remote intelligent inspections [3].
While intelligent inspection equipment has initially replaced manual work to realize remote inspection, the lack of high-precision defect detection algorithms means that defect analysis still relies heavily on manual verification at this stage. This has become a bottleneck restricting the implementation of whole-process intelligent operation and maintenance [4]. When releasing the term “power vision”, the China Computer Federation (CCF) pointed out that existing research on substation defect detection is mainly designed and optimized based on general object detection frameworks, which are tailored for conventional object detection problems in natural scenes. These frameworks fail to fully consider the characteristics of substation scenarios, leading to issues such as weak generalization ability, poor interpretability, and susceptibility to overfitting [5]. Inspection samples collected by substation inspection equipment at the same location but different times contain abundant temporal information [6], as shown in Figure 1. Defects correspond to changes in regional features in bi-temporal samples. Utilizing these feature changes to reveal the existence of defects is conducive to suppressing background interference and reducing the complexity of model design. To fully exploit the temporal information in inspection samples, we propose a novel substation defect detection network, CAG-Net, which embeds a change attention mechanism into the general object detection framework. Taking the feature differences of input images as attention weights, the mechanism guides the feature maps of inspection images—specifically, enhancing regions with feature changes and suppressing backgrounds with minimal changes—thereby achieving the performance improvement of bi-temporal prior-driven defect detection.

2. Related Work

2.1. Substation Defect Detection

As a research subfield of industrial defect detection, substation defect detection methods can be categorized into traditional methods and deep learning-based methods. Traditional methods mainly rely on low-level image processing or conventional machine learning for defect extraction, which are only applicable to scenarios with stable imaging conditions and simple backgrounds. Deep learning-based methods can directly learn the feature representations of defects from raw images, outperforming traditional methods in performance. They have become the mainstream approach in current substation defect detection research and can be further classified into two-stage models and one-stage models based on their architectural structures [7].
Two-stage models divide object detection into two sequential steps: first generating region proposals, followed by bounding box regression and classification for each proposal. Faster R-CNN [8] is a classic algorithm adopting this framework and has been widely applied in substation defect detection. Yu et al. [9] proposed to utilize the Faster R-CNN algorithm to accurately locate the target of six types of substation equipment, and then identify abnormal temperature defects in the equipment region based on a temperature threshold discriminative algorithm. Ying et al. [10] introduced a series of improvement strategies such as inter-class sampling, center-guided non-maximum suppression, and category-adaptive threshold into Faster R-CNN, which alleviated the problems of missed detections or false alarms caused by long-tailed sample distribution and bounding box redundancy. Zhang et al. [11] added a spatial pyramid pooling structure to the Faster R-CNN network to optimize the feature fusion, enhancing the detection performance for small targets. The advantage of two-stage models lies in their high accuracy, but they suffer from high computational complexity and slow detection speed, making them unsuitable for scenarios with strict real-time requirements.
One-stage models do not rely on region proposals and can directly predict the location and category of defects from features extracted by the backbone network. The YOLO series models [12] and SSD model [13] are representative algorithms of this framework. With their simple structure and high speed, they have become commonly used detection methods for industrial embedded devices. Dong et al. [14] embedded a parallel hybrid attention mechanism into YOLOv5, which enhanced the differences between various defect features by integrating local and non-local correlation information, guiding the model to focus on key equipment regions in complex background. Yuan et al. [15] proposed a novel multi-branch backbone and neck network fusion feature pyramid network framework based on YOLOv7, which effectively combines spatial location features with rich semantic information in images and improved the detection performance for multi-scale targets and small targets. Meng et al. [16] designed a novel attention mechanism module that encourages the model to focus on target objects under adverse weather conditions, and optimized the SPPF and C2f-OD modules of YOLOv10n to strengthen the backbone’s feature extraction capability for small targets. He et al. [17] combined the Swin Transformer with Convolutional Neural Network to enhance the model’s understanding of multi-scale global semantic information via cross-layer interactions, and employed bidirectional cross-scale connections and weighted feature fusion to enhance the capability of tiny defect detection. Huang et al. [18] improved the feature extraction network, upsampling operator, and Dyhead detection head of YOLOv7-tiny, and adopted a comprehensive pruning strategy, which enhanced the overall prediction performance of the algorithm while reducing model complexity. One-stage models offer high detection efficiency, but their detection performance for multi-scale targets in complex and noisy scenarios still needs further improvement.
The aforementioned one-stage and two-stage models are mainly based on general object detection frameworks. By integrating attention mechanisms, optimizing loss functions, and other strategies, these models reduce background interference and enhance the capability to extract multi-scale defect features, achieving certain progress in detecting specific types of substation defects. However, with the diversification of substation defect types and the increasing complexity of inspection scenarios, existing methods still have the following limitations: (1) The backbone’s feature extraction capability under complex backgrounds is insufficient, which limits its capability to distinguish similar targets, and features from the surrounding environment can easily interfere with defect recognition; (2) When dealing with targets that vary significantly in scale and shape, the feature representation capability from the neck to the detection head is notably inadequate, ultimately leading to poor model detection performance [19]. In general, current research only focuses on intuitive perception tasks and fails to deeply explore and utilize substation domain-specific knowledge. Consequently, the model performance cannot effectively meet practical application requirements, and there is an urgent need for “specialized” customization.

2.2. Change Detection

Change detection refers to the technology of analyzing images acquired at the same location from different times to identify inter-period state differences of entities, and it has emerged as a research hotspot in the field of photogrammetry and remote sensing [20]. Early change detection methods mainly rely on comparison and analysis of manually selected features, which perform well for segmenting regions with obvious changes. However, they suffer from issues including poor noise resistance, blurred and adherent boundaries, and numerous pseudo-changes. With the rise of deep learning technology, Siamese network-based defect detection methods have become the mainstream of research.
Siamese networks extract features from multi-temporal inputs separately via weight-sharing backbone networks, and then perform change detection using feature differences, which has advantages in reducing the mutual interference of high-dimensional features. Fang et al. [21] embedded a series of feature interaction layers into the Siamese feature extraction module, achieving competitive performance without the need for dense prediction. Chen et al. [22] proposed a spatiotemporal relation modeling mechanism, which is integrated with the visual Mamba decoder architecture to achieve spatiotemporal interaction of multi-temporal features and improve the accuracy of change information extraction. Feng et al. [23] enhanced the information coupling between features via self-attention and cross-attention mechanisms, boosting the change detection performance on datasets with foreground-background class imbalance.
Samples collected by substation inspection equipment at different times also have a multi-temporal relationship, and defects correspond to regional changes in these multi-temporal samples. Obviously, detecting regional changes in multi-temporal inspection samples can assist in defect detection. In recent years, some studies have begun to focus on the application of multi-temporal information in the field of substation defect detection. For example, Wang et al. [24] extracted multi-scale feature differences based on a triplet Siamese network and enhanced the model’s anti-interference capability using spatial consistency loss. Zhao et al. [25] identified discriminative features among various appearance defects through a comparative sample strategy, improving the detection accuracy of instrument appearance defects. However, overall, research related to substation multi-temporal information remains scarce and urgently needs in-depth exploration and utilization. In addition, the interference in remote sensing images mainly comes from noise, while substation inspection images are easily affected by factors such as weather, lighting, and seasons, resulting in irrelevant changes including vegetation shifts, leaf movement, and shadow displacement, which imposes higher requirements on the backbone network for feature extraction and interference suppression.

2.3. Differences from Existing Representative Approaches

To clarify the innovations of CAG-Net, we briefly distinguish it from the two representative approaches mentioned above, i.e., Siamese-based change detection and attention-guided defect detection, as summarized below:
Siamese-based change detection methods [21,22,23] mainly rely on weight-sharing backbones to extract bi-temporal features and adopt simple feature subtraction or concatenation for change identification. However, they lack targeted modules to enhance defect-related change signals and suppress background interference, leading to insufficient discriminability for subtle substation defects. In contrast, CAG-Net integrates the Change Attention Guided Module (CAGM) to explicitly model bi-temporal feature differences, and supplements the Detail-enhanced Feature Fusion Module to compensate for detail loss, addressing the core limitation of existing Siamese methods in balancing change semantics and spatial details.
For attention-guided defect detection methods [14,16], attention mechanisms are primarily designed for single-frame image saliency mining, focusing on spatial attention within a single image rather than temporal change cues. CAG-Net differs by constructing a change-driven attention paradigm, which takes bi-temporal feature differences as attention weights, dynamically enhancing defect regions while suppressing irrelevant background variations that are prevalent in substation scenarios. Such design makes the attention mechanism more tailored to the characteristics of substation defects, which are inherently associated with temporal changes.

3. Methods

This section elaborates on the proposed change attention-guided substation defect detection algorithm, CAG-Net. As illustrated in Figure 2, CAG-Net consists of a change encoder, a change decoder, and a defect detection head. Firstly, the bi-temporal inspection images collected by substation inspection equipment at the same location but different times—namely, the defect-free reference image T 1 R H × W × 3 and the potentially defect-containing inspection image T 2 R H × W × 3 —are input into the change encoder after pixel-level registration [26]. The change encoder extracts multi-scale global contextual feature information and highlights defect-related changed regions with temporal characteristics. Subsequently, the change decoder adopts an FPN-like structure [27], adaptively supplementing defect detailed information and fusing cross-layer information to obtain the predicted feature map. Finally, the predicted feature map is fed into the defect detection head for the localization and classification of defect targets. Compared with traditional single-frame approaches that only exploit spatial features of individual images, the dual-temporal encoder–decoder framework realizes the transformation from spatial-only perception to spatiotemporal joint reasoning via explicit temporal information modeling and can effectively suppress interference from irrelevant background variations.

3.1. Change Encoder

The change encoder consists of a Siamese backbone network with shared weights and a Change Attention Guided Module (CAGM). The Siamese backbone network is composed of L layers of downsampling and feature extraction blocks, responsible for extracting multi-scale defect features from the bi-temporal images. The CAGM undertakes the fusion and mapping of features from each layer extracted by the Siamese backbone network to strengthen defect information and suppress irrelevant background.

3.1.1. Backbone Network

Considering the characteristics of substation defect images, such as complex background, diverse features, and variable scales, we adopt the efficient local-global context aggregation module proposed by Noman et al. [28] to construct the backbone network. The goal is to enhance the feature representation capability and nonlinear fitting ability of the backbone network, thereby better capturing the details and overall structure of defects. In the backbone network, the downsampling block is implemented by patch embed composed of a single convolutional layer, while the feature extraction block is realized by the Encoder Block consisting of multiple convolutional layers and self-attention [29]. The structure of the Encoder Block is illustrated in Figure 3.
ELGCA is the core module of the Encoder Block, which can effectively integrate the advantages of convolution and self-attention mechanisms to enhance feature representation capability. It mainly aggregates local and global contextual information through a dual-branch structure, namely the local branch and the global branch. The local branch facilitates the capture of subtle visual cues closely related to defect recognition, while the global branch enables the transmission of feature information between long-distance positions, improving the model’s feature modeling capability for large-scale regions. The main processing steps are as follows:
  • Given the input feature X i , perform channel-wise splitting to obtain the local subset X l o i and global subset X g l i ;
  • For the local branch, apply a 3 × 3 depth-wise convolution to X l o i to generate the local spatial contextual feature X ¯ l o i ;
  • For the global branch, perform a 1 × 1 convolution on X g l i to obtain Q i , K i , V i , and Z i features, respectively. Q i , K i , and V i are then used as query, key, and value for pooled-transpose (PT) Attention to capture global contextual information A a t t i , whereas Z i is for multi-channel feature aggregation;
  • Finally, concatenate X ¯ l o i , A a t t i , and Z i along the channel dimension to obtain the fused feature X ¯ i .
The balance between local details and global semantics of ELGCA is achieved through the dual-branch design, i.e., the local branch uses 3 × 3 depth-wise convolution to preserve subtle defect cues, while the global branch leverages PT-Attention to model long-range dependencies. The 1:1 channel-splitting strategy allocates equal computational resources to both branches, avoiding mutual interference and ensuring comprehensive feature representation. According to the scale prior of substation defects, we adopt a four-layer backbone network to extract multi-scale defect features, i.e., L = 4 . The downsampling ratios of the feature maps of each layer relative to the input image are 1 4 , 1 8 , 1 16 , and 1 32 , respectively, with the corresponding number of channels being 64, 96, 128, and 256. The multi-scale feature maps extracted by the Siamese backbone network are then fed into the CAGM for feature fusion and mapping.

3.1.2. Change Attention Guided Module (CAGM)

In substation scenarios, background interference is a key factor leading to missed detections or false detections of defects [30]. Defects correspond to regional changes in the bi-temporal input images, i.e., among the multi-scale features extracted by the Siamese backbone network, the features of background regions are highly similar, while those of change regions are less similar, and the defects to be detected are exactly in these change regions with low similarity. To suppress background interference, we design the Change Attention Guidance Module (CAGM). Its core function lies in calculating the differences between the feature maps extracted by the Siamese backbone network to perceive regional changes, thereby automatically enhancing defect features and highlighting the saliency of change regions, which provides guidance for the final defect target detection. To distinguish the actual defect-related changes from background interference, the CAGM first captures initial change cues via absolute feature difference. Then, the MAFE module expands the receptive field to avoid missing small defects, and the SimAM attention dynamically enhances pixel-level importance of defect regions. Finally, the attention weights generated by the Sigmoid function adaptively amplify defect features while suppressing spurious background changes, ensuring accurate discrimination.
As shown in Figure 4, the input of the CAGM consists of the reference image feature map f 1 i and inspection image feature map f 2 i in the i-th layer extracted by the Siamese backbone network. Firstly, an absolute difference operation is performed to highlight the differences between adjacent regions and emphasize the change regions. This process can be expressed by the following formula:
f d i f f i = | f 1 i f 2 i |
where | · | denotes the absolute difference operation, and f d i f f i represents the initial difference features. Subsequently, f d i f f i is sequentially processed through a Multiscale Atrous Feature Extraction Module (MAFE), a 3D attention module SimAM [31], and a convolution block to further enhance the difference features. Although the backbone network already has a certain multi-scale feature extraction capability, substation defects exhibit significant scale differences, i.e., small defects are easily overlooked due to insufficient receptive fields, while large defects may suffer from detail loss caused by reduced feature resolution. To enhance the model’s ability of capturing global receptive field and reduce the impact of pseudo-changes, we design the MAFE module. It enables the network to fully learn feature information at different scales through multi-branch convolutions with different dilation rates, thereby more comprehensively capturing the features of substation defects and improving the detection accuracy of defect targets.
As shown in Figure 5, the MAFE module uses a series of atrous convolutions and residual connections to perform multi-scale feature extraction and fusion. According to the scale prior of substation defects, four parallel atrous convolution branches Branch k { 1 , 2 , 3 , 4 } are designed in the multi-branch structure to expand the receptive field, with receptive field sizes of 15, 11, 7, and 3, respectively. Each branch is composed of a mixture of atrous convolutions with different dilation rates, which helps to alleviate the gridding effect and local information loss [32]. The features extracted by each branch are fused with the next branch by element-wise addition, which can achieve complementary enhancement of multi-scale features while expanding the global receptive field. The processing procedure of MAFE can be expressed by the formula:
F M A F E i = Conv 1 ( f d i f f i ) + Fuse ( f d i f f i )
where Conv 1 denotes a 1 × 1 convolution which can be treated as a residual connect. Fuse ( · ) represents the progressive fusion operation of multi-receptive-field features. In such a manner, different branches are interrelated and benefit from each other, and the temporal changes are enhanced from diverse receptive fields.
After the processing of MAFE to expand receptive field, the difference features are then fed into SimAM to capture the pixel-level importance of local neighborhoods. SimAM [31] is a lightweight, parameter-free 3D attention module that can dynamically model the spatial correlation and contrast between features and adaptively adjust feature responses, thereby identifying the features most critical to the task and improving the detection accuracy. Finally, the difference information is fed into a Sigmoid function to generate the attention weight ω , which is then applied to the original feature map f 2 i . Such process realizes the enhancement of change regions in the inspection image and the suppression of unimportant background regions, enabling the model to focus more accurately on effective information and thus improving the efficiency and accuracy of feature representation. This process can be expressed as:
D i = σ ( Convb 3 ( SimAM ( F M A F E i ) ) ) f 2 i
where Convb 3 denotes a convolution block composed of a 3 × 3 convolution, a Batch Normalization (BN), and a ReLU activation function. σ represents the Sigmoid function. ⊗ denotes the element-wise multiplication operation. Finally, the multi-scale defect features D i { 1 , 2 , 3 , 4 } are obtained and served as the input to the change decoder module. Overall, the CAGM enhances defect semantic features and suppresses irrelevant background interference, thereby improving the discriminability of multi-scale defect features and laying a solid foundation for the change decoder to efficiently capture and fuse cross-layer information for accurate defect detection.

3.2. Change Decoder

The core task of the change decoder is to fuse defect information from different levels to achieve information complementation and enhancement. It adopts an FPN-like pyramid structure and fuses high-level semantic information with low-level contextual information in a bottom-up manner, facilitating feature decoding and providing more abundant feature support for subsequent defect detection.
Taking the feature fusion of the k-th and ( k + 1 ) -th layers as an example. Firstly, fed D i and D i + 1 along with the original inspection feature maps into the feature fusion module DFFM to obtain detail-enhanced features C i and C i + 1 , respectively. Here, C i + 1 serves as a critical semantic bridge for cross-scale feature fusion, which inherits rich global semantic cues derived from the (i + 1)-th encoder layer, while supplementing fine-grained details via DFFM, addressing the detail loss caused by repeated downsampling in high-level feature extraction. Then, upsample C i + 1 and add it with C i element-wisely to realize the fusion of features from different levels. The semantic-dominant characteristic enables C i + 1 to provide directional guidance for the fusion with C i which is rich in local details but weak in global discriminability. Finally, a 3 × 3 convolution is applied to enhance the feature representation. It should be noted that for the lowest-level feature D 4 , after going through the feature fusion module and a 3 × 3 convolution processing, they are directly output as enhanced feature P 4 . This process can be described by the following formulas:
C i = DFFM ( D i , f 2 i )
C i + 1 = DFFM ( D i + 1 , f 2 i + 1 )
P i = Conv 3 ( C i + Up ( C i + 1 ) )
where Up ( · ) denotes the upsampling operation, which is mainly composed of a bilinear interpolation and convolutions to achieve spatial scale alignment between high-level and low-level features. The change decoder ultimately outputs multi-scale enhanced features P i { 1 , 2 , 3 , 4 } , which serve as the input for the defect detection head.
The defect feature D i contains rich semantic cues but may suffer from the loss of defect details during the bi-temporal feature interaction of CAGM [33], while the original inspection image feature f 2 i have more detailed information but less semantic cues. It is natural to enhance the feature representation ability by combining the two kinds of feature maps. Inspired by the lateral skip connection design of the Unet [34], we design a Detail-enhanced Feature Fusion Module (DFFM) to fuse the defect features and the inspection image features, which aims to supplement defect details and improve the detection performance for minor defects.
Considering the semantic differences between defect features and inspection image features, simply combining them through concatenation or summation operation can easily lead to feature confusion. To address this issue, we incorporate spatial attention and cross-attention mechanisms into the traditional feature concatenation fusion method, enabling more precise capture of key information during feature fusion, thereby effectively enhancing feature representation capability. As shown in Figure 6, The DFFM consists of four branches: two direct feature concatenation branches and two cross-attention branches. In the cross-attention branch, the input features D i and f 2 i are first processed through a spatial attention module (SAM) proposed in [35], generating the corresponding spatial feature responses S D 1 and S f 1 , respectively. Then, after processing through a 3 × 3 convolution, the input features are multiplied by the spatial feature responses in a cross manner. The whole fusion process can be expressed as:
S D f = Conv 3 ( D i ) SAM ( f 2 i )
S f D = Conv 3 ( f 2 i ) SAM ( D i )
C i = Conv 1 ( Cat ( f 2 i , S D f , S f D , D i ) )
where Cat ( · ) refers to the channel concaenation operation. The 3 × 3 convolution Conv 3 is applied to enhance the fusion of different semantic features, and the SAM module helps the model focus on more critical spatial areas.

3.3. Defect Detection Head

The defect detection head takes the enhanced multi-scale defect features output by the change decoder as input, aiming to accurately identify defect categories and localize their positions. It adopts the design of Faster-RCNN [8], mainly consisting of a Region Proposal Network (RPN), a classification sub-network, and a regression sub-network. In the original RPN, the generation ratios of anchor boxes are 2:1, 1:1, and 1:2 with a minimum size of 128 × 128 pixels, enabling the generation of 9 types of candidate regions with different sizes and ratios. Such setup is suitable for object detection tasks on general datasets such as ImageNet and COCO [36]. Considering the particularities of substation defects in features such as shape and scale, it is necessary to make targeted adjustments to the anchor box generation parameters.
By analyzing the substation defect samples, we set the anchor box aspect ratios to 1:4, 1:2, 1:1, 2:1, and 4:1, and the anchor box sizes to 32, 64, 128, and 256 pixels. The improved RPN can generate 20 types of candidate regions with different sizes [37], thereby further improving the detection accuracy of defects with diverse scales.

4. Experimental Setup

4.1. Dataset

Due to the lack of publicly available datasets for substation defect detection, a Substation Multi-phase Defect Dataset (SMDD) was constructed to verify the effectiveness of the proposed method. Specifically, a large number of inspection images were collected by substation inspection robots and surveillance cameras, and samples with poor imaging quality induced by environmental factors such as lighting and weather were filtered out based on manual empirical knowledge. After image registration [26] to compensate for the positioning errors of imaging equipment, we finally obtained 1915 groups of inspection images. As shown in Figure 7, each group consists of 1 normal image and 3 defect images collected at the same location but different times, with defect regions annotated using VOC-style rectangular boxes. Since 1 normal image and 1 defect image can form a pair of bi-temporal inputs, a total of 5745 image pairs can be obtained. The original data are divided into training and test sets at a ratio of 9:1, with the proportion of each category kept consistent during the split. The defects in the dataset cover common types such as abnormal box door closure, foreign object intrusion, and insulator breakage, and the number of each defect type is shown in Table 1.

4.2. Implementation Details

The experiments were conducted on a server equipped with 2 NVIDIA RTX 4090 GPUs, and an Intel Xeon W5-2455x CPU. The operation system is Ubuntu 22.04 LTS 64-bit with python 3.9 and the code structure is based on Pytorch 2.2.1 framework.
The training parameters of the network are as follows: the bi-temporal input images are resized to 640 × 640 × 3 , the learning rate is initially set to 0.0001 and linearly decays to 0 until trained for 300 epochs, the optimization algorithm is the AdamW algorithm with weight decay of 0.0005 and beta values of ( 0.5 , 0.999 ) , and the batch size is set to 8. The backbone of the change encoder is initialized with pre-trained weights [28], whereas the remaining sub-networks adopt Kaiming initialization [38]. During training, we applied normal data augmentation to the input images, including random flip, rotation, crop, color jittering, and Gaussian blur.

4.3. Performance Metrics

To compare the performance of our model with state-of-the-art (SOTA) methods, we use Average Precision (AP) and mean Average Precision (mAP) as evaluation metrics. AP is calculated from Precision (P) and Recall (R). Precision, also known as positive predictive value, refers to the proportion of samples correctly predicted as positive among all samples predicted as positive. Recall, also known as sensitivity, refers to the proportion of samples correctly predicted as positive among all actually positive samples. mAP is one of the commonly used performance evaluation metrics in object detection tasks. We adopt the statistical results when the IOU (Intersection over Union) threshold is set to 0.5 , i.e., the predicted bounding box and the ground truth box are considered matched when IOU 0.5 . The calculation formulas for each metric can be found in [39].

5. Results and Discussion

5.1. Comparison with SOTA Methods

We compare the proposed model with six SOTA object detection algorithms on the SMDD, including the one-stage models YOLOX [40], YOLOv8m [41], YOLOv10m [42], and Gold-YOLO [43], the two-stage model Faster RCNN [8], and the transformer-based model RT-DETR [44]. To adapt to the substation defect detection task, two improvements are made to the above algorithms: (1) Single-input mode: the bi-temporal input images are concatenated along the channel dimension as the model input, denoted by “-c”; (2) Dual-input Mode: the Siamese backbone network is used to calculate absolute difference features defined in Formula (1), which serve as the input for the subsequent defect detection network, denoted by “-d”. The experimental environment and hyperparameter settings for model training are consistent with those of the proposed method.
As shown in Table 2, the experimental results of the proposed CAG-Net in this tudy and other comparative methods are presented. It is evident that Siamese network-based methods are generally superior to channel concatenation-based methods, indicating that Siamese networks extract defect features with higher expressiveness and verifying the effectiveness of leveraging bi-temporal information to suppress background interference and enhance detection performance. Among the six comparative algorithms mentioned above, Faster RCNN-d achieves the highest mAP metric with a maximum leading margin of 5.33 % compared to RT-DETR-d. Additionally, the Faster RCNN algorithm achieves the largest performance improvement of up to 1.63 % when switching from the single-input mode to the dual-temporal mode, demonstrating that two-stage models have greater application potential in bi-temporal defect detection tasks. They can more effectively cope with the substation inspection scenarios characterized by complex backgrounds, diverse features and scales, which is also an important reason why we adopt a two-stage framework to construct the network model.
The proposed method achieves the highest detection accuracy across all categories, with an overall mAP of 81.76 % —an improvement of 3.79 % compared with Faster RCNN-d—indicating that simply stacking bi-temporal inputs cannot fully exploit the underlying change information. The proposed change attention mechanism and detail-enhanced feature fusion mechanism in this paper significantly outperform other algorithms in the dual-temporal input mode by explicitly modeling feature differences and guiding feature enhancement. The effectiveness of the proposed method is also reflected in specific defect categories: (1) For categories with high detection difficulty and relatively weak change features such as insulator breakage and silica gel cartridge broken, the proposed method achieves significant improvements. Such defects are typically prone to interference from complex backgrounds or inconspicuous feature changes, rendering them challenging for traditional single-temporal methods or simple bi-temporal inputs to stably identify. The change attention mechanism in this paper can effectively focus on the difference features of defect regions, overcome background interference and enhance defect features. (2) For device position change as well as switch status change, the proposed method achieves the highest accuracy. These two types of defects are essentially changes in structural position and state, and the change attention mechanism is inherently suitable for such detection tasks with significant feature differences. (3) For categories that are relatively easy to identify with traditional methods such as abnormal box door closure and foreign object intrusion, the proposed method also reaches the highest level, verifying the generalization ability of the proposed method.

5.2. Ablation Studies

To verify the effect of different improvements in the proposed method, ablation experiments are designed. Specific improvements are gradually introduced to the baseline model of Faster RCNN-d (i.e., the decoder adopts a standard FPN design, with the feature fusion module replaced by a 1 × 1 convolution) to analyze the independent contribution of each improvement and the interactions between different improvements. The experimental results are shown in Table 3.
Improvement1 replaces the ResNet50 backbone of the Baseline with ELGCNet, resulting in a 0.57 % increase in mAP. This indicates that compared with the traditional ResNet50, ELGCNet can more effectively extract discriminative global contextual features, laying a solid foundation for subsequent change detection.
Improvement2 replaces the simple absolute difference features with the change attention mechanism (CAGM) on the basis of Improvement1, achieving a significant mAP improvement ( + 1.92 % ). Its gain significantly exceeds that of the backbone network improvement, demonstrating that the change attention mechanism is the key for performance improvement. It can effectively amplify the feature differences in defect regions (such as texture changes caused by insulator cracks) while suppressing background interference (such as lighting changes and vegetation growth).
Improvement3 introduces the detail-enhanced feature fusion module (DFFM) on the basis of Improvement1, realizing a 0.83 % mAP increase. This shows that compared with the standard FPN, the feature fusion module designed in this paper can extract richer defect detail features, thereby enhancing the model’s defect localization accuracy and robustness in complex scenarios.
Improvement4 restores the backbone network to ResNet50 on the basis of Improvement3 and adopts the change attention mechanism for difference features, achieving a substantial mAP improvement ( + 1.44 % ). This further confirms the irreplaceability of the attention mechanism in the framework.
The complete method of this paper based on ELGCNet + CAGM + DFFM achieves the highest mAP of 81.76 % . Interestingly, on the basis of Improvement4, replacing the ResNet50 backbone with ELGCNet (i.e., the complete method) achieves a 0.95 % mAP improvement, which is higher than the performance gain of Improvement1 relative to the Baseline (i.e., 0.57 % ). This indicates that the CAGM and DFFM can more fully utilize the features extracted by ELGCNet. The contributions of each module are shown in Figure 8. The three synergistically produce a positive effect, and the final result is superior to any single-module improvement scheme, verifying the overall rationality of the architecture design.

5.3. Visualization of Detection Results

To further verify the superiority of the proposed method in substationdefect detection tasks, some representative samples are selected from the test set to conduct a visual comparative analysis of the defect detection results and activation heatmaps [45] between the proposed method and comparative models. By intuitively presenting the response characteristics and localization accuracy of different models for defect targets, the performance differences of various methods in complex scenarios are further analyzed. The results are shown in Figure 9.
In large-scale defect target detection scenarios (as shown in the first group of samples), both the proposed method and the comparative models can effectively identify defect targets. However, comparative models such as RT-DETR and YOLOX also exhibit high response values in background regions, which indicates that these models have a weaker ability to suppress background interference. In contrast, the heatmap of the proposed method is concentrated in defect target regions, with significantly lower responses in background regions than the comparative models, and its bounding boxes are closer to the ground truth, reflecting superior target localization accuracy.
In small-scale defect target detection scenarios (as shown in the second group of samples), affected by factors such as light changes, models including RT-DETR, YOLOX, and YOLOv10m all suffer from missing detection. Analysis of activation heatmaps shows that the response regions of the aforementioned comparative models are mainly concentrated in areas with severe brightness changes in the images, failing to accurately focus on tiny defects. In contrast, via the change attention-guided strategy and multi-scale feature fusion mechanism, the heatmap of the proposed method exhibits significant peak responses in small defect regions, demonstrating the method’s robustness in small-target defect detection tasks.
Based on the above visualization results, it can be seen that the method proposed in this article can not only effectively suppress background interference and improve the localization accuracy of large-scale defect targets in complex substation scenarios, but also significantly enhance the detection performance of small targets by improving the feature perception of minor defects, demonstrating a more comprehensive defect detection capability and stronger environmental adaptability.

5.4. Computational Complexity Analysis

To comprehensively evaluate the proposed method, a computational complexity analysis is conducted on the substation multi-phase defect dataset, including floating-point operations (FLOPs), parameter size (Params), and frame rate (FPS). The results are shown in Table 4.
The YOLO-based models exhibit remarkable advantages in metrics such as FLOPs and FPS owing to their end-to-end design, which endows them with stronger applicability for deployment on edge devices with limited computing resources and strict real-time requirements.
By introducing the ELGCA backbone network and change attention mechanism into the two-stage detection framework, our method has higher computational complexity than other models, with a frame rate of only 20.1 FPS. It should be noted that although the one-stage YOLO models perform excellently in computational efficiency, there remains a considerable gap in detection performance compared with our method. In practical application scenarios of substation remote intelligent inspection systems, inspection devices such as intelligent robots and surveillance cameras are mainly responsible for collecting inspection images, which are then transmitted to the intelligent analysis server host with adequate computing resources for defect detection. Compared to real-time performance, the detection performance of algorithms is more urgently needed to ensure accurate detection of defects in substation. Therefore, despite the lack of advantage in algorithm efficiency, our method is more suitable for deployment on intelligent analysis server hosts from the perspective of practical application demands.

6. Conclusions

This study addresses key challenges in substation defect detection, including strong background interference, high missing detection rates of small targets, and insufficient utilization of multi-temporal information. It proposes a change attention-guided defect detection algorithm, which effectively improves the accuracy and robustness of substation defect detection. The algorithm can fully leverage the multi-temporal information contained in substation inspection images to accurately identify defects, meeting the requirements of substation defect detection tasks. The algorithm has the following advantages:
  • It employs a change attention mechanism to explicitly model bi-temporal feature differences, effectively amplifying the salient responses of defect regions and suppressing complex background interference. It performs exceptionally well in detecting weakly changing defects such as insulator damage;
  • Through improvements to the feature extraction and fusion network, it achieves long-range dependency modeling while retaining detailed defect features, enhances the complementary fusion of high-level and low-level defect features, and achieves higher detection accuracy for small-scale defects.
This study provides a specialized technical pathway for substation defect detection, facilitates the paradigm shift from single-image perception to multi-temporal change reasoning, and offers crucial technical support for the implementation of the unmanned intelligent operation and maintenance mode in substations. In the future, research will focus on lightweight model compression based on knowledge distillation and defect feature enhancement via multi-modal data fusion, to drive the technological iteration of intelligent operation and maintenance in power systems. For lightweight model compression, the key challenge lies in balancing compression ratio, detection accuracy, and real-time performance, avoiding the loss of fine-grained defect feature extraction capability for small-scale and weakly changing defects. For multi-modal fusion, addressing the heterogeneity of cross-domain data (e.g., visual, thermal, and partial discharge signals) and achieving effective alignment while suppressing redundant interference will be the core focus. Corresponding validation strategies should be designed around quantitative metrics and scenario-specific robustness to ensure practical applicability.

Author Contributions

D.X. conducted the research and wrote the paper, and X.D. and Z.L. contributed to the writing and editing of the paper. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

Some of the code and data generated during the study can be found at: https://github.com/pb07210028/CAG-Net/tree/main (accessed on 23 November 2025). The complete code and data are available from the corresponding author upon reasonable request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Zhang, N.; Yang, G.; Wang, D.; Hu, F.; Yu, H.; Fan, J. A defect detection method for substation equipment based on image data generation and deep learning. IEEE Access 2024, 12, 105042–105054. [Google Scholar] [CrossRef]
  2. Bai, Y.; Wang, L.; Gao, W.; Ma, Y. Multi-modal hierarchical classification for power equipment defect detection. J. Image Graph. 2024, 29, 2011–2023. [Google Scholar] [CrossRef]
  3. Li, J.; Su, H.; Zhang, Y.; Yan, Q.; Liu, L.; Wang, T. Smart Digital Inspection and Maintenance of Substations. In Proceedings of the 2024 5th International Conference on Power Engineering (ICPE), Shanghai, China, 13–15 December 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 388–393. [Google Scholar] [CrossRef]
  4. Qi, D.; Han, Y.; Zhou, Z.; Yan, Y. Review of Defect Detection Technology of Power EquipmentBased on Video Images. J. Electron. Inf. Technol. 2022, 44, 3709–3720. [Google Scholar] [CrossRef]
  5. Zhao, Z.; Feng, S.; Xi, Y.; Zhang, J.; Zhai, Y.; Zhao, W. The Era of Large Models: A New Starting Point for Electric Power Vision Technology. High Volt. Eng. 2024, 50, 1813–1825. [Google Scholar] [CrossRef]
  6. Chen, P. Research on Image Change Detection Technology in Substation Scene. Master’s Thesis, Zhejiang University, Hangzhou, China, 2021. [Google Scholar]
  7. Jha, S.B.; Babiceanu, R.F. Deep CNN-based visual defect detection: Survey of current literature. Comput. Ind. 2023, 148, 103911. [Google Scholar] [CrossRef]
  8. Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 39, 1137–1149. [Google Scholar] [CrossRef]
  9. Yu, H.; Gong, Z.; Zhang, H.; Zhou, S.; Yu, Z. Research on substation equipment identification and defect detection technology based on Faster R-CNN algorithm. Electr. Meas. Instrum. 2024, 61, 153–159. [Google Scholar] [CrossRef]
  10. Ying, Y.; Wang, Y.; Yan, Y.; Dong, Z.; Qi, D.; Li, C. An improved defect detection method for substation equipment. In Proceedings of the 2020 39th Chinese Control Conference (CCC), Shenyang, China, 27–29 July 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 6318–6323. [Google Scholar] [CrossRef]
  11. Zhang, M.; Xing, F.; Liu, D. External defect detection of transformer substation equipment based on improved Faster R-CNN. CAAI Trans. Intell. Syst. 2024, 19, 290–298. [Google Scholar] [CrossRef]
  12. Flores-Calero, M.; Astudillo, C.A.; Guevara, D.; Maza, J.; Lita, B.S.; Defaz, B.; Ante, J.S.; Zabala-Blanco, D.; Armingol Moreno, J.M. Traffic sign detection and recognition using YOLO object detection algorithm: A systematic review. Mathematics 2024, 12, 297. [Google Scholar] [CrossRef]
  13. Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. Ssd: Single shot multibox detector. In Computer Vision—ECCV 2016; Springer: Cham, Switzerland, 2016; pp. 21–37. [Google Scholar] [CrossRef]
  14. Dong, H.; Yuan, M.; Wang, S.; Zhang, L.; Bao, W.; Liu, Y.; Hu, Q. PHAM-YOLO: A parallel hybrid attention mechanism network for defect detection of meter in substation. Sensors 2023, 23, 6052. [Google Scholar] [CrossRef] [PubMed]
  15. Yuan, J.; Wang, F.; Hu, Q.; Wang, X.; Wang, L.; Zhang, Y.; Wang, L.; Xi, S.; Lin, C.; Sun, P. The Substation Defect Detection Method Based on FEFPN-YOLOv7. In Proceedings of the 2024 International Conference on Image Processing, Computer Vision and Machine Learning (ICICML), Shenzhen, China, 22–24 November 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 419–423. [Google Scholar] [CrossRef]
  16. Meng, Q.; Fu, T.; Li, K.; Huang, L.; Chen, S. Defect Detection Algorithm for Electrical Substation Equipment Based on Improved YOLOv10n. IEEE Access 2025, 13, 91409–91422. [Google Scholar] [CrossRef]
  17. He, Z.; Yang, W.; Liu, Y.; Zheng, A.; Liu, J.; Lou, T.; Zhang, J. Insulator defect detection based on YOLOv8s-SwinT. Information 2024, 15, 206. [Google Scholar] [CrossRef]
  18. Huang, T.; Li, Y.; Wang, L.; Liu, A. Defect Detection Method for Substations with Multi-Class Classification based on Improved YOLOv7-tiny. Control Eng. China 2024, 1–9. [Google Scholar] [CrossRef]
  19. Hui, M.; Yao, J.; Fu, Z.; Hai, T.; Zhang, M.; Pan, T. YOLO-CSS: A lightweight defect detection model for complex substation scenarios. Meas. Sci. Technol. 2025, 36, 086003. [Google Scholar] [CrossRef]
  20. Zhang, H.; Chen, K.; Liu, C.; Chen, H.; Zou, Z.; Shi, Z. CDMamba: Incorporating local clues into mamba for remote sensing image binary change detection. IEEE Trans. Geosci. Remote Sens. 2025, 63, 4405016. [Google Scholar] [CrossRef]
  21. Fang, S.; Li, K.; Li, Z. Changer: Feature interaction is what you need for change detection. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5610111. [Google Scholar] [CrossRef]
  22. Chen, H.; Song, J.; Han, C.; Xia, J.; Yokoya, N. ChangeMamba: Remote sensing change detection with spatiotemporal state space model. IEEE Trans. Geosci. Remote Sens. 2024, 62, 4409720. [Google Scholar] [CrossRef]
  23. Feng, Y.; Jiang, J.; Xu, H.; Zheng, J. Change detection on remote sensing images using dual-branch multilevel intertemporal network. IEEE Trans. Geosci. Remote Sens. 2023, 61, 4401015. [Google Scholar] [CrossRef]
  24. Wang, P.; Zhao, C.; Zhou, H.; Su, Z.; Wang, J.; Wang, W. Anti-interference defect detection of substation equipment based on multitemporal inspection images. Control Decis. 2024, 39, 885–892. [Google Scholar] [CrossRef]
  25. Zhao, Z.; Ma, D.; Shi, Y.; Li, G. Appearance defect detection algorithm of substation instrument based on improved YOLOX. J. Graph. 2023, 44, 937–946. [Google Scholar] [CrossRef]
  26. Lindenberger, P.; Sarlin, P.E.; Pollefeys, M. Lightglue: Local feature matching at light speed. In Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 2–3 October 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 17581–17592. [Google Scholar] [CrossRef]
  27. Han, L.; Li, N.; Li, J.; Gao, B.; Niu, D. SA-FPN: Scale-aware attention-guided feature pyramid network for small object detection on surface defect detection of steel strips. Measurement 2025, 249, 117019. [Google Scholar] [CrossRef]
  28. Noman, M.; Fiaz, M.; Cholakkal, H.; Khan, S.; Khan, F.S. ELGC-Net: Efficient local–global context aggregation for remote sensing change detection. IEEE Trans. Geosci. Remote Sens. 2024, 62, 4701611. [Google Scholar] [CrossRef]
  29. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In NIPS’17: Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Curran Associates Inc.: Red Hook, NY, USA, 2017; Volume 30. [Google Scholar]
  30. Wang, P. Research on Anomaly Detection Method of Substation Equipment Multi-Temporal Inspection Images Based on Deep Learning. Master’s Thesis, Zhejiang University, Hangzhou, China, 2023. [Google Scholar]
  31. Yang, L.; Zhang, R.Y.; Li, L.; Xie, X. Simam: A simple, parameter-free attention module for convolutional neural networks. In Proceedings of the 38th International Conference on Machine Learning, PMLR, Cambridge, MA, USA, 18–24 July 2021; pp. 11863–11874. [Google Scholar]
  32. Li, Z.; Long, J.; Qian, Y.; Wang, Y.; Fan, Y.; Bai, L. A Difference-Aware Enhanced Network for Cropland Change Detection in Remote Sensing Images. Comput. Eng. Appl. 2025, 1–18. Available online: https://link.cnki.net/urlid/11.2127.tp.20250320.1332.012 (accessed on 23 November 2025).
  33. Li, J.; Zhang, G.; Zhu, S.; Xu, Y.; Li, X. Change detection for high-resolution remote sensing images with multi-scale feature transformer. Natl. Remote Sens. Bull. 2025, 29, 266–278. [Google Scholar] [CrossRef]
  34. Zhou, Z.; Siddiquee, M.M.R.; Tajbakhsh, N.; Liang, J. Unet++: Redesigning skip connections to exploit multiscale features in image segmentation. IEEE Trans. Med. Imaging 2019, 39, 1856–1867. [Google Scholar] [CrossRef] [PubMed]
  35. Ren, H.; Xia, M.; Weng, L.; Hu, K.; Lin, H. Dual-attention-guided multiscale feature aggregation network for remote sensing image change detection. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 4899–4916. [Google Scholar] [CrossRef]
  36. Durusoy, O. Open-Source Datasets for Image Processing and Artificial Intelligence Research: A Comparison of ImageNet and MS COCO Datasets. Int. J. Sci. Innov. Eng. 2025, 2, 639–653. [Google Scholar] [CrossRef]
  37. Yin, Z.; Meng, R.; Fan, X.; Li, B.; Zhao, Z. Typical visual defect detection system of substation equipment based on edge computing and improved Faster R-CNN. China Sci. 2021, 16, 343–348. [Google Scholar]
  38. He, K.; Zhang, X.; Ren, S.; Sun, J. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; IEEE: Piscataway, NJ, USA, 2015; pp. 1026–1034. [Google Scholar] [CrossRef]
  39. Chen, W.; Meng, S.; Wang, X. Local and global context-enhanced lightweight CenterNet for PCB surface defect detection. Sensors 2024, 24, 4729. [Google Scholar] [CrossRef]
  40. Ge, Z.; Liu, S.; Wang, F.; Li, Z.; Sun, J. Yolox: Exceeding yolo series in 2021. arXiv 2021, arXiv:2107.08430. [Google Scholar] [CrossRef]
  41. Sohan, M.; Sai Ram, T.; Rami Reddy, C.V. A review on yolov8 and its advancements. In Data Intelligence and Cognitive Informatics; Springer: Singapore, 2024; pp. 529–545. [Google Scholar]
  42. Wang, A.; Chen, H.; Liu, L.; Chen, K.; Lin, Z.; Han, J.; Ding, G. Yolov10: Real-time end-to-end object detection. In NIPS’24: Proceedings of the 38th International Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 10–15 December 2024; Curran Associates Inc.: Red Hook, NY, USA, 2024; Volume 37, pp. 107984–108011. [Google Scholar]
  43. Wang, C.; He, W.; Nie, Y.; Guo, J.; Liu, C.; Wang, Y.; Han, K. Gold-YOLO: Efficient object detector via gather-and-distribute mechanism. In NIPS’23: Proceedings of the 37th International Conference on Neural Information Processing Systems; Curran Associates Inc.: Red Hook, NY, USA, 2023; Volume 36, pp. 51094–51112. [Google Scholar]
  44. Zhao, Y.; Lv, W.; Xu, S.; Wei, J.; Wang, G.; Dang, Q.; Liu, Y.; Chen, J. Detrs beat yolos on real-time object detection. In Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 17–21 June 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 16965–16974. [Google Scholar] [CrossRef]
  45. Leem, S.; Seo, H. Attention guided CAM: Visual explanations of vision transformer guided by self-attention. Proc. AAAI Conf. Artif. Intell. 2024, 38, 2956–2964. [Google Scholar] [CrossRef]
Figure 1. Dual-temporal defect samples: (a) distribution box door open; (b) insulator breakage; (c) structure ladder unlocked. The red boxes in the second row represent bounding boxes of substation defects.
Figure 1. Dual-temporal defect samples: (a) distribution box door open; (b) insulator breakage; (c) structure ladder unlocked. The red boxes in the second row represent bounding boxes of substation defects.
Mathematics 14 00178 g001
Figure 2. Flowchart of the proposed CAG-Net.
Figure 2. Flowchart of the proposed CAG-Net.
Mathematics 14 00178 g002
Figure 3. Illustration of the Encoder Block.
Figure 3. Illustration of the Encoder Block.
Mathematics 14 00178 g003
Figure 4. Illustration of the CAGM module.
Figure 4. Illustration of the CAGM module.
Mathematics 14 00178 g004
Figure 5. Illustration of the MAFE module.
Figure 5. Illustration of the MAFE module.
Mathematics 14 00178 g005
Figure 6. Illustration of the DFFM.
Figure 6. Illustration of the DFFM.
Mathematics 14 00178 g006
Figure 7. Examples of substation multi-phase defect dataset. The first column is normal images, and the remaining three columns are defect images.
Figure 7. Examples of substation multi-phase defect dataset. The first column is normal images, and the remaining three columns are defect images.
Mathematics 14 00178 g007
Figure 8. The contribution of each module to the improvement of mAP.
Figure 8. The contribution of each module to the improvement of mAP.
Mathematics 14 00178 g008
Figure 9. Test results of different models. Subfigures (1) and (2) depict the detection result visualizations of abnormal box door closure defects with large-scale and small-scale, respectively. (a) reference image; (b) inspection image; (c) RT-DETR-d; (d) YOLOX-d; (e) YOLOv8m-d; (f) YOLOv10m-d; (g) Gold-YOLOm-d; (h) Faster RCNN-d; (i) ours; (j) GroundTruth.
Figure 9. Test results of different models. Subfigures (1) and (2) depict the detection result visualizations of abnormal box door closure defects with large-scale and small-scale, respectively. (a) reference image; (b) inspection image; (c) RT-DETR-d; (d) YOLOX-d; (e) YOLOv8m-d; (f) YOLOv10m-d; (g) Gold-YOLOm-d; (h) Faster RCNN-d; (i) ours; (j) GroundTruth.
Mathematics 14 00178 g009
Table 1. Dataset details.
Table 1. Dataset details.
Defect CategoryNumber of ImagesProportion (%)Label
Abnormal box door closure128422.35xmbhyc
Foreign object intrusion91215.87ywrq
Insulator breakage4808.36jyzps
Meter damaged63911.12bjps
Silica gel cartridge broken5469.5gjtps
Switch status change105618.38kgfhbh
Device position change82814.41sbwzbh
Table 2. Performance comparison with advanced detect methods.
Table 2. Performance comparison with advanced detect methods.
MethodsBackboneAP (%)mAP (%)ΔmAP (%)
d − c
xmbhycywrqjyzpsbjpsgjtpskgfhbhsbwzbh
YOLOX-cDarkNet5376.5773.8174.0464.8266.3976.1580.7273.211.04
YOLOX-d78.1275.274.8365.7567.8176.8781.1474.25
RT-DETR-cHGNetv275.9871.5272.6662.1364.1275.6480.3771.770.87
RT-DETR-d76.5672.3273.1563.5364.5577.3481.0372.64
YOLOv8m-cCSPDarkNet77.6575.2470.5168.9866.3281.2483.0574.861.31
YOLOv8m-d79.876.1271.2771.6369.2581.6183.4876.17
YOLOv10m-cCSPDarkNet80.3274.6177.7571.4572.1378.6282.3776.750.71
YOLOv10m-d81.7674.8378.0572.1973.0479.2283.1677.46
Gold-YOLOm-cEfficientRep81.1478.9276.3167.2572.2479.8381.4976.741.06
Gold-YOLOm-d81.5179.5477.1270.372.8480.7282.5877.8
Faster RCNN-cResnet5080.0674.2176.1472.2470.3580.1281.2776.341.63
Faster RCNN-d81.3278.6476.5673.1771.4881.9382.7177.97
OursELGCNet85.9383.1679.7174.1277.3785.1986.8481.76/
Table 3. Results of ablation experiment.
Table 3. Results of ablation experiment.
MethodsBackboneChange EncoderChange DecodermAP (%)
Resnet50ELGCNetAbsolute Feature DifferenceCAGM1 × 1 ConvDFFM
Baseline 77.97
Improvement1 78.54
Improvement2 80.46
Improvement3 79.37
Improvement4 80.81
Ours 81.76
Table 4. Comparison results of computational complexity.
Table 4. Comparison results of computational complexity.
MethodsFLOPs/GParams/MFPS
RT-DETR-d186.765.456.2
YOLOX-d114.625.396.8
YOLOv8m-d104.225.889.3
YOLOv10m-d84.515.895.9
Gold-YOLOm-d115.841.294.9
Faster RCNN-d218.841.323.8
Ours241.340.520.1
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Xiang, D.; Du, X.; Liu, Z. CAG-Net: A Novel Change Attention Guided Network for Substation Defect Detection. Mathematics 2026, 14, 178. https://doi.org/10.3390/math14010178

AMA Style

Xiang D, Du X, Liu Z. CAG-Net: A Novel Change Attention Guided Network for Substation Defect Detection. Mathematics. 2026; 14(1):178. https://doi.org/10.3390/math14010178

Chicago/Turabian Style

Xiang, Dao, Xiaofei Du, and Zhaoyang Liu. 2026. "CAG-Net: A Novel Change Attention Guided Network for Substation Defect Detection" Mathematics 14, no. 1: 178. https://doi.org/10.3390/math14010178

APA Style

Xiang, D., Du, X., & Liu, Z. (2026). CAG-Net: A Novel Change Attention Guided Network for Substation Defect Detection. Mathematics, 14(1), 178. https://doi.org/10.3390/math14010178

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop