Substation Instrument Defect Detection Based on Multi-Domain Collaborative Attention Fusion

Liu, Kequan; Li, Yandong; Wang, Shiwei; Yang, Zhaoguang; Li, Zhixin; Zhao, Zhenbing

doi:10.3390/electronics14234690

Open AccessArticle

Substation Instrument Defect Detection Based on Multi-Domain Collaborative Attention Fusion

by

Kequan Liu

¹,

Yandong Li

¹,

Shiwei Wang

¹,

Zhaoguang Yang

^2,*,

Zhixin Li

³ and

Zhenbing Zhao

⁴

¹

State Grid Gansu Electric Power Company, Lanzhou 730070, China

²

State Grid Gansu Comprehensive Energy Service Co., Ltd., Lanzhou 730070, China

³

Electric Power Research Institute of State Grid Gansu Electric Power Company, Lanzhou 730070, China

⁴

School of Electrical and Electronic Engineering, North China Electric Power University, Baoding 071003, China

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(23), 4690; https://doi.org/10.3390/electronics14234690

Submission received: 26 October 2025 / Revised: 20 November 2025 / Accepted: 25 November 2025 / Published: 28 November 2025

(This article belongs to the Special Issue Applications of Artificial Intelligence in Electric Power Systems)

Download

Browse Figures

Versions Notes

Abstract

The detection of defects in substation instruments, such as surge arrester counters, is hindered by subtle characteristic changes and severe class imbalance. To address these challenges, this study proposes an enhanced detection algorithm based on multi-domain collaborative attention fusion (MDCAF). This method integrates three key contributions: hybrid enhancement to alleviate boundary blurring in transition samples; the MDCAF module, which collaboratively captures features across channel, space, and axis domains; and a class-weight balancing strategy to optimize learning for rare defects. The experimental results show that the average precision (

m A P

) is 90.1%, which is 2.8 percentage points higher than the baseline, reducing both missed and false detections.

Keywords:

substation meter defect detection; multi-domain attention; data augmentation; class weight balancing; YOLOv8

1. Introduction

Power grid infrastructure is fundamental to socioeconomic development [1]. With the rapid evolution of smart grids, automated monitoring of substations—the network’s core nodes—has become essential. Substation metering instruments, such as surge arrester counters, act as critical sensors within the grid’s perception layer. Since their operational status directly reflects the health of high-voltage equipment, ensuring their reliability via automated inspection is paramount [2,3]. In recent years, researchers have proposed various deep learning-based approaches for substation meter and defect detection. Compared with traditional methods, deep learning enables automatic feature extraction and offers stronger generalization capabilities [4,5]. However, mainstream generic object detection frameworks, such as the YOLO series, while achieving remarkable success on public datasets, often exhibit limited performance when directly applied to meter defect detection. Their backbone networks, designed primarily to enhance receptive fields and computational efficiency, are less effective for capturing subtle defect features [6,7]. Meanwhile, Transformer-based models capture the global context but still struggle to localize low-texture anomalies on small surfaces [8,9]. To overcome these limitations, several targeted enhancements have been implemented. For example, Zhao et al. utilized SIoU loss in YOLOX to focus attention on defect regions [10,11]. In scenarios where defect samples are minimal, unsupervised anomaly detection methods have also gained attention [12,13]. Despite their independence from labeled data, they generally output only coarse heatmaps, which fall short of the requirement for explicit classification and precise localization in substation instrument defect detection.

While attention mechanisms such as SE-Net and CBAM have advanced meter localization, they often lack the discriminative power needed to identify subtle defects in real-world scenarios [14,15,16,17,18]. Notably, Li et al. and Xu et al. demonstrated that mitigating feature distribution shifts via progressive alignment and explicit spatial guidance is critical for maintaining robustness [19,20].

Motivated by these insights, this paper introduces a defect-detection method tailored for power metering instruments in substations. Our contributions are threefold:

Data-level Enhancement: We introduce a Mixup augmentation strategy to generate transitional defective samples, reducing data scarcity and enhancing the model’s robustness to defects with ambiguous boundaries [21].
Architectural Innovation: We develop a Multi-Domain Collaborative Attention Fusion (MDCAF) module to improve the representation of fine-grained defects by synergistically capturing features across channel, spatial, and axial-relational domains.
Training Optimization: We use a class-weight balancing strategy to help the model focus on low-frequency defect classes, reducing the impact of data imbalance.

Experimental results show that the proposed method significantly enhances detection accuracy and robustness against common defects, such as dirt, dial breakage, and condensation, on meter surfaces in substation environments, thereby confirming its effectiveness for visual defect detection in substation metering devices.

2. An Improved Detection Algorithm Based on Multi-Domain Collaborative Attention Fusion

Recently, deep learning-based object detection has been increasingly deployed to identify defects in substation meters. Notably, YOLOv8 is widely favored in power equipment inspection due to its superior efficiency and accuracy. However, simply applying YOLOv8 to substation meter defect detection still faces several adaptability challenges.

First, complex substation backgrounds and the minute, indistinct nature of defect features often lead to missed detections and reduced accuracy. Second, the scarcity of samples for rare defect types limits the model’s ability to generalize, as practical training requires diverse data. Third, despite its high inference speed, the baseline architecture struggles to discriminate between visually similar defect categories, resulting in misclassifications.

To address these issues, this paper presents targeted improvements from three complementary perspectives: data augmentation, model optimization, and training mechanism design. Specifically, we incorporate the Mixup data augmentation strategy, the MDCAF module, and a class weight balancing strategy to enhance robustness, improve feature discrimination, and reduce class imbalance, respectively.

2.1. Enhancing Data Sample Diversity

To enhance sample diversity and mitigate ambiguity in fuzzy defect regions, we apply Mixup to generate transitional samples. The mixing ratio,

r \in [0, 1]

, follows a Beta distribution defined as follows:

f (x; α, β) = \frac{x^{α - 1} {(1 - x)}^{β - 1}}{B (α, β)}

(1)

where

B (α, β)

is the Beta function normalization constant. Setting

α = β = 32.0

centers the distribution peak at 0.5, ensuring that

r

typically falls between 0.4 and 0.6. This prevents extreme values near 0 or 1, avoiding generated images that are nearly identical to the originals, thereby negating the purpose of mixing.

Given two original defect images

img 1

and

img 2

, the composite image

img

is generated via a weighted sum:

img = r \cdot img 1 + (1 - r) \cdot img 2

(2)

This combination blends features such as defect edges and shapes. For instance, when r = 0.6,

img 1

contributes 60% and

img 2

contributes 40%, effectively simulating real-world defects like indistinct cracks or low-contrast stains. This process improves the dataset’s representation of complex defect morphologies.

Similarly, the class labels

cls 1

and

cls 2

are fused proportionally to create a new label

cls

:

cls = r \cdot cls 1 + (1 - r) \cdot cls 2

(3)

This label fusion encourages the model to learn smoother decision boundaries for transitional defects.

Overall, Mixup enhances both dataset diversity and model robustness. By blending samples across dimensions such as defect size and ambiguity, the model encounters a broader range of patterns. Furthermore, mixing underrepresented classes with other samples generates diverse variants, mitigating class imbalance and promoting balanced feature learning. Ultimately, the semi-blurred characteristics of these synthetic samples improve the model’s generalization to real-world scenarios, reducing missed detections and enhancing stability.

2.2. Multi-Domain Collaborative Attention Fusion Module

In substation meter defect detection, defect features exhibit complex distributions across multiple dimensions and scales. Since single-dimensional attention mechanisms struggle to capture these feature correlations fully, we introduce the MDCAF module. This module captures multi-dimensional defect features through synergistic correlations across the channel, spatial, and axial-relation domains. It also integrates a Residual Feature Enhancer (RFE) and efficiency optimization strategies, as illustrated in Figure 1.

The MDCAF module extracts defect features from channel-, spatial-, and axial-perspective views. By combining residual enhancement with output alignment, it effectively improves the detection of small targets, multiple simultaneous defects, and defects in complex backgrounds, providing a robust feature foundation for subsequent detection tasks.

2.2.1. Channel-Domain Computation

Different defect types exhibit distinct activation patterns across feature map channels. Using all channel features directly adds unnecessary background noise and weakens the focus on essential defects. Therefore, channel-domain attention computes channel dependencies to emphasize the most relevant channels for defect detection selectively.

The process, illustrated in Figure 2, is as follows:

Spatial dimensions are compressed via global average pooling. This condenses spatial information into a channel-level global statistic, as defined in Equation (4):

$A v g P o o l (X) = z_{c} = \frac{1}{H \times W} \sum_{i = 1}^{H} \sum_{j = 1}^{W} x_{c} (i, j)$

(4)

where $z_{c}$ represents the pixel value of the input feature map $X$ at the $c$ -th channel and spatial position $(i, j)$ .
A 1 × 1 convolution with $δ / 2$ ( $δ = C \times 0.25$ ) channels reduces dimensionality to capture key inter-channel correlations efficiently. A GELU activation function follows this to model complex channel dependencies. Subsequently, A 1 × 1 convolution with $δ$ channels is used to perform up-sampling, further strengthening the feature representation.
A Sigmoid function maps the output to the [0, 1] interval to generate channel weights $M_{c} (X)$ .

$M_{c} (X) = σ (C o n v_{1 \times 1}^{δ} (G E L U (C o n v_{1 \times 1}^{δ / 2} (A v g P o o l (X)))))$

(5)

where $X \in ℝ^{R \times C \times H \times W}$ , $M_{c} (X) \in ℝ^{R \times C \times 1 \times 1}$ .
Finally, the input features are multiplied by the generated weights $M_{c} (X)$ . This enhances channels containing defect signals while suppressing irrelevant background noise, resulting in the weighted features $X_{c}$ .

$X_{c} = X \otimes M_{c} (X) + X$

(6)

where $\otimes$ denotes channel-wise element-wise multiplication.

This process enables the model to highlight channels with necessary defect signals and reduce unnecessary background noise.

2.2.2. Spatial-Domain Computation

Following the channel-domain computation, the feature map focuses on key channels; however, precise localization of defect regions within the spatial domain remains necessary. Spatial-domain attention distinguishes defect areas from background interference by capturing salient responses at specific spatial locations.

As illustrated in Figure 3, the process involves:

The maximum $M_{\max}$ and average values $M_{a v g}$ along the channel dimension of the channel-weighted features $X_{c}$ are extracted to capture both key regions and overall distribution traits. The calculations of $M_{\max}$ and $M_{a v g}$ are shown in Equation (7):

$M_{\max} = \max (X_{c}), M_{a v g} = mean (X_{c})$

(7)

where max( $X_{c}$ ) and mean( $X_{c}$ ) represent the highest and average value across all spatial positions within each channel.
Subsequently, $M_{\max}$ and $M_{a v g}$ are concatenated and processed by a 7 × 7 convolutional layer. This operation leverages a large receptive field to integrate multi-scale spatial information and fuse features from different spatial statistics. A Sigmoid function then normalizes the output to [0, 1], generating spatial attention weights:

$M_{s} (X) = σ (C o n v_{7 \times 7} (c o n c a t (M_{m a x}, M_{a v g})))$

(8)
Finally, the spatial weights $M_{s} (X)$ are applied to the input features $X_{c}$ via element-wise multiplication. This highlights defect-relevant regions while suppressing secondary background noise, yielding the spatially weighted features $X_{s}$ :

$X_{s} = X_{c} \otimes M_{s} (X)$

(9)

This mechanism enables the model to adaptively highlight defect regions, thereby enhancing the representation of subtle features and improving its ability to locate and identify meter defects.

2.2.3. Axial-Relation-Domain Computation

Substation instruments typically exhibit strong horizontal or vertical geometric structures, such as rectangular dial edges. Consequently, defects often manifest with axial continuity, such as horizontally extending cracks. Since traditional spatial attention struggles to fully capture these long-range structural relationships, we introduce Axial Attention. Building upon channel and spatial attention, this mechanism captures long-range dependencies along height and width by converting spatial features into sequences and employing multi-head attention [22]. The process proceeds as follows:

To adapt the spatial-weighted features $X_{s}$ for the attention mechanism, $X_{s}$ is flattened into a sequential representation $X_{f l a t}$ . The spatial dimensions are rearranged into a sequence-length dimension, ensuring that each spatial position corresponds to a sequence element. This preserves positional information while enabling the capture of long-range dependencies:

$X_{f l a t} = r e s h a p e (X_{s}) \in ℝ^{(H \times W) \times B \times C}$

(10)
The sequence is processed using multi-head attention with $h$ heads, where each head has a dimensionality of $d_{k}$ . This parallel design allows the model to learn diverse feature correlations across different subspaces jointly. The dimensionality $d_{k}$ is calculated as follows:

$d_{k} = \frac{C}{h}$

(11)
For each head, three independent linear transformation matrices $W^{Q}$ , $W^{K}$ , and $W^{V}$ reused to generate the Query ( $Q$ ), Key ( $K$ ) and Value ( $V$ ) matrices, respectively. These are further divided into submatrices $Q_{i}$ , $K_{i}$ , $V_{i}$ ( $i = 1, 2, \dots, h$ ) for each attention head. Each attention head computes its output using Equation (12), where ${Q_{i} K_{i}}^{T}$ represents the similarity matrix among sequence positions. The Softmax function normalizes the similarity matrix, while the scaling factor $\sqrt{d_{k}}$ mitigates the vanishing gradient problem associated with large inner products:

$A t t e n t i o n (Q_{i}, K_{i}, V_{i}) = softmax (\frac{{Q_{i} K_{i}}^{T}}{\sqrt{d_{k}}}) V_{i}$

(12)
Outputs from all attention heads are concatenated and linearly projected through $W^{O}$ to obtain the final multi-head attention representation, denoted as $M u l t i H e a d$ . The resulting output is reshaped back to its original spatial configuration, producing the axial feature representation $X_{a x i a l}$ , as expressed in Equations (13) and (14):

$M u l t i H e a d = concat (h e a d_{1}, \dots, h e a d_{h}) W^{O}$

(13)

$X_{a x i a l} = reshape (M u l t i H e a d)$

(14)

This mechanism effectively improves the model’s capacity to represent globally continuous defect patterns.

2.2.4. Residual Feature Enhancer

While the preceding attention operations enhance feature representation, they may induce redundancy or minor information loss. To address this, we incorporate a Residual Feature Enhancer (RFE) [23].

The RFE employs a dual-branch fusion structure that captures spatial details (e.g., defect edges) via a 3 × 3 convolution while managing computational costs through channel reduction. The implementation is as follows:

A 1 × 1 convolutional kernel is used for channel down-sampling, reducing the input axial features’ channels $X_{a x i a l}$ to obtain $X_{d o w n}$ .
A 3 × 3 convolution is used for spatial feature extraction to derive $X_{s p a t i a l}$ . This step captures local spatial correlations and enhances structural details such as edges.
A 1 × 1 convolutional up-samples the features resulting in $X_{u p}$ , ensuring dimensions align with the original input.
Finally, element-wise multiplication fuses the enhanced features with the original input $X_{r}$ . This interaction maintains the integrity of the original information while reinforcing it with the learned spatial details, as shown in Equations (15)–(18).

$X_{d o w n} = C o n v_{1 \times 1}^{↓} (X_{a x i a l})$

(15)

$X_{s p a t i a l} = C o n v_{3 \times 3} (X_{d o w n})$

(16)

$X_{u p} = C o n v_{1 \times 1}^{↑} (X_{a x i a l})$

(17)

$Y = X_{r} \otimes X_{u p}$

(18)

2.2.5. Stage Difference Regularization

In the serial structure, the channel, spatial, and axial attention modules process the input features sequentially. Although the RFE can enhance expression quality, the serial setup tends to cause different stages to learn very similar feature representations, preventing the full utilization of the potential complementarity among multi-stage attention modules.

To promote consideration of different characteristics at various stages, we introduce Stage-Divergence Regularization (SDR). This regularization directly influences the intermediate features from the concatenated attention, ensuring a moderate difference at each stage and thereby improving the model’s ability to express diversity. Let the input features be the following:

X \in ℝ^{H \times W \times C}

(19)

Then the features after channel attention, the features after spatial attention, and the features after axial attention are

A_{1} (X)

,

A_{2} (X)

and

A_{3} (X)

. We perform global average pooling for each stage:

h = G A P (X)

(20)

The definition of stage difference regularization is the following:

L_{div} = \sum_{t = 1}^{3} ‖ h_{t} - h_{t - 1} ‖_{2}^{2} .

(21)

This regularization term explicitly promotes learning diverse features at different stages, thereby reducing redundancy in the multi-stage attention outputs and increasing complementarity among the three stages.

The final loss is the following:

L = L_{\det} + λ_{div} L_{div}

(22)

where

λ_{div}

control the intensity of the difference constraints.

2.2.6. Feature Alignment

The detection head requires features with specific dimensional ordering, distribution states, and channel counts. To ensure compatibility and stability, output processing is applied before feeding features into the detection head. The implementation steps are as follows:

Let the original feature be denoted as $X$ . After a dimension permutation operation, the dimensions are rearranged into the order of batch $B$ , height $H$ , width $W$ , and channel $C$ , resulting in $X_{p e r m}$ .

$X_{p e r m} = p e r m u t e (X, [B, H, W, C])$

(23)
Layer normalization is applied to each channel of every sample in the spatial dimensions $(H \times W)$ . This reduces the internal covariate shift problem and enhances the stability and speed of model training, as shown in Equation (24):

$L a y e r N o r m (X) = \frac{X - μ}{σ} ⊙ γ + β$

(24)

where $μ$ is the mean of the feature along the spatial dimension, $σ$ is the standard deviation, $γ$ is the scaling factor used to adjust the amplitude of the feature distribution, and $β$ is the offset term.
B Finally, a conditional branching strategy adapts the feature channels to match the downstream network’s requirement. Let $C_{i n}$ be the current channel count and $C_{o u t}$ be the output from the RFE module. The final output $Y_{o u t}$ is determined as follows:

$Y_{o u t} = \{\begin{matrix} C o n v_{1 \times 1} (Y) + B N (Y) & C_{i n} \neq C_{o u t} \\ Y & C_{i n} = C_{o u t} \end{matrix}$

(25)

where $Y$ is the output of the RFE, and $B N (Y)$ is the result of applying batch normalization to $Y$ .

If

C_{i n} \neq C_{o u t}

, a 1 × 1 convolution combined with Batch Normalization adjusts the channel count. If

C_{i n} = C_{o u t}

, the features are passed directly. This ensures effective multi-dimensional feature transmission and compatibility with subsequent tasks.

2.3. Class Weight Balancing Strategy

In substation meter defect datasets, class imbalance is common—normal samples greatly exceed rare defect types. This imbalance biases the model toward high-frequency classes, leading to poor detection of low-frequency defects. To solve this issue, this paper presents a class weight balancing strategy that dynamically adjusts class weights in the loss calculation, helping the model learn features from all classes fairly and increasing its focus on low-frequency samples.

It is necessary to count the occurrences $c$ of each class in the training set to determine the class frequency $c o u n t (c)$ , as shown in Equation (26).

$c o u n t (c) = The number of times category c appears in the training set$

(26)

This frequency reflects the original distribution. For instance, in our dataset, “normal meter” samples far exceed “dial condensation” samples, resulting in a significant frequency disparity.
To prioritize rare defects, the paper defines an inverse frequency weighting function $w e i g h t (c)$ . Lower frequency classes are assigned higher weights to amplify their contribution to the loss calculation:

$w e i g h t (c) = \frac{1}{c o u n t (c) + ε} (ε = 1 \times 10^{- 6})$

(27)

where $ε$ is a small value (usually taken as $1 \times 10^{- 6}$ ) to prevent division-by-zero errors.
Since inverse weights vary significantly in scale, they are normalized. The weight for each class $w e i g h t (c)$ is divided by the sum of weights across all classes $\sum_{i = 1}^{n c} w e i g h t (i)$ , ensuring the values lie within a stable range:

$w e i g h t (c) = \frac{w e i g h t (c)}{\sum_{i = 1}^{n c} w e i g h t (i)}$

(28)
The normalized weight $w e i g h t (c)$ is multiplied by $n c$ to amplify the differences between classes, completing the class weight balancing calculation and obtaining the final balanced weight $c l a s s_w e i g h t s (c)$ , as shown in Equation (29).

$c l a s s_w e i g h t s (c) = w e i g h t (c) \times n c$

(29)

These weights are applied to the loss function. By down-weighting easily classified (high-frequency) samples and balancing the positive–negative sample ratio, the strategy optimizes gradient calculation and parameter updates, effectively mitigating the impact of class imbalance.

3. Experimental Results and Analysis

3.1. Experimental Environment

The hardware setup for the experiments in this paper is detailed in Table 1. The software environment included Ubuntu 16.04, Pytorch 1.8.1, Torchvision 0.9.1, Python 3.8, and CUDA 10.1.

3.2. Experimental Dataset

This paper established a domain-specific dataset collected from real-world substations, focusing on key metering equipment such as current transformers and surge arrester counters. The dataset contains images collected under various lighting and weather conditions, including sunny, overcast, low-light, and reflective scenarios. These variations provide natural diversity that enables the MDCAF module and YOLOv8 backbone to learn stable defect representations. These devices feature glass-encapsulated precision components essential for smart grid monitoring. Unlike generic public datasets, this collection captures specific industrial attributes like low contrast and reflective interference. Visual examples of the investigated equipment and these representative defect categories are presented in Figure 4.

The dataset contains 1702 images classified into four distinct categories based on their visual characteristics: the ‘Meter Normal’ class represents meters in optimal condition with clear visibility; ‘Dial fouling’ refers to the accumulation of external contaminants like dust, mud, or oil that reduces contrast and obscures the display; ‘Dial condensation’ manifests as internal moisture or misting caused by temperature fluctuations, which blurs the dial readings; and ‘Dial damage’ captures physical structural impairments, primarily cracked glass. The dataset is split into training and validation sets (80:20 ratio), as detailed in Table 2.

3.3. Training Details

During training, we consistently used the same optimizer and set the initial learning rate, learning rate decay strategy, weight decay coefficient, batch size, and number of training epochs as the hyperparameters. The specific configuration of the training hyperparameters is shown in Table 3.

3.4. Evaluation Metrics

To comprehensively evaluate the model’s performance, we employ standard object detection metrics, including Precision (

P

) and Recall (

R

). Furthermore, Average Precision (

A P

) and mean Average Precision (

m A P

) are selected as primary metrics to assess detection capabilities in real-world scenarios.

A P

quantifies accuracy for individual classes, while

m A P

provides an overall performance measure by averaging

A P

across all categories.

Prediction validity is determined by Intersection over Union (IoU) and confidence thresholds, which classify results into four categories: True Positive (

T P

), False Positive (

F P

), True Negative (

T N

), and False Negative (

F N

).

T P

and

T N

denote correct predictions for positive and negative samples, respectively, while

F P

and

F N

represent incorrect predictions. Based on these definitions, Precision and Recall are formulated as follows:

P = \frac{T P}{T P + F P}

(30)

R = \frac{T P}{T P + F N}

(31)

Subsequently,

A P

and

m A P

are derived:

A P = \int_{0}^{1} P d R

(32)

m A P = \frac{\sum_{i = 0}^{N_{c l s}} \int_{0}^{1} P_{i} (R_{i}) d R}{N_{c l s}}

(33)

3.5. Detailed Analysis of Results

To ensure the stability of the experimental results, all models were independently trained three times under the same settings, and the average of the three results was used as the final metric. Due to the fixed distribution of the instrument defect data, the randomness during training has a relatively small impact on detection accuracy. The fluctuations in the indicators across repeated experiments are all less than 0.3%, which does not affect the overall conclusion. Therefore, the

A P

and

m A P

values reported in this paper are all statistically reliable.

3.5.1. Comparative Experiment

To verify the superiority of the method proposed in this paper, a comparison was made with current advanced object detection algorithms on the transformer defect dataset of the substation. Table 4 compares the detection accuracy of this method with YOLOX, YOLOF, and Faster R-CNN on the substation transformer defect dataset [24,25].

From Table 4, the following conclusions can be drawn:

The proposed method achieves the best performance among the four algorithms, reaching an

m A P

of 90.1%. It demonstrates significant advantages in fine-grained categories such as condensation and contamination, while also surpassing other models in detecting dial damage. Regarding inference efficiency, Faster R-CNN lacks real-time capability, and YOLOv8 sets the speed benchmark. The algorithm proposed in this paper strikes an optimal balance. Although the algorithm proposed in this paper incurs a slight FPS decrease compared to YOLOv8, it maintains a high speed (~80 FPS) that exceeds real-time requirements. This confirms that the method effectively trades marginal latency for significant accuracy gains. Ultimately, by synergizing the multi-domain collaborative attention fusion structure with the class-balancing mechanism, the model effectively captures weak textures and low-contrast defects, ensuring robust detection for substation instruments.

3.5.2. Ablation Experiment of the MDCAF Module

To verify the complementarity of the three-domain attention, the independent and combined performance of channel, spatial, and axial attention was tested, respectively. The results are shown in Table 5.

From Table 5, the following conclusions can be drawn:

When introducing any single attention module alone, the overall $m A P$ improves to some extent compared to the baseline. Among them, the gain from axial attention is particularly noticeable, indicating that modeling long-range dependencies along the coordinate axes helps enhance the perception of the structural information of the instrument dial.
Combining two attention modules (such as channel + space, channel + axial, space + axial) further improves the $m A P$ .
Introducing all three attention modules at once boosts the overall $m A P$ by about 0.8 percentage points over the Baseline, highlighting the strong synergy between different attention mechanism dimensions.

It should be noted that after introducing the attention modules, the

A P

of some categories may experience slight fluctuations. Still, the overall trend shows that most defect categories have improved to varying degrees, with no significant performance degradation. In summary, the designed multi-dimensional attention mechanism can effectively enhance the network’s ability to represent details, local stains, and small-scale defects of the instrument dial, thereby providing stable and continuous performance improvements in detecting appearance defects of substation instruments.

3.5.3. Ablation Experiments of the Algorithm Proposed in This Paper

To verify each module’s contribution to the overall performance, we conducted an ablation study by gradually adding each improvement module to the baseline model (YOLOv8). The experiments focused on three aspects: the data augmentation strategy, the MDCAF module, and the class weight balancing strategy. By systematically assessing the impact of each module on detection accuracy for different defect classes, we evaluated the effectiveness of each proposed enhancement. The experimental results are shown in Table 6.

From Table 6, the following conclusions can be drawn:

Compared to the baseline model, adding data augmentation, the MDCAF module, and the class weight balancing strategy helps the network learn defect features more effectively, which improves the accuracy of substation instrument defect detection.
When comparing the baseline model with the version that includes only the MDCAF module, the overall $m A P$ increased from 87.3% to 88.1%, a gain of 0.8 percentage points. The main performance boost was seen in the “dial fouling” class, where $A P$ rose by 1.9 percentage points. The $A P$ for the “dial damage” class remained pretty steady, while the “dial condensation” class showed a modest increase of 1.5 percentage points. This demonstrates the effectiveness of the MDCAF module, which improves the representation of complex defects by selectively emphasizing key features in the channel domain, spatially localizing defect regions, and modeling global continuity in the axial-relation domain. However, because the “dial condensation” class had only 186 samples, the attention mechanism alone could not fully compensate for the limited training data, resulting in only modest improvements in that category.
After adding the Mixup data augmentation strategy on top of the MDCAF module, the $A P$ for the “dial condensation” class increased significantly by 2.2 percentage points. In contrast, the $A P$ for the “normal meter” class slightly decreased. Mixup creates transitional defect samples via weighted blending, thereby enhancing the diversity of low-frequency defects and helping the model learn complex morphologies more effectively. Conversely, since the “normal meter” class already had ample samples, excessive mixing diluted its distinctive patterns, resulting in a small decrease in $A P$ . However, this minor decline had little impact on overall performance.
Comparing different two-module combinations, the model that combined Mixup and the class weight balancing strategy showed the most remarkable $A P$ improvement for the “dial fouling” class, increasing by 2.8 percentage points. In contrast, the $A P$ for the “dial condensation” class remained essentially unchanged. This suggests that class-weight balancing, using inverse-frequency weighting, increased the significance of low-frequency defects in the loss calculation and helped reduce overfitting to normal samples. However, without the MDCAF module, the weak features of condensation were still not effectively captured, so the performance on this low-frequency class did not improve much.
When all three components—Mixup, the MDCAF module, and the class weight balancing strategy—were used together, the overall $m A P$ increased by 2.8 percentage points compared to the baseline. The $A P$ for the rare “dial condensation” class rose by 6.4 percentage points, the largest improvement among all categories. The $A P$ for the “dial damage” and “dial fouling” classes increased by 1.8 and 3.6 percentage points, respectively, while the “normal meter” class saw only a slight decline. The Mixup strategy created transitional defect samples to compensate for limited feature diversity; the MDCAF module identified low-contrast features of condensation, damage edges, and the overall distribution of fouling via multi-domain attention; and the class weight balancing strategy dynamically focused on low-frequency categories. The combination of these three components led to significant improvements in substation instrument defect detection accuracy.

3.6. Visual Comparison of Results

To verify the superiority of the method proposed in this paper, we conducted a visual comparison of the detection results of the proposed method and the baseline model on the instrument defect dataset of the substation.

From Figure 5, it can be observed that, compared to the baseline model, the method proposed in this paper accurately locates and correctly classifies the defects in the instrument’s appearance.

In some extreme cases (such as when the glass surface has strong reflections, heavy dirt, or large obstructions), false detections and missed detections can still occur. This shows that in more complex situations with significant lighting changes and irregular defect textures, there is still room to improve the model’s robustness.

4. Conclusions

To tackle the challenges of detecting appearance defects in substation instruments—such as subtle features, complex shapes, and severe class imbalance—this study developed and validated an improved detection algorithm based on Multi-Domain Collaborative Attention Fusion (MDCAF). By incorporating data augmentation, a multi-domain attention mechanism, and a class-weight balancing strategy, the algorithm greatly improved the model’s overall performance. The experimental results strongly confirmed the method’s effectiveness, achieving a mean Average Precision (

m A P

) of 90.1%, which is a 2.8 percentage point increase over the baseline model. Importantly, the method effectively addressed the class imbalance issue, boosting the detection precision for the rare “dial condensation” class by 6.4%, while also improving the precision for the more common “dial fouling” and “dial damage” classes by 3.6% and 1.8%, respectively. These outcomes demonstrate the synergistic effect of the proposed modules.

The proposed model is designed for deployment on edge computing modules embedded in mobile inspection robots or UAVs. In this workflow, the system performs real-time inference on captured meter images and, upon detecting defects, transmits structured alert data to the central system. This edge–cloud collaborative approach enables rapid maintenance responses, thereby ensuring the operational reliability of the grid’s sensing layer.

Despite these positive results, this research has limitations. First, the model’s performance still depends on the diversity of the training data, and Mixup cannot fully mimic the complex real-world environmental interferences. Second, the MDCAF structure increases computational overhead. To address these shortcomings, future research will focus on two main directions: refining feature fusion methods to lower computational complexity for resource-constrained edge devices and developing lightweight model variants further to enhance real-time processing capabilities in complex power system environments.

Author Contributions

Conceptualization, Z.L.; methodology, Z.Y.; investigation, Y.L.; data curation, S.W.; writing—original draft preparation, K.L.; writing—review and editing, Z.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Electric Power Company under Project (No. 522702250002, Research and Application on Performance Enhancement of Intelligent Detection for Irregular Visual Defects in Substation Equipment, conducted by Electric Power Company in 2025).

Institutional Review Board Statement

Not applicable.

Data Availability Statement

The data presented in this study are available upon request from the corresponding author.

Conflicts of Interest

Author Kequan Liu, Yandong Li, Shiwei Wang are employed by the State Grid Gansu Electric Power Company. Author Zhaoguang Yang is employed by the State Grid Gansu Comprehensive Energy Service Co., Ltd. Author Zhixin Li is employed by the Electric Power Research Institute of State Grid Gansu Electric Power Company. Author Zhenbing Zhao is employed by North China Electric Power University. The authors declare that this study received funding from Electric Power Company. Kequan Liu had the following involvement with the study: writing; Yandong Li had the following involvement with the study: literature search; Shiwei Wang had the following involvement with the study: meter image acquisition; Zhaoguang Yang had the following involvement with the study: methodological design; Zhixin Li had the following involvement with the study: experimental data analysis; Zhenbing Zhao had the following involvement with the study: manuscript review.

References

Zhang, J.; Zhu, W. Research on Algorithm for Improving Infrared Image Defect Segmentation of Power Equipment. Electronics 2023, 12, 1588. [Google Scholar] [CrossRef]
He, M.; Qin, L.; Liu, K.; Deng, X.; Li, B.; Li, Q.; Xu, X. Research on Intelligent Detection Algorithms for Substation Power Meters. High Volt. Eng. 2024, 50, 2942–2954. [Google Scholar] [CrossRef]
Tang, W.; Chen, H. Research on Intelligent Substation Monitoring by Image Recognition Method. Int. J. Emerg. Electr. Power Syst. 2021, 22, 1–7. [Google Scholar] [CrossRef]
Sharma, K.U.; Thakur, N.V. A Review and an Approach for Object Detection in Images. Int. J. Comput. Vis. Robot. 2017, 7, 196–237. [Google Scholar] [CrossRef][Green Version]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image Is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv 2021, arXiv:2010.11929. [Google Scholar] [CrossRef]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection; IEEE Computer Society: Washington, DC, USA, 2016; pp. 779–788. [Google Scholar]
Varghese, R.; Sambath, M. YOLOv8: A Novel Object Detection Algorithm with Enhanced Performance and Robustness. In Proceedings of the 2024 International Conference on Advances in Data Engineering and Intelligent Computing Systems (ADICS), Chennai, India, 18–19 April 2024; pp. 1–6. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. 7Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 9992–10002. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention Is All You Need. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Curran Associates, Inc.: Nice, France, 2017; Volume 30. [Google Scholar]
Zhao, Z.; Ma, D.; Shi, Y.; Li, G. Appearance Defect Detection Algorithm for Substation Meters Based on Improved YOLOX. J. Graph. 2023, 44, 937–946. [Google Scholar]
Gevorgyan, Z. SIoU Loss: More Powerful Learning for Bounding Box Regression. arXiv 2022. [Google Scholar] [CrossRef]
Wang, X.; Fan, J.; Yan, F.; Hu, H.; Zeng, Z.; Huang, H. Unsupervised Fur Anomaly Detection with B-Spline Noise-Guided Multi-Directional Feature Aggregation. Vis. Comput. 2025, 41, 6169–6185. [Google Scholar] [CrossRef]
Wang, X.; Xu, X.; Wang, Y.; Wu, P.; Yan, F.; Zeng, Z. A Robust Defect Detection Method for Syringe Scale without Positive Samples. Vis. Comput. 2023, 39, 5451–5467. [Google Scholar] [CrossRef] [PubMed]
Hao, S.; Lee, D.-H.; Zhao, D. Sequence to Sequence Learning with Attention Mechanism for Short-Term Passenger Flow Prediction in Large-Scale Metro System. Transp. Res. Part C Emerg. Technol. 2019, 107, 287–300. [Google Scholar] [CrossRef]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-Excitation Networks. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
Li, X.; Wang, W.; Hu, X.; Yang, J. Selective Kernel Networks. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 510–519. [Google Scholar]
Bao, Q.; Liu, Y.; Gang, B.; Yang, W.; Liao, Q. SCTANet: A Spatial Attention-Guided CNN-Transformer Aggregation Network for Deep Face Image Super-Resolution. IEEE Trans. Multimed. 2023, 25, 8554–8565. [Google Scholar] [CrossRef]
Woo, S.; Park, J.; Lee, J.-Y.; Kweon, I.S. CBAM: Convolutional Block Attention Module. In Proceedings of the Computer Vision—ECCV 2018, Munich, Germany, 8–14 September 2018; Springer International Publishing: Cham, Switzerland, 2018; pp. 3–19. [Google Scholar]
Li, Q.; Tan, K.; Yuan, D.; Liu, Q. Progressive Domain Adaptation for Thermal Infrared Tracking. Electronics 2025, 14, 162. [Google Scholar] [CrossRef]
Xu, W.; Geng, G.; Zhang, X.; Yuan, D. Cross-Modal Alignment Enhancement for Vision–Language Tracking via Textual Heatmap Mapping. AI 2025, 6, 263. [Google Scholar] [CrossRef]
Zhang, H.; Cisse, M.; Dauphin, Y.N.; Lopez-Paz, D. Mixup: Beyond Empirical Risk Minimization. arXiv 2017, arXiv:1710.09412. [Google Scholar] [CrossRef]
Wang, H.; Zhu, Y.; Green, B.; Adam, H.; Yuille, A.; Chen, L.-C. Axial-DeepLab: Stand-Alone Axial-Attention for Panoptic Segmentation. In Proceedings of the Computer Vision—ECCV 2020, Glasgow, UK, 23–28 August 2020; Springer International Publishing: Cham, Switzerland, 2020; pp. 108–126. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Chen, Q.; Wang, Y.; Yang, T.; Zhang, X.; Cheng, J.; Sun, J. You Only Look One-Level Feature. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 13039–13048. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Overall framework diagram of the MDCAF module.

Figure 2. Channel domain calculation process.

Figure 3. Spatial domain computing process.

Figure 4. Example of some instruments in the domain-specific dataset.

Figure 5. Test results of models.

Table 1. Experimental environment.

Name	Model
Operating System	Ubuntu16.04
CPU	E5-2620v3
GPU	TITAN Xp
CUDA	10.1
Python	3.8
Deep Learning Framework	Pytorch1.8.1

Table 2. Dataset details.

Category Name	Number of Images	Label Name
Meter Normal	845	bj_zc
Dial fouling	490	bj_zw
Dial condensation	186	bj_nl
Dial damage	272	bj_bpps

Table 3. Train hyperparameter configuration.

Name	Configuration
optimizer	SGD
Initial learning rate	0.01
Learning rate strategy	Cosine annealing strategy
weight decay	0.0005
epochs	200
batch size	16
Image input size	640

Table 4. Performance comparison with advanced detect algorithms.

Algorithms	$A P$ (%)				$m A P$ (%)	FPS
Algorithms	Meter Normal	Dial Condensation	Dial Damage	Dial Fouling	$m A P$ (%)	FPS
Faster R-CNN	96.1	77.2	87.4	86.3	87.1	21.5
YOLOv8	96.8	75.7	88.6	88.2	87.3	98.4
YOLOF	89.7	70.5	82.3	80.1	80.7	53.2
The algorithm proposed in this paper	95.9	82.1	90.4	91.8	90.1	81.6

Table 5. Ablation test results of the MDCAF Module.

Baseline	Channel Attention	Spatial Attention	Axial Attention	$A P$ (%)				$m A P$ (%)
Baseline	Channel Attention	Spatial Attention	Axial Attention	Meter Normal	Dial Condensation	Dial Damage	Dial Fouling	$m A P$ (%)
√	-	-	-	96.8	75.7	88.6	88.2	87.3
√	√	-	-	96.6	76.3	87.9	88.3	87.3
√	-	√	-	96.5	76.2	88.3	88.4	87.4
√	-	-	√	96.4	76.0	88.8	89.0	87.5
√	√	√	-	96.7	76.7	88.5	89.5	87.7
√	√	-	√	96.6	76.6	88.3	89.4	87.8
√	-	√	√	96.5	76.2	88.7	89.3	87.6
√	√	√	√	96.6	77.2	88.1	90.1	88.1

The bolded data denote the optimal values. A hyphen (“-”) indicates that the module was not adopted, while a checkmark (“√”) suggests that the module was adopted.

Table 6. Ablation test results of the algorithm proposed in this paper.

Data Augmentation	MDCAF Module	Class Weight Balancing Strategy	$A P$ (%)				$m A P$ (%)
Data Augmentation	MDCAF Module	Class Weight Balancing Strategy	Meter Normal	Dial Condensation	Dial Damage	Dial Fouling	$m A P$ (%)
-	-	-	96.8	75.7	88.6	88.2	87.3
-	√	-	96.6	77.2	88.1	90.1	88.1
-	√	√	94.6	73.2	90.0	90.7	87.1
√	√	-	94.5	79.4	85.8	90.3	87.5
√	-	√	95.8	75.6	89.8	91.0	88.1
√	√	√	95.9	82.1	90.4	91.8	90.1

The bolded data denote the optimal values. A hyphen (“-”) indicates that the module was not adopted, while a checkmark (“√”) indicates that the module was adopted.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Liu, K.; Li, Y.; Wang, S.; Yang, Z.; Li, Z.; Zhao, Z. Substation Instrument Defect Detection Based on Multi-Domain Collaborative Attention Fusion. Electronics 2025, 14, 4690. https://doi.org/10.3390/electronics14234690

AMA Style

Liu K, Li Y, Wang S, Yang Z, Li Z, Zhao Z. Substation Instrument Defect Detection Based on Multi-Domain Collaborative Attention Fusion. Electronics. 2025; 14(23):4690. https://doi.org/10.3390/electronics14234690

Chicago/Turabian Style

Liu, Kequan, Yandong Li, Shiwei Wang, Zhaoguang Yang, Zhixin Li, and Zhenbing Zhao. 2025. "Substation Instrument Defect Detection Based on Multi-Domain Collaborative Attention Fusion" Electronics 14, no. 23: 4690. https://doi.org/10.3390/electronics14234690

APA Style

Liu, K., Li, Y., Wang, S., Yang, Z., Li, Z., & Zhao, Z. (2025). Substation Instrument Defect Detection Based on Multi-Domain Collaborative Attention Fusion. Electronics, 14(23), 4690. https://doi.org/10.3390/electronics14234690

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Substation Instrument Defect Detection Based on Multi-Domain Collaborative Attention Fusion

Abstract

1. Introduction

2. An Improved Detection Algorithm Based on Multi-Domain Collaborative Attention Fusion

2.1. Enhancing Data Sample Diversity

2.2. Multi-Domain Collaborative Attention Fusion Module

2.2.1. Channel-Domain Computation

2.2.2. Spatial-Domain Computation

2.2.3. Axial-Relation-Domain Computation

2.2.4. Residual Feature Enhancer

2.2.5. Stage Difference Regularization

2.2.6. Feature Alignment

2.3. Class Weight Balancing Strategy

3. Experimental Results and Analysis

3.1. Experimental Environment

3.2. Experimental Dataset

3.3. Training Details

3.4. Evaluation Metrics

3.5. Detailed Analysis of Results

3.5.1. Comparative Experiment

3.5.2. Ablation Experiment of the MDCAF Module

3.5.3. Ablation Experiments of the Algorithm Proposed in This Paper

3.6. Visual Comparison of Results

4. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI