Two-Stage Fault Diagnosis of Distribution Network Based on MS-CNN and Spatio-Temporal Dual Attention

Yang, Ying; Huang, Jinyi; Zhu, Hao; Cai, Zibin; Zheng, Weijia

doi:10.3390/electronics15122545

Open AccessArticle

Two-Stage Fault Diagnosis of Distribution Network Based on MS-CNN and Spatio-Temporal Dual Attention

by

Ying Yang

¹,

Jinyi Huang

¹,

Hao Zhu

¹,

Zibin Cai

^2,* and

Weijia Zheng

²

¹

Zhaoqing Power Supply Bureau, Guangdong Power Grid Co., Ltd., Zhaoqing 526000, China

²

School of Mechatronic Engineering and Automation, Foshan University, Foshan 528000, China

^*

Author to whom correspondence should be addressed.

Electronics 2026, 15(12), 2545; https://doi.org/10.3390/electronics15122545 (registering DOI)

Submission received: 27 April 2026 / Revised: 26 May 2026 / Accepted: 2 June 2026 / Published: 9 June 2026

(This article belongs to the Special Issue AI Applications for Smart Grid: 2nd Edition)

Download

Browse Figures

Versions Notes

Abstract

Aiming at the problem of weak fault features and difficult localization of adjacent nodes in distribution networks, we constructed a two-stage cascaded architecture to decouple the diagnosis task into fault classification and section location. The feature layer fuses MS-CNN, SimAM, and Transformer to form a spatio-temporal dual attention mechanism that synchronously captures spatial saliency and global temporal logic. A prototype network is introduced at the fault location decision layer, and metric learning is used to solve the problem of feature aliasing of adjacent nodes. The experimental results show that the accuracy of fault classification and localization are 98.61% and 94.22%, respectively, and it exhibits graceful degradation under extremely low-SNR conditions, which verifies the effectiveness of the proposed strategy in the refined fault diagnosis of distribution networks.

Keywords:

distribution network fault diagnosis; MS-CNN; SimAM; Transformer; spatio-temporal dual attention; prototype network

1. Introduction

As critical infrastructure connecting the transmission system to end-users, the power supply reliability of the distribution network is directly correlated with the stable operation of the socio-economy. However, influenced by factors such as complex topologies, line aging, and environmental interference, distribution networks are highly susceptible to various types of faults. Traditional troubleshooting relies heavily on manual patrols or passive repair reporting, which are inefficient and struggle to meet the urgent requirements of smart grids for rapid power restoration. Therefore, research into high-precision automated fault diagnosis and localization methods is of significant importance for enhancing the perception capabilities and self-healing levels of distribution networks [1,2,3].

In the field of distribution network fault diagnosis, the complex physical features of fault signals represent a core challenge constraining accuracy. Distribution network faults primarily encompass single-line-to-ground (SLG) faults, line-to-line (L-L) short circuits, and single-phase open circuits. Among them, SLG faults, which have the highest occurrence rate [4,5,6], are typically accompanied by a surge in zero-sequence components, where the signal is weak and easily submerged by noise. L-L short circuits trigger violent current impulses containing abundant high-frequency transients. Meanwhile, single-phase open circuits result in three-phase imbalance and negative-sequence currents. These fault morphologies vary significantly, and, in practical operation, fault signals often possess both high-frequency transients and millisecond-level power frequency distortions. Such cross-scale non-stationary signal features easily lead to modal aliasing of critical fault information in the time–frequency domain. This not only exposes the limitations of early physical mechanism models, but also drives the continuous evolution of diagnostic technologies toward intelligent data-driven directions.

Early research mainly relied on physical mechanism models, such as the traditional impedance method. Zhu et al. [7] developed an automated fault location and diagnosis scheme specifically for radial distribution feeders, establishing one of the first systematic applications of the impedance-based method to distribution networks. Salim et al. [8] further advanced this approach by proposing a hybrid fault diagnosis framework integrating wavelet-based detection, impedance-based fault location, and neural network-based section determination. However, Mora-Flórez et al. [9] pointed out, in their comparative study, that the performance of impedance-based methods relies heavily on accurate system parameters, load models, and power system topology. Chang et al. [10] further demonstrated that dynamic parameter deviations in actual environments can easily lead to protection maloperation based on deterministic calculations. To compensate for the deficiencies of physical models, Xiao et al. [11] introduced the Improved S-Transform (MST) for time–frequency analysis. Although this enhanced the representation of non-stationary signals, it failed to break free from the dependence on cumbersome manual feature engineering.

With the development of artificial intelligence, deep learning has gradually become mainstream [12,13,14,15,16,17]. These architectures cover a wide spectrum, including recurrent neural networks for temporal dependency modeling, convolutional neural networks for local feature extraction, graph neural networks for topological reasoning, and attention mechanisms for feature recalibration. Ji et al. [12] utilized the dense connection mechanism of LSTM-DenseNet to effectively validate the advantages of end-to-end models in capturing long-term temporal dependencies of faults. Shafei et al. [18] proved that combining a CNN with the Park transformation can effectively eliminate time-varying load interference, enhancing the model’s generalization ability under variable operating conditions.

Addressing the challenges of complex topology and data sparsity, Mo et al. [19] proposed a Super-Resolution Graph Neural Network (SR-GNN), achieving full-network state reconstruction under sparse measurements. Lu and Hou [20] solved the problem of dynamic topological changes using a domain-adaptive graph attention algorithm. Guo et al. [21] combined the Hilbert–Huang Transform (HHT) with CNN to further enhance the time–frequency representation capability for high-frequency details of non-stationary signals.

Despite the impressive performance of existing architectures in classification tasks, design flaws persist in fine-grained localization. Most networks employ convolution kernels of fixed sizes, making them unable to achieve synchronous decoupling of full-frequency domain information. Although some studies, such as Hu et al. [22] and Yao et al. [23], incorporated channel attention mechanisms (SE-Net) and Convolutional Block Attention Modules (CBAMs), respectively, to optimize feature weights, the parameter redundancy introduced by their fully connected layers significantly increases the computational burden for edge deployment. Furthermore, these methods struggle to fully cope with signal redundancy in complex backgrounds. Additionally, traditional Softmax classifiers often fail to delineate clear decision boundaries when processing highly aliased features of adjacent nodes, thereby severely constraining fault localization accuracy. Despite these advances, several key research gaps remain in distribution network fault diagnosis. Li et al. [24] highlighted the importance of bridging fault diagnosis results to intelligent maintenance decisions, underscoring the growing demand for diagnosis systems that incorporate operating condition awareness. Meanwhile, from the perspective of diagnosis accuracy itself, existing deep learning-based methods still face three fundamental challenges: (1) single-scale convolution kernels cannot simultaneously capture high-frequency transient mutations and steady-state power frequency distortions; (2) parameter-intensive attention mechanisms hinder deployment on resource-constrained edge devices; and (3) Softmax-based classifiers struggle to separate highly aliased feature representations of electrically adjacent nodes. To address these challenges, this paper proposes a two-stage fault diagnosis framework based on MS-CNN and spatio-temporal dual attention. The main contributions are summarized as follows:

A fusion mechanism of MS-CNN and SimAM is proposed. This achieves the synchronous decoupling of cross-scale fault signals and adaptive feature enhancement under noisy backgrounds, significantly improving the sensitivity to weak faults.
The Transformer encoder is utilized to compensate for the limitations of CNNs in the temporal dimension. By mining the full-time-domain correlations of faults from transient inception to steady-state evolution via the multi-head self-attention mechanism, it solves the difficulty of synchronously decoupling high-frequency impulses and steady-state information.
A two-stage decoupled decision mechanism and a prototype-based metric learning method are designed. By optimizing the geometric structure of the feature space rather than relying on probabilistic classification, this approach effectively overcomes the issue of feature aliasing among adjacent nodes, thereby achieving high-precision section localization.

2. Fault Diagnosis Framework Based on MS-CNN-SimAM-Transformer

2.1. Overview of the Overall Framework

A two-stage cascade diagnosis framework based on MS-CNN-SimAM-Transformer is proposed to solve the problem of fault diagnosis caused by weak fault features and waveform aliasing of adjacent nodes due to complex topology of distribution network. The framework abandons the traditional end-to-end single task mode and innovatively decouples fault diagnosis into two logically dependent cascade stages of “Stage 1: Fault Classification” and “Stage 2: Fault Section Localization”.

The proposed architecture, shown in Figure 1, combines spatio-temporal feature extraction with cascade decision-making. The input layer receives multi-channel signals: three-phase voltage and current, zero-sequence current, and active and reactive power (

P, Q

). These signals are processed by three parallel MS-CNN branches with kernel sizes of 3, 7, and 11, capturing transient features at different temporal scales. The SimAM module then applies parameter-free attention to suppress background noise and enhance fault-related signals. A Transformer encoder follows, modeling global temporal dependencies across the full time series. Together, SimAM (spatial screening) and Transformer (temporal modeling) form the spatio-temporal dual attention mechanism that extracts decoupled, high-dimensional feature representations.

After global average pooling and MLP projection, the generated feature vectors enter the cascade decision-making process. As the first line of defense, Stage 1 uses the Softmax classifier to quickly determine the operating state of the system, and plays the role of a logical filter: if the system determines that the sample is in normal operation, it directly terminates the diagnosis; only when the fault sample is judged to be a fault sample are its features transferred to the next stage. For these selected fault samples, Stage 2 introduces a metric learning-based localization strategy. By maintaining a set of learnable fault prototype vectors and calculating the Euclidean distance between the query sample and each prototype, Stage 2 can effectively locate the fault samples. Finally, the fault section is accurately located according to the nearest neighbor principle.

2.2. Data Pre-Processing and Feature Construction

In order to fully exploit the time–frequency features of the fault signal in the distribution network and adapt to the input requirements of the MS-CNN-Transformer hybrid model, a complete data pre-processing process is constructed in this paper that can be used to improve the accuracy of fault diagnosis. It mainly includes four steps: feature engineering, data standardization, label coding, and class imbalance processing.

2.2.1. Construction of Multi-Dimensional Time Series Feature Matrix

The original data are collected from the IEEE 33-node distribution network simulation model based on MATLAB/Simulink. In view of the limitation of single voltage or current signal in characterizing weak fault features, a multi-dimensional feature space covering electrical quantities and power quantities is constructed in this study. In this work, “weak fault” refers to fault scenarios where the electrical signatures of adjacent nodes are highly similar due to short electrical distances, making accurate fault localization challenging.

The original sampling data set is set to contain a single sample with a sampling length of L. For the first sample, the three-phase voltage and current (

u_{a b c} (t), i_{a b c} (t)

) are extracted as the basic channel features to fully reflect the basic operating state of the system. The zero-sequence component (

i_{0} (t)

) is used as the sensitive criterion of grounding fault (the calculation basis is

3 i_{0} (t) = i_{a} (t) + i_{b} (t) + i_{c} (t)

), and is used as the feature, together with the active and reactive instantaneous power (

P, Q

), to enhance the ability of load fluctuation and fault impact identification. Unlike voltage and current magnitudes alone, P and Q encode the phase relationship (power factor angle) between voltage and current. During faults, the abrupt change in system impedance causes a characteristic shift in the power factor angle, which cannot be directly captured from U or I individually. Teng et al. [25] demonstrated that power-based features provide complementary discriminative information for fault characterization. The input tensor

X \in R^{C \times L}

, which represents the total number of characteristic channels C, is constructed by the above multi-physical quantities after splicing the channel dimensions. The multi-channel structure design is highly compatible with the multi-scale convolution kernel of MS-CNN and supports the parallel extraction of deep fault features. The construction process of this multidimensional time series feature matrix is illustrated in Figure 2.

2.2.2. Standardized Handling of Data Leakage

Direct input to the network can lead to slow gradient descent convergence or even gradient explosion due to the large numerical magnitude differences in voltage (kV level), current (A level), and power (kW/kVar level). To this end, Z-Score standardization is applied using training set statistics:

x^{'} = \frac{x - μ_{train}}{σ_{train} + ϵ}

(1)

The same

μ_{train}

and

σ_{train}

are used to transform the validation and test sets to prevent data leakage.

2.2.3. Multi-Task Label Coding

For the two-stage diagnosis framework of “Stage 1” and “Stage 2” constructed in this paper, the hierarchical label coding strategy is adopted. For the fault classification task of the first stage, the label encoder is used to map various working conditions such as single-phase grounding, two-phase short circuit, and normal operation to discrete integer coding. In the second stage of the fault localization task, the processing flow only screens out the samples of abnormal operation state and converts their corresponding fault line section identifiers into specific localization labels.

2.2.4. Class Balance Strategy Based on Weighted Sampling

In the actual operation data of distribution network, the number of normal samples is often much more than that of fault samples, and the occurrence probability of different fault types is different (such as single-phase grounding is significantly more than two-phase short circuit). This long-tailed distribution will cause the model to tend to predict the majority class. To address this, a weighted random sampler is introduced, where the sample weight of each category is calculated as

W_{c} = \frac{N_{total}}{N_{c}}

(2)

where

N_{c}

is the total number of samples of category c. During batch construction, samples are drawn with replacement based on these weights to balance the participation probability across all categories.

The overall data pre-processing pipeline is summarized in Figure 3.

2.3. MS-CNN Module

The fault signal of distribution network shows significant multi-scale time–frequency features, covering rich information from transient high-frequency oscillation to steady-state power frequency offset. The traditional fixed convolution kernel size is often difficult to strike a balance between capturing local mutation details and maintaining global waveform trends. In this study, a multi-scale convolutional neural network (MS-CNN) module was designed as a backbone feature extractor to obtain the feature expression under different receptive fields by constructing three parallel convolutional branches. Each branch adopts a convolution kernel of size

1 \times k

, in which the small convolution kernel focuses on capturing high-frequency transient noise and small signal mutations while the large convolution kernel is responsible for extracting low-frequency waveform contours and global trends. This multi-granularity parallel processing mechanism enables the model to adaptively decouple the frequency components in complex fault signals. The detailed architecture of the MS-CNN module is illustrated in Figure 4.

In order to solve the problem of gradient disappearance in deep network training and enhance the efficiency of feature propagation, residual blocks with skip connections were integrated in each convolution branch:

y = F (x, {W_{i}}) + x

(3)

The feature maps of multi-branch outputs are concatenated in the channel dimension to form a mixed feature tensor containing rich multi-scale information. Specifically, each convolutional branch has 64 output channels, resulting in 192 channels after concatenation, which are then fused into 64 channels via the

1 \times 1

convolution. The residual blocks further expand the channel dimensions to 128 (stride 2) and then to 256 (stride 2). In order to realize the effective interaction of different scale features and control the complexity of the model, a

1 \times 1

convolution layer was deployed at the end of the module as a feature fuser. This layer fused the multi-scale features through weighted linear combination across channels while reducing the dimension. The resulting compact feature representation serves as a refined input to the subsequent SimAM and Transformer modules.

2.4. Spatial Attention: SimAM Attention

Although deep convolutional networks have a strong ability to abstract features, they face a dilemma when dealing with weak fault signals in distribution networks: high-frequency transient details are easily smoothed as the network deepens, while background noise may be mistakenly amplified. Traditional attention mechanisms, such as SE-Net or CBAM, learn feature weights by introducing additional parameters, which increases the risk of overfitting. To address this, this paper introduces SimAM (Simple Attention Module) [26], which directly derives three-dimensional attention weights from the statistical properties of the signal itself based on the spatial inhibition effect in neuroscience.

2.4.1. Waveform Singularity Detection and Feature Recalibration

To detect transient singularities submerged in noise, SimAM reinterprets the attention mechanism from a signal processing perspective. For each sampling point

x_{i}

in the feature map, a binary signal-to-noise separator is formulated as

e_{t} (w_{t}, b_{t}, y, x_{i}) = {(y_{t} - \hat{t})}^{2} + \frac{1}{M - 1} \sum_{i = 1}^{M - 1} {(y_{o} - {\hat{x}}_{i})}^{2}

(4)

which quantifies the “singularity” of each point by measuring its deviation from the surrounding statistical distribution. As illustrated in Figure 5, the process consists of three steps: computing the global mean and variance of the input feature map as a statistical baseline, evaluating each feature point using the energy function, and applying a Sigmoid activation to generate attention weights in the range

[0, 1]

. These weights are multiplied element-wise with the original feature map to suppress background noise and amplify fault transient features [26,27,28].

2.4.2. Module Embedding Strategy

SimAM is embedded into the core path of each multi-scale residual block, positioned after the second Batch Normalization layer and before the residual addition:

\tilde{X} = X + σ (E) ⊙ X

(5)

where X is the feature map after the two-layer convolution transform. This “residual embedding” ensures that the attention mechanism fine-tunes only the residual mapping while the identity path remains intact, preserving weak fault information across layers and preventing feature degradation in deep networks.

2.5. Temporal Attention: Transformer Encoder

Although the MS-CNN, with the introduction of SimAM, is able to sensitively capture high-frequency transient shocks, its receptive field is always limited to the local window due to the physical size of the convolution kernel. In the face of the complete evolution process of distribution network fault waveform from transient mutation to steady-state distortion, it is difficult for the CNN to establish long-distance logical correlation in the full time domain scale.

To this end, this paper uses Transformer encoder to take over the output of the CNN, aiming to capture the global dependencies in the time series. The encoder consists of one Transformer block with four attention heads, an embedding dimension of 256, a feed-forward network hidden dimension of 512, and a dropout rate of 0.3.

2.5.1. Sequence Reconstruction and Multi-Head Self-Attention

Unlike convolution operations, which focus on “Channels”, transformers focus on “Sequences”. Therefore, the output tensor of MS-CNN is reshaped with the time dimension as the sequence length and the channel dimension as the feature embedding.

The central mechanism is multi-headed self-attention [29]. It allows for each time step in the sequence to directly “Focus” on all other time steps, no matter how far apart they are. For the input sequence, we generate the Query, Key, and Value matrices:

Q = X W_{Q}, K = X W_{K}, V = X W_{V}

(6)

where W is a learnable linear projection matrix. The attention score is calculated by scaling the dot product:

Attention (Q, K, V) = Softmax (\frac{Q K^{T}}{\sqrt{d_{k}}}) V

(7)

Here,

\sqrt{d_{k}}

is the scaling factor, which prevents the gradient from disappearing if the dot product is too large.

In the physical sense, the attention weight matrix quantifies the correlation at different times in the process of fault development. For example, there is an inherent causal relationship between the transient mutation (the moment

t_{1}

) at the beginning of the fault and the steady-state distortion (the moment

t_{2}

) after several cycles. The MSA mechanism can give this kind of cross-period correlation extremely high weight so as to reconstruct the complete logical chain of fault evolution and make up for the lack of the CNN only focusing on local features.

2.5.2. Feed-Forward Network and Global Aggregation

The output of MSA is fed into a feedforward neural network (FFN) after residual linking (Add) and layer normalization (Norm). As shown in the middle of Figure 6, to enhance the non-linear representation and prevent overfitting, we use the GELU (Gaussian Error Linear Unit) activation function instead of the traditional ReLU:

GELU (x) = x Φ (x) \approx 0.5 x (1 + \tanh [\sqrt{2 / π} (x + 0.044715 x^{3})])

(8)

The smoothness of the GELU near zero makes its gradient propagation more stable when dealing with high-dimensional electrical features. Finally, the sequence features from the Transformer output are compressed into a single high-dimensional semantic vector through the Global Average Pooling layer. This vector highly condenses the local multifrequency details of the fault and the global temporal logic and is entered into the MLP Head shown on the right side of Figure 6. The projection header contains two layers of fully connected networks (Linear Large/Small) in which the features are further refined by SiLU activation and Dropout layers and finally shunted to the Softmax (Stage 1) and Prototype Matching (Stage 2) modules for decision-making.

2.6. Decision Mechanisms for Task Decoupling

In order to balance the accuracy of fault classification and the fineness of fault localization, this paper designs a two-branch decision architecture, as shown on the right side of Figure 6. The extracted high-dimensional features are first mapped by a shared-structure MLP Head and are then shunted to two different decision paths of Stage 1 and Stage 2 according to the task requirements.

2.6.1. Nonlinear Feature Projection (MLP Head)

A shared MLP projection head maps the Transformer output to the decision space, consisting of LayerNorm, a linear expansion layer, SiLU activation, Dropout regularization, and a linear compression layer that outputs a compact feature vector z for both Stage 1 and Stage 2.

2.6.2. Softmax-Based Fault Classification

For Stage 1 fault classification tasks (such as single-phase grounding, two-phase short circuit, etc.), the feature vector output by MLP is directly input into the Softmax classifier. The objective of the path is to find the linear decision boundary of each type of fault in the feature space and calculate the conditional probability of belonging to the first type of fault:

P (y = i | x) = \frac{e^{z_{i}}}{\sum_{j = 1}^{C} e^{z_{j}}}

(9)

In this stage, the model is guided to quickly lock the macroscopic waveform features of the fault by minimizing the cross-entropy loss.

2.6.3. Fault Localization Based on Prototype Networks

For the fault section localization task of Stage 2, the traditional linear classification often fails due to the high feature similarity between adjacent nodes. To this end, we introduce a Prototype Matching strategy [30] at the MLP output, as shown in Figure 6, transforming the localization problem into a metric learning problem in geometric space.

Suppose that the embedding vector of the MLP output is

f_{ϕ} (x)

. For each fault localization category k, the model maintains a learnable parameter vector called a Prototype

c_{k}

. During the inference process, the Euclidean distance between the query sample and all the prototype vectors is calculated:

d (x, c_{k}) = {∥ f_{ϕ} (x) - c_{k} ∥}_{2}^{2}

(10)

To force the model to learn discriminative feature distributions, we employ a prototypical loss function whose essence is Softmax in the distance dimension:

L_{proto} = - log \frac{exp (- d (x, c_{y}))}{\sum_{k^{'}} exp (- d (x, c_{k^{'}}))}

(11)

This loss pulls samples of the same class toward their prototype center while pushing different prototypes apart, effectively separating previously aliased adjacent node clusters (see Section 3 for details).

2.6.4. Multi-Task Joint Optimization

In order to balance the performance of the two phases, the whole framework adopts an end-to-end joint training strategy. The total objective function is defined as the weighted sum of the two-stage losses:

L_{total} = L_{S t a g e 1} + λ L_{S t a g e 2}

(12)

The balance factor

λ

controls the gradient contribution of each task. During early training, the model focuses on learning common fault waveform features. As training progresses, the prototypical loss gradually refines the feature space, keeping it globally separable while producing more compact local manifolds for precise localization of adjacent nodes.

3. Experimental Setup and Basic Performance Evaluation

3.1. Dataset Generation and Experimental Environment

In this paper, a standard IEEE 33-node radial distribution network [31] is built based on MATLAB R2024b/Simulink (SimPowerSystems) platform. The system consists of 33 nodes and 37 branches. The reference voltage is 12.66 kV and the reference frequency is 50 Hz. The system topology is shown in Figure 7.

Fault scenarios cover 37 line sections, 4 operating conditions (normal, single-phase-to-ground, line-to-line, and single-phase open circuit, each on phases A, B, and C), 4 fault-resistance levels (0.1

Ω

, 1

Ω

, 10

Ω

, 25

Ω

), and 4 inception angles (0°, 30°, 60°, 90°). Faults were applied at the midpoint of each line section. System loads were maintained at nominal levels, and voltage and current signals were recorded at all 33 nodes. Each event lasted 0.1 s and was sampled at 10 kHz. The MS-CNN kernel sizes (3, 7, 11) corresponds to temporal windows of 0.3 ms, 0.7 ms, and 1.1 ms, respectively, for local multi-scale feature extraction, while the Transformer encoder captures global temporal dependencies across the full 0.1 s event (1000 time steps). The final data set contains 240,111 samples, which were divided at the fault-event level into training, validation, and test sets in a 6:2:2 ratio, ensuring that all window samples from the same event belonged exclusively to one subset to prevent data leakage. Stratified sampling based on the Stage 1 label (fault type) was applied to maintain consistent proportions across subsets.

For the fault localization task of Stage 2, its data set is directly derived from the above partitioning results. Specifically, the training set and the test set of Stage 2 are subsets of all non-normal (i.e., fault) samples in the corresponding set of Stage 1, respectively. This approach maintains the homology of the data distribution of the two stages.

All experiments were implemented under the PYTORCH Deep Learning Framework with the hardware environment configured as Intel Core i5-10200H CPU and Nvidia GeForce RTX 1650 (4GB) GPU. The ADAMW optimizer (weight decay

1 \times 10^{- 4}

) was used for model training, with an initial learning rate of 0.001 and a Cosine Annealing learning rate scheduler. The label smoothing factor was set to 0.1, and the gradient clipping threshold was 1.0. Early stopping with a patience of 10 epochs and a minimum delta of 0.0001 was adopted to prevent overfitting. The batch size is 256 and the maximum number of epochs was 30. For the prototype network in Stage 2, the prototype vectors were initialized using nn.Parameter (torch.randn × 0.02), and the prototype loss weight

λ

is set to 0.2. These hyperparameters follow common practices in deep learning-based fault diagnosis: AdamW mitigates overfitting, cosine annealing ensures smooth convergence, gradient clipping stabilizes training, and early stopping prevents unnecessary epochs. The prototype loss weight

λ = 0.2

balances metric learning and cross-entropy losses.

3.2. Evaluation Indicators

In order to comprehensively evaluate the performance of the model in fault classification and fault localization tasks, this paper uses accuracy, precision, recall and F1-score as evaluation indicators. The calculation formulas are as follows:

\begin{matrix} Accuracy & = \frac{T P + T N}{T P + T N + F P + F N} \\ Precision & = \frac{T P}{T P + F P} \\ Recall & = \frac{T P}{T P + F N} \\ F 1 - Score & = 2 \times \frac{Precision \times Recall}{Precision + Recall} \end{matrix}

(13)

where TP, TN, FP, and FN represent the number of true positive, true negative, false positive, and false negative samples, respectively.

3.3. Training Performance Analysis of the Model

3.3.1. Training Convergence and Stability Analysis

As can be seen from the Stage 1 curve in Figure 8a, the model loss function exhibits a large negative gradient descent at the initial stage of training. This steep convergence trajectory indicates that the MS-CNN backbone network possesses strong feature decoupling capability, enabling it to rapidly lock onto the prominent transient impact and power frequency distortion components in the fault current, allowing the model to cross the underfitting region within very few epochs. As training progresses, the test accuracy reaches a peak of 98.61% at epoch 16 (Best Ep.16). Thereafter, the training loss stabilizes within a very low range (approximately 0.05) without severe oscillations, eliminating early convergence fluctuations while exhibiting high training robustness.

Figure 8b shows the training process for the more challenging Stage 2 fault localization task. The loss curve of Stage 2 exhibits a continuous, smooth, and monotonically decreasing convergence trend, driving the test accuracy to rise steadily and reaching a maximum of 94.22% at epoch 24 (Best Ep.24). This is primarily attributed to the prototypical loss introduced at this stage: unlike traditional cross-entropy, which merely seeks decision boundaries, the prototypical loss requires constant adjustment of geometric distances in a high-dimensional space to “pull” samples of the same class toward their class center. This strict geometric constraint smooths the optimization landscape, enabling the optimizer to perfectly avoid local minima during fine-grained discrimination.

3.3.2. Quantitative Evaluation of Diagnostic Performance

To objectively evaluate the generalization performance of the proposed MS-CNN-SimAM-Transformer framework in different task phases, Table 1 summarizes the core metrics on the test set. The evaluation includes four dimensions: accuracy, precision, recall and F1-Score. The last three indicators are calculated using Macro-average in order to eliminate the evaluation bias caused by the uneven distribution of category samples.

As shown in Table 1, Stage 1 achieves an accuracy of 98.61% and a recall of 98.64%, ensuring reliable fault detection. Stage 2 achieves 94.22% accuracy under the more challenging 37-class fine-grained localization task. To further evaluate the practical performance of the cascaded system, Table 2 reports the end-to-end metrics. The missed detection rate is only 0.83%, meaning that only 0.83% of fault samples are blocked by Stage 1 and never reach Stage 2. This limits the upper bound of the cascaded localization rate to 99.17%. In practice, the final effective localization rate reaches 93.62%, with a degradation of only 0.60% compared with the standalone Stage 2 accuracy (94.22%). This demonstrates that Stage 1’s high recall ensures minimal error propagation, and the cascaded architecture imposes negligible performance loss while benefiting from pre-filtering of normal samples.

Figure 9 shows the normalized confusion matrix on the test set, which intuitively reveals the discriminant details and error distribution features of the model at different task stages.

As shown in Figure 9a, the Stage 1 confusion matrix confirms the model’s high classification accuracy across all four operating conditions. Normal operation is correctly identified with 98.1% accuracy, with only 1.9% misclassified as single-phase open circuit. Single-phase grounding achieves 99.7% recall, demonstrating the model’s sensitivity to weak transient signals. Phase-to-phase faults reach 98.5% recall, and single-phase open circuit reaches 98.2% recall. These results validate the effectiveness of the MS-CNN backbone in extracting discriminative features from the multi-channel input, particularly for the most challenging single-phase grounding category.

Figure 9b shows the confusion matrix for the 37-line fault localization task. The matrix exhibits clear diagonal dominance, with most diagonal elements exceeding 94% recall. The sparse off-diagonal errors follow an “electrical proximity” pattern, concentrating between electrically adjacent lines such as Line 9 and Line 36, or Line 25 and Line 31 due to their extremely short electrical distances. Nevertheless, no large-scale cross-region misjudgment occurs, confirming that the spatio-temporal dual attention mechanism provides strong fine-grained discriminative ability.

In order to further explain the high-precision performance of the model from the perspective of feature space evolution, this paper uses the t-SNE algorithm to reduce the dimension of the high-dimensional semantic features of the test set samples, and the visualization results are shown in Figure 10. As shown in Figure 10a, the features of different fault types in Stage 1 form clearly separated clusters, indicating that the spatio-temporal dual attention mechanism effectively maps cross-scale fault signals into a discriminative feature space. In Figure 10b, the feature distribution of the 37 line sections in Stage 2 shows a well-clustered structure, with prototypes acting as compact class centers that facilitate accurate localization of electrically adjacent nodes.

3.4. Comparative Experimental Analysis

In order to prove the superiority of the MS-CNN-SimAM-Transformer framework, this paper selects seven mainstream benchmark models for comparison, including the traditional machine learning method SVM and deep learning models such as 1D-CNN, LSTM, OS-CNN, LSTM-DenseNet [12], HHT-CNN [21], and ResBlock-CBAM-CNN [23]. All contrasting models were trained under the same dataset and experimental conditions.

The data in Table 3 reveal a remarkable pattern: most models achieve a high accuracy of over 96% in Stage 1 (fault classification), but generally suffer from performance bottlenecks in Stage 2 (fault localization). Taking the basic 1D-CNN as an example, it achieves an accuracy of 96.82% in classification, but plummets to 53.79% in localization—a performance degradation exceeding 40%. Even the LSTM, despite its temporal memory, only reaches 65.68% in the localization task.

This dramatic gap reflects the inherent difficulty of the two-stage task: while fault classification relies on macroscopic morphological differences in voltage and current waveforms (such as the surge in short-circuit amplitude), fault localization requires the deduction of subtle changes in line impedance. In the IEEE 33-bus system, the extremely short electrical distance between adjacent nodes results in highly overlapping fault features. Conventional Softmax classifiers struggle to construct effective decision boundaries between such dense clusters, leading to the decline in accuracy.

Hybrid architectures significantly improve this situation. LSTM-DenseNet [12] reaches an accuracy of 82.88%, while HHT-CNN [21] and ResBlock-CBAM-CNN [23] reach 92.70% and 93.59%, respectively. However, our proposed framework achieves the highest accuracy in both stages, with 98.61% in Stage 1 and 94.22% in Stage 2, surpassing HHT-CNN by 1.52% and ResBlock-CBAM-CNN by 0.63% in Stage 2 localization. This superiority is attributed to the deep synergy between the spatio-temporal dual attention and the prototype metric learning. At the feature extraction level, SimAM offers better signal fidelity than the CBAM module in ResBlock-CBAM-CNN. Unlike CBAM, which uses dimensionality reduction, SimAM directly calculates three-dimensional attention weights based on an energy function, preserving the integrity of weak fault signals without losing structural information. Simultaneously, the Transformer encoder overcomes the gradient dispersion issues of RNNs when dealing with long sequences. Its parallel self-attention mechanism accurately captures the global dependence of the fault from transient inception to steady-state evolution. Finally, the prototype network optimizes the geometric structure of the feature space directly, demonstrating superior discriminative ability over the traditional Softmax-based decision-making in HHT-CNN.

The above results verify the effectiveness of the proposed model under standard working conditions. However, in the actual distribution network, the influence of noise interference and component contribution on the accuracy of the model still needs to be further analyzed in the following sections.

3.5. Cross-Scenario Generalization Validation

To further evaluate the generalization ability of the proposed framework across different network topologies, additional experiments were conducted on a 10 kV distribution network [32,33]. Unlike the IEEE 33-node system (33 nodes, 37 line sections), this network consists of seven nodes and seven line sections fed by four feeders, representing a structurally distinct topology. Faults are generated on all seven line sections under the same simulation configurations as the IEEE 33-node system, including four operating conditions (normal, single-phase-to-ground, line-to-line, and single-phase open circuit, each on phases A, B, and C), four fault-resistance levels, and four inception angles, ensuring comparability with the previous experiments.

As shown in Table 4, the model achieves 96.23% accuracy in Stage 1 and 93.19% accuracy in Stage 2 on the 10 kV network. Table 5 further reports the end-to-end cascade metrics on this network: the final effective localization rate reaches 91.85%, with a false trip rate of 3.82% and a missed detection rate of 1.56%. Compared with the IEEE 33-node results, the performance on the 10 kV network is slightly lower, which is expected given the different topology and the limited number of training samples available for the smaller network. Nevertheless, the model maintains high diagnostic accuracy across both topologies, demonstrating strong generalization capability.

The topology of the 10 kV distribution network is illustrated in Figure 11.

4. Model Ablation and Robustness Assessment

4.1. Ablation Experiments

As shown in the comparative experimental analysis in Section 3.4, fault localization (Stage 2) faces greater challenges than fault classification and is the main bottleneck that restricts the overall performance of the system. Therefore, in order to further explore the specific contribution of each module proposed in this paper to this critical task, we focus the ablation experiment on the fault localization phase.

To this end, we constructed a benchmark model (denoted as Backbone) designed to verify the effectiveness of the multi-scale (MS) module and the attention mechanism (SimAM). As the “Single-scale Backbone” of our framework, the benchmark model inherits the advanced training strategies (such as optimizer configuration, learning rate adjustment strategy, etc.) that are completely consistent with the full model of this paper to ensure the fairness of internal evaluation.

Specifically, to quantify the performance contribution of each core component, we designed five experimental scenarios that were progressively accumulated from the benchmark model, as summarized in Table 6:

Model 0 (Backbone): Basic single-stream feature extractor. Only standard convolutional layers are included to extract features, and a fully connected layer is used to directly predict fault section labels.
Model 1 (Model 0 + Multi-scale module): Multi-scale module is integrated on the basis of Model 0, and a parallel convolution branch structure with different core sizes is adopted.
Model 2 (Model 1 + SimAM attention): A nonparametric attention mechanism is embedded in the feature extraction back-end of Model 1, and the three-dimensional attention weight is calculated based on the energy function.
Model 3 (Model 2 + Transformer encoder): Transformer encoder is further introduced to process sequence features through a multi-head self-attention mechanism.
Model 4 (Ours): The proposed complete framework. The Prototype Network strategy is introduced to replace the traditional regression head, and the positioning is performed by calculating the metric distance between the sample and the Prototype.

To quantitatively assess the specific contribution of each module to the fault localization performance, we recorded the accuracy, precision, recall, and F1-Score of five progressive variants on the test set. The detailed experimental results are summarized in Table 7. From the data in the table, it can be intuitively observed that, with the gradual introduction of key components, the indicators of the model show a steady upward trend, which verifies the rationality of the proposed framework design.

The ablation results in Table 7 show that each module contributes to the overall performance. Adding multi-scale convolution (M1) improves accuracy by 1.93% and F1 by 0.94% over the single-scale backbone (M0). The SimAM module (M2) further increases precision to 92.05% by suppressing redundant features. Transformer (M3) brings the most significant gain, with accuracy rising by 3.43% to 93.95%, confirming that the CNN’s limited local receptive field is effectively complemented by global temporal modeling. Finally, the prototype network (M4) refines the feature space and achieves the best overall performance.

Finally, the F1-Score reaches 93.94% when the complete model M4 is introduced into the prototype network. Although the localization effect of M3 has been very good, there are still misjudgments when dealing with adjacent nodes with similar features. By calculating the Euclidean distance, the clustering of similar samples and the separation of different samples are forced to solve the problem of boundary ambiguity caused by feature aliasing, and the further optimization of positioning accuracy is realized.

Further ablation experiments are conducted to validate the contribution of power features (P, Q), which are introduced in Section 2.2 to encode the phase relationship between voltage and current. As shown in Figure 12, incorporating P and Q significantly improves performance across all metrics. Stage 1 accuracy increases from 94.57% to 98.61%, and Stage 2 accuracy from 92.15% to 94.22%. The final effective localization rate rises from 87.54% to 93.62%, while the missed detection rate drops sharply from 4.62% to 0.83%. These results confirm that P and Q provide complementary phase-related information beyond voltage and current magnitudes.

4.2. Noise Robustness Experiments

In the actual industrial scene, power equipment is often in a complex electromagnetic environment, and the collected sensor signals will inevitably be polluted by various background noises [34,35]. To evaluate the robustness of the proposed model under diverse interference conditions, five types of noise are introduced during data preprocessing:

(1) Gaussian white noise: Random Gaussian noise

N (0, σ^{2})

is added directly to the raw signals. The noise amplitude is controlled by the target SNR, where a lower SNR corresponds to higher noise intensity.

(2) Pulse noise: Randomly selected sampling points are replaced by large-amplitude impulses. At 35 dB, 2% of elements are affected at 1.5× the signal amplitude; at 5 dB, 20% of elements are affected at 8× the signal amplitude. The parameters vary linearly between these two SNR extremes.

(3) Measurement error: All elements are multiplied by a random scaling factor. At 35 dB, the overall offset is ±2%; at 5 dB, the offset reaches ±20%, with linear variation between these bounds.

(4) Sensor dropout: Randomly selected elements are set to zero. At 35 dB, 5% of elements are dropped; at 5 dB, the dropout rate reaches 30%, with linear scaling in between.

(5) Communication delay: Randomly selected elements are cyclically shifted. At 35 dB, 8% of elements are shifted by 2 positions; at 5 dB, 40% of elements are shifted by 6 positions, with linear variation.

All noise types are tested across seven SNR levels from 35 dB to 5 dB with a step size of 5 dB, where the SNR is defined as the ratio of signal power to noise power. At 35 dB, the noise is barely perceptible; at 5 dB, the parameters reach their maximum settings, representing extreme interference conditions.

Figure 13 presents the accuracy under these five noise types. The results show that the model maintains reliable classification accuracy (above 82%), even at 5 dB for all noise types. However, Stage 2 localization degrades more significantly, dropping to approximately 60–67% at 5 dB. This is because localization relies on fine-grained waveform features, which are easily submerged by strong noise, whereas classification depends on overall waveform morphology and is less affected. Among the noise types, SensorDrop exhibits the highest robustness at low SNRs (67.25% at 5 dB), suggesting that the model’s multi-channel input provides redundancy that mitigates the loss of individual channels.

Despite the performance degradation under extremely low SNRs, the model demonstrates graceful degradation rather than catastrophic failure and maintains practical accuracy for SNR ≥ 15 dB. For reference, an SNR of 10 dB corresponds to a noise amplitude of approximately 32% of the signal amplitude, while 5 dB corresponds to approximately 56%, meaning the fault waveform is heavily corrupted and barely distinguishable from the background noise. Such extreme noise conditions are rare in practical distribution networks and typically result from severe electromagnetic interference or sensor degradation. The progressive parameter design ensures that the evaluation covers a wide range of interference severities, providing a rigorous stress test of model robustness under worst-case scenarios.

5. Conclusions

Aiming at the problems of difficult fault feature extraction and low positioning accuracy of adjacent nodes in distribution networks, a two-stage fault diagnosis framework based on MS-CNN and spatio-temporal dual-attention is proposed. In this paper, a “space-time” dual attention collaborative enhancement mechanism is constructed that integrates SimAM spatial screening and Transformer time series modeling; it makes up for the limitation of single convolution networks in multi-dimensional feature capture.

At the decision-making level of the fault localization stage, the prototype-based metric learning strategy is used to replace the traditional Softmax classification. The geometry of the feature space is optimized by minimizing the Euclidean distance to improve the classification accuracy, and a classification boundary with a safety margin is constructed that effectively breaks through the bottleneck of fault localization.

The experimental results show that the proposed method achieves 98.61% accuracy in fault classification and 94.22% in fault localization, significantly reducing misjudgment caused by similar electrical distances between adjacent nodes. Cross-scenario validation on a 10 kV distribution network further confirms its generalization capability across different topologies. Moreover, the model maintains reliable diagnostic performance at SNR ≥ 15 dB, with graceful degradation under extremely low-SNR conditions.

Future work will focus on improving the model’s noise robustness under extremely low-SNR conditions, extending the proposed framework to dynamic reconfiguration scenarios, and exploring lightweight deployment schemes suitable for edge computing devices.

Author Contributions

Conceptualization, Y.Y. and Z.C.; methodology, Z.C.; software, Z.C. and H.Z.; validation, J.H. and H.Z.; formal analysis, Y.Y.; investigation, W.Z.; resources, Y.Y.; data curation, J.H.; writing—original draft preparation, Z.C.; writing—review and editing, W.Z. and Y.Y.; visualization, H.Z.; supervision, W.Z.; project administration, Y.Y.; funding acquisition, Y.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by Guangdong Power Grid Co., Ltd. under the project “Auxiliary Decision-Making for Distribution Network Faults Based on Computer Vision and Knowledge Graph Technologies” (Project No. 031200KC23120020).

Data Availability Statement

The original contributions presented in the study are included in the article; further inquiries can be directed to the corresponding author.

Conflicts of Interest

Author Ying Yang, Jinyi Huang, and Hao Zhu were employed by the company Zhaoqing Power Supply Bureau, Guangdong Power Grid Co., Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Subramanian, N.; Stonier, A.A. A comprehensive review on selective harmonic elimination techniques and its permissible standards in electrical systems. IEEE Access 2024, 12, 141966–141998. [Google Scholar] [CrossRef]
Teng, Y.; Zhang, Z.; Li, X. Improved approach to high-frequency current injection-based protection for grounding electrode line in high-voltage direct current system. IEEE Trans. Ind. Appl. 2020, 56, 2409–2417. [Google Scholar] [CrossRef]
Daniel, K.; Kütt, L.; Iqbal, M.N.; Shabbir, N.; Raja, H.A.; Sardar, M.U. A review of harmonic detection, suppression, aggregation, and estimation techniques. Appl. Sci. 2024, 14, 10966. [Google Scholar] [CrossRef]
Zhou, T.; Yuan, T.; Wei, S.; He, H.; Yang, Q. Temperature and composition of AC arc plasma of medium voltage distribution networks in the air. J. Phys. D Appl. Phys. 2022, 55, 245201. [Google Scholar] [CrossRef]
Zhou, T.; Yang, Q.; Yuan, T.; He, H.; Liu, H. A novel mathematical-physical arc model and its application to the simulation of high-impedance arc faults in distribution networks. IEEE Trans. Power Deliv. 2024, 39, 1794–1806. [Google Scholar] [CrossRef]
Toader, D.; Vintan, M. Mathematical Models of the Phase Voltages of High-, Medium-and Low-Voltage Busbars in a Substation during a Phase-to-Ground Fault on High-Voltage Busbars. Mathematics 2023, 11, 3032. [Google Scholar] [CrossRef]
Zhu, J.; Lubkeman, D.L.; Girgis, A.A. Automated fault location and diagnosis on electric power distribution feeders. IEEE Trans. Power Deliv. 1997, 12, 801–809. [Google Scholar] [CrossRef]
Salim, R.H.; de Oliveira, K.R.C.; Filomena, A.D.; Resener, M.; Bretas, A.S. Hybrid fault diagnosis scheme implementation for power distribution systems automation. IEEE Trans. Power Deliv. 2008, 23, 1846–1856. [Google Scholar] [CrossRef]
Mora-Florez, J.; Meléndez, J.; Carrillo-Caicedo, G. Comparison of impedance based fault location methods for power distribution systems. Electr. Power Syst. Res. 2008, 78, 657–666. [Google Scholar] [CrossRef]
Chang, N.; Song, G.; Hou, J.; Chang, Z. Fault identification method based on unified inverse-time characteristic equation for distribution network. Int. J. Electr. Power Energy Syst. 2023, 146, 108734. [Google Scholar] [CrossRef]
Xiao, F.; Wu, M.; Song, K.; Lu, T.; Ai, Q. Diagnosis of distribution network fault using multiresolution S-transform and modified convolution neural network. Int. J. Electr. Power Energy Syst. 2024, 162, 110294. [Google Scholar] [CrossRef]
Ji, L.; Tian, X.; Wei, Z.; Zhu, D. Intelligent fault diagnosis in power distribution networks using LSTM-DenseNet network. Electr. Power Syst. Res. 2025, 239, 111202. [Google Scholar] [CrossRef]
Chen, K.; Hu, J.; Zhang, Y.; Yu, Z.; He, J. Fault location in power distribution systems via deep graph convolutional networks. IEEE J. Sel. Areas Commun. 2019, 38, 119–131. [Google Scholar] [CrossRef]
Rezapour, H.; Jamali, S.; Bahmanyar, A. Review on artificial intelligence-based fault location methods in power distribution networks. Energies 2023, 16, 4636. [Google Scholar] [CrossRef]
Cao, Y.; Tang, J.; Shi, S.; Cai, D.; Zhang, L.; Xiong, P. Fault diagnosis techniques for electrical distribution network based on artificial intelligence and signal processing: A review. Processes 2024, 13, 48. [Google Scholar] [CrossRef]
Ngo, Q.H.; Nguyen, B.L.; Zhang, J.; Schoder, K.; Ginn, H.; Vu, T. Deep graph neural network for fault detection and identification in distribution systems. Electr. Power Syst. Res. 2025, 247, 111721. [Google Scholar] [CrossRef]
Zhang, H.; Qu, Z.; Qin, K.; Han, J.; Ren, K. Construction of Sample Data for Line Disconnection Faults in Distribution Networks Aimed at Deep Learning Fault Diagnosis. In Proceedings of the 5th Power System and Green Energy Conference (PSGEC), Hong Kong, China, 20 August 2025; pp. 376–380. [Google Scholar]
Shafei, A.P.; Silva, J.F.; Monteiro, J. Convolutional neural network approach for fault detection and characterization in medium voltage distribution networks. e-Prime-Adv. Electr. Eng. Electron. Energy 2024, 10, 100820. [Google Scholar] [CrossRef]
Mo, H.; Peng, Y.; Wei, W.; Xi, W.; Cai, T. SR-GNN based fault classification and location in power distribution network. Energies 2022, 16, 433. [Google Scholar] [CrossRef]
Lu, T.; Hou, S. Fault Location Algorithm for Distribution Network with Distributed Generation Based on Domain-Adaptive TGATv2. IET Gener. Transm. Distrib. 2025, 19, e70033. [Google Scholar] [CrossRef]
Guo, M.F.; Yang, N.C.; Chen, W.F. Deep-learning-based fault classification using Hilbert-Huang transform and convolutional neural network in power distribution systems. IEEE Sens. J. 2019, 19, 6905–6913. [Google Scholar] [CrossRef]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
Yao, Y.; Ma, H.; Gong, C.; Li, Y.; Zhao, Q.; Wei, N.; Yang, B. A Real Data-Driven Fault Diagnosing Method for Distribution Networks Based on ResBlock-CBAM-CNN. Electricity 2025, 6, 19. [Google Scholar] [CrossRef]
Li, X.; Li, G.; Wang, T.; Dong, Z.; Qin, Z.; Chu, F. Knowledge extraction and retrieval-augmented generation for intelligent maintenance of wind power equipment based on graph attention networks. Chin. J. Mech. Eng. 2025, 38, 100141. [Google Scholar] [CrossRef]
Teng, J.; Sun, Y.; Song, N.; Gan, Y.; Li, Y.; Hou, X. A power system fault assessment methodology from a power-based perspective. In Proceedings of the 2025 International Conference on New Power System Technology (PowerCon), Hefei, China, 24–25 September 2025; pp. 1–6. [Google Scholar]
Yang, L.; Zhang, R.Y.; Li, L.; Xie, X. Simam: A simple, parameter-free attention module for convolutional neural networks. In Proceedings of the International Conference on Machine Learning, Virtual, 1 July 2021; pp. 11863–11874. [Google Scholar]
Tang, Z.; Hou, X.; Huang, X.; Wang, X.; Zou, J. Domain adaptation for bearing fault diagnosis based on SimAM and adaptive weighting strategy. Sensors 2024, 24, 4251. [Google Scholar] [CrossRef] [PubMed]
Yang, W.; Gu, J.; Xie, X.; Wei, X.; Ye, H. Lightweight shuffle-SimAM network-based open-circuit fault diagnosis of grid-connected cascaded H-bridge inverters. J. Power Electron. 2025, 25, 565–577. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 5998–6008. [Google Scholar]
Snell, J.; Swersky, K.; Zemel, R. Prototypical networks for few-shot learning. Adv. Neural Inf. Process. Syst. 2017, 30, 4074–4082. [Google Scholar]
Baran, M.E.; Wu, F.F. Network reconfiguration in distribution systems for loss reduction and load balancing. IEEE Trans. Power Deliv. 1989, 4, 1401–1407. [Google Scholar] [CrossRef]
Schneider, K.P.; Mather, B.A.; Pal, B.C.; Ten, C.-W.; Shirek, G.J.; Zhu, H.; Fuller, J.C.; Pereira, J.L.R.; Ochoa, L.F.; de Araujo, L.R.; et al. Analytic considerations and design basis for the IEEE distribution test feeders. IEEE Trans. Power Syst. 2018, 33, 3181–3188. [Google Scholar] [CrossRef]
Wang, J.; Zhang, B.; Yin, D.; Ouyang, J. Distribution network fault comprehensive identification method based on voltage-ampere curves and deep ensemble learning. Int. J. Electr. Power Energy Syst. 2025, 164, 110403. [Google Scholar] [CrossRef]
Dumkhana, L.; Biragbara, P.B. Review on the impact of electromagnetic interference in high voltage transmission systems. Adv. J. Sci. Eng. Technol. 2025, 10, 42–54. [Google Scholar]
Mariscotti, A. Assessment of human exposure (including interference to implantable devices) to low-frequency electromagnetic field in modern microgrids, power systems and electric transports. Energies 2021, 14, 6789. [Google Scholar] [CrossRef]

Figure 1. Overall structure of two-stage fault diagnosis for distribution network.

Figure 2. Construction of multidimensional time series feature matrix.

Figure 3. Flow chart of data pre-processing.

Figure 4. MS-CNN module diagram.

Figure 5. Schematic diagram of SimAM.

Figure 6. Transformer encoder and two-phase decision architecture.

Figure 7. Topology of IEEE 33-node distribution network.

Figure 8. The curve of training loss and testing accuracy: (a) Stage 1; (b) Stage 2.

Figure 9. Confusion matrix: (a) Stage 1; (b) Stage 2.

Figure 10. t-SNE-based visual analysis of Stage 2 fault features: (a) Stage 1: Fault classification; (b) Stage 2: Fault localization.

Figure 11. Topology of the 10 kV distribution network.

Figure 12. Feature ablation results with and without P and Q.

Figure 13. Noise robustness results under five noise types: (a) Stage 1 classification, (b) Stage 2 localization, (c) End-to-end localization.

Table 1. Summary of model phased test performance.

Stage	Task Description	Accuracy	Precision	Recall (Macro)	F1-Score (Macro)
Stage 1	Fault Classification	98.61%	97.84%	98.64%	98.23%
Stage 2	Fault Localization	94.22%	95.16%	94.19%	93.94%

Table 2. End-to-end cascade evaluation results.

Metric	Value
False Trip Rate (Normal → Fault)	0.96%
Missed Detection Rate (Fault → Normal)	0.83%
Final Effective Localization Rate	93.62%

Table 3. Comparison of experimental results.

Model	Stage 1 Accuracy (Classification)	Stage 2 Accuracy (Localization)
SVM	84.73%	56.62%
1D-CNN	96.82%	53.79%
LSTM	98.35%	65.68%
OS-CNN (Omni-Scale)	97.39%	70.84%
LSTM-DenseNet [12]	98.44%	82.88%
HHT-CNN [21]	98.18%	92.70%
ResBlock-CBAM-CNN [23]	98.37%	93.59%
Ours	98.61%	94.22%

Table 4. Performance metrics on the 10 kV distribution network.

Stage	Accuracy	Precision	Recall	F1-Score
Stage 1	96.23%	97.66%	96.55%	97.08%
Stage 2	93.19%	93.42%	93.19%	93.05%

Table 5. Summary of end-to-end cascade evaluation on the 10 kV network.

Metric	Value
False Trip Rate (Normal → Fault)	3.82%
Missed Detection Rate (Fault → Normal)	1.56%
Final Effective Localization Rate	91.85%

Table 6. Ablation protocol.

Models	Backbone	Multi-Scale	SimAM	Transformer	Prototype
Model 0	√	×	×	×	×
Model 1	√	√	×	×	×
Model 2	√	√	√	×	×
Model 3	√	√	√	√	×
Model 4	√	√	√	√	√

Table 7. Results of ablation experiments.

Model ID	Accuracy	Precision	Recall	F1-Score
M0 (Backbone)	87.35%	88.91%	86.95%	87.92%
M1 (M0 + Multi-scale)	89.28%	90.78%	87.02%	88.86%
M2 (M1 + SimAM)	90.52%	92.05%	90.54%	91.29%
M3 (M2 + Transformer)	93.95%	95.06%	92.34%	93.68%
M4 (Ours)	94.22%	95.16%	94.19%	93.94%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Yang, Y.; Huang, J.; Zhu, H.; Cai, Z.; Zheng, W. Two-Stage Fault Diagnosis of Distribution Network Based on MS-CNN and Spatio-Temporal Dual Attention. Electronics 2026, 15, 2545. https://doi.org/10.3390/electronics15122545

AMA Style

Yang Y, Huang J, Zhu H, Cai Z, Zheng W. Two-Stage Fault Diagnosis of Distribution Network Based on MS-CNN and Spatio-Temporal Dual Attention. Electronics. 2026; 15(12):2545. https://doi.org/10.3390/electronics15122545

Chicago/Turabian Style

Yang, Ying, Jinyi Huang, Hao Zhu, Zibin Cai, and Weijia Zheng. 2026. "Two-Stage Fault Diagnosis of Distribution Network Based on MS-CNN and Spatio-Temporal Dual Attention" Electronics 15, no. 12: 2545. https://doi.org/10.3390/electronics15122545

APA Style

Yang, Y., Huang, J., Zhu, H., Cai, Z., & Zheng, W. (2026). Two-Stage Fault Diagnosis of Distribution Network Based on MS-CNN and Spatio-Temporal Dual Attention. Electronics, 15(12), 2545. https://doi.org/10.3390/electronics15122545

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Two-Stage Fault Diagnosis of Distribution Network Based on MS-CNN and Spatio-Temporal Dual Attention

Abstract

1. Introduction

2. Fault Diagnosis Framework Based on MS-CNN-SimAM-Transformer

2.1. Overview of the Overall Framework

2.2. Data Pre-Processing and Feature Construction

2.2.1. Construction of Multi-Dimensional Time Series Feature Matrix

2.2.2. Standardized Handling of Data Leakage

2.2.3. Multi-Task Label Coding

2.2.4. Class Balance Strategy Based on Weighted Sampling

2.3. MS-CNN Module

2.4. Spatial Attention: SimAM Attention

2.4.1. Waveform Singularity Detection and Feature Recalibration

2.4.2. Module Embedding Strategy

2.5. Temporal Attention: Transformer Encoder

2.5.1. Sequence Reconstruction and Multi-Head Self-Attention

2.5.2. Feed-Forward Network and Global Aggregation

2.6. Decision Mechanisms for Task Decoupling

2.6.1. Nonlinear Feature Projection (MLP Head)

2.6.2. Softmax-Based Fault Classification

2.6.3. Fault Localization Based on Prototype Networks

2.6.4. Multi-Task Joint Optimization

3. Experimental Setup and Basic Performance Evaluation

3.1. Dataset Generation and Experimental Environment

3.2. Evaluation Indicators

3.3. Training Performance Analysis of the Model

3.3.1. Training Convergence and Stability Analysis

3.3.2. Quantitative Evaluation of Diagnostic Performance

3.4. Comparative Experimental Analysis

3.5. Cross-Scenario Generalization Validation

4. Model Ablation and Robustness Assessment

4.1. Ablation Experiments

4.2. Noise Robustness Experiments

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI