1. Introduction
As critical infrastructure connecting the transmission system to end-users, the power supply reliability of the distribution network is directly correlated with the stable operation of the socio-economy. However, influenced by factors such as complex topologies, line aging, and environmental interference, distribution networks are highly susceptible to various types of faults. Traditional troubleshooting relies heavily on manual patrols or passive repair reporting, which are inefficient and struggle to meet the urgent requirements of smart grids for rapid power restoration. Therefore, research into high-precision automated fault diagnosis and localization methods is of significant importance for enhancing the perception capabilities and self-healing levels of distribution networks [
1,
2,
3].
In the field of distribution network fault diagnosis, the complex physical features of fault signals represent a core challenge constraining accuracy. Distribution network faults primarily encompass single-line-to-ground (SLG) faults, line-to-line (L-L) short circuits, and single-phase open circuits. Among them, SLG faults, which have the highest occurrence rate [
4,
5,
6], are typically accompanied by a surge in zero-sequence components, where the signal is weak and easily submerged by noise. L-L short circuits trigger violent current impulses containing abundant high-frequency transients. Meanwhile, single-phase open circuits result in three-phase imbalance and negative-sequence currents. These fault morphologies vary significantly, and, in practical operation, fault signals often possess both high-frequency transients and millisecond-level power frequency distortions. Such cross-scale non-stationary signal features easily lead to modal aliasing of critical fault information in the time–frequency domain. This not only exposes the limitations of early physical mechanism models, but also drives the continuous evolution of diagnostic technologies toward intelligent data-driven directions.
Early research mainly relied on physical mechanism models, such as the traditional impedance method. Zhu et al. [
7] developed an automated fault location and diagnosis scheme specifically for radial distribution feeders, establishing one of the first systematic applications of the impedance-based method to distribution networks. Salim et al. [
8] further advanced this approach by proposing a hybrid fault diagnosis framework integrating wavelet-based detection, impedance-based fault location, and neural network-based section determination. However, Mora-Flórez et al. [
9] pointed out, in their comparative study, that the performance of impedance-based methods relies heavily on accurate system parameters, load models, and power system topology. Chang et al. [
10] further demonstrated that dynamic parameter deviations in actual environments can easily lead to protection maloperation based on deterministic calculations. To compensate for the deficiencies of physical models, Xiao et al. [
11] introduced the Improved S-Transform (MST) for time–frequency analysis. Although this enhanced the representation of non-stationary signals, it failed to break free from the dependence on cumbersome manual feature engineering.
With the development of artificial intelligence, deep learning has gradually become mainstream [
12,
13,
14,
15,
16,
17]. These architectures cover a wide spectrum, including recurrent neural networks for temporal dependency modeling, convolutional neural networks for local feature extraction, graph neural networks for topological reasoning, and attention mechanisms for feature recalibration. Ji et al. [
12] utilized the dense connection mechanism of LSTM-DenseNet to effectively validate the advantages of end-to-end models in capturing long-term temporal dependencies of faults. Shafei et al. [
18] proved that combining a CNN with the Park transformation can effectively eliminate time-varying load interference, enhancing the model’s generalization ability under variable operating conditions.
Addressing the challenges of complex topology and data sparsity, Mo et al. [
19] proposed a Super-Resolution Graph Neural Network (SR-GNN), achieving full-network state reconstruction under sparse measurements. Lu and Hou [
20] solved the problem of dynamic topological changes using a domain-adaptive graph attention algorithm. Guo et al. [
21] combined the Hilbert–Huang Transform (HHT) with CNN to further enhance the time–frequency representation capability for high-frequency details of non-stationary signals.
Despite the impressive performance of existing architectures in classification tasks, design flaws persist in fine-grained localization. Most networks employ convolution kernels of fixed sizes, making them unable to achieve synchronous decoupling of full-frequency domain information. Although some studies, such as Hu et al. [
22] and Yao et al. [
23], incorporated channel attention mechanisms (SE-Net) and Convolutional Block Attention Modules (CBAMs), respectively, to optimize feature weights, the parameter redundancy introduced by their fully connected layers significantly increases the computational burden for edge deployment. Furthermore, these methods struggle to fully cope with signal redundancy in complex backgrounds. Additionally, traditional Softmax classifiers often fail to delineate clear decision boundaries when processing highly aliased features of adjacent nodes, thereby severely constraining fault localization accuracy. Despite these advances, several key research gaps remain in distribution network fault diagnosis. Li et al. [
24] highlighted the importance of bridging fault diagnosis results to intelligent maintenance decisions, underscoring the growing demand for diagnosis systems that incorporate operating condition awareness. Meanwhile, from the perspective of diagnosis accuracy itself, existing deep learning-based methods still face three fundamental challenges: (1) single-scale convolution kernels cannot simultaneously capture high-frequency transient mutations and steady-state power frequency distortions; (2) parameter-intensive attention mechanisms hinder deployment on resource-constrained edge devices; and (3) Softmax-based classifiers struggle to separate highly aliased feature representations of electrically adjacent nodes. To address these challenges, this paper proposes a two-stage fault diagnosis framework based on MS-CNN and spatio-temporal dual attention. The main contributions are summarized as follows:
A fusion mechanism of MS-CNN and SimAM is proposed. This achieves the synchronous decoupling of cross-scale fault signals and adaptive feature enhancement under noisy backgrounds, significantly improving the sensitivity to weak faults.
The Transformer encoder is utilized to compensate for the limitations of CNNs in the temporal dimension. By mining the full-time-domain correlations of faults from transient inception to steady-state evolution via the multi-head self-attention mechanism, it solves the difficulty of synchronously decoupling high-frequency impulses and steady-state information.
A two-stage decoupled decision mechanism and a prototype-based metric learning method are designed. By optimizing the geometric structure of the feature space rather than relying on probabilistic classification, this approach effectively overcomes the issue of feature aliasing among adjacent nodes, thereby achieving high-precision section localization.
2. Fault Diagnosis Framework Based on MS-CNN-SimAM-Transformer
2.1. Overview of the Overall Framework
A two-stage cascade diagnosis framework based on MS-CNN-SimAM-Transformer is proposed to solve the problem of fault diagnosis caused by weak fault features and waveform aliasing of adjacent nodes due to complex topology of distribution network. The framework abandons the traditional end-to-end single task mode and innovatively decouples fault diagnosis into two logically dependent cascade stages of “Stage 1: Fault Classification” and “Stage 2: Fault Section Localization”.
The proposed architecture, shown in
Figure 1, combines spatio-temporal feature extraction with cascade decision-making. The input layer receives multi-channel signals: three-phase voltage and current, zero-sequence current, and active and reactive power (
). These signals are processed by three parallel MS-CNN branches with kernel sizes of 3, 7, and 11, capturing transient features at different temporal scales. The SimAM module then applies parameter-free attention to suppress background noise and enhance fault-related signals. A Transformer encoder follows, modeling global temporal dependencies across the full time series. Together, SimAM (spatial screening) and Transformer (temporal modeling) form the spatio-temporal dual attention mechanism that extracts decoupled, high-dimensional feature representations.
After global average pooling and MLP projection, the generated feature vectors enter the cascade decision-making process. As the first line of defense, Stage 1 uses the Softmax classifier to quickly determine the operating state of the system, and plays the role of a logical filter: if the system determines that the sample is in normal operation, it directly terminates the diagnosis; only when the fault sample is judged to be a fault sample are its features transferred to the next stage. For these selected fault samples, Stage 2 introduces a metric learning-based localization strategy. By maintaining a set of learnable fault prototype vectors and calculating the Euclidean distance between the query sample and each prototype, Stage 2 can effectively locate the fault samples. Finally, the fault section is accurately located according to the nearest neighbor principle.
2.2. Data Pre-Processing and Feature Construction
In order to fully exploit the time–frequency features of the fault signal in the distribution network and adapt to the input requirements of the MS-CNN-Transformer hybrid model, a complete data pre-processing process is constructed in this paper that can be used to improve the accuracy of fault diagnosis. It mainly includes four steps: feature engineering, data standardization, label coding, and class imbalance processing.
2.2.1. Construction of Multi-Dimensional Time Series Feature Matrix
The original data are collected from the IEEE 33-node distribution network simulation model based on MATLAB/Simulink. In view of the limitation of single voltage or current signal in characterizing weak fault features, a multi-dimensional feature space covering electrical quantities and power quantities is constructed in this study. In this work, “weak fault” refers to fault scenarios where the electrical signatures of adjacent nodes are highly similar due to short electrical distances, making accurate fault localization challenging.
The original sampling data set is set to contain a single sample with a sampling length of
L. For the first sample, the three-phase voltage and current (
) are extracted as the basic channel features to fully reflect the basic operating state of the system. The zero-sequence component (
) is used as the sensitive criterion of grounding fault (the calculation basis is
), and is used as the feature, together with the active and reactive instantaneous power (
), to enhance the ability of load fluctuation and fault impact identification. Unlike voltage and current magnitudes alone,
P and
Q encode the phase relationship (power factor angle) between voltage and current. During faults, the abrupt change in system impedance causes a characteristic shift in the power factor angle, which cannot be directly captured from
U or
I individually. Teng et al. [
25] demonstrated that power-based features provide complementary discriminative information for fault characterization. The input tensor
, which represents the total number of characteristic channels
C, is constructed by the above multi-physical quantities after splicing the channel dimensions. The multi-channel structure design is highly compatible with the multi-scale convolution kernel of MS-CNN and supports the parallel extraction of deep fault features. The construction process of this multidimensional time series feature matrix is illustrated in
Figure 2.
2.2.2. Standardized Handling of Data Leakage
Direct input to the network can lead to slow gradient descent convergence or even gradient explosion due to the large numerical magnitude differences in voltage (kV level), current (A level), and power (kW/kVar level). To this end, Z-Score standardization is applied using training set statistics:
The same
and
are used to transform the validation and test sets to prevent data leakage.
2.2.3. Multi-Task Label Coding
For the two-stage diagnosis framework of “Stage 1” and “Stage 2” constructed in this paper, the hierarchical label coding strategy is adopted. For the fault classification task of the first stage, the label encoder is used to map various working conditions such as single-phase grounding, two-phase short circuit, and normal operation to discrete integer coding. In the second stage of the fault localization task, the processing flow only screens out the samples of abnormal operation state and converts their corresponding fault line section identifiers into specific localization labels.
2.2.4. Class Balance Strategy Based on Weighted Sampling
In the actual operation data of distribution network, the number of normal samples is often much more than that of fault samples, and the occurrence probability of different fault types is different (such as single-phase grounding is significantly more than two-phase short circuit). This long-tailed distribution will cause the model to tend to predict the majority class. To address this, a weighted random sampler is introduced, where the sample weight of each category is calculated as
where
is the total number of samples of category
c. During batch construction, samples are drawn with replacement based on these weights to balance the participation probability across all categories.
The overall data pre-processing pipeline is summarized in
Figure 3.
2.3. MS-CNN Module
The fault signal of distribution network shows significant multi-scale time–frequency features, covering rich information from transient high-frequency oscillation to steady-state power frequency offset. The traditional fixed convolution kernel size is often difficult to strike a balance between capturing local mutation details and maintaining global waveform trends. In this study, a multi-scale convolutional neural network (MS-CNN) module was designed as a backbone feature extractor to obtain the feature expression under different receptive fields by constructing three parallel convolutional branches. Each branch adopts a convolution kernel of size
, in which the small convolution kernel focuses on capturing high-frequency transient noise and small signal mutations while the large convolution kernel is responsible for extracting low-frequency waveform contours and global trends. This multi-granularity parallel processing mechanism enables the model to adaptively decouple the frequency components in complex fault signals. The detailed architecture of the MS-CNN module is illustrated in
Figure 4.
In order to solve the problem of gradient disappearance in deep network training and enhance the efficiency of feature propagation, residual blocks with skip connections were integrated in each convolution branch:
The feature maps of multi-branch outputs are concatenated in the channel dimension to form a mixed feature tensor containing rich multi-scale information. Specifically, each convolutional branch has 64 output channels, resulting in 192 channels after concatenation, which are then fused into 64 channels via the convolution. The residual blocks further expand the channel dimensions to 128 (stride 2) and then to 256 (stride 2). In order to realize the effective interaction of different scale features and control the complexity of the model, a convolution layer was deployed at the end of the module as a feature fuser. This layer fused the multi-scale features through weighted linear combination across channels while reducing the dimension. The resulting compact feature representation serves as a refined input to the subsequent SimAM and Transformer modules.
2.4. Spatial Attention: SimAM Attention
Although deep convolutional networks have a strong ability to abstract features, they face a dilemma when dealing with weak fault signals in distribution networks: high-frequency transient details are easily smoothed as the network deepens, while background noise may be mistakenly amplified. Traditional attention mechanisms, such as SE-Net or CBAM, learn feature weights by introducing additional parameters, which increases the risk of overfitting. To address this, this paper introduces SimAM (Simple Attention Module) [
26], which directly derives three-dimensional attention weights from the statistical properties of the signal itself based on the spatial inhibition effect in neuroscience.
2.4.1. Waveform Singularity Detection and Feature Recalibration
To detect transient singularities submerged in noise, SimAM reinterprets the attention mechanism from a signal processing perspective. For each sampling point
in the feature map, a binary signal-to-noise separator is formulated as
which quantifies the “singularity” of each point by measuring its deviation from the surrounding statistical distribution. As illustrated in
Figure 5, the process consists of three steps: computing the global mean and variance of the input feature map as a statistical baseline, evaluating each feature point using the energy function, and applying a Sigmoid activation to generate attention weights in the range
. These weights are multiplied element-wise with the original feature map to suppress background noise and amplify fault transient features [
26,
27,
28].
2.4.2. Module Embedding Strategy
SimAM is embedded into the core path of each multi-scale residual block, positioned after the second Batch Normalization layer and before the residual addition:
where
X is the feature map after the two-layer convolution transform. This “residual embedding” ensures that the attention mechanism fine-tunes only the residual mapping while the identity path remains intact, preserving weak fault information across layers and preventing feature degradation in deep networks.
2.5. Temporal Attention: Transformer Encoder
Although the MS-CNN, with the introduction of SimAM, is able to sensitively capture high-frequency transient shocks, its receptive field is always limited to the local window due to the physical size of the convolution kernel. In the face of the complete evolution process of distribution network fault waveform from transient mutation to steady-state distortion, it is difficult for the CNN to establish long-distance logical correlation in the full time domain scale.
To this end, this paper uses Transformer encoder to take over the output of the CNN, aiming to capture the global dependencies in the time series. The encoder consists of one Transformer block with four attention heads, an embedding dimension of 256, a feed-forward network hidden dimension of 512, and a dropout rate of 0.3.
2.5.1. Sequence Reconstruction and Multi-Head Self-Attention
Unlike convolution operations, which focus on “Channels”, transformers focus on “Sequences”. Therefore, the output tensor of MS-CNN is reshaped with the time dimension as the sequence length and the channel dimension as the feature embedding.
The central mechanism is multi-headed self-attention [
29]. It allows for each time step in the sequence to directly “Focus” on all other time steps, no matter how far apart they are. For the input sequence, we generate the Query, Key, and Value matrices:
where
W is a learnable linear projection matrix. The attention score is calculated by scaling the dot product:
Here,
is the scaling factor, which prevents the gradient from disappearing if the dot product is too large.
In the physical sense, the attention weight matrix quantifies the correlation at different times in the process of fault development. For example, there is an inherent causal relationship between the transient mutation (the moment ) at the beginning of the fault and the steady-state distortion (the moment ) after several cycles. The MSA mechanism can give this kind of cross-period correlation extremely high weight so as to reconstruct the complete logical chain of fault evolution and make up for the lack of the CNN only focusing on local features.
2.5.2. Feed-Forward Network and Global Aggregation
The output of MSA is fed into a feedforward neural network (FFN) after residual linking (Add) and layer normalization (Norm). As shown in the middle of
Figure 6, to enhance the non-linear representation and prevent overfitting, we use the GELU (Gaussian Error Linear Unit) activation function instead of the traditional ReLU:
The smoothness of the GELU near zero makes its gradient propagation more stable when dealing with high-dimensional electrical features. Finally, the sequence features from the Transformer output are compressed into a single high-dimensional semantic vector through the Global Average Pooling layer. This vector highly condenses the local multifrequency details of the fault and the global temporal logic and is entered into the MLP Head shown on the right side of
Figure 6. The projection header contains two layers of fully connected networks (Linear Large/Small) in which the features are further refined by SiLU activation and Dropout layers and finally shunted to the Softmax (Stage 1) and Prototype Matching (Stage 2) modules for decision-making.
2.6. Decision Mechanisms for Task Decoupling
In order to balance the accuracy of fault classification and the fineness of fault localization, this paper designs a two-branch decision architecture, as shown on the right side of
Figure 6. The extracted high-dimensional features are first mapped by a shared-structure MLP Head and are then shunted to two different decision paths of Stage 1 and Stage 2 according to the task requirements.
2.6.1. Nonlinear Feature Projection (MLP Head)
A shared MLP projection head maps the Transformer output to the decision space, consisting of LayerNorm, a linear expansion layer, SiLU activation, Dropout regularization, and a linear compression layer that outputs a compact feature vector z for both Stage 1 and Stage 2.
2.6.2. Softmax-Based Fault Classification
For Stage 1 fault classification tasks (such as single-phase grounding, two-phase short circuit, etc.), the feature vector output by MLP is directly input into the Softmax classifier. The objective of the path is to find the linear decision boundary of each type of fault in the feature space and calculate the conditional probability of belonging to the first type of fault:
In this stage, the model is guided to quickly lock the macroscopic waveform features of the fault by minimizing the cross-entropy loss.
2.6.3. Fault Localization Based on Prototype Networks
For the fault section localization task of Stage 2, the traditional linear classification often fails due to the high feature similarity between adjacent nodes. To this end, we introduce a Prototype Matching strategy [
30] at the MLP output, as shown in
Figure 6, transforming the localization problem into a metric learning problem in geometric space.
Suppose that the embedding vector of the MLP output is
. For each fault localization category
k, the model maintains a learnable parameter vector called a Prototype
. During the inference process, the Euclidean distance between the query sample and all the prototype vectors is calculated:
To force the model to learn discriminative feature distributions, we employ a prototypical loss function whose essence is Softmax in the distance dimension:
This loss pulls samples of the same class toward their prototype center while pushing different prototypes apart, effectively separating previously aliased adjacent node clusters (see
Section 3 for details).
2.6.4. Multi-Task Joint Optimization
In order to balance the performance of the two phases, the whole framework adopts an end-to-end joint training strategy. The total objective function is defined as the weighted sum of the two-stage losses:
The balance factor
controls the gradient contribution of each task. During early training, the model focuses on learning common fault waveform features. As training progresses, the prototypical loss gradually refines the feature space, keeping it globally separable while producing more compact local manifolds for precise localization of adjacent nodes.
3. Experimental Setup and Basic Performance Evaluation
3.1. Dataset Generation and Experimental Environment
In this paper, a standard IEEE 33-node radial distribution network [
31] is built based on MATLAB R2024b/Simulink (SimPowerSystems) platform. The system consists of 33 nodes and 37 branches. The reference voltage is 12.66 kV and the reference frequency is 50 Hz. The system topology is shown in
Figure 7.
Fault scenarios cover 37 line sections, 4 operating conditions (normal, single-phase-to-ground, line-to-line, and single-phase open circuit, each on phases A, B, and C), 4 fault-resistance levels (0.1 , 1 , 10 , 25 ), and 4 inception angles (0°, 30°, 60°, 90°). Faults were applied at the midpoint of each line section. System loads were maintained at nominal levels, and voltage and current signals were recorded at all 33 nodes. Each event lasted 0.1 s and was sampled at 10 kHz. The MS-CNN kernel sizes (3, 7, 11) corresponds to temporal windows of 0.3 ms, 0.7 ms, and 1.1 ms, respectively, for local multi-scale feature extraction, while the Transformer encoder captures global temporal dependencies across the full 0.1 s event (1000 time steps). The final data set contains 240,111 samples, which were divided at the fault-event level into training, validation, and test sets in a 6:2:2 ratio, ensuring that all window samples from the same event belonged exclusively to one subset to prevent data leakage. Stratified sampling based on the Stage 1 label (fault type) was applied to maintain consistent proportions across subsets.
For the fault localization task of Stage 2, its data set is directly derived from the above partitioning results. Specifically, the training set and the test set of Stage 2 are subsets of all non-normal (i.e., fault) samples in the corresponding set of Stage 1, respectively. This approach maintains the homology of the data distribution of the two stages.
All experiments were implemented under the PYTORCH Deep Learning Framework with the hardware environment configured as Intel Core i5-10200H CPU and Nvidia GeForce RTX 1650 (4GB) GPU. The ADAMW optimizer (weight decay ) was used for model training, with an initial learning rate of 0.001 and a Cosine Annealing learning rate scheduler. The label smoothing factor was set to 0.1, and the gradient clipping threshold was 1.0. Early stopping with a patience of 10 epochs and a minimum delta of 0.0001 was adopted to prevent overfitting. The batch size is 256 and the maximum number of epochs was 30. For the prototype network in Stage 2, the prototype vectors were initialized using nn.Parameter (torch.randn × 0.02), and the prototype loss weight is set to 0.2. These hyperparameters follow common practices in deep learning-based fault diagnosis: AdamW mitigates overfitting, cosine annealing ensures smooth convergence, gradient clipping stabilizes training, and early stopping prevents unnecessary epochs. The prototype loss weight balances metric learning and cross-entropy losses.
3.2. Evaluation Indicators
In order to comprehensively evaluate the performance of the model in fault classification and fault localization tasks, this paper uses accuracy, precision, recall and F1-score as evaluation indicators. The calculation formulas are as follows:
where TP, TN, FP, and FN represent the number of true positive, true negative, false positive, and false negative samples, respectively.
3.3. Training Performance Analysis of the Model
3.3.1. Training Convergence and Stability Analysis
As can be seen from the Stage 1 curve in
Figure 8a, the model loss function exhibits a large negative gradient descent at the initial stage of training. This steep convergence trajectory indicates that the MS-CNN backbone network possesses strong feature decoupling capability, enabling it to rapidly lock onto the prominent transient impact and power frequency distortion components in the fault current, allowing the model to cross the underfitting region within very few epochs. As training progresses, the test accuracy reaches a peak of 98.61% at epoch 16 (Best Ep.16). Thereafter, the training loss stabilizes within a very low range (approximately 0.05) without severe oscillations, eliminating early convergence fluctuations while exhibiting high training robustness.
Figure 8b shows the training process for the more challenging Stage 2 fault localization task. The loss curve of Stage 2 exhibits a continuous, smooth, and monotonically decreasing convergence trend, driving the test accuracy to rise steadily and reaching a maximum of 94.22% at epoch 24 (Best Ep.24). This is primarily attributed to the prototypical loss introduced at this stage: unlike traditional cross-entropy, which merely seeks decision boundaries, the prototypical loss requires constant adjustment of geometric distances in a high-dimensional space to “pull” samples of the same class toward their class center. This strict geometric constraint smooths the optimization landscape, enabling the optimizer to perfectly avoid local minima during fine-grained discrimination.
3.3.2. Quantitative Evaluation of Diagnostic Performance
To objectively evaluate the generalization performance of the proposed MS-CNN-SimAM-Transformer framework in different task phases,
Table 1 summarizes the core metrics on the test set. The evaluation includes four dimensions: accuracy, precision, recall and F1-Score. The last three indicators are calculated using Macro-average in order to eliminate the evaluation bias caused by the uneven distribution of category samples.
As shown in
Table 1, Stage 1 achieves an accuracy of 98.61% and a recall of 98.64%, ensuring reliable fault detection. Stage 2 achieves 94.22% accuracy under the more challenging 37-class fine-grained localization task. To further evaluate the practical performance of the cascaded system,
Table 2 reports the end-to-end metrics. The missed detection rate is only 0.83%, meaning that only 0.83% of fault samples are blocked by Stage 1 and never reach Stage 2. This limits the upper bound of the cascaded localization rate to 99.17%. In practice, the final effective localization rate reaches 93.62%, with a degradation of only 0.60% compared with the standalone Stage 2 accuracy (94.22%). This demonstrates that Stage 1’s high recall ensures minimal error propagation, and the cascaded architecture imposes negligible performance loss while benefiting from pre-filtering of normal samples.
Figure 9 shows the normalized confusion matrix on the test set, which intuitively reveals the discriminant details and error distribution features of the model at different task stages.
As shown in
Figure 9a, the Stage 1 confusion matrix confirms the model’s high classification accuracy across all four operating conditions. Normal operation is correctly identified with 98.1% accuracy, with only 1.9% misclassified as single-phase open circuit. Single-phase grounding achieves 99.7% recall, demonstrating the model’s sensitivity to weak transient signals. Phase-to-phase faults reach 98.5% recall, and single-phase open circuit reaches 98.2% recall. These results validate the effectiveness of the MS-CNN backbone in extracting discriminative features from the multi-channel input, particularly for the most challenging single-phase grounding category.
Figure 9b shows the confusion matrix for the 37-line fault localization task. The matrix exhibits clear diagonal dominance, with most diagonal elements exceeding 94% recall. The sparse off-diagonal errors follow an “electrical proximity” pattern, concentrating between electrically adjacent lines such as Line 9 and Line 36, or Line 25 and Line 31 due to their extremely short electrical distances. Nevertheless, no large-scale cross-region misjudgment occurs, confirming that the spatio-temporal dual attention mechanism provides strong fine-grained discriminative ability.
In order to further explain the high-precision performance of the model from the perspective of feature space evolution, this paper uses the t-SNE algorithm to reduce the dimension of the high-dimensional semantic features of the test set samples, and the visualization results are shown in
Figure 10. As shown in
Figure 10a, the features of different fault types in Stage 1 form clearly separated clusters, indicating that the spatio-temporal dual attention mechanism effectively maps cross-scale fault signals into a discriminative feature space. In
Figure 10b, the feature distribution of the 37 line sections in Stage 2 shows a well-clustered structure, with prototypes acting as compact class centers that facilitate accurate localization of electrically adjacent nodes.
3.4. Comparative Experimental Analysis
In order to prove the superiority of the MS-CNN-SimAM-Transformer framework, this paper selects seven mainstream benchmark models for comparison, including the traditional machine learning method SVM and deep learning models such as 1D-CNN, LSTM, OS-CNN, LSTM-DenseNet [
12], HHT-CNN [
21], and ResBlock-CBAM-CNN [
23]. All contrasting models were trained under the same dataset and experimental conditions.
The data in
Table 3 reveal a remarkable pattern: most models achieve a high accuracy of over 96% in Stage 1 (fault classification), but generally suffer from performance bottlenecks in Stage 2 (fault localization). Taking the basic 1D-CNN as an example, it achieves an accuracy of 96.82% in classification, but plummets to 53.79% in localization—a performance degradation exceeding 40%. Even the LSTM, despite its temporal memory, only reaches 65.68% in the localization task.
This dramatic gap reflects the inherent difficulty of the two-stage task: while fault classification relies on macroscopic morphological differences in voltage and current waveforms (such as the surge in short-circuit amplitude), fault localization requires the deduction of subtle changes in line impedance. In the IEEE 33-bus system, the extremely short electrical distance between adjacent nodes results in highly overlapping fault features. Conventional Softmax classifiers struggle to construct effective decision boundaries between such dense clusters, leading to the decline in accuracy.
Hybrid architectures significantly improve this situation. LSTM-DenseNet [
12] reaches an accuracy of 82.88%, while HHT-CNN [
21] and ResBlock-CBAM-CNN [
23] reach 92.70% and 93.59%, respectively. However, our proposed framework achieves the highest accuracy in both stages, with 98.61% in Stage 1 and 94.22% in Stage 2, surpassing HHT-CNN by 1.52% and ResBlock-CBAM-CNN by 0.63% in Stage 2 localization. This superiority is attributed to the deep synergy between the spatio-temporal dual attention and the prototype metric learning. At the feature extraction level, SimAM offers better signal fidelity than the CBAM module in ResBlock-CBAM-CNN. Unlike CBAM, which uses dimensionality reduction, SimAM directly calculates three-dimensional attention weights based on an energy function, preserving the integrity of weak fault signals without losing structural information. Simultaneously, the Transformer encoder overcomes the gradient dispersion issues of RNNs when dealing with long sequences. Its parallel self-attention mechanism accurately captures the global dependence of the fault from transient inception to steady-state evolution. Finally, the prototype network optimizes the geometric structure of the feature space directly, demonstrating superior discriminative ability over the traditional Softmax-based decision-making in HHT-CNN.
The above results verify the effectiveness of the proposed model under standard working conditions. However, in the actual distribution network, the influence of noise interference and component contribution on the accuracy of the model still needs to be further analyzed in the following sections.
3.5. Cross-Scenario Generalization Validation
To further evaluate the generalization ability of the proposed framework across different network topologies, additional experiments were conducted on a 10 kV distribution network [
32,
33]. Unlike the IEEE 33-node system (33 nodes, 37 line sections), this network consists of seven nodes and seven line sections fed by four feeders, representing a structurally distinct topology. Faults are generated on all seven line sections under the same simulation configurations as the IEEE 33-node system, including four operating conditions (normal, single-phase-to-ground, line-to-line, and single-phase open circuit, each on phases A, B, and C), four fault-resistance levels, and four inception angles, ensuring comparability with the previous experiments.
As shown in
Table 4, the model achieves 96.23% accuracy in Stage 1 and 93.19% accuracy in Stage 2 on the 10 kV network.
Table 5 further reports the end-to-end cascade metrics on this network: the final effective localization rate reaches 91.85%, with a false trip rate of 3.82% and a missed detection rate of 1.56%. Compared with the IEEE 33-node results, the performance on the 10 kV network is slightly lower, which is expected given the different topology and the limited number of training samples available for the smaller network. Nevertheless, the model maintains high diagnostic accuracy across both topologies, demonstrating strong generalization capability.
The topology of the 10 kV distribution network is illustrated in
Figure 11.
4. Model Ablation and Robustness Assessment
4.1. Ablation Experiments
As shown in the comparative experimental analysis in
Section 3.4, fault localization (Stage 2) faces greater challenges than fault classification and is the main bottleneck that restricts the overall performance of the system. Therefore, in order to further explore the specific contribution of each module proposed in this paper to this critical task, we focus the ablation experiment on the fault localization phase.
To this end, we constructed a benchmark model (denoted as Backbone) designed to verify the effectiveness of the multi-scale (MS) module and the attention mechanism (SimAM). As the “Single-scale Backbone” of our framework, the benchmark model inherits the advanced training strategies (such as optimizer configuration, learning rate adjustment strategy, etc.) that are completely consistent with the full model of this paper to ensure the fairness of internal evaluation.
Specifically, to quantify the performance contribution of each core component, we designed five experimental scenarios that were progressively accumulated from the benchmark model, as summarized in
Table 6:
Model 0 (Backbone): Basic single-stream feature extractor. Only standard convolutional layers are included to extract features, and a fully connected layer is used to directly predict fault section labels.
Model 1 (Model 0 + Multi-scale module): Multi-scale module is integrated on the basis of Model 0, and a parallel convolution branch structure with different core sizes is adopted.
Model 2 (Model 1 + SimAM attention): A nonparametric attention mechanism is embedded in the feature extraction back-end of Model 1, and the three-dimensional attention weight is calculated based on the energy function.
Model 3 (Model 2 + Transformer encoder): Transformer encoder is further introduced to process sequence features through a multi-head self-attention mechanism.
Model 4 (Ours): The proposed complete framework. The Prototype Network strategy is introduced to replace the traditional regression head, and the positioning is performed by calculating the metric distance between the sample and the Prototype.
To quantitatively assess the specific contribution of each module to the fault localization performance, we recorded the accuracy, precision, recall, and F1-Score of five progressive variants on the test set. The detailed experimental results are summarized in
Table 7. From the data in the table, it can be intuitively observed that, with the gradual introduction of key components, the indicators of the model show a steady upward trend, which verifies the rationality of the proposed framework design.
The ablation results in
Table 7 show that each module contributes to the overall performance. Adding multi-scale convolution (M1) improves accuracy by 1.93% and F1 by 0.94% over the single-scale backbone (M0). The SimAM module (M2) further increases precision to 92.05% by suppressing redundant features. Transformer (M3) brings the most significant gain, with accuracy rising by 3.43% to 93.95%, confirming that the CNN’s limited local receptive field is effectively complemented by global temporal modeling. Finally, the prototype network (M4) refines the feature space and achieves the best overall performance.
Finally, the F1-Score reaches 93.94% when the complete model M4 is introduced into the prototype network. Although the localization effect of M3 has been very good, there are still misjudgments when dealing with adjacent nodes with similar features. By calculating the Euclidean distance, the clustering of similar samples and the separation of different samples are forced to solve the problem of boundary ambiguity caused by feature aliasing, and the further optimization of positioning accuracy is realized.
Further ablation experiments are conducted to validate the contribution of power features (P, Q), which are introduced in
Section 2.2 to encode the phase relationship between voltage and current. As shown in
Figure 12, incorporating P and Q significantly improves performance across all metrics. Stage 1 accuracy increases from 94.57% to 98.61%, and Stage 2 accuracy from 92.15% to 94.22%. The final effective localization rate rises from 87.54% to 93.62%, while the missed detection rate drops sharply from 4.62% to 0.83%. These results confirm that P and Q provide complementary phase-related information beyond voltage and current magnitudes.
4.2. Noise Robustness Experiments
In the actual industrial scene, power equipment is often in a complex electromagnetic environment, and the collected sensor signals will inevitably be polluted by various background noises [
34,
35]. To evaluate the robustness of the proposed model under diverse interference conditions, five types of noise are introduced during data preprocessing:
(1) Gaussian white noise: Random Gaussian noise is added directly to the raw signals. The noise amplitude is controlled by the target SNR, where a lower SNR corresponds to higher noise intensity.
(2) Pulse noise: Randomly selected sampling points are replaced by large-amplitude impulses. At 35 dB, 2% of elements are affected at 1.5× the signal amplitude; at 5 dB, 20% of elements are affected at 8× the signal amplitude. The parameters vary linearly between these two SNR extremes.
(3) Measurement error: All elements are multiplied by a random scaling factor. At 35 dB, the overall offset is ±2%; at 5 dB, the offset reaches ±20%, with linear variation between these bounds.
(4) Sensor dropout: Randomly selected elements are set to zero. At 35 dB, 5% of elements are dropped; at 5 dB, the dropout rate reaches 30%, with linear scaling in between.
(5) Communication delay: Randomly selected elements are cyclically shifted. At 35 dB, 8% of elements are shifted by 2 positions; at 5 dB, 40% of elements are shifted by 6 positions, with linear variation.
All noise types are tested across seven SNR levels from 35 dB to 5 dB with a step size of 5 dB, where the SNR is defined as the ratio of signal power to noise power. At 35 dB, the noise is barely perceptible; at 5 dB, the parameters reach their maximum settings, representing extreme interference conditions.
Figure 13 presents the accuracy under these five noise types. The results show that the model maintains reliable classification accuracy (above 82%), even at 5 dB for all noise types. However, Stage 2 localization degrades more significantly, dropping to approximately 60–67% at 5 dB. This is because localization relies on fine-grained waveform features, which are easily submerged by strong noise, whereas classification depends on overall waveform morphology and is less affected. Among the noise types, SensorDrop exhibits the highest robustness at low SNRs (67.25% at 5 dB), suggesting that the model’s multi-channel input provides redundancy that mitigates the loss of individual channels.
Despite the performance degradation under extremely low SNRs, the model demonstrates graceful degradation rather than catastrophic failure and maintains practical accuracy for SNR ≥ 15 dB. For reference, an SNR of 10 dB corresponds to a noise amplitude of approximately 32% of the signal amplitude, while 5 dB corresponds to approximately 56%, meaning the fault waveform is heavily corrupted and barely distinguishable from the background noise. Such extreme noise conditions are rare in practical distribution networks and typically result from severe electromagnetic interference or sensor degradation. The progressive parameter design ensures that the evaluation covers a wide range of interference severities, providing a rigorous stress test of model robustness under worst-case scenarios.
5. Conclusions
Aiming at the problems of difficult fault feature extraction and low positioning accuracy of adjacent nodes in distribution networks, a two-stage fault diagnosis framework based on MS-CNN and spatio-temporal dual-attention is proposed. In this paper, a “space-time” dual attention collaborative enhancement mechanism is constructed that integrates SimAM spatial screening and Transformer time series modeling; it makes up for the limitation of single convolution networks in multi-dimensional feature capture.
At the decision-making level of the fault localization stage, the prototype-based metric learning strategy is used to replace the traditional Softmax classification. The geometry of the feature space is optimized by minimizing the Euclidean distance to improve the classification accuracy, and a classification boundary with a safety margin is constructed that effectively breaks through the bottleneck of fault localization.
The experimental results show that the proposed method achieves 98.61% accuracy in fault classification and 94.22% in fault localization, significantly reducing misjudgment caused by similar electrical distances between adjacent nodes. Cross-scenario validation on a 10 kV distribution network further confirms its generalization capability across different topologies. Moreover, the model maintains reliable diagnostic performance at SNR ≥ 15 dB, with graceful degradation under extremely low-SNR conditions.
Future work will focus on improving the model’s noise robustness under extremely low-SNR conditions, extending the proposed framework to dynamic reconfiguration scenarios, and exploring lightweight deployment schemes suitable for edge computing devices.