5.1. HIF Simulation System
The model performance was validated through simulations conducted using PSCAD/EMTDC 4.6.2 software. The simulation model is shown in
Figure 4.
In the figure, solid lines are used to represent cable lines, while dashed lines are employed to denote overhead lines. The parameter l is defined as the length of the overhead line. F1–F23 represent the predefined fault locations. Zero-sequence current transformers (CTs) are installed at the beginning of each feeder.
The Emanuel model [
39], whose structure is depicted in
Figure 5, was utilized as the HIF model in this study. The model is composed of two anti-parallel branches, each consisting of a diode, a voltage source, and a resistor. To simulate the random fluctuation characteristics of arc voltage and arc resistance observed in real faults, the parameters
,
,
, and
were set to fluctuate randomly over time within a range of
of their predefined central values. Different grounding medium scenarios can be simulated by adjusting these parameters.
For the dataset generation, the sampling rate was set to 2133 Hz. Fault incidents were simulated across a phase angle range of 0–180 degrees and a resistance range of 500–1500 . High-impedance grounding faults, inrush currents, capacitor switching events, and load switching events were configured at fault points F1–F23. A total of 8000 samples were generated, with a 1:1 ratio between fault data and non-fault data. The dataset was partitioned into 3000 samples for the training set, 3000 samples for the validation set, and 2000 samples for the test set. The first 150 sampling points from each signal were used as the raw input.
5.1.1. Validation of Wavelet Decomposition Effectiveness
The wavelet decomposition module serves as the first component of the model and holds significant importance. To validate its effectiveness, a comparative analysis was conducted between using the original time-domain signals directly as input and utilizing signals processed by the MLDWT. The performance differences between these two input representation methods were systematically evaluated.
In terms of network architecture, a symmetric encoder–decoder structure with two hidden layers has been adopted for all SDAEs. The encoder consists of two fully connected layers with 128 and 64 hidden units, respectively. The decoder mirrors the encoder’s architecture, with the output dimensionality maintained consistent with that of the input. During the training phase, white WGN with a SNR of 15 dB is added to the input data to construct noisy samples.
To compare the effectiveness of different input features, two input modalities have been defined:
Wavelet Transform: A 5-level decomposition is performed on the raw signal using the db1 wavelet basis, resulting in six sub-band components (D1–D5 and A5). Each component is reconstructed via inverse transformation into sequences matching the length of the original signal, and these are then used as multi-channel inputs.
Raw Input: The original one-dimensional time-series signal is directly used as the input.
To ensure fairness in the comparative experiments, an identical min–max normalization strategy is applied at both the sample level and the channel level for both input modalities. The loss function for all models is defined as the MSE, and the Adam optimizer is employed for training, with an initial learning rate set to 0.001. The training and validation batch sizes are set to 256 and 512, respectively. The maximum number of training epochs is fixed at 1000, with early stopping enabled (patience = 100). All training hyperparameters, including network initialization, random seed, and learning rate scheduling, were strictly maintained at identical settings across all models, with the exception of the input modality.
A comparative analysis of the denoising performance was conducted between the standard SDAE with raw signal input and the Channel-wise SDAE (C-SDAE) with wavelet-transformed input. The comparative results of the MSE for the two methods under different SNR levels are presented in
Figure 6. Furthermore, the training loss curves for both approaches are shown in
Figure 7.
The results demonstrate that the C-SDAE achieves significantly lower MSE values across all signal-to-noise ratio (SNR) conditions compared to the conventional SDAE. Specifically, the average MSE is reduced from p.u. to p.u., representing a decrease of approximately , which indicates a substantial performance improvement. This enhancement is most pronounced under high-noise conditions (), where the MSE decreases from p.u. to p.u. Meanwhile, the C-SDAE maintains a consistent performance advantage in the medium-to-low-noise conditions (15–). The training loss curves further reveal that the C-SDAE surpasses the conventional SDAE in both convergence speed and final loss level. On both the training and validation sets, the C-SDAE exhibits faster loss reduction and stabilizes at a lower value. These observations indicate that the multi-scale time–frequency information provided by the MLDWT effectively enhances the time–frequency resolution during feature extraction. This enables the network to better distinguish between useful signal components and noise, thereby significantly improving the overall denoising performance and enhancing the model’s robustness and feature representation capability. Consequently, the necessity of the wavelet decomposition module within the overall framework is successfully validated.
5.1.2. Interpretability of the Attention Mechanism
To validate the adaptive capability and interpretability of the cross-channel attention mechanism under different noise levels, the attention weight distributions across wavelet channels of the channel-wise attention stacked denoising autoencoder (CA-SDAE) model were statistically analyzed at three specific SNRs:
,
, and
. The resulting distributions are illustrated in
Figure 8,
Figure 9 and
Figure 10, with the corresponding average distribution values provided in
Table 1. The analysis indicates that the attention weights undergo adaptive redistribution in response to varying noise intensities, and the observed variation pattern demonstrates clear physical significance.
Under high-noise conditions (), the attention is predominantly focused on the low-frequency detail channels D5 (0.237) and D4 (0.209), whereas substantially reduced weights are allocated to the high-frequency channel D1 (0.113) and the approximation component A5 (0.108). This distribution pattern demonstrates that under significant noise interference, strategic emphasis is placed by the model on low-frequency components where signal energy is concentrated and less vulnerable to noise contamination, thereby ensuring reconstruction stability, while high-frequency components that are easily corrupted by noise are effectively suppressed.
Under medium-noise conditions (), a sustained preference is maintained by the model for the low-frequency channels D5 (0.207) and D4 (0.194), while the weights assigned to the medium- and high-frequency channels D3 (0.173), D2 (0.165), and D1 (0.164) are significantly increased. This redistribution reflects that as noise interference diminishes, finer-grained high-frequency features are increasingly leveraged by the model to enhance reconstruction quality, demonstrating a transitional trend from “robustness-first” to “detail-recovery” prioritization.
Under low-noise conditions (), the attention distribution is fundamentally shifted, with weights being significantly tilted toward the high-frequency detail channels D2 (0.241) and D1 (0.209). Compared to the high-noise conditions (), increases of and are observed for D1 and D2, respectively, whereas decreases of and are recorded for D5 and D4. This distribution confirms that under low-noise conditions, high-frequency details can be fully utilized by the model to recover the fine-grained structure of the signal.
The above analysis demonstrates that the proposed channel-wise attention mechanism can adaptively adjust the weight allocation according to the noise level. Its behavioral pattern, which shows reliance on low-frequency components under strong noise conditions and utilization of high-frequency components under weak noise conditions, is shown to align with physical intuition, thereby validating both the interpretability and adaptability of the attention mechanism.
5.1.3. Validation of Energy Proportion Guidance Effectiveness
Although the previously described data-driven attention mechanism demonstrates a promising adaptive trend, its fully data-driven learning approach may lead to distributional biases in certain complex noise scenarios, resulting in a misalignment between the model’s focus and the actual energy characteristics of the signal. To address this limitation, an energy proportion guidance module is introduced in this work, which utilizes the energy distribution from the original domain as a physical prior to constrain the attention weights. By incorporating an energy target distribution term into the loss function, the energy-proportion-based Channel-wise attention stacked denoising autoencoder (EPCA-SDAE) model can automatically rectify unreasonable weight biases during training, thereby ensuring that the allocation results better conform to physical principles.
In this section, four categories of methods were compared:
C-SDAE: The conventional SDAE without any attention or energy guidance mechanisms, serving as the baseline model
CA-SDAE: The cross-channel attention is driven solely by data, autonomously learning the weights
EPCA-SDAE: The global average relative energy is directly adopted as fixed channel weights, with specific values provided in
Table 2EPGCA-SDAE: The wavelet channel energy distribution is estimated in the original domain for each sample, serving as a guidance term for attention learning, thereby achieving sample-wise dynamic constraints that preserve self-learning capability while preventing deviation from physical priors.
The denoising performance of the four methods is compared in
Table 3 using the test set contaminated with noise levels ranging from 5–
, where the MSE (
) p.u. is employed as the evaluation metric. The baseline C-SDAE, which directly processes multi-level wavelet-decomposed signals without any attention or energy guidance, shows limited denoising capability, particularly under low-SNR conditions, while maintaining reasonable performance as the noise level decreases. The purely data-driven CA-SDAE performs poorly under low-SNR conditions, with an MSE of 282.28 recorded at
and an overall average MSE of 75.82, indicating that the unconstrained attention mechanism is prone to deviation. By introducing a fixed energy prior, the EPCA-SDAE establishes a physical basis for weight allocation, resulting in a significant reduction of the average MSE to 52.85. In contrast, the proposed EPGCA-SDAE achieves the best performance under all tested conditions, with the average MSE further reduced to 50.60. This corresponds to reductions of
,
, and
over the C-SDAE, CA-SDAE, and EPCA-SDAE, respectively. Notably, the advantage of the EPGCA-SDAE is particularly pronounced in high-noise scenarios, demonstrating that the sample-wise energy-proportion guidance mechanism can substantially enhance model robustness in complex noise environments.
Further analysis from the perspective of ACC provides clearer insight into model performance across different noise levels. For the detection task, an LSTM model [
40,
41] is employed, which is configured with a single LSTM layer containing 50 hidden units. The tanh activation function is utilized in the LSTM layer, while the output layer is implemented as a fully connected layer with a sigmoid activation function. During the training process, the Adam optimizer is adopted with a learning rate of 0.001, and the binary cross-entropy is chosen as the loss function. The batch size is set to 1024, and the number of iterations is specified as 300. An early stopping strategy is applied with a patience of 50. The dataset is partitioned into training, validation, and test sets according to a 3:4:3 ratio. As shown in
Table 4, the accuracy of the original noisy signal under high-noise conditions (
) is only
, indicating that strong noise severely corrupts signal characteristics. After applying different denoising methods, the baseline C-SDAE, which directly processes wavelet-decomposed signals without attention or energy guidance, achieves
accuracy, already providing a substantial improvement over the raw noisy input. The purely data-driven CA-SDAE and fixed-prior EPCA-SDAE further improve accuracies to
and
, respectively, while the proposed EPGCA-SDAE attains the highest accuracy of
. As the SNR ratio increases, the accuracy of all methods gradually approaches
, with EPGCA-SDAE consistently maintaining the best performance across all test points. For instance, under low-noise conditions (
), EPGCA-SDAE achieves an accuracy of
, outperforming C-SDAE (
), CA-SDAE (
), and EPCA-SDAE (
). Similarly, at
, EPGCA-SDAE again attains the highest accuracy of
. Overall, average accuracies of
,
, and
are recorded for C-SDAE, CA-SDAE, and EPCA-SDAE, respectively. Although EPCA-SDAE incorporates a fixed energy prior to ensure physical consistency in weight distribution, its lack of sample-wise adaptation results in slightly lower overall accuracy. In contrast, EPGCA-SDAE further improves the average accuracy to
, representing increases of
,
, and
percentage points over C-SDAE, CA-SDAE, and EPCA-SDAE, respectively. This demonstrates that the sample-wise energy proportion guidance mechanism not only ensures physical consistency but also preserves adaptive capability in complex noise scenarios, thereby significantly enhancing both accuracy and stability of classification.
The distributions of attention weights are shown in
Figure 11,
Figure 12 and
Figure 13. The weight distribution of CA-SDAE is characterized by strong randomness and near-uniformity, with insignificant differences observed among channels, making it difficult to highlight dominant channels. Furthermore, only a weak correlation is found with the physical characteristics of the signal, indicating the lack of a clear physical basis for its weight allocation. Although EPCA-SDAE employs a fixed global energy distribution as attention weights, introducing a degree of physical consistency, its adaptive capability is completely lost when dealing with different sample characteristics. In contrast, the proposed EPGCA-SDAE introduces a sample-wise dynamic correction mechanism while respecting the prior ordering of channel energy. This approach not only avoids the tendency toward complete equalization but also breaks through the limitations of fixed templates, thereby achieving a balance between global constraints and local adaptation.
5.1.4. Performance of Denoising Models
To comprehensively evaluate the performance of the proposed EPGCA-SDAE, it is compared in this section with traditional wavelet denoising, the classical SDAE, the DnCNN, and a Transformer-based method. The parameter settings for each method are configured as follows: the wavelet method employs the soft-thresholding strategy proposed by [
20]; the SDAE structure remains consistent with that described in
Section 5.1.1; the DnCNN adopts a one-dimensional convolutional residual learning framework with a depth of 17 layers, 64 channels per layer, a kernel size of 3, and a batch size of 256; the Transformer denoiser utilizes a 4-layer TransformerEncoder with a hidden dimension of 64, four attention heads, a feed-forward layer dimension of 256, and a batch size of 1024. All models in this section are trained using
noise and tested at SNR levels of 5, 10, 15, 20, 30, and
, with the MSE (
) p.u. employed as the evaluation metric.
As can be observed from
Table 5, significant differences in performance are demonstrated among the various methods.
The traditional wavelet soft-thresholding approach is shown to provide moderate noise suppression under high-noise conditions and satisfactory performance under low-noise conditions. However, an overall elevated error level is maintained, with an average MSE of 179.76.
The SDAE is observed to perform adequately near its training noise level, but severe performance degradation is exhibited at higher noise levels, while performance saturation is encountered at lower noise levels. This indicates that the feature representations learned by the SDAE are rendered overly sensitive to noise variations, resulting in limited generalization capability and poor robustness.
The DnCNN demonstrates a strong preference for local features, achieving optimal performance at and . However, when the test noise levels deviate from the training distribution, its limited receptive field is found inadequate for modeling global noise structures, leading to significant performance deterioration. Complete failure is observed under high-noise conditions, confirming its inability to effectively extrapolate to unseen noise distributions.
The Transformer method excels at global dependency modeling. Consequently, superior robustness compared to DnCNN is demonstrated, particularly under high-noise conditions not encountered during training, where global information is required for effective noise suppression. However, under low-noise conditions where the primary task shifts to recovering fine local details, potential limitations in capturing such localized information efficiently are revealed, leading to performance plateaus.
The proposed EPGCA-SDAE, under identical training conditions, exhibits exceptional robustness and generalization capability. Across all SNR levels, significantly lower MSE values are achieved compared to all baseline models. A smooth, monotonically improving performance curve is observed as the noise level decreases, indicating that the true physical distribution is effectively learned by the model rather than merely memorizing the training distribution. Ultimately, its average MSE of 50.60 is substantially lower than those of other methods, providing compelling evidence for the superior and stable denoising effectiveness of the proposed architecture in unknown noise environments.
In addition to denoising effectiveness, the computational efficiency and real-time feasibility of the proposed framework were quantitatively evaluated. All experiments were conducted on a workstation equipped with an Intel® Core™ i9-13900HX CPU, 96 GB DDR5 5200 MHz RAM (Intel Corp., Santa Clara, CA, USA), and an NVIDIA® GeForce RTX 4090 Laptop GPU (16 GB VRAM) (NVIDIA, Santa Clara, CA, USA).
Under this configuration, the MLDWT module was observed to require an average processing latency of ms per sample, whereas the subsequent multi-channel attention and SDAE inference pipeline required 0.041 ms per sample. Consequently, an overall end-to-end inference latency of ms per sample was achieved, confirming that the proposed EPGCA-SDAE can operate well within real-time constraints.
To ensure fair comparison, identical batch sizes and data settings were applied to all baseline models. As presented in
Table 6, the inference efficiency of the proposed EPGCA-SDAE was found to be comparable to that of the conventional SDAE (0.070 ms/sample) and DnCNN (0.154 ms/sample), slightly higher than the traditional wavelet thresholding method (0.084 ms/sample), while remaining significantly faster than the Transformer-based denoiser (0.435 ms/sample). These results demonstrate that, despite the inclusion of multi-stage physical–data fusion mechanisms, the overall architecture remains computationally lightweight and well suited for real-time deployment in distribution-network fault-detection systems.
5.2. Realistic System
HIF experiments were conducted in a
distribution network laboratory, from which measured data were obtained. The laboratory equipment is shown in
Figure 14, and the line topology is illustrated in
Figure 15. A CT sampler was installed at the beginning of each line. Multiple short-circuit points were introduced during the experiments to ensure data diversity. A total of 600 experimental samples were collected, with high-impedance grounding media including turf, cement brick, red brick, and sand. Additionally, normal scenarios with waveforms similar to HIFs—such as inrush current, capacitor switching, and load switching—were incorporated. The dataset therefore consisted of 600 samples with a 1:1 ratio between fault and non-fault data, which were split into 200 samples for training, 200 for validation, and 200 for testing. The first 150 sampling points were used as the raw input. The EPGCA-SDAE architecture remained consistent with that described in
Section 5.1.1. To further examine the model’s stability and generalization under limited real-world data, a five-fold cross-validation accompanied by a guidance strength sensitivity analysis was additionally performed. The guidance strength was determined by the approximate ratio between the energy-guided loss and the reconstruction MSE loss within the total objective, corresponding roughly to no guidance (=0%), weak guidance (≈1%), moderate guidance (≈10%), and strong guidance (≈90%). The results, summarized in
Table 7, demonstrate that the model remains stable across folds and achieves optimal denoising performance under moderate guidance, confirming that the proposed physically guided attention mechanism maintains robustness even under small-sample conditions.
A fundamental difference in distribution patterns between the two mechanisms can be clearly observed through the comparison of
Figure 16 and
Figure 17. The attention distribution of CA-SDAE is characterized by near-equal weighting across channels, with no dominant channels identified. This indicates that when relying solely on data-driven learning, only an “averaged” pattern is acquired by the model, failing to capture the physical imbalance in energy distribution. In contrast, the heatmap of EPGCA-SDAE demonstrates highly consistent and focused characteristics, with significantly elevated weights in channels D5 and D6 stably maintained across all samples. This cross-sample stability confirms that the physical prior has been successfully injected into the model through the energy-proportion-guidance mechanism, providing a clear physical basis and enhanced robustness for attention allocation. Consequently, the “defocused” attention problem inherent in purely data-driven models is fundamentally resolved, transforming the model into a signal processor capable of precisely focusing on critical information according to physical principles. This improvement is directly reflected in the performance metrics: the average MSE is reduced from 113.51 in CA-SDAE to 76.45 in EPGCA-SDAE.