1. Introduction
The distribution network, as a critical link connecting the power grid and end-users, directly impacts the reliability of social and economic operations. With the advancement of the “dual carbon” strategy, control and management strategies for distributed energy resources (DERs) have become increasingly complex [
1,
2]. The high penetration of distributed photovoltaics, combined with the widespread adoption of highly variable loads such as electric vehicles, has transformed traditional unidirectional radial distribution networks into complex active distribution networks (ADNs) characterized by strong source–load interaction. Particularly at the low-voltage distribution level, voltage violation issues caused by high-penetration photovoltaics have emerged as a major control challenge [
3,
4]. This transformation has significantly increased system vulnerability and operational uncertainty.
In the context of increasingly frequent extreme weather events (e.g., heatwaves) [
5] and sharp load surges due to “coal-to-electricity” programs in rural areas, fault patterns in distribution networks now exhibit strong nonlinearity, high randomness, and complex evolution mechanisms [
6]. Consequently, keeping pace with emerging trends in active distribution network fault detection [
7], the proactive prediction and early identification of fault precursors based on multi-source data have gradually replaced the conventional post-fault emergency repair paradigm, and have become a key strategy for enhancing grid resilience [
8].
The core of fault prediction lies in extracting effective features from massive time-series data and inferring evolutionary trends. Early studies mainly relied on physical models or simple signal processing techniques, such as wavelet transform-based methods for early short-circuit fault detection, which performed well in specific frequency band feature extraction [
9]. However, these approaches show poor adaptability in active distribution networks with frequently changing topologies. Subsequently, data-driven methods have gradually taken the lead. Among them, shallow machine learning algorithms have been widely applied: some researchers utilized Random Forests with ensemble voting mechanisms to improve fault prediction accuracy [
10]; others employed XGBoost combined with Particle Swarm Optimization (PSO) to achieve efficient fault identification [
11,
12]; and innovative approaches based on “fault gene sequences” and sequence alignment techniques have been proposed for precise state mapping of smart distribution networks [
13]. Additionally, improved BP neural networks (e.g., those optimized by Harris Hawks Optimization) have demonstrated good performance in fault prediction considering weather factors [
14]. Nevertheless, such shallow models typically rely heavily on manual feature engineering and struggle to capture long-term temporal dependencies, resulting in limited generalization capability when facing massive high-dimensional monitoring data (e.g., protection relays, disturbance recorders) [
15].
To extract deep features from time-series data, deep learning approaches have emerged. Notably, deep learning has shown tremendous potential not only in time-series forecasting but also in vision-based detection of critical distribution equipment (e.g., SGI-YOLOv9 model) [
16]. In the time-series domain, recurrent neural networks (RNNs) and their variants—Long Short-Term Memory (LSTM)—have become mainstream. LSTM effectively mitigates the vanishing gradient problem of traditional RNNs through its gating mechanism and has been widely applied to short-circuit current prediction [
17] and early fault probability sequence classification [
18]. Although LSTM performs excellently on short- to medium-term sequences, it still suffers from memory bottlenecks and low computational efficiency when dealing with ultra-long-term fault precursors in distribution networks (such as gradual insulation aging or persistent overloading).
In recent years, with the rise in large language models (LLMs) and Transformer architectures, attention-based time-series forecasting has become a research hotspot. Some studies have explored optimized LLMs for insulation fault prediction in distribution networks, demonstrating their great potential in capturing global contextual information [
19]. However, while standard Transformers excel at modeling global dependencies, their self-attention mechanisms incur quadratic computational complexity with respect to sequence length. Moreover, they tend to produce a “smoothing effect,” making them insensitive to subtle transient features such as voltage sags and local disturbances [
20]. In real distribution networks, faults often originate from minor local perturbations before eventually evolving into system-wide failures.
In summary, existing methods for distribution network fault prediction face a fundamental trade-off, which is the difficulty in capturing instantaneous abrupt changes while simultaneously accounting for long-term evolutionary trends. Standalone LSTM struggles to maintain long-term memory, whereas standard Transformers tend to overlook critical local details. To address this challenge, this paper proposes a hybrid prediction model that integrates Extended LSTM (XLSTM) with Informer, termed XLSTM-Informer. The model consists of two key components: (1) an XLSTM-based local feature encoder that leverages improved exponential gating and matrix memory structures to specifically capture high-frequency, transient fault precursors; (2) an Informer-based global decoder that employs the ProbSparse sparse self-attention mechanism to efficiently infer long-term fault evolution trends. Using real operational data from a regional distribution network, this study particularly evaluates the model’s performance under typical extreme scenarios, including summer peak loads and winter heating spikes, with the aim of providing a high-precision, robust solution for proactive defense of active distribution networks under complex operating conditions.
2. Related Work
To contextualize the contributions of this study, this section critically reviews the evolution of fault prediction methodologies in distribution networks. The existing literature is broadly categorized into three dominant paradigms: traditional deep learning baselines (RNNs/CNNs), long-sequence attention mechanisms (Transformer variants), and emerging hybrid architectures. The strengths and inherent limitations of each category are analyzed below.
RNN- and CNN-based approaches for traditional deep learning models, such as Long Short-Term Memory (LSTM) networks and Convolutional Neural Networks (CNNs), have been widely deployed in grid fault diagnosis. LSTM is favored for its gate mechanisms that handle temporal dependencies, while CNNs excel at extracting local spatial features.
Critique: However, these methods face inherent limitations. LSTMs suffer from high memory costs and gradient vanishing problems when processing ultra-long sequences (e.g., continuous weekly monitoring). Similarly, while CNNs capture local fluctuations effectively, their limited receptive field makes them struggle to model the long-range global correlations required for trend prediction.
In order for the transformer and its variants to address the long-term dependency issue, attention-based models like the transformer and informer have emerged. The Informer, in particular, utilizes a ProbSparse attention mechanism to reduce computational complexity to O(L Log L), making it suitable for long-sequence forecasting.
Critique: Despite their success in capturing global trends, standard transformers exhibit a “smoothing effect.” The self-attention mechanism tends to average out high-frequency signals, treating valuable transient fault symptoms (e.g., voltage spikes) as noise. Consequently, they often fail to capture the sharp, instantaneous mutations that are critical for early fault warning.
Hybrid mechanisms and research gap: Recent studies have attempted to fuse RNNs and transformers to combine their strengths. However, most existing hybrid models rely on simple parallel concatenation or lack specialized memory structures for high-frequency signals.
Research Gap: There is a lack of a unified framework that can simultaneously lock onto local transient features (using matrix memory) and model global evolutionary trends (using sparse attention). This paper addresses this gap by proposing the XLSTM-Informer fusion model, specifically designed to balance sensitivity to mutations with long-term forecasting stability.
3. Materials and Methods
3.1. Overall Architecture
To address the challenges of weak fault symptoms and strong randomness in active distribution networks, this paper proposes a novel trend prediction framework named XLSTM-Informer. As illustrated in
Figure 1, the framework consists of three integrated modules:
Data Preprocessing Module: Responsible for cleaning raw measurement data, normalizing features, and generating time-series samples via sliding windows.
Local Feature Extraction Module (Encoder): Utilizing the Extended LSTM (XLSTM) to capture instantaneous local variations and short-term dependencies in voltage/current sequences.
Global Trend Inference Module (Decoder): Employing the Informer architecture with ProbSparse self-attention to model long-range dependencies and output multi-step trend predictions.
3.2. Data Acquisition and Processing
The dataset used in this study originates from real-world measurement data of a distribution network in the Tangshan area, Northern China. To ensure model convergence and prediction accuracy, rigorous data processing is required.
(1) Min–Max Normalization: Since the dataset contains multiple electrical quantities (e.g., Voltage, Current, Active Power) with varying magnitudes, direct input into the neural network may cause gradient oscillation. The min–max normalization to map all features to the [0, 1] range is as follows:
where
x represents the original observed value, and
xmin and
xmax denote the minimum and maximum values of the feature sequence, respectively.
(2) Sliding Window Construction: To transform the continuous time-series forecasting problem into a supervised learning task, a sliding window strategy is implemented. As shown in
Figure 2, the input sequence length is set to
Lin (Historical Horizon), and the prediction sequence length is set to
Lout (Forecasting Horizon).
Input Sequence:
Target Sequence:
In the experiments, based on the sampling frequency and fault evolution characteristics, Lin = 96 and Lout = 24 are set.
Figure 2.
Schematic diagram of the sliding window data segmentation.
Figure 2.
Schematic diagram of the sliding window data segmentation.
3.3. The Local Feature Encoder: XLSTM
Traditional LSTMs suffer from limited storage capacity and gradient decay when processing high-frequency sampling data in distribution networks. To overcome this, the Extended LSTM (XLSTM) is employed as the local feature encoder.
The content presented in
Figure 3 is the structure of the XLSTM unit with exponential gating and matrix memory.
The XLSTM introduces two key improvements over the standard LSTM:
Exponential Gating: The traditional Sigmoid activation function is replaced by an exponential gating mechanism. This allows for sharper selection of input information, enabling the model to be more sensitive to instantaneous fault symptoms (such as sudden voltage sags) while suppressing background noise.
Matrix Memory: The scalar cell state (ct) is expanded into matrix structures. This significantly increases the memory capacity, ensuring that critical local features are preserved even before entering the global attention module.
3.4. The Global Trend Decoder: Informer
After extracting local features, the high-dimensional hidden states are fed into the Informer module to infer future evolutionary trends. The Informer is specifically designed to solve the computational inefficiency of standard Transformers in long-sequence forecasting.
(1) ProbSparse Self-Attention: The standard self-attention mechanism requires O(L
2) computational complexity, which is resource-intensive. Informer employs the ProbSparse mechanism, which selects only the “Top-u” queries with the highest dominant correlations to compute attention weights:
where
represents the sparse query matrix. This reduces the complexity to O(L log L) allowing the model to efficiently capture long-term dependencies and global periodicity in the distribution network data.
(2) Generative Decoder: Different from the step-by-step recursive prediction of RNNs, the Informer uses a generative style decoder to output the entire prediction sequence (Lout) in one forward step. This avoids the accumulation of prediction errors during the multi-step forecasting process.
3.5. Experimental Environment and Evaluation Metrics
All experiments were conducted on a workstation equipped with an NVIDIA GeForce RTX 3080 GPU and an Intel Core i9 CPU. The proposed model was implemented using the PyTorch (2.0.0) deep learning framework. The optimization algorithm used is Adam, with an initial learning rate of and a batch size of 32.
To quantitatively evaluate the prediction performance, three standard metrics were selected: Mean Squared Error (MSE), Mean Absolute Error (MAE), and Mean Absolute Percentage Error (MAPE). The formulas are as follows:
where
represents the actual measured value,
represents the predicted value, and
n is the number of samples.
4. Results
4.1. Experimental Data and Seasonal Settings
To accurately capture the transient details of fault evolution, the data acquisition system is configured for high-frequency sampling at 1 min intervals. The dataset covers typical winter and summer load distribution periods and comprises approximately 40,000 multivariate time-series samples, with invalid records filtered out.
In the rural distribution network under study, the low-voltage side voltage at the transformer (head end of the distribution area) is typically maintained around 240 V. This intentionally elevated operating strategy compensates for line losses and ensures voltage stability at the end users, often exceeding the urban grid’s standard threshold of 235 V. This practice is closely related to the characteristics of rural power supply, such as long supply radii and dispersed loads.
In particular, samples with voltages exceeding 245 V, falling below 220 V, or even approaching or dropping below 198 V are regarded as critical fault precursors or abnormal states (with 245 V serving as a key upper/lower limit reference).
To visually illustrate the aforementioned characteristics,
Figure 4 presents the typical normal operating waveform of the distribution area. As observed, although the voltage is maintained at a high level of approximately 240 V, the curve remains smooth with no abrupt changes. In contrast,
Figure 5 demonstrates two representative types of fault precursors:
Figure 5 shows a rapid voltage drop under heavy load accompanied by high-frequency fluctuations;
Figure 6 depicts an abnormal voltage rise to 250 V, exhibiting a dangerous overvoltage trend. The model proposed in this paper aims to precisely distinguish the normal high voltage in
Figure 4 from the abnormal voltage violations shown in
Figure 5 and
Figure 6.
The experimental validation is conducted using the real-world dataset described in
Section 3.2. To comprehensively verify the robustness of the proposed XLSTM-Informer model under complex load profiles, the test set is specifically structured to cover three typical seasonal scenarios in rural distribution networks:
Spring (Transition Season): Characterized by moderate load levels and relatively stable fluctuations.
Summer (Cooling Season): Characterized by persistently high loads due to the continuous operation of large-scale air conditioning equipment.
Winter (Heating Season): Characterized by sharp evening peaks and high volatility, driven by the centralized usage of electric heating devices (following the “coal-to-electricity” conversion policy) in rural areas.
To rigorously evaluate the model’s generalization ability under complex operating conditions, we established strict quantitative selection criteria for “extreme scenarios.” Based on the load characteristics of the distribution network in Northern China and the operational impact of the “Coal-to-Electricity” policy, three typical datasets were extracted. The specific definitions and thresholds are as follows:
- (1)
Scenario I: Summer Sustained High-Load: This scenario represents the grid status during summer heatwaves, driven by the continuous operation of large-scale air conditioning loads. The primary challenge here is the thermal accumulation in transformers rather than instantaneous volatility. Selection Criteria: Time Window: 10:00 to 16:00 (Peak temperature period). Load Threshold: The Load Factor (LF = Pavg/Prated) must exceed 90% continuously. Duration: The high-load state must persist for at least 5 h. Rationale: Sustained high load reduces the thermal margin of the equipment, making the voltage baseline sensitive to minor fluctuations.
- (2)
Scenario II: Winter Sharp Peak and High Volatility, This scenario captures the evening peaks in winter, heavily influenced by the “Coal-to-Electricity” policy. The synchronization of residential electric heating equipment creates sharp load impulses. Selection Criteria: Time Window: 18:00 to 22:00 (Residential heating peak). Volatility Threshold (Ramp Rate): The Load Ramp Rate (Rt) must exceed 15% of the rated capacity within a 15 min interval. Rationale: The rapid ramp-up of heating loads causes inductive impact currents, testing the model’s ability to capture instantaneous voltage sags (transient features) rather than just trends.
- (3)
Scenario III: Transitional Season (Baseline): Served as a control group, selected from spring/autumn data where the daily average load factor is between 0.3 and 0.5, with a voltage variance less than 0.02.
All models were trained and tested on the same hardware platform (NVIDIA GeForce RTX 3080) as detailed in
Section 3.5.
4.2. Overall Performance Comparison
To rigorously evaluate the general trend prediction capability of the proposed model, we conducted a comprehensive benchmark on the complete test dataset. The XLSTM-Informer was compared against mainstream deep learning baselines (CNN, LSTM) as well as state-of-the-art transformer variants, including Autoformer, FEDformer, and the original Informer. The quantitative results are presented in
Table 1.
As shown in
Table 1, the proposed model consistently achieves superior performance across all evaluation metrics. Notably, compared to the standard Informer, our model yields a significant reduction in error rates—lowering the MSE by approximately 30.1% and the MAPE by 10.6%. These results provide compelling evidence that the integration of the XLSTM module effectively enhances the capture of local transient details without compromising the global trend modeling advantages inherent to the Informer architecture.
As can be observed from
Figure 7, the proposed model achieves satisfactory performance.
4.3. Ablation Studies
To verify the specific contributions of the XLSTM (Local Encoder) and the Informer (Global Decoder), ablation experiments are conducted. The results are shown in
Table 2.
The following observations can be made regarding the internal mechanisms of the proposed method:
Contribution of Matrix Memory: Model D (XLSTM-Sigmoid) achieves a lower prediction error compared to Model C (LSTM + Informer). Since both models utilize the standard Sigmoid activation, this improvement suggests that replacing scalar memory cells with matrix memory structures contributes to expanding the state capacity, thereby enabling the model to retain more informative historical patterns under complex grid conditions.
Impact of Exponential Gating: Notably, the proposed Model E (XLSTM-Informer) shows further performance gains over the ablation variant Model D. By employing the exponential gating mechanism instead of the Sigmoid activation, the model effectively mitigates the gradient saturation issue. This modification allows Model E to assign larger weights to instantaneous error gradients, enhancing its sensitivity to transient fault symptoms (e.g., voltage sags) and reducing the prediction lag.
Overall Effectiveness: Consequently, Model E demonstrates the best overall performance across the evaluated metrics. This indicates that the serial fusion strategy effectively couples the high-frequency feature extraction of the XLSTM encoder with the long-term trend modeling of the Informer decoder, offering a competitive and reliable solution for distribution network fault symptom prediction.
4.4. Performance Analysis Under Different Seasonal Load Profiles
This section focuses on the model’s adaptability to the distinct load characteristics of summer and winter, which is critical for practical engineering applications.
Quantitative Analysis (across seasons):
Table 3 details the prediction performance across three seasons.
As shown in
Table 3, the model maintains high accuracy (MAPE < 3%) even in winter, where load volatility is highest due to the stochastic nature of rural electric heating.
Interpretation of Results: Inverse Performance Trend: Interestingly, the results exhibit an inverse relationship between load volatility and prediction error. The model achieves its best performance in winter (lowest MSE of 0.0066 and MAPE of 0.45%), despite this season being characterized by “sharp peaks” due to electric heating loads. Reason for winter superiority: This phenomenon highlights the specific advantage of the XLSTM module. Standard LSTMs often struggle with sharp peaks due to gradient saturation. However, the exponential gating mechanism in our proposed model is specifically designed to assign higher importance to these large, instantaneous gradients (the “sharp peaks”). Consequently, the model captures the regular heating cycles in winter more precisely than the “moderate but random” fluctuations observed in spring. Overall Stability: The comparison confirms that the proposed method is not only accurate under stable conditions (spring) but becomes increasingly effective when handling complex, high-load scenarios (summer and winter), demonstrating strong robustness against load mutations.
5. Discussion
5.1. Mechanism Analysis of Model Superiority
The experimental results in
Section 3 demonstrate that the proposed XLSTM-Informer model consistently outperforms traditional deep learning methods across various metrics. This superiority can be attributed to the complementary nature of its two core components:
Solving the Long-Term Dependency Problem: In the “summer” scenario (
Figure 5), the load exhibits a continuously high level due to air conditioning. Traditional RNN-based models (like LSTM) suffer from memory forgetting in such long sequences, often failing to maintain the high-load trend prediction. The Informer module, with its ProbSparse self-attention mechanism, effectively captures these global dependencies, ensuring the prediction curve does not drift over time.
Overcoming the “Smoothing Effect”: A common drawback of standard transformer models is their tendency to produce smooth outputs, acting like a low-pass filter that ignores high-frequency mutations. This is fatal for fault prediction. In the “winter” scenario, where electric heating causes sudden load spikes, the XLSTM module plays a crucial role. Its exponential gating mechanism acts as a high-sensitivity trigger, allowing the model to respond instantly to these abrupt changes, thereby capturing potential fault symptoms that other models miss.
5.2. Practical Value for Active Warning in Rural Grids
The study utilized data from a rural distribution network, which presents unique challenges compared to urban grids, such as weaker infrastructure and higher load volatility due to the “coal-to-electricity” policy.
Robustness Across Seasons: As shown in
Table 3, the model maintains high accuracy (MAPE < 3%) in both the high-load summer and the volatile winter. This indicates that the model is robust enough to be deployed in real-world environments with changing seasonal patterns without frequent retraining.
Transition from Passive to Active O&M: Traditionally, distribution network maintenance relies on post-fault repair. The proposed model provides accurate multi-step trend prediction (covering the next few hours). This capability allows grid operators to identify potential overloads or voltage violations before they occur, enabling proactive measures such as load shedding or voltage regulation. This shifts the operational paradigm from “passive defense” to “active warning.”
5.3. Limitations and Future Work
Despite the promising results in fault symptom trend prediction, this study has certain limitations that point towards future research directions. First, the current model focuses on the time-series forecasting of electrical quantities (voltage, current). While it can successfully predict future trends and identify potential deviations (early warnings) based on prediction errors, it does not yet possess the capability to autonomously diagnose the specific type of fault (e.g., single-phase grounding, inter-phase short circuit, or high-impedance fault). In practical engineering, after an early warning is triggered, operators need to know not only “that something is wrong” but also “what exactly is wrong” to take targeted measures.
Therefore, future work will focus on rapid and accurate fault classification based on the predicted trends. Specifically, we propose a “Two-Stage Predict-and-Diagnose” framework with the following detailed designs:
Connection Scheme (Serial Fusion Strategy): We plan to construct a serial pipeline where the proposed XLSTM-Informer acts as the upstream “symptom predictor.” Crucially, we will adopt a feature-level fusion strategy rather than simple data transmission. Input Interface: The high-precision waveform sequence generated by the model will serve as the primary input. Latent Feature Sharing: To enhance information density, the high-dimensional hidden state vectors extracted by the XLSTM encoder—which contain rich historical volatility patterns—will be concatenated with the predicted sequence and fed into the downstream classifier.
Classification Network Architecture: Instead of generic classifiers, we intend to design a Multi-Scale Temporal Convolutional Network (TCN) with Attention Mechanism. Structure: The network will utilize dilated causal convolutions with varying kernel sizes to extract features from the predicted trajectories at different time scales. Mechanism: An attention layer will be integrated to automatically assign weights to critical time steps (e.g., the exact moment of a voltage sag), enabling the system to distinguish fine-grained fault signatures.
Fast Pre-fault Diagnosis: By utilizing the multi-step prediction capability of the current model, this cascading system can analyze the predicted future data rather than waiting for the fault to fully develop. This aims to achieve “pre-fault diagnosis,” enabling the system to identify the fault type and isolate the faulty section faster and more accurately before the protection relay trips.