1. Introduction
As the core equipment of power systems, the operational state of power transformers is crucial to grid stability and safety. However, traditional monitoring methods like Dissolved Gas Analysis (DGA) and manual inspection suffer from response delays and low efficiency, struggling to meet the real-time demands of modern grids [
1]. With advancements in artificial intelligence (AI), machine learning and deep learning models have been introduced for transformer fault diagnosis. While models like Support Vector Machines (SVM) and Random Forests (RF) perform well on structured data classification, they exhibit limitations in handling high-dimensional time-series data [
2]. Recently, the Transformer model, renowned for its attention mechanism capable of capturing long-range dependencies, has shown exceptional performance in anomaly detection tasks [
3,
4]. The rise in multi-source heterogeneous data fusion technology further promotes the evolution of transformer monitoring from single sensors to integrated IoT devices and intelligent inspection robots, laying the foundation for real-time data analysis [
2,
5].
Current research on predictive maintenance for power transformers has achieved progress under the impetus of artificial intelligence (AI) and Internet of Things (IoT) technologies. However, extant problems remain prominent at the data, model, and system levels. Firstly, data scarcity and heterogeneity severely constrain model generalizability. The literature indicates that while early fusion of multi-source data (e.g., Dissolved Gas Analysis (DGA), vibration signals, and thermal imaging) can enhance diagnostic accuracy, sample imbalance and noise interference complicate feature extraction. The particular scarcity of fault data heightens the risk of overfitting in supervised learning models such as Deep Neural Networks [
6]. Further revelations from literature show that traditional threshold methods, reliant on fixed rules, are ill-suited to dynamic operating environments. In contrast, data-driven models (e.g., Random Forest and Long Short-Term Memory networks) require substantial annotated data, a requirement often hampered by high labeling costs in practical applications [
7]. Secondly, the disconnection between model interpretability and physical mechanisms is a pronounced issue. The literature emphasizes that while black-box AI models (e.g., Convolutional Neural Networks) achieve high accuracy rates, they cannot elucidate causal relationships, such as that between vibration propagation paths and insulation ageing, thereby diminishing operational trust [
8]. Concurrently, the literature points out that multi-modal fusion models, exemplified by hybrid Transformer-Gated Recurrent Unit-Convolutional Neural Network architectures, provide insufficient modeling of nonlinear coupling effects (e.g., electromagnetic-thermal–mechanical stress), rendering them prone to false alarms during load fluctuations [
9]. Furthermore, the lack of standardization exacerbates system heterogeneity. The literature notes that data from infrared and temperature sensors are difficult to align due to protocol disparities (e.g., Modbus vs. MQTT) [
10]. It is also mentioned in the literature that although Physics-Informed Neural Networks (PINN) introduce constraints based on heat conduction equations, their boundary condition settings rely on empirical parameters, which limits transferability across different transformer models [
11].
Research challenges are concentrated across the dimensions of technical integration, real-time performance, and security. One significant challenge is balancing the real-time processing demands of multi-modal data fusion with computational efficiency. The literature demonstrates that while dynamic multi-scale attention mechanisms can capture spatiotemporal features effectively, their deployment on resource-limited edge devices increases inference latency, making it difficult to meet the requirements of online decision-making [
6]. Another critical challenge is the insufficient depth of integration between physical constraints and data-driven approaches. Literature proposes embedding physically meaningful models, such as the hysteresis nonlinear Jiles–Atherton model, into PINNs; however, current methodologies incur prohibitively high computational costs for simulating multi-field coupling within transformers. Future efforts must focus on developing lightweight fusion architectures, establishing cross-platform standards, and advancing explainable AI to facilitate the transition of predictive maintenance from theoretical innovation to practical engineering application [
11].
Recent integration of AI has significantly propelled this field, with domestic and international research showing complementary characteristics. Domestic studies focus on multi-modal fusion and temporal modeling innovations. For instance, Zhang et al. achieved a 96.67% diagnostic accuracy by converting time series into Gramian Angular Summation Fields and Recurrence Plots and extracting multi-scale features with a Dual Transformer-Bidirectional Long Short-Term Memory fusion, highlighting the advantage of heterogeneous data fusion [
12]. Addressing data imbalance, Chen and Zhang introduced Knowledge-Filtered Oversampling, combining DGA domain rules (e.g., IEC ratios) with genetic algorithms to enhance diagnostic robustness for minority classes [
13]. Furthermore, Liu’s multi-scale time adaptive fusion network dynamically integrates multi-scale features through adaptive temporal encoding and Bidirectional Gated Recurrent Unit modules, strengthening the recognition capability for complex fault patterns [
14].
In contrast, international research emphasizes model comparison and systematic reviews. One study, based on field data, compared models like CNN and SVM, demonstrating CNN’s leading accuracy of 96% and highlighting deep learning’s potential in nonlinear feature extraction [
15]. Another systematic analysis of AI applications in Transformer Prognostic and Health Management revealed that while CNN and LSTM can replace manual health index calculations, challenges like data scarcity and computational efficiency remain [
16]. Overall, global research trends point towards hybrid models, handling imbalanced data, and optimizing real-time monitoring, necessitating future integration of traditional knowledge and deep learning to improve diagnostic accuracy and industrial applicability.
This paper proposes an innovative physics-informed enhanced transformer (PI-Transformer) framework for power transformer fault diagnosis, with the core contribution being the deep integration of physical priors into the attention mechanism through multi-physics coupling constraints. This primary innovation enables three key advancements:
- 1.
A unified temporal representation scheme that provides appropriately processed input for the physical constraints, effectively resolving data heterogeneity through Dynamic Time Warping and physics-guided feature projection.
- 2.
A multi-task diagnostic framework that validates the physical consistency of the model’s predictions across fault classification, severity assessment, and localization tasks, optimized by a curriculum learning strategy.
- 3.
Comprehensive experimental validation on 3000 samples from 76 transformers, demonstrating the framework’s practicality with 89.70% accuracy and superior robustness under noisy conditions.
3. Construction of Unified Temporal Representation for Multi-Source Heterogeneous Monitoring Data
3.1. Overview
This section proposes an innovative physics-informed enhanced transformer (PI-Transformer). The core idea is deep integration of physical priors (thermodynamic laws, fault evolution patterns) into the deep learning framework. The model employs a four-layer architecture: Input Layer, Encoding Layer, Diagnostic Layer, and Optimization Layer, achieving unity between data-driven and mechanistic models. The algorithm flowchart is shown in
Figure 1 below.
3.2. Model Architecture Design and Integration of Physical Laws
The input layer of the model receives the pre-processed multi-source time-series data from
Section 2, including DGA gas concentrations, infrared temperature measurements, and partial discharge characteristics. This data is first processed by a physics-guided feature projection layer, which transforms the raw monitoring values into feature representations with explicit physical significance. These features then enter a transformer encoder incorporating physical constraints, which captures temporal dependencies via a multi-head attention mechanism. Finally, a multi-task diagnostic head simultaneously outputs predictions for fault type, severity, and location.
Within the input layer design, we establish a feature projection mechanism guided by clear physical principles, tailored to the characteristics of the multi-source monitoring data. The central idea is to convert raw monitoring data through mathematical transformations steered by physical laws into feature representations that are more suitable for model processing and inherently more physically meaningful. Considering the physics of gas diffusion and dissolution, gas concentrations in transformer oil typically follow a long-tailed distribution. A logarithmic transformation effectively compresses the value range in high-concentration regions while enhancing discriminability in low-concentration regions. Thus, concentration features are processed using the following transformation:
where
is the projected DGA feature vector;
is the projection weight matrix;
is the raw DGA feature vector (including various gas concentrations);
is the bias term.
Based on thermodynamic principles, normalization is performed to convert absolute temperature values into relative temperature rise ratios. This approach aligns more closely with the physical essence of overheating faults, as their diagnosis typically relies on temperature rise rather than absolute values.
where
is the projected temperature feature vector;
is the projection weight matrix for temperature features;
is the raw temperature measurement;
is the ambient temperature baseline;
is the transformer’s rated temperature;
is the bias term for temperature features.
In the model design, physical constraints are integrated into the deep learning framework through several key strategies. Primarily, during the feature projection stage, the aforementioned mathematical transformations ensure that the input features adhere to physical laws. Furthermore, constraint terms based on fundamental principles like energy conservation are incorporated into the attention mechanism, guiding the model to focus on hotspot regions consistent with heat propagation patterns. Additionally, physical consistency constraints are embedded within the loss function design, ensuring that the model’s predictions are not only accurate but also physically plausible.
3.3. Physically Constrained Attention Mechanism and Feature Fusion
This section details the specific implementation method of integrating physical laws into the attention mechanism. Based on the heat balance equation and gas diffusion laws, a multi-physics coupled attention weight calculation method is designed. The core innovation of this mechanism lies in transforming physical constraints into differentiable penalty terms that directly participate in the calculation of attention weights. The physical constraint
, constructed based on the heat balance equation, ensures that the hotspot regions the model focuses on comply with the law of energy conservation. A composite attention calculation framework is built as shown in the following equation:
where
are the query, key, and value matrices obtained through linear transformation;
is the scaling factor reflecting the dimensionality of the key vectors;
and
are the weight coefficients for the thermal and gas constraints, respectively;
is the heat balance constraint matrix;
is the gas ratio constraint matrix.
The element calculation of the constraint matrix
considers the spatiotemporal characteristics of thermal diffusion. When the difference in power loss between two positions is small, their corresponding attention weights should be enhanced, reflecting the requirements of the law of energy conservation. Simultaneously, the temperature difference term ensures that the hotspot distribution the model focuses on conforms to the laws of heat conduction. The calculation formula is as follows:
where
is the power loss at position i;
is the thermal diffusion coefficient;
is the temperature value at the corresponding position;
is a constant term.
Furthermore, based on industry standards such as the Duval triangle method, the gas ratio constraint matrix
is constructed. The design of this matrix fully considers the physical significance and diagnostic rules of gas ratios in fault diagnosis. The calculation formula is as follows:
where
represents the fault type discrimination function based on gas ratios;
is the penalty coefficient;
denotes the gas ratio similarity, where
is the feature gas ratio vector.
This design ensures that when the model computes attention weights, it automatically enhances the associations between samples of the same fault type while suppressing unreasonable associations between samples of different fault types.
Integrating the aforementioned physical constraint matrices into the attention mechanism forms a multi-physics coupled attention calculation framework. The specific implementation process is as follows: first, compute the basic attention score
; then, incorporate the physical constraint terms to obtain
; after normalization via the softmax function, the final output
is computed. The specific calculation formulas are as follows:
where
is the original attention score between position i and position j;
is the attention score corrected by physical constraints;
is the final attention weight after normalization by the Softmax function, representing the formal “attention” proportion allocated from position to position after incorporating physical knowledge;
is the final output vector at a certain position, which is a weighted fusion of information from all positions.
This design allows the model to maintain the powerful representational capacity of the traditional attention mechanism while, through physical constraints, guiding it to focus on feature combinations that conform to physical laws. This significantly enhances the interpretability and physical plausibility of the diagnostic results.
3.4. Multi-Task Diagnosis and Physical Consistency Optimization
The diagnostic layer employs a multi-task learning framework to simultaneously accomplish fault classification, severity assessment, and fault location. Based on the requirement for temporal consistency, a physics-guided output constraint mechanism is designed to ensure that the diagnostic results are both accurate and compliant with physical laws. The multi-task loss function is meticulously designed to fully consider physics-consistency requirements. Its formula is as follows:
where
denotes the data fitting loss;
represents the thermal balance consistency loss;
is the temporal continuity loss;
is the gas ratio constraint loss;
,
, and
are the weighting coefficients for the respective loss terms.
Among them, the thermal balance consistency loss
ensures that the prediction results conform to thermodynamic laws, maintaining the hotspot patterns emphasized by the attention mechanism in the final temperature prediction. Its calculation formula is as follows:
where
is the temperature value predicted by the model;
is the theoretical temperature value calculated based on physical equations;
is the weight for the gradient consistency loss;
is the total length of the sequence.
The gas ratio constraint loss
is based on industry-standard fault diagnosis rules. It ensures the model’s outputs comply with physical laws by comparing the difference between predicted gas ratios and theoretical standard values. Its calculation formula is as follows:
Temporal consistency loss captures temporal dependencies through a physics-constrained attention mechanism. It penalizes state transitions that violate fault progression patterns, thereby ensuring the continuity of fault evolution. The formula is as follows:
where
and
are the model’s predicted outputs at time steps t and t − 1, respectively;
is the weight coefficient for the penalty term;
is the logical indicator function;
is a hyperparameter threshold.
Constructing a multi-task optimization framework enhances the model’s generalization capability through joint optimization. The introduction of physics-based constraint terms reduces the risk of overfitting. Meanwhile, the multi-task outputs provide richer information for comprehensive diagnosis.
The training process employs a two-stage curriculum learning strategy, progressively incorporating physical constraints. The first stage focuses on physics-based pre-training, emphasizing the optimization of the physics-consistency loss. This stage enables the model to preliminarily grasp the physical laws, laying a foundation for subsequent training. The second stage involves supervised fine-tuning, balancing data fitting and physical constraints. The constraint weights are dynamically adjusted during the training process to ensure optimal training effectiveness.
In the first stage, an unsupervised pre-training approach is adopted, focusing on optimizing the physics-consistency loss function. The primary objective of this stage is to allow the model to learn the fundamental physical laws of transformer operation beforebeing exposed to labeled data. In this way, the model establishes a preliminary understanding of fault evolution patterns, providing a favorable initialization for subsequent supervised learning. The pre-training stage utilizes a large-scale dataset of unlabeled monitoring data, including normal operational data and data indicating minor anomalies, enabling the model to learn the physical laws governing normal operating states.
where
represents the set of learnable model parameters;
denotes the minimization of the loss with respect to parameters
;
is the sum of the physical constraint losses.
The second stage involves supervised fine-tuning on the pre-trained model using labeled data. This stage aims to adapt the model to the specific fault diagnosis task while maintaining consistency with physical laws. A smooth transition of the training focus is achieved by introducing a dynamic constraint weight
, calculated as follows:
where
represents the constraint weight;
is the initial constraint weight;
is the decay time constant;
is the minimum constraint weight;
t denotes the training time step.
During the initial phase of training, the model parameters require strong physical constraints to guide the learning direction and prevent convergence to suboptimal local minima. As training progresses and the model gradually assimilates the underlying physical laws, the focus shifts towards enhancing its data-fitting capability to improve diagnostic accuracy. Consequently, the constraint weight is dynamically attenuated according to the prescribed exponential decay schedule.
3.5. Summary
This study proposed an innovative method for power transformer fault diagnosis based on a physics-informed enhanced transformer architecture. By deeply integrating physical prior knowledge—such as thermodynamic laws and gas diffusion principles—into the deep learning model, we successfully constructed an intelligent diagnostic system featuring synergistic multi-modal feature fusion and physical constraints.
In terms of model design, this study innovatively introduced a physics-guided feature projection mechanism and a multi-physics coupled attention calculation method, enabling the model to automatically adhere to physical laws during the feature extraction process. The design of a multi-task loss function incorporating thermal balance consistency, temporal continuity, and gas ratio constraints ensured that the diagnostic results maintain both accuracy and physical plausibility.
Regarding the training strategy, a two-stage curriculum learning approach and a dynamic constraint weight adjustment mechanism were employed, achieving an organic unification of physical principles and data-driven learning. Experimental validation demonstrated that the proposed method significantly outperforms traditional approaches in fault diagnosis accuracy, physical consistency, and generalization capability.
The main contribution of this study lies in establishing a deep integration framework that couples physical laws with deep learning models, providing a new technological pathway for transformer condition assessment. The core methodology is not only applicable to power equipment fault diagnosis but can also be extended to intelligent operation and maintenance in other complex industrial systems, holding significant theoretical value and prospects for engineering application.
4. Case Study
4.1. Case Setup
This section evaluates the proposed fault prediction model for power transformers based on multi-source heterogeneous data. The experiment utilizes multi-source heterogeneous monitoring data and Dissolved Gas Analysis (DGA) data from power transformers deployed in the traction power supply sections of an urban rail transit system. The dataset comprises 3000 state samples collected over a six-month period from 76 oil-immersed transformers with similar configurations, providing a representative foundation for evaluating fault evolution characteristics in real-world operational scenarios.
All experiments were conducted on a high-performance computing cluster running the Linux operating system. The hardware configuration includes an eight-core Intel i9 processor with a base frequency of 2.3 GHz and 32 GB of RAM. Model development, training, and testing were completed using Python 3.8.5. The comprehensive case study encompasses data description, model configuration, the training process, and a holistic performance evaluation.
This study utilized monitoring data collected over a three-year period (2020–2023) from 76 in-service 110-kV power transformers operated by a provincial grid company in China. The dataset integrates multi-source heterogeneous monitoring signals, including Dissolved Gas Analysis (DGA), vibration, infrared thermal imaging, partial discharge, oil chemistry, and operational parameters. Data preprocessing involved dynamic time warping alignment and physics-guided feature projection (e.g., logarithmic transformation of gas concentrations and conversion of absolute temperatures into relative temperature rise ratios) to construct a unified temporal representation. A total of 3000 valid samples were constructed, where each sample corresponds to a multi-modal feature vector at a specific timestamp, labeled with the transformer’s health state. Health states were determined based on DGA guidelines, maintenance records, and expert analysis, covering seven categories: Normal (N), Low-temperature Overheating (LT), Medium-temperature Overheating (MT), High-temperature Overheating (HT), Partial Discharge (PD), Low-energy Discharge (LD), and High-energy Discharge (HD). To ensure generalizability and avoid data leakage, a leave-transformers-out partitioning strategy was adopted: 61 transformers for training, 8 for validation, and 7 for testing.
4.2. Dataset Construction
To validate the effectiveness of the proposed transformer fault prediction model based on multi-source heterogeneous features, this study selected operational records from oil-immersed power transformers in an urban rail transit traction power supply area as the research subject. The final curated dataset includes key monitoring indicators such as dissolved gas features (H2, CH4, C2H4, C2H6, C2H2, CO, CO2) and multi-dimensional state variables (temperature parameters, partial discharge magnitude and its severity, vibration measurements, load rate, and cooling status). These indicators cover the most representative electrical, thermal, and mechanical multi-source features for fault diagnosis.
To validate the effectiveness of the proposed transformer fault prediction model based on multi-source heterogeneous features, this study selected operational sampling records from oil-immersed transformers deployed in the traction power supply section of an urban rail transit system as the research subject. The data were collected from multiple transformers with similar configurations over their routine operational cycles. To ensure that subsequent model training can reflect the various typical operating states of the transformers, the typical values of the DGA data and multidimensional state variables used for modeling are presented in
Table 2 and
Table 3, respectively. For brevity, the table presents representative values for each transformer state; additional data are consistent with these examples. The data encompass gas characteristic profiles under different fault conditions as well as variations in operating conditions, providing a foundational dataset for identifying typical fault patterns such as low-temperature overheating, medium-temperature overheating, high-temperature overheating, partial discharge, low-energy discharge, and high-energy discharge. The final compiled samples are comprehensive and exhibit clear state distinctions, effectively supporting subsequent validation tasks such as fault classification, risk identification, and feature contribution analysis.
The raw dataset was randomly partitioned into a training set and a test set at a 4:1 ratio. Specifically, 80% of the total samples were allocated for model training, while the remaining 20% were set aside for performance evaluation. To gain an in-depth understanding of the distribution characteristics of the transformer state data, the DBSCAN clustering method was applied to the training set samples for state characterization and cluster analysis. Multiple dimensionality reduction and visualization techniques were subsequently employed to reveal the intrinsic structure of the data.
t-Distributed Stochastic Neighbor Embedding (t-SNE) is a nonlinear dimensionality reduction technique capable of effectively preserving the local structural relationships of data in the high-dimensional space.
Figure 2 illustrates the visualization of the distribution for the normal state versus fault states based on t-SNE. From the figure, it can be clearly observed that the normal state samples (green points) form a distinct, large, and densely populated cluster, indicating a high degree of consistency in the operational characteristics of transformers under normal conditions. In contrast, the fault state samples (red points) are more widely dispersed, reflecting the diversity of different fault types within the feature space.
Figure 2 provides a more refined classification display of fault types, presenting the distribution of seven specific states via t-SNE. It can be observed that the normal state maintains its compact clustering feature, while various fault states exhibit different distribution patterns. Among them, the sample points for certain fault types, such as partial discharge (PD) and low-energy discharge (LD), show partial overlap with relatively ambiguous boundaries, suggesting that these faults may share similar physical characteristics. Other fault types, like PD and different levels of overheating, demonstrate relatively independent distribution regions, indicating that these fault types possess good separability within the feature space.
Based on the aforementioned analytical results, an adaptive sampling strategy was employed during the model training process. For the normal state, which exhibits clear boundaries, its sample weight was appropriately reduced to prevent the model from overfitting to these easily classifiable samples. For fault types with ambiguous boundaries, their sample weights were increased to ensure the model could sufficiently learn these challenging classification boundaries. This targeted training strategy, informed by data characteristic analysis, effectively enhanced the model’s generalization capability and robustness in the transformer state assessment task, demonstrating superior classification performance, particularly when handling fault types with ambiguous boundaries.
4.3. Case Analysis
This study conducted a systematic performance evaluation of the proposed physics-informed transformer (PI-Transformer) model. Based on multi-state monitoring samples from power transformers, the improved physics-informed transformer model was employed for state discrimination and fault prediction. Its performance was compared and analyzed against several traditional machine learning algorithms, including Support Vector Machine (SVM), Random Forest (RF), LightGBM, and XGBoost, as well as a representative multimodal deep learning benchmark, the dynamic multiscale attention CNN–LSTM model [
13]. These analyses collectively validate the superiority of the proposed model in terms of multi-type fault discrimination accuracy, adaptability, and physical consistency.
4.3.1. Accuracy Test
As shown in
Figure 3, the average discrimination accuracy of the proposed PI-Transformer method for the seven transformer states reached 87.5% as the number of epochs increased.
Figure 4 reveals differences in the identification difficulty among different fault states. The recognition accuracy for boundary-ambiguous faults, specifically low-energy discharge (LD) and partial discharge (PD), was relatively lower, with PI-Transformer achieving 83.6% and 77.9%, respectively. In contrast, the highest accuracies among the other compared methods were only 78.4% (LightGBM) for LD and 71.9% (LightGBM) for PD. This is primarily attributed to the high similarity in features between these two fault types, resulting in relatively ambiguous boundaries in the feature space, which makes it difficult for traditional classifiers to effectively distinguish them. Nevertheless, compared to traditional methods, PI-Transformer demonstrated a clear advantage for multiple critical fault types. Even for the boundary-ambiguous LD and PD faults, PI-Transformer still achieved a certain level of improvement. This advantage mainly stems from the introduction of the physical constraint mechanism. By integrating physical parameters such as operating temperature, load rate, and vibration frequency of the transformer, the model’s ability to learn the essential characteristics of faults is enhanced. In contrast to LD and PD, high-energy discharge (HD) exhibits more distinct features, enabling the proposed method to achieve a recognition accuracy exceeding 89%. Within the series of overheating faults, from low-temperature overheating (LT) to high-temperature overheating (HT), the recognition accuracy shows a gradually increasing trend, reflecting the physical principle that fault characteristics become more pronounced with increasing severity. Particularly in scenarios with small samples and overlapping features, the physical constraints provide an additional regularization effect, preventing the model from learning spurious features that contradict physical laws. Regarding the performance difference between the training and test sets, PI-Transformer exhibited a smaller accuracy drop on the test set (an average decrease of 2.8%), demonstrating good generalization capability. This indicates that the physical constraints not only improve the model’s discrimination accuracy but also enhance its stability across different scenarios.
As shown in
Table 4, distinct performance variations are evident among different algorithms in fault identification. Among the comparative methods, the dynamic multiscale attention CNN–LSTM model proposed in [
13] is selected as a representative multimodal deep learning benchmark for transformer anomaly detection. This method integrates convolutional neural networks for spatial feature extraction and LSTM networks for temporal dependency modeling, enhanced by a multiscale attention mechanism to adaptively fuse heterogeneous monitoring data. The proposed FI-Transformer consistently outperforms the CNN–LSTM benchmark on both the training and test datasets. Specifically, the FI-Transformer achieves an accuracy of 89.70% on the training set and 86.90% on the test set, whereas the CNN–LSTM model attains 84.49% and 81.28%, respectively. The performance gap indicates that the proposed method exhibits better generalization capability. This improvement can be attributed to the physics-informed constraints, which further enhance the generalization capability of the proposed model under complex operating conditions.
Gradient boosting methods (LightGBM and XGBoost) demonstrate competitive performance across most fault types, with their accuracy tending to decrease when distinguishing faults with ambiguous boundaries. Traditional strong classifiers, such as Random Forest and SVM, perform reliably on simpler fault types but show limitations in capturing complex fault patterns. Logistic Regression exhibits relatively lower overall performance due to its model complexity constraints, reflecting the inherent limitations of linear models in handling the nonlinear characteristics of transformer faults. These observations are consistent with the weighted average performance across multiple fault types and highlight the comparative advantages of the proposed PI-Transformer in multi-type fault diagnosis, as indicated by the overall trends in
Table 4.
In summary, the physics-informed transformer method proposed in this paper demonstrates advantages in multi-type fault diagnosis for power transformers, including high accuracy, strong adaptability to complex faults, and good cross-scenario stability. It provides effective technical support for transformer condition monitoring and fault early warning.
4.3.2. Scenario Adaptability Test
To validate the adaptability of the proposed physics-constrained transformer model to the complex operating environments of rail transit power supply transformers, the following tests were conducted under scenarios involving continuous monitoring samples and sample loss due to communication interruptions.
Continuous monitoring samples from 10 consecutive time periods (with 15-min intervals) of rail transit power supply transformers were selected to evaluate the fault state prediction performance of the physics-constrained transformer. As shown in
Figure 5, the PI-Transformer achieved a prediction accuracy of 83.2%, representing a relative decrease of 4.9%—the smallest decline among all tested classifiers. In contrast, the accuracy degradation for other classifiers ranged from 5.9% to 14.0%, with Logistic Regression (LR) performing the worst, showing a 14.0% decline. The results indicate that the proposed method can effectively capture the temporal characteristics of fault state evolution in power transformers, identifying changes in dissolved gas compositions and variations in operating condition indicators during fault progression. The high-accuracy early identification of fault states in rail transit power supply transformers provided by the PI-Transformer method offers a reliable basis for maintenance personnel to implement timely inspections and protective measures, thereby effectively ensuring the operational safety of the rail transit power supply.
Furthermore, the widespread distribution of rail transit power supply networks and the susceptibility of communication channels to weather or network interference can lead to monitoring interruptions, which may introduce deviations in state identification. The adaptability of the proposed physics-constrained transformer method under strong interference conditions was further evaluated. Scenarios with random loss of monitoring data in certain time periods were simulated to represent communication interruption events. As shown in
Figure 6, the physics-constrained transformer algorithm maintained a high accuracy of 82.0% in state identification under discontinuous conditions with 20% sample loss. The accuracy degradation of the proposed algorithm under interference was approximately 7.1%, whereas other algorithms exhibited declines of about 9.2% to 14.4%. This demonstrates that the proposed algorithm, through its physics constraints and attention mechanism, can effectively mitigate the impact of monitoring interference, exhibiting strong adaptability in challenging scenarios. However, the study also notes that when communication interruptions are prolonged, resulting in a sample loss ratio exceeding 40%, the identification accuracy of the proposed method drops below 72%. In such cases, prompt on-site maintenance is required to restore communication and ensure operational safety.
4.3.3. Summary
The dataset used comprises 3000 state samples collected over six months from 76 oil-immersed transformers with similar configurations, covering diverse operational states and fault patterns. The extensive coverage ensures that the evaluation reflects realistic operational scenarios and supports robust multi-type fault assessment. The typical operating conditions (
Table 2) and DGA indicators (
Table 3) provide further evidence of the dataset’s representativeness and its support for accurate state identification.
Furthermore,
Figure 4 and
Figure 5 illustrate the temporal and scenario adaptability of the proposed model. In continuous monitoring scenarios, the FI-Transformer maintains high accuracy with minimal decline under temporal dependencies. Under simulated communication loss conditions, it preserves robust performance even with partial data loss, indicating strong resilience against real-world operational uncertainties. Additionally, the monotonic increase in recognition accuracy from LT to HT faults demonstrates that the model captures physically meaningful fault evolution patterns.
Overall, the effectiveness of the FI-Transformer is demonstrated through complementary analyses across multiple perspectives: static state classification, comparative algorithm performance, temporal adaptability, robustness under data loss, and physical consistency. Special attention to boundary-ambiguous faults such as LD and PD further highlights the model’s capability in challenging classification scenarios. The combined evidence confirms that the proposed method achieves accurate, generalizable, and physically consistent multi-type fault diagnosis.
5. Conclusions
To address the challenges in transformer fault diagnosis, this study proposes a novel physics-informed transformer (PI-Transformer) model based on multimodal input data and the integration of physical constraints. The proposed framework constructs a unified temporal representation from multi-source heterogeneous monitoring data and embeds physically meaningful constraints to guide model learning.
A systematic case analysis based on long-term operational data from multiple oil-immersed transformers was conducted to evaluate the effectiveness of the proposed method. The experimental results show that the PI-Transformer achieves high fault identification accuracy on both training and test datasets, outperforming several traditional machine learning methods and a representative multimodal deep learning benchmark. In particular, the proposed model demonstrates improved discriminative capability for boundary-ambiguous fault types, such as low-energy discharge and partial discharge, which are typically difficult to distinguish using conventional approaches.
Furthermore, the introduction of physical constraints contributes to enhanced generalization performance and scenario adaptability. The model exhibits a relatively small performance gap between the training and test sets, indicating stable generalization behavior. Additional tests under continuous monitoring sequences and simulated communication interruptions further demonstrate the robustness of the proposed approach under realistic operating conditions.