Next Article in Journal
PSCAD-Based Analysis of Short-Circuit Faults and Protection Characteristics in a Real BESS–PV Microgrid
Previous Article in Journal
Improved Linear Active Disturbance Rejection Control of Energy Storage Converter
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Multi-Source Data Fusion and Multi-Task Physics-Informed Transformer for Power Transformer Fault Diagnosis

1
School of Rail Transit, Guangdong Communication Polytechnic, Guangzhou 510000, China
2
School of Electric Power, South China University of Technology, Guangzhou 510000, China
3
College of Engineering, Shantou University, Shantou 515063, China
*
Author to whom correspondence should be addressed.
Energies 2026, 19(3), 599; https://doi.org/10.3390/en19030599
Submission received: 22 December 2025 / Revised: 19 January 2026 / Accepted: 22 January 2026 / Published: 23 January 2026

Abstract

Power transformers are critical assets in power systems, and their reliable operation is essential for grid stability. Conventional fault diagnosis methods suffer from delayed response and limited adaptability, while existing artificial intelligence-based approaches face challenges related to data heterogeneity, limited interpretability, and weak integration of physical mechanisms. To address these issues, this paper proposes a physics-informed enhanced transformer-based framework for power transformer fault diagnosis. A unified temporal representation scheme is developed to integrate heterogeneous monitoring data using Dynamic Time Warping and physics-guided feature projection. Physical priors derived from thermodynamic laws and gas diffusion principles are embedded into the attention mechanism through multi-physics coupling constraints, improving physical consistency and interpretability. In addition, a multi-task diagnostic strategy is adopted to jointly perform fault classification, severity assessment, and fault localization. Experiments on 3000 samples from 76 power transformers demonstrate that the proposed method achieves high diagnostic accuracy and superior robustness under noise and interference, indicating its effectiveness for practical predictive maintenance applications.

1. Introduction

As the core equipment of power systems, the operational state of power transformers is crucial to grid stability and safety. However, traditional monitoring methods like Dissolved Gas Analysis (DGA) and manual inspection suffer from response delays and low efficiency, struggling to meet the real-time demands of modern grids [1]. With advancements in artificial intelligence (AI), machine learning and deep learning models have been introduced for transformer fault diagnosis. While models like Support Vector Machines (SVM) and Random Forests (RF) perform well on structured data classification, they exhibit limitations in handling high-dimensional time-series data [2]. Recently, the Transformer model, renowned for its attention mechanism capable of capturing long-range dependencies, has shown exceptional performance in anomaly detection tasks [3,4]. The rise in multi-source heterogeneous data fusion technology further promotes the evolution of transformer monitoring from single sensors to integrated IoT devices and intelligent inspection robots, laying the foundation for real-time data analysis [2,5].
Current research on predictive maintenance for power transformers has achieved progress under the impetus of artificial intelligence (AI) and Internet of Things (IoT) technologies. However, extant problems remain prominent at the data, model, and system levels. Firstly, data scarcity and heterogeneity severely constrain model generalizability. The literature indicates that while early fusion of multi-source data (e.g., Dissolved Gas Analysis (DGA), vibration signals, and thermal imaging) can enhance diagnostic accuracy, sample imbalance and noise interference complicate feature extraction. The particular scarcity of fault data heightens the risk of overfitting in supervised learning models such as Deep Neural Networks [6]. Further revelations from literature show that traditional threshold methods, reliant on fixed rules, are ill-suited to dynamic operating environments. In contrast, data-driven models (e.g., Random Forest and Long Short-Term Memory networks) require substantial annotated data, a requirement often hampered by high labeling costs in practical applications [7]. Secondly, the disconnection between model interpretability and physical mechanisms is a pronounced issue. The literature emphasizes that while black-box AI models (e.g., Convolutional Neural Networks) achieve high accuracy rates, they cannot elucidate causal relationships, such as that between vibration propagation paths and insulation ageing, thereby diminishing operational trust [8]. Concurrently, the literature points out that multi-modal fusion models, exemplified by hybrid Transformer-Gated Recurrent Unit-Convolutional Neural Network architectures, provide insufficient modeling of nonlinear coupling effects (e.g., electromagnetic-thermal–mechanical stress), rendering them prone to false alarms during load fluctuations [9]. Furthermore, the lack of standardization exacerbates system heterogeneity. The literature notes that data from infrared and temperature sensors are difficult to align due to protocol disparities (e.g., Modbus vs. MQTT) [10]. It is also mentioned in the literature that although Physics-Informed Neural Networks (PINN) introduce constraints based on heat conduction equations, their boundary condition settings rely on empirical parameters, which limits transferability across different transformer models [11].
Research challenges are concentrated across the dimensions of technical integration, real-time performance, and security. One significant challenge is balancing the real-time processing demands of multi-modal data fusion with computational efficiency. The literature demonstrates that while dynamic multi-scale attention mechanisms can capture spatiotemporal features effectively, their deployment on resource-limited edge devices increases inference latency, making it difficult to meet the requirements of online decision-making [6]. Another critical challenge is the insufficient depth of integration between physical constraints and data-driven approaches. Literature proposes embedding physically meaningful models, such as the hysteresis nonlinear Jiles–Atherton model, into PINNs; however, current methodologies incur prohibitively high computational costs for simulating multi-field coupling within transformers. Future efforts must focus on developing lightweight fusion architectures, establishing cross-platform standards, and advancing explainable AI to facilitate the transition of predictive maintenance from theoretical innovation to practical engineering application [11].
Recent integration of AI has significantly propelled this field, with domestic and international research showing complementary characteristics. Domestic studies focus on multi-modal fusion and temporal modeling innovations. For instance, Zhang et al. achieved a 96.67% diagnostic accuracy by converting time series into Gramian Angular Summation Fields and Recurrence Plots and extracting multi-scale features with a Dual Transformer-Bidirectional Long Short-Term Memory fusion, highlighting the advantage of heterogeneous data fusion [12]. Addressing data imbalance, Chen and Zhang introduced Knowledge-Filtered Oversampling, combining DGA domain rules (e.g., IEC ratios) with genetic algorithms to enhance diagnostic robustness for minority classes [13]. Furthermore, Liu’s multi-scale time adaptive fusion network dynamically integrates multi-scale features through adaptive temporal encoding and Bidirectional Gated Recurrent Unit modules, strengthening the recognition capability for complex fault patterns [14].
In contrast, international research emphasizes model comparison and systematic reviews. One study, based on field data, compared models like CNN and SVM, demonstrating CNN’s leading accuracy of 96% and highlighting deep learning’s potential in nonlinear feature extraction [15]. Another systematic analysis of AI applications in Transformer Prognostic and Health Management revealed that while CNN and LSTM can replace manual health index calculations, challenges like data scarcity and computational efficiency remain [16]. Overall, global research trends point towards hybrid models, handling imbalanced data, and optimizing real-time monitoring, necessitating future integration of traditional knowledge and deep learning to improve diagnostic accuracy and industrial applicability.
This paper proposes an innovative physics-informed enhanced transformer (PI-Transformer) framework for power transformer fault diagnosis, with the core contribution being the deep integration of physical priors into the attention mechanism through multi-physics coupling constraints. This primary innovation enables three key advancements:
1.
A unified temporal representation scheme that provides appropriately processed input for the physical constraints, effectively resolving data heterogeneity through Dynamic Time Warping and physics-guided feature projection.
2.
A multi-task diagnostic framework that validates the physical consistency of the model’s predictions across fault classification, severity assessment, and localization tasks, optimized by a curriculum learning strategy.
3.
Comprehensive experimental validation on 3000 samples from 76 transformers, demonstrating the framework’s practicality with 89.70% accuracy and superior robustness under noisy conditions.

2. Construction of Unified Temporal Representation from Multi-Source Heterogeneous Monitoring Data

2.1. Multi-Source Monitoring Data System and Fault State Characterization

This study constructs a multi-source information system comprising six data types. The features and physical significance of each source are detailed in Table 1.
A multi-level diagnostic strategy determines the fault state by combining key features: gas ratios from DGA (internal chemical changes), temperature information from infrared monitoring (external thermal performance), and PD signals (insulation state). An expert system performs comprehensive judgment, with decision rules represented as
D = f ( C 2 H 2 / C 2 H 4 , C H 4 / H 2 , Δ T max , P D m a g )
where Dis is the concluded transformer state; f is the diagnostic logic function defined by expert knowledge or standards like IEC 60599 [17]; Δ T max is the difference between the maximum surface temperature and ambient/reference temperature; P D m a g is the magnitude of the PD signal.

2.2. Data Preprocessing

A DTW-based preprocessing pipeline addresses heterogeneity and sampling rate differences. For missing data, a spatiotemporal KNN imputation algorithm is used:
x i ( t j ) = k = 1 K w j k x i ( t k ) k = 1 K w j k , w j k = exp t j t k 2 2 σ t 2
where x i ( t j ) is the imputed value; K is the number of nearest neighbors; w j k is the temporal weight; σ t is the time decay coefficient.
An improved RobustScaler handles dimensional differences:
x n o r m = x m e d i a n ( X ) I Q R ( X )
where x n o r m is the normalized value; m e d i a n ( X ) and I Q R ( X ) are the median and interquartile range.
This method has better robustness against outliers. The timing alignment adopts an elastic matching strategy based on dynamic time bending and achieves alignment of sequences with different sampling rates by constructing a cost matrix C N × M . This alignment method can effectively handle non-uniform sampling data and lay the foundation for subsequent feature fusion. The calculation formula is as follows:
C ( i , j ) = || x i y j || 2 + min C ( i 1 , j ) C ( i , j 1 ) C ( i 1 , j 1 )
where C ( i , j ) is the minimal cumulative cost; || x i y j || is the Euclidean distance; C ( i 1 , j ) , C ( i , j 1 ) , and C ( i 1 , j 1 ) represent the cumulative costs of deletion, insertion, and matching operations, respectively.

2.3. Hierarchical Embedding Network Architecture for Feature Fusion

To address the spatial alignment problem of heterogeneous data, we design a hierarchical embedding network architecture. For numerical time-series data, a spatiotemporal attention encoder is employed, with the calculation formula as follows:
E n u m ( t ) = M W q x ( t ) , W k X , W v X
where E n u m ( t ) represents the feature vector characterizing the equipment state at time t; W q , W k , and W v denote the weight matrices for query, key, and value, respectively; x ( t ) is the feature vector at the current time point; X is the historical time-series feature matrix; M is the multi-head attention mechanism function.
For image data, deformable convolution is introduced on the basis of a pre-trained ResNet to enhance adaptability to the specific structures of power transformers. The calculation formula is as follows:
E i m g = D e c o ( I ( t ) ; Δ p ) + P o ( t )
where E i m g represents the external thermal state of the equipment at time; D e c o denotes the deformable convolution operation; I ( t ) is the infrared image at time point; Δ p is the learnable positional offset; P o ( t ) is the positional encoding, which adds temporal information to the image features.
The feature fusion employs a dynamic weight allocation mechanism based on cross-modal attention. The salience weights for each modality are calculated, and the fused representation is achieved through temporal fusion using a Gated Recurrent Unit (GRU). This enables the model to dynamically adjust the contribution of each modality according to the current operating condition. The calculation formulas are as follows:
α m = σ W α E m || h c o n + b α
E f u s e d ( t ) = G R U α 1 E 1 ( t ) ; ; α 6 E 6 ( t ) , h ( t 1 )
where α m represents the attention weight for modality; σ denotes the sigmoid activation function; W α and b α are the learnable weight and bias parameters; E m || h c o n signifies the concatenation of the modality feature and the context vector; h con is the context vector representing the operating condition; E f u s e d ( t ) is the multi-modal fused feature, which serves as the input to the subsequent Transformer classifier for the final state diagnosis; G R U refers to the Gated Recurrent Unit used for temporal feature fusion; α 1 E 1 ( t ) ; ; α 6 E 6 ( t ) is the concatenation of the weighted multi-modal features; h ( t 1 ) is the hidden state from the previous time step.
The sequence of fused feature vectors E f u s e d ( t 1 ) , E f u s e d ( t 2 ) , , E f u s e d ( t n ) , is transformed into a sequence of fixed-length time windows. Assuming the current diagnostic time is t c and the time window length is L , the constructed input sequence is S = [ E f u s e d ( t c L + 1 ) , E f u s e d ( t c L + 2 ) , , E f u s e d ( t c ) ] . This process converts the continuous time stream into a discrete sequence of samples. The sliding window mechanism effectively augments the training data, providing the model with sufficient historical context to capture the evolution patterns of faults, which lays the data foundation for the subsequent fault diagnosis task.

3. Construction of Unified Temporal Representation for Multi-Source Heterogeneous Monitoring Data

3.1. Overview

This section proposes an innovative physics-informed enhanced transformer (PI-Transformer). The core idea is deep integration of physical priors (thermodynamic laws, fault evolution patterns) into the deep learning framework. The model employs a four-layer architecture: Input Layer, Encoding Layer, Diagnostic Layer, and Optimization Layer, achieving unity between data-driven and mechanistic models. The algorithm flowchart is shown in Figure 1 below.

3.2. Model Architecture Design and Integration of Physical Laws

The input layer of the model receives the pre-processed multi-source time-series data from Section 2, including DGA gas concentrations, infrared temperature measurements, and partial discharge characteristics. This data is first processed by a physics-guided feature projection layer, which transforms the raw monitoring values into feature representations with explicit physical significance. These features then enter a transformer encoder incorporating physical constraints, which captures temporal dependencies via a multi-head attention mechanism. Finally, a multi-task diagnostic head simultaneously outputs predictions for fault type, severity, and location.
Within the input layer design, we establish a feature projection mechanism guided by clear physical principles, tailored to the characteristics of the multi-source monitoring data. The central idea is to convert raw monitoring data through mathematical transformations steered by physical laws into feature representations that are more suitable for model processing and inherently more physically meaningful. Considering the physics of gas diffusion and dissolution, gas concentrations in transformer oil typically follow a long-tailed distribution. A logarithmic transformation effectively compresses the value range in high-concentration regions while enhancing discriminability in low-concentration regions. Thus, concentration features are processed using the following transformation:
h D G A ( t ) = W D G A log ( 1 + x D G A ( t ) ) + b D G A
where h D G A ( t ) is the projected DGA feature vector; W D G A is the projection weight matrix; x D G A ( t ) is the raw DGA feature vector (including various gas concentrations); b D G A is the bias term.
Based on thermodynamic principles, normalization is performed to convert absolute temperature values into relative temperature rise ratios. This approach aligns more closely with the physical essence of overheating faults, as their diagnosis typically relies on temperature rise rather than absolute values.
h thermal ( t ) = W thermal x thermal ( t ) T ambient T rated + b thermal
where h thermal ( t ) is the projected temperature feature vector; W thermal is the projection weight matrix for temperature features; x thermal ( t ) is the raw temperature measurement; T ambient is the ambient temperature baseline; T rated is the transformer’s rated temperature; b thermal is the bias term for temperature features.
In the model design, physical constraints are integrated into the deep learning framework through several key strategies. Primarily, during the feature projection stage, the aforementioned mathematical transformations ensure that the input features adhere to physical laws. Furthermore, constraint terms based on fundamental principles like energy conservation are incorporated into the attention mechanism, guiding the model to focus on hotspot regions consistent with heat propagation patterns. Additionally, physical consistency constraints are embedded within the loss function design, ensuring that the model’s predictions are not only accurate but also physically plausible.

3.3. Physically Constrained Attention Mechanism and Feature Fusion

This section details the specific implementation method of integrating physical laws into the attention mechanism. Based on the heat balance equation and gas diffusion laws, a multi-physics coupled attention weight calculation method is designed. The core innovation of this mechanism lies in transforming physical constraints into differentiable penalty terms that directly participate in the calculation of attention weights. The physical constraint M thermal , constructed based on the heat balance equation, ensures that the hotspot regions the model focuses on comply with the law of energy conservation. A composite attention calculation framework is built as shown in the following equation:
A t t e n t i o n ( Q , K , V ) = S o f t max Q K T d k + λ 1 M thermal + λ 2 M gas V
where Q , K , V are the query, key, and value matrices obtained through linear transformation; d k is the scaling factor reflecting the dimensionality of the key vectors; λ 1 and λ 2 are the weight coefficients for the thermal and gas constraints, respectively; M thermal is the heat balance constraint matrix; M gas is the gas ratio constraint matrix.
The element calculation of the constraint matrix M thermal considers the spatiotemporal characteristics of thermal diffusion. When the difference in power loss between two positions is small, their corresponding attention weights should be enhanced, reflecting the requirements of the law of energy conservation. Simultaneously, the temperature difference term ensures that the hotspot distribution the model focuses on conforms to the laws of heat conduction. The calculation formula is as follows:
M thermal ( i , j ) = exp P loss ( i ) P loss ( j ) σ 2 1 T ( i ) T ( j ) + ϵ
where P loss ( i ) is the power loss at position i; σ is the thermal diffusion coefficient; T ( i ) is the temperature value at the corresponding position; ϵ is a constant term.
Furthermore, based on industry standards such as the Duval triangle method, the gas ratio constraint matrix M g a s is constructed. The design of this matrix fully considers the physical significance and diagnostic rules of gas ratios in fault diagnosis. The calculation formula is as follows:
M g a s ( i , j ) = s g a s ( i , j )       i f   D r i = D r j                         α       o t h e r w i s e
where D · represents the fault type discrimination function based on gas ratios; α > 0 is the penalty coefficient; s g a s ( i , j ) denotes the gas ratio similarity, where r i is the feature gas ratio vector.
This design ensures that when the model computes attention weights, it automatically enhances the associations between samples of the same fault type while suppressing unreasonable associations between samples of different fault types.
Integrating the aforementioned physical constraint matrices into the attention mechanism forms a multi-physics coupled attention calculation framework. The specific implementation process is as follows: first, compute the basic attention score e i j ; then, incorporate the physical constraint terms to obtain e ˜ i j ; after normalization via the softmax function, the final output o i is computed. The specific calculation formulas are as follows:
e i j = q i T k j d k
e ˜ i j = e i j + λ 1 M thermal ( i , j ) + λ 2 M gas ( i , j )
α i j = exp e ˜ i j k = 1 N exp e ˜ i k
o i = j = 1 N α i j v j
where e i j is the original attention score between position i and position j; e ˜ i j is the attention score corrected by physical constraints; α i j is the final attention weight after normalization by the Softmax function, representing the formal “attention” proportion allocated from position to position after incorporating physical knowledge; o i is the final output vector at a certain position, which is a weighted fusion of information from all positions.
This design allows the model to maintain the powerful representational capacity of the traditional attention mechanism while, through physical constraints, guiding it to focus on feature combinations that conform to physical laws. This significantly enhances the interpretability and physical plausibility of the diagnostic results.

3.4. Multi-Task Diagnosis and Physical Consistency Optimization

The diagnostic layer employs a multi-task learning framework to simultaneously accomplish fault classification, severity assessment, and fault location. Based on the requirement for temporal consistency, a physics-guided output constraint mechanism is designed to ensure that the diagnostic results are both accurate and compliant with physical laws. The multi-task loss function is meticulously designed to fully consider physics-consistency requirements. Its formula is as follows:
L total = L data + β 1 L thermal + β 2 L temporal + β 3 L r a t i o
where L data denotes the data fitting loss; L thermal represents the thermal balance consistency loss; L temporal is the temporal continuity loss; L r a t i o is the gas ratio constraint loss; β 1 , β 2 , and β 3 are the weighting coefficients for the respective loss terms.
Among them, the thermal balance consistency loss L thermal ensures that the prediction results conform to thermodynamic laws, maintaining the hotspot patterns emphasized by the attention mechanism in the final temperature prediction. Its calculation formula is as follows:
L t h e r m a l = 1 T t = 1 T || T t pred T t physics || 2 2 + γ || T t pred T t physics || 1
where T t pred is the temperature value predicted by the model; T t physics is the theoretical temperature value calculated based on physical equations; γ is the weight for the gradient consistency loss; T is the total length of the sequence.
The gas ratio constraint loss L r a t i o is based on industry-standard fault diagnosis rules. It ensures the model’s outputs comply with physical laws by comparing the difference between predicted gas ratios and theoretical standard values. Its calculation formula is as follows:
L ratio = t = 1 T r t pred r t standard 2
Temporal consistency loss captures temporal dependencies through a physics-constrained attention mechanism. It penalizes state transitions that violate fault progression patterns, thereby ensuring the continuity of fault evolution. The formula is as follows:
L temporal = 1 T 1 t = 2 T || y ^ t y ^ t 1 || 2 2 + μ t = 2 T I ( Δ y ^ t > τ )
where y ^ t and y ^ t 1 are the model’s predicted outputs at time steps t and t − 1, respectively; μ is the weight coefficient for the penalty term; I ( · ) is the logical indicator function; τ is a hyperparameter threshold.
Constructing a multi-task optimization framework enhances the model’s generalization capability through joint optimization. The introduction of physics-based constraint terms reduces the risk of overfitting. Meanwhile, the multi-task outputs provide richer information for comprehensive diagnosis.
The training process employs a two-stage curriculum learning strategy, progressively incorporating physical constraints. The first stage focuses on physics-based pre-training, emphasizing the optimization of the physics-consistency loss. This stage enables the model to preliminarily grasp the physical laws, laying a foundation for subsequent training. The second stage involves supervised fine-tuning, balancing data fitting and physical constraints. The constraint weights are dynamically adjusted during the training process to ensure optimal training effectiveness.
In the first stage, an unsupervised pre-training approach is adopted, focusing on optimizing the physics-consistency loss function. The primary objective of this stage is to allow the model to learn the fundamental physical laws of transformer operation beforebeing exposed to labeled data. In this way, the model establishes a preliminary understanding of fault evolution patterns, providing a favorable initialization for subsequent supervised learning. The pre-training stage utilizes a large-scale dataset of unlabeled monitoring data, including normal operational data and data indicating minor anomalies, enabling the model to learn the physical laws governing normal operating states.
min θ   L physics = L thermal + L temporal + L ratio
where θ represents the set of learnable model parameters; min θ denotes the minimization of the loss with respect to parameters θ ; L physics is the sum of the physical constraint losses.
The second stage involves supervised fine-tuning on the pre-trained model using labeled data. This stage aims to adapt the model to the specific fault diagnosis task while maintaining consistency with physical laws. A smooth transition of the training focus is achieved by introducing a dynamic constraint weight β ( t ) , calculated as follows:
min θ   L t o t a l = L d a t a + β ( t ) L p h y s i c s
β ( t ) = β 0 exp t T d e c a y + β min
where β ( t ) represents the constraint weight; β 0 is the initial constraint weight; T d e c a y is the decay time constant; β min is the minimum constraint weight; t denotes the training time step.
During the initial phase of training, the model parameters require strong physical constraints to guide the learning direction and prevent convergence to suboptimal local minima. As training progresses and the model gradually assimilates the underlying physical laws, the focus shifts towards enhancing its data-fitting capability to improve diagnostic accuracy. Consequently, the constraint weight is dynamically attenuated according to the prescribed exponential decay schedule.

3.5. Summary

This study proposed an innovative method for power transformer fault diagnosis based on a physics-informed enhanced transformer architecture. By deeply integrating physical prior knowledge—such as thermodynamic laws and gas diffusion principles—into the deep learning model, we successfully constructed an intelligent diagnostic system featuring synergistic multi-modal feature fusion and physical constraints.
In terms of model design, this study innovatively introduced a physics-guided feature projection mechanism and a multi-physics coupled attention calculation method, enabling the model to automatically adhere to physical laws during the feature extraction process. The design of a multi-task loss function incorporating thermal balance consistency, temporal continuity, and gas ratio constraints ensured that the diagnostic results maintain both accuracy and physical plausibility.
Regarding the training strategy, a two-stage curriculum learning approach and a dynamic constraint weight adjustment mechanism were employed, achieving an organic unification of physical principles and data-driven learning. Experimental validation demonstrated that the proposed method significantly outperforms traditional approaches in fault diagnosis accuracy, physical consistency, and generalization capability.
The main contribution of this study lies in establishing a deep integration framework that couples physical laws with deep learning models, providing a new technological pathway for transformer condition assessment. The core methodology is not only applicable to power equipment fault diagnosis but can also be extended to intelligent operation and maintenance in other complex industrial systems, holding significant theoretical value and prospects for engineering application.

4. Case Study

4.1. Case Setup

This section evaluates the proposed fault prediction model for power transformers based on multi-source heterogeneous data. The experiment utilizes multi-source heterogeneous monitoring data and Dissolved Gas Analysis (DGA) data from power transformers deployed in the traction power supply sections of an urban rail transit system. The dataset comprises 3000 state samples collected over a six-month period from 76 oil-immersed transformers with similar configurations, providing a representative foundation for evaluating fault evolution characteristics in real-world operational scenarios.
All experiments were conducted on a high-performance computing cluster running the Linux operating system. The hardware configuration includes an eight-core Intel i9 processor with a base frequency of 2.3 GHz and 32 GB of RAM. Model development, training, and testing were completed using Python 3.8.5. The comprehensive case study encompasses data description, model configuration, the training process, and a holistic performance evaluation.
This study utilized monitoring data collected over a three-year period (2020–2023) from 76 in-service 110-kV power transformers operated by a provincial grid company in China. The dataset integrates multi-source heterogeneous monitoring signals, including Dissolved Gas Analysis (DGA), vibration, infrared thermal imaging, partial discharge, oil chemistry, and operational parameters. Data preprocessing involved dynamic time warping alignment and physics-guided feature projection (e.g., logarithmic transformation of gas concentrations and conversion of absolute temperatures into relative temperature rise ratios) to construct a unified temporal representation. A total of 3000 valid samples were constructed, where each sample corresponds to a multi-modal feature vector at a specific timestamp, labeled with the transformer’s health state. Health states were determined based on DGA guidelines, maintenance records, and expert analysis, covering seven categories: Normal (N), Low-temperature Overheating (LT), Medium-temperature Overheating (MT), High-temperature Overheating (HT), Partial Discharge (PD), Low-energy Discharge (LD), and High-energy Discharge (HD). To ensure generalizability and avoid data leakage, a leave-transformers-out partitioning strategy was adopted: 61 transformers for training, 8 for validation, and 7 for testing.

4.2. Dataset Construction

To validate the effectiveness of the proposed transformer fault prediction model based on multi-source heterogeneous features, this study selected operational records from oil-immersed power transformers in an urban rail transit traction power supply area as the research subject. The final curated dataset includes key monitoring indicators such as dissolved gas features (H2, CH4, C2H4, C2H6, C2H2, CO, CO2) and multi-dimensional state variables (temperature parameters, partial discharge magnitude and its severity, vibration measurements, load rate, and cooling status). These indicators cover the most representative electrical, thermal, and mechanical multi-source features for fault diagnosis.
To validate the effectiveness of the proposed transformer fault prediction model based on multi-source heterogeneous features, this study selected operational sampling records from oil-immersed transformers deployed in the traction power supply section of an urban rail transit system as the research subject. The data were collected from multiple transformers with similar configurations over their routine operational cycles. To ensure that subsequent model training can reflect the various typical operating states of the transformers, the typical values of the DGA data and multidimensional state variables used for modeling are presented in Table 2 and Table 3, respectively. For brevity, the table presents representative values for each transformer state; additional data are consistent with these examples. The data encompass gas characteristic profiles under different fault conditions as well as variations in operating conditions, providing a foundational dataset for identifying typical fault patterns such as low-temperature overheating, medium-temperature overheating, high-temperature overheating, partial discharge, low-energy discharge, and high-energy discharge. The final compiled samples are comprehensive and exhibit clear state distinctions, effectively supporting subsequent validation tasks such as fault classification, risk identification, and feature contribution analysis.
The raw dataset was randomly partitioned into a training set and a test set at a 4:1 ratio. Specifically, 80% of the total samples were allocated for model training, while the remaining 20% were set aside for performance evaluation. To gain an in-depth understanding of the distribution characteristics of the transformer state data, the DBSCAN clustering method was applied to the training set samples for state characterization and cluster analysis. Multiple dimensionality reduction and visualization techniques were subsequently employed to reveal the intrinsic structure of the data.
t-Distributed Stochastic Neighbor Embedding (t-SNE) is a nonlinear dimensionality reduction technique capable of effectively preserving the local structural relationships of data in the high-dimensional space. Figure 2 illustrates the visualization of the distribution for the normal state versus fault states based on t-SNE. From the figure, it can be clearly observed that the normal state samples (green points) form a distinct, large, and densely populated cluster, indicating a high degree of consistency in the operational characteristics of transformers under normal conditions. In contrast, the fault state samples (red points) are more widely dispersed, reflecting the diversity of different fault types within the feature space. Figure 2 provides a more refined classification display of fault types, presenting the distribution of seven specific states via t-SNE. It can be observed that the normal state maintains its compact clustering feature, while various fault states exhibit different distribution patterns. Among them, the sample points for certain fault types, such as partial discharge (PD) and low-energy discharge (LD), show partial overlap with relatively ambiguous boundaries, suggesting that these faults may share similar physical characteristics. Other fault types, like PD and different levels of overheating, demonstrate relatively independent distribution regions, indicating that these fault types possess good separability within the feature space.
Based on the aforementioned analytical results, an adaptive sampling strategy was employed during the model training process. For the normal state, which exhibits clear boundaries, its sample weight was appropriately reduced to prevent the model from overfitting to these easily classifiable samples. For fault types with ambiguous boundaries, their sample weights were increased to ensure the model could sufficiently learn these challenging classification boundaries. This targeted training strategy, informed by data characteristic analysis, effectively enhanced the model’s generalization capability and robustness in the transformer state assessment task, demonstrating superior classification performance, particularly when handling fault types with ambiguous boundaries.

4.3. Case Analysis

This study conducted a systematic performance evaluation of the proposed physics-informed transformer (PI-Transformer) model. Based on multi-state monitoring samples from power transformers, the improved physics-informed transformer model was employed for state discrimination and fault prediction. Its performance was compared and analyzed against several traditional machine learning algorithms, including Support Vector Machine (SVM), Random Forest (RF), LightGBM, and XGBoost, as well as a representative multimodal deep learning benchmark, the dynamic multiscale attention CNN–LSTM model [13]. These analyses collectively validate the superiority of the proposed model in terms of multi-type fault discrimination accuracy, adaptability, and physical consistency.

4.3.1. Accuracy Test

As shown in Figure 3, the average discrimination accuracy of the proposed PI-Transformer method for the seven transformer states reached 87.5% as the number of epochs increased. Figure 4 reveals differences in the identification difficulty among different fault states. The recognition accuracy for boundary-ambiguous faults, specifically low-energy discharge (LD) and partial discharge (PD), was relatively lower, with PI-Transformer achieving 83.6% and 77.9%, respectively. In contrast, the highest accuracies among the other compared methods were only 78.4% (LightGBM) for LD and 71.9% (LightGBM) for PD. This is primarily attributed to the high similarity in features between these two fault types, resulting in relatively ambiguous boundaries in the feature space, which makes it difficult for traditional classifiers to effectively distinguish them. Nevertheless, compared to traditional methods, PI-Transformer demonstrated a clear advantage for multiple critical fault types. Even for the boundary-ambiguous LD and PD faults, PI-Transformer still achieved a certain level of improvement. This advantage mainly stems from the introduction of the physical constraint mechanism. By integrating physical parameters such as operating temperature, load rate, and vibration frequency of the transformer, the model’s ability to learn the essential characteristics of faults is enhanced. In contrast to LD and PD, high-energy discharge (HD) exhibits more distinct features, enabling the proposed method to achieve a recognition accuracy exceeding 89%. Within the series of overheating faults, from low-temperature overheating (LT) to high-temperature overheating (HT), the recognition accuracy shows a gradually increasing trend, reflecting the physical principle that fault characteristics become more pronounced with increasing severity. Particularly in scenarios with small samples and overlapping features, the physical constraints provide an additional regularization effect, preventing the model from learning spurious features that contradict physical laws. Regarding the performance difference between the training and test sets, PI-Transformer exhibited a smaller accuracy drop on the test set (an average decrease of 2.8%), demonstrating good generalization capability. This indicates that the physical constraints not only improve the model’s discrimination accuracy but also enhance its stability across different scenarios.
As shown in Table 4, distinct performance variations are evident among different algorithms in fault identification. Among the comparative methods, the dynamic multiscale attention CNN–LSTM model proposed in [13] is selected as a representative multimodal deep learning benchmark for transformer anomaly detection. This method integrates convolutional neural networks for spatial feature extraction and LSTM networks for temporal dependency modeling, enhanced by a multiscale attention mechanism to adaptively fuse heterogeneous monitoring data. The proposed FI-Transformer consistently outperforms the CNN–LSTM benchmark on both the training and test datasets. Specifically, the FI-Transformer achieves an accuracy of 89.70% on the training set and 86.90% on the test set, whereas the CNN–LSTM model attains 84.49% and 81.28%, respectively. The performance gap indicates that the proposed method exhibits better generalization capability. This improvement can be attributed to the physics-informed constraints, which further enhance the generalization capability of the proposed model under complex operating conditions.
Gradient boosting methods (LightGBM and XGBoost) demonstrate competitive performance across most fault types, with their accuracy tending to decrease when distinguishing faults with ambiguous boundaries. Traditional strong classifiers, such as Random Forest and SVM, perform reliably on simpler fault types but show limitations in capturing complex fault patterns. Logistic Regression exhibits relatively lower overall performance due to its model complexity constraints, reflecting the inherent limitations of linear models in handling the nonlinear characteristics of transformer faults. These observations are consistent with the weighted average performance across multiple fault types and highlight the comparative advantages of the proposed PI-Transformer in multi-type fault diagnosis, as indicated by the overall trends in Table 4.
In summary, the physics-informed transformer method proposed in this paper demonstrates advantages in multi-type fault diagnosis for power transformers, including high accuracy, strong adaptability to complex faults, and good cross-scenario stability. It provides effective technical support for transformer condition monitoring and fault early warning.

4.3.2. Scenario Adaptability Test

To validate the adaptability of the proposed physics-constrained transformer model to the complex operating environments of rail transit power supply transformers, the following tests were conducted under scenarios involving continuous monitoring samples and sample loss due to communication interruptions.
Continuous monitoring samples from 10 consecutive time periods (with 15-min intervals) of rail transit power supply transformers were selected to evaluate the fault state prediction performance of the physics-constrained transformer. As shown in Figure 5, the PI-Transformer achieved a prediction accuracy of 83.2%, representing a relative decrease of 4.9%—the smallest decline among all tested classifiers. In contrast, the accuracy degradation for other classifiers ranged from 5.9% to 14.0%, with Logistic Regression (LR) performing the worst, showing a 14.0% decline. The results indicate that the proposed method can effectively capture the temporal characteristics of fault state evolution in power transformers, identifying changes in dissolved gas compositions and variations in operating condition indicators during fault progression. The high-accuracy early identification of fault states in rail transit power supply transformers provided by the PI-Transformer method offers a reliable basis for maintenance personnel to implement timely inspections and protective measures, thereby effectively ensuring the operational safety of the rail transit power supply.
Furthermore, the widespread distribution of rail transit power supply networks and the susceptibility of communication channels to weather or network interference can lead to monitoring interruptions, which may introduce deviations in state identification. The adaptability of the proposed physics-constrained transformer method under strong interference conditions was further evaluated. Scenarios with random loss of monitoring data in certain time periods were simulated to represent communication interruption events. As shown in Figure 6, the physics-constrained transformer algorithm maintained a high accuracy of 82.0% in state identification under discontinuous conditions with 20% sample loss. The accuracy degradation of the proposed algorithm under interference was approximately 7.1%, whereas other algorithms exhibited declines of about 9.2% to 14.4%. This demonstrates that the proposed algorithm, through its physics constraints and attention mechanism, can effectively mitigate the impact of monitoring interference, exhibiting strong adaptability in challenging scenarios. However, the study also notes that when communication interruptions are prolonged, resulting in a sample loss ratio exceeding 40%, the identification accuracy of the proposed method drops below 72%. In such cases, prompt on-site maintenance is required to restore communication and ensure operational safety.

4.3.3. Summary

The dataset used comprises 3000 state samples collected over six months from 76 oil-immersed transformers with similar configurations, covering diverse operational states and fault patterns. The extensive coverage ensures that the evaluation reflects realistic operational scenarios and supports robust multi-type fault assessment. The typical operating conditions (Table 2) and DGA indicators (Table 3) provide further evidence of the dataset’s representativeness and its support for accurate state identification.
Furthermore, Figure 4 and Figure 5 illustrate the temporal and scenario adaptability of the proposed model. In continuous monitoring scenarios, the FI-Transformer maintains high accuracy with minimal decline under temporal dependencies. Under simulated communication loss conditions, it preserves robust performance even with partial data loss, indicating strong resilience against real-world operational uncertainties. Additionally, the monotonic increase in recognition accuracy from LT to HT faults demonstrates that the model captures physically meaningful fault evolution patterns.
Overall, the effectiveness of the FI-Transformer is demonstrated through complementary analyses across multiple perspectives: static state classification, comparative algorithm performance, temporal adaptability, robustness under data loss, and physical consistency. Special attention to boundary-ambiguous faults such as LD and PD further highlights the model’s capability in challenging classification scenarios. The combined evidence confirms that the proposed method achieves accurate, generalizable, and physically consistent multi-type fault diagnosis.

5. Conclusions

To address the challenges in transformer fault diagnosis, this study proposes a novel physics-informed transformer (PI-Transformer) model based on multimodal input data and the integration of physical constraints. The proposed framework constructs a unified temporal representation from multi-source heterogeneous monitoring data and embeds physically meaningful constraints to guide model learning.
A systematic case analysis based on long-term operational data from multiple oil-immersed transformers was conducted to evaluate the effectiveness of the proposed method. The experimental results show that the PI-Transformer achieves high fault identification accuracy on both training and test datasets, outperforming several traditional machine learning methods and a representative multimodal deep learning benchmark. In particular, the proposed model demonstrates improved discriminative capability for boundary-ambiguous fault types, such as low-energy discharge and partial discharge, which are typically difficult to distinguish using conventional approaches.
Furthermore, the introduction of physical constraints contributes to enhanced generalization performance and scenario adaptability. The model exhibits a relatively small performance gap between the training and test sets, indicating stable generalization behavior. Additional tests under continuous monitoring sequences and simulated communication interruptions further demonstrate the robustness of the proposed approach under realistic operating conditions.

Author Contributions

Conceptualization, Y.H. and Z.H.; methodology, Y.H.; validation, Z.H. and J.C.; formal analysis, Y.H.; investigation, Z.H.; data curation, J.C.; writing—original draft preparation, Y.H.; writing—review and editing, Z.H. and J.C.; visualization, J.C.; supervision, Z.H.; funding acquisition, Z.H. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by: 1. 2025 Scientific Research Platforms and Projects of Guangdong Provincial Education Department, 2025KTSCX290; 2. 2025 Central Government Guided Local Science and Technology Development Special Fund (Shantou Innovative City Construction):STKJ2025095; 3. STU Seientific Research Initiation Grant: NTF24030T.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Revera, A.; Abbas, M. Condition Monitoring of Power Transformers Using AI-Driven Diagnostic Systems. Natl. J. Electr. Mach. Power Convers. 2025, 1, 1–8. [Google Scholar]
  2. Saravanan, B.; Vengateson, A. Benchmarking Traditional Machine Learning and Deep Learning Models for Fault Detection in Power Transformers. arXiv 2025, arXiv:2505.06295. [Google Scholar]
  3. Zhao, Z.; Ding, X.; Prakash, B.A. Pinnsformer: A transformer-based framework for physics-informed neural networks. arXiv 2023, arXiv:2307.11833. [Google Scholar]
  4. Ma, M.; Han, L.; Zhou, C. Research and application of Transformer based anomaly detection model: A literature review. arXiv 2024, arXiv:2402.08975. [Google Scholar] [CrossRef]
  5. Wu, B.; Hu, Y. Analysis of substation joint safety control system and model based on multi-source heterogeneous data fusion. IEEE Access 2023, 11, 35281–35297. [Google Scholar] [CrossRef]
  6. Tamakloe, E.; Kommey, B.; Kponyo, J.J.; Tchao, E.T.; Agbemenu, A.S.; Klogo, G.S. Predictive AI Maintenance of Distribution Oil-Immersed Transformer via Multimodal Data Fusion: A New Dynamic Multiscale Attention CNN-LSTM Anomaly Detection Model for Industrial Energy Management. IET Electr. Power Appl. 2025, 19, e70011. [Google Scholar] [CrossRef]
  7. Nuruzzaman, M.; Limon, G.Q.; Chowdhury, A.R.; Khan, M.M. Predictive Maintenance in Power Transformers: A Systematic Review of AI And IOT Applications. ASRC Procedia Glob. Perspect. Sci. Scholarsh. 2025, 1, 34–47. [Google Scholar]
  8. Rao, S.; Nie, S.; Lv, X.; Ruan, W.; Yin, F.; Barmada, S.; Zhao, Y.; Deng, Y. Evaluating Power Transformer State Based on Tank Vibration: A Graphical Approach. IEEE Trans. Instrum. Meas. 2025, 74, 3539615. [Google Scholar] [CrossRef]
  9. Mo, B.Y.; Zhou, S.H.; Han, M.L.; He, Y.J.; Li, J.W.; Yang, Y.P. Transformer Anomaly Detection Based on Multi-Source Information Fusion of Insulation Oil. J. Comput. 2025, 36, 41–54. [Google Scholar] [CrossRef]
  10. Zheng, J.; Chen, T.; Wang, Z.; He, J. Multi-Modal Fusion for Substation Thermal Situational Awareness: Infrared Clustering and Time-Series Forecasting. In Proceedings of the 2025 IEEE 15th International Conference on CYBER Technology in Automation, Control, and Intelligent Systems (CYBER), Shanghai, China, 15–18 July 2025; pp. 264–268. [Google Scholar]
  11. Tang, P.; Zhang, Z.; Tong, J.; Long, T.; Huang, C.; Qi, Z. Predicting transformer temperature field based on physics-informed neural networks. High Volt. 2024, 9, 839–852. [Google Scholar] [CrossRef]
  12. Zhang, X.; Zhang, Y.; Yang, Q. GRU-Dual Transformer–BiLSTM Fusion for Transformer Fault Diagnosis. Int. J. Circuit Theory Appl. 2025; Online version of record before inclusion in an issue. [Google Scholar] [CrossRef]
  13. Chen, H.C.; Zhang, Y. Dissolved Gas Analysis Using Knowledge-Filtered Oversampling-Based Diverse Stack Learning. IEEE Trans. Instrum. Meas. 2025, 74, 2505211. [Google Scholar] [CrossRef]
  14. Liu, X.; He, X.; Li, Y. A Multi-Scale Time Adaptive Fusion Network for Transformer Fault Diagnosis. Eng. Rep. 2025, 7, e70152. [Google Scholar] [CrossRef]
  15. Alsobhani, A.; Alwash, S.; Rahaim, L.A.A. Comparative Study of DGA-Based AI Models for Transformer Faults. In Proceedings of the 2025 9th International Conference on Man-Machine Systems (ICoMMS), Malacca, Malaysia, 18–19 August 2025; pp. 481–486. [Google Scholar]
  16. Vatsa, A.; Hati, A.S.; Rathore, A.K. Enhancing transformer health monitoring with ai-driven prognostic diagnosis trends: Overcoming traditional methodology’s computational limitations. IEEE Ind. Electron. Mag. 2023, 18, 30–44. [Google Scholar] [CrossRef]
  17. IEC 60599:2022; Mineral Oil-Filled Electrical Equipment in Service—Guidance on the Interpretation of Dissolved and Free Gases Analysis. International Electrotechnical Commission: Geneva, Switzerland, 2022.
Figure 1. Algorithm flowchart.
Figure 1. Algorithm flowchart.
Energies 19 00599 g001
Figure 2. Clustering of transformer state samples. (a) Classification of normal and fault states; (b) classification of fault type.
Figure 2. Clustering of transformer state samples. (a) Classification of normal and fault states; (b) classification of fault type.
Energies 19 00599 g002
Figure 3. Accuracy iteration of the PI-Transformer and comparative methods.
Figure 3. Accuracy iteration of the PI-Transformer and comparative methods.
Energies 19 00599 g003
Figure 4. State-wise accuracy of the PI-Transformer for transformer fault diagnosis.
Figure 4. State-wise accuracy of the PI-Transformer for transformer fault diagnosis.
Energies 19 00599 g004
Figure 5. Testing accuracy of various algorithms under time-series samples.
Figure 5. Testing accuracy of various algorithms under time-series samples.
Energies 19 00599 g005
Figure 6. Testing accuracy of various algorithms in a communication interruption environment.
Figure 6. Testing accuracy of various algorithms in a communication interruption environment.
Energies 19 00599 g006
Table 1. Description of Multi-Source Monitoring Data Characteristics.
Table 1. Description of Multi-Source Monitoring Data Characteristics.
Data TypeMonitoring ParametersPhysical MeaningSampling Frequency
DGA dataH2, CH4, C2H2 and other gasesThermal decomposition and electrical fault characteristics of insulating materialsdaily
infrared imageMaximum temperature, temperature gradient degree, and hotspot distributionSurface overheating and poor connection faultsdaily
partial dischargeDischarge capacity, frequency, PRPD knowledge graphInsulation defects and discharge intensityreal-time
vibration signalVibration amplitude and frequency componentsMechanical state and loose iron coreper hour
Oil chemical dataBreakdown voltage, micro water content quantity, and acid valueAging state of insulation mediumquarterly
operating dataLoad current, oil temperature, and environment temperatureOperating conditions and thermal stressper minute
Table 2. Typical operating condition indicators for transformer state classification.
Table 2. Typical operating condition indicators for transformer state classification.
StateOperating and Condition Indicators
TemperaturePartial DischargeVibrationLoad RateCooling Status
HotspotSurfaceCalculated HotspotQuantitySeverity(100 Hz)(200 Hz)
NS41.5540.7554.3137.996.840.440.265.840
LD45.3348.4360.31375.1934.750.760.8857.110
HD55.0759.1283.81102276.934.271.3574.871
PD47.7642.0762.611021.6366.581.550.8292.60
LT72.7169.3186.2298.4619.010.310.391000
MT108.3694.38142.6194.9728.381.891.381001
HT155.4145.77187.08102.8437.952.362.62115.762
Table 3. Typical DGA indicators for transformer state classification (ppm).
Table 3. Typical DGA indicators for transformer state classification (ppm).
StateDiagnosis Gas Content (ppm)Sample Count
H2CH4C2H4C2H6C2H2COCO2
NS6.2926.7317.7412.410.0772.782055.46420
LD42.4479.3440.964.567.42330.152479.17407
HD305.57287.5432.3125.8932.63671.942987.9441
PD19.4739.1333.1821.072.45182.942820.99438
LT67.92182.9653.2330.711.11557.064347.14426
MT106.44286.31134.31148.993.41537.325787.7438
HT414.47463.82321.68362.949.511540.6411,166.93430
Table 4. Accuracy of different methods.
Table 4. Accuracy of different methods.
AccuracyFI-TransformerCNN–LSTMLightGBMXGBoostSVMRFLR
Training set89.70%84.49%82.23%81.67%79.74%78.76%75.25%
Test set86.90%81.28%78.96%76.52%70.78%73.01%74.10%
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Huang, Y.; Huang, Z.; Chen, J. Multi-Source Data Fusion and Multi-Task Physics-Informed Transformer for Power Transformer Fault Diagnosis. Energies 2026, 19, 599. https://doi.org/10.3390/en19030599

AMA Style

Huang Y, Huang Z, Chen J. Multi-Source Data Fusion and Multi-Task Physics-Informed Transformer for Power Transformer Fault Diagnosis. Energies. 2026; 19(3):599. https://doi.org/10.3390/en19030599

Chicago/Turabian Style

Huang, Yuanfang, Zhanhong Huang, and Junbin Chen. 2026. "Multi-Source Data Fusion and Multi-Task Physics-Informed Transformer for Power Transformer Fault Diagnosis" Energies 19, no. 3: 599. https://doi.org/10.3390/en19030599

APA Style

Huang, Y., Huang, Z., & Chen, J. (2026). Multi-Source Data Fusion and Multi-Task Physics-Informed Transformer for Power Transformer Fault Diagnosis. Energies, 19(3), 599. https://doi.org/10.3390/en19030599

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop