Next Article in Journal
A CFD-Based Correction for Ship Mass and Longitudinal Center of Gravity to Improve Resistance Simulation
Previous Article in Journal
Telecom Fraud Recognition Based on Large Language Model Neuron Selection
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Mathematical and Machine Learning Innovations for Power Systems: Predicting Transformer Oil Temperature with Beluga Whale Optimization-Based Hybrid Neural Networks

Chongqing University-University of Cincinnati Joint Co-op Institute, Chongqing University, Chongqing 400044, China
*
Author to whom correspondence should be addressed.
These authors contributed equally to this work.
Mathematics 2025, 13(11), 1785; https://doi.org/10.3390/math13111785
Submission received: 20 April 2025 / Revised: 24 May 2025 / Accepted: 24 May 2025 / Published: 27 May 2025

Abstract

Power transformers are vital in power systems, where oil temperature is a key operational indicator. This study proposes an advanced hybrid neural network model, BWO-TCN-BiGRU-Attention, to predict the top-oil temperature of transformers. The model was validated using temperature data from power transformers in two Chinese regions. It achieved MAEs of 0.5258 and 0.9995, MAPEs of 2.75% and 2.73%, and RMSEs of 0.6353 and 1.2158, significantly outperforming mainstream methods like ELM, PSO-SVR, Informer, CNN-BiLSTM-Attention, and CNN-GRU-Attention. In tests conducted in spring, summer, autumn, and winter, the model’s MAPE was 2.75%, 3.44%, 3.93%, and 2.46% for Transformer 1, and 2.73%, 2.78%, 3.07%, and 2.05% for Transformer 2, respectively. These results indicate that the model can maintain low prediction errors even with significant seasonal temperature variations. In terms of time granularity, the model performed well at both 1 h and 15 min intervals: for Transformer 1, MAPE was 2.75% at 1 h granularity and 2.98% at 15 min granularity; for Transformer 2, MAPE was 2.73% at 1 h granularity and further reduced to 2.16% at 15 min granularity. This shows that the model can adapt to different seasons and maintain good prediction performance with high-frequency data, providing reliable technical support for the safe and stable operation of power systems.

1. Introduction

Power transformers play a critical role in ensuring the symmetrical operation of power systems [1], serving as key infrastructure for transmission and distribution with extensive applications in sectors such as transportation [2]. Any transformer failure may trigger cascading failures that disrupt system stability, potentially leading to widespread blackouts and substantial economic losses [3]. As vital components of power grids, transformers’ stable operation fundamentally guarantees electrical symmetry and balanced load distribution [4,5].
The oil temperature at the transformer’s apex serves as a pivotal diagnostic indicator for operational anomalies, given its direct correlation with the thermodynamic state of internal components [6]. Transformer oil not only serves essential cooling functions but also performs critical dielectric responsibilities. Elevated oil temperatures exhibit intricate associations with dielectric system deterioration, which constitutes a primary causative factor in transformer failures [7]. Empirical studies demonstrate that thermal elevations induce molecular dissociation in insulating materials, thereby precipitating accelerated degradation kinetics [8,9]. Aberrant thermal profiles are frequently symptomatic of insulation deterioration, which compromises the functional integrity of power transformation systems and precipitates premature service termination. The integration of oil temperature metrics with operational load parameters enables the enhanced predictive modeling of failure events, as these variables synergistically govern dielectric degradation rates [10].
Under standard load conditions, the top-oil temperature of power transformers typically remains confined below 60 °C [11]. Within this thermal range, both the insulating materials and dielectric oil exhibit sustained chemical stability, precluding observable degradation phenomena during routine operation [12]. Exceeding operational thresholds of 80 °C induces a statistically significant escalation in transformer failure rates [13]. Empirical investigations have further demonstrated that thermal exposures surpassing 100 °C accelerate insulation degradation kinetics by orders of magnitude, thereby substantially amplifying the conditional probability of catastrophic dielectric failure [14,15]. Furthermore, oscillatory thermal variations impose detrimental ramifications on transformer integrity. Prolonged exposure to sustained thermal stress precipitates the precipitous deterioration of both liquid dielectric and cellulose-based insulation matrices [16]. This empirical evidence underscores the critical imperative for systematic thermal monitoring and adaptive management protocols, which concurrently mitigate failure precursors and enhance grid-wide reliability indices within power transmission architectures. The fundamental configuration of oil-immersed transformers is schematically illustrated in Figure 1.

2. Related Works

In the field of temperature monitoring, thermometric sensors serve as the primary instruments, requiring precise installation within target equipment to facilitate real-time thermal monitoring and quantitative analysis [17]. Representative methodologies are summarized in Table 1. Early temperature measurements primarily relied on manual methods, in which technicians used infrared pyrometers to sequentially sample ambient thermal profiles and conduct localized thermographic inspections of critical peripheral components [18]. When anomalies were detected, immediate on-site interventions were carried out. However, this operational approach presented significant limitations in complex systems, particularly in detecting endothermic or exothermic gradients within enclosed structures [19]. In addition, high-voltage conditions in electrical installations generated intense ionizing radiation, posing serious occupational health risks. The emergence of in situ thermometric technologies has gradually mitigated these challenges. These advancements have significantly reduced human intervention through automated process integration, enabling continuous thermal diagnostics with improved efficiency and support for machine learning-based automation.
In contrast, mathematical and data-driven modeling approaches demonstrate substantive advantages in predicting transformer oil temperature dynamics [14,23]. While conventional methodologies are capable of real-time oil temperature monitoring, they remain constrained in their capacity to anticipate future thermal trajectories and exhibit inherent limitations when addressing complex operational scenarios with nonlinear characteristics [24]. Conversely, mathematical and data-driven frameworks leverage extensive historical datasets to extract latent patterns and thermodynamic regularities, enabling the proactive prediction of thermal evolution while maintaining superior adaptability to heterogeneous operating environments, thereby enhancing both predictive accuracy and system reliability [25]. These analytical architectures further demonstrate self-optimizing capabilities through continuous assimilation of emerging operational data, effectively reducing the dependency on dedicated instrumentation while minimizing manual intervention. Such attributes align effectively with the intelligent grid paradigm’s requirements for autonomous equipment monitoring, ultimately contributing to enhanced operational efficiency and reliability metrics in power systems [26].
Illustratively, the integration of transformer oil temperature profiles under rated loading conditions with thermal elevation computation models specified in IEEE Std C57.91-2011 [27] facilitates the predictive modeling of hotspot temperature trajectories across variable load scenarios [28]. Taheri’s thermal modeling approach incorporates heat transfer principles with electrothermal analogies while accounting for solar radiation impacts [29]; however, such heat transfer models frequently rely on simplified assumptions regarding complex physical processes that may not hold complete validity in practical operational contexts, thereby compromising predictive fidelity. Wang et al. developed thermal circuit models to simulate temporal temperature variations in transformers [30], though their computational efficiency remains suboptimal. Rao et al. implemented Bayesian networks and association rules for oil temperature prediction, yet such rule-based systems prove inadequate in capturing intricate multivariate interactions when oil temperature becomes subject to complex multi-factor interdependencies, resulting in diminished predictive precision [31].
The progressive evolution of smart grid technologies has precipitated the systematic integration of machine learning methodologies into transformer oil temperature forecasting. Support vector machines (SVMs), initially conceived for classification problem-solving, have been strategically adapted for thermal prediction in power transformers given their superior capabilities in processing nonlinear, high-dimensional datasets [32]. Nevertheless, SVM architectures exhibit notable sensitivity to hyperparameter configurations, wherein suboptimal parameter selection may substantially compromise predictive accuracy, necessitating rigorous optimization protocols to achieve algorithmic convergence [33]. In response to this constraint, researchers have developed enhanced computational frameworks that synergize SVM with particle swarm optimization (PSO) algorithms, thereby achieving marked improvements in forecasting precision through systematic parameter calibration [34]. Contemporary analyses concurrently reveal that, while conventional forecasting techniques—including ARIMA and baseline SVM implementations—demonstrate proficiency in capturing linear thermal trends, they exhibit diminished efficacy when confronted with complex multivariable fluctuations arising from composite environmental variables and dynamic load variations [35,36].
Concurrently, neural network-based predictive methodologies have witnessed substantial proliferation across diverse application domains in recent years [37,38,39]. Temporal convolutional networks (TCNs) [40,41,42] and bidirectional gated recurrent units (BiGRUs) [43,44,45] have demonstrated their exceptional predictive performance in multidisciplinary contexts. As an architectural variant of convolutional neural networks (CNNs), TCNs exhibit distinct advantages compared to CNN-based approaches for transformer hotspot temperature prediction proposed by Dong et al. and Wang et al. [46,47]. Specifically, TCN architectures inherently capture extended temporal dependencies without constraints from fixed window sizes, thereby enhancing the training stability and computational efficiency while addressing the structural deficiencies inherent in recurrent neural network (RNN) frameworks. Building upon this foundation and informed by Zou’s seminal work on attention mechanisms [48], this study further augments the architecture’s feature extraction capacity through integrated self-attention layers. This modification enables the refined identification and processing of critical sequential features, thereby achieving superior performance in complex operational scenarios through adaptive prioritization of salient data patterns.
The resolution of hyperparameter optimization challenges within algorithmic architectures necessitates the meticulous selection of computational strategies. As systematically compared in Table 2, prevalent optimization algorithms—including PSO [49,50], Genetic Algorithms [51,52], Grey Wolf Optimizer [53,54], Sparrow Search Algorithm [55], and Whale Optimization Algorithm [56]—demonstrate commendable efficacy in specific operational contexts. However, these methodologies exhibit inherent limitations in training velocity and global exploration capacity. A critical deficiency manifests as their susceptibility to premature convergence to local optima, thereby generating suboptimal solutions that systematically degrade model precision across complex parameter landscapes.
To address the aforementioned methodological challenges, this study introduces the Beluga Whale Optimization (BWO) algorithm, a novel metaheuristic framework. Originally proposed by Zhong et al. [68], BWO is a biologically inspired algorithm that simulates beluga whale predation behavior using collective swarm intelligence and adaptive search strategies in high-dimensional solution spaces. Comparative analyses demonstrate that BWO outperforms conventional optimization methods in terms of global exploration capability, convergence speed, parameter simplicity, robustness, and computational efficiency. Zhong et al. [68] validated its effectiveness through empirical testing on 30 benchmark functions, employing a comprehensive framework of qualitative analysis, quantitative metrics, and scalability evaluation. Experimental results show that BWO offers statistically significant advantages in solving both unimodal and multimodal optimization problems. Furthermore, nonparametric Friedman ranking tests confirm BWO’s superior scalability compared to other metaheuristic algorithms. In practical applications, Wan et al. [69] applied BWO to hyperparameter optimization in offshore wind power forecasting models, demonstrating higher predictive accuracy and better generalization performance than established methods including GA, HBO (Heap-Based Optimization), and COA algorithms.
In this study, BWO was selected due to its competitive performance in solving complex nonlinear optimization problems and its balance between exploration and exploitation. Although many metaheuristic algorithms are available, BWO has demonstrated robustness and simplicity in implementation, which makes it a suitable candidate for the current application. According to the No Free Lunch Theorem [70] for optimization proposed by Wolpert and Macready, no single optimization algorithm is universally superior for all problem types. This implies that the effectiveness of an algorithm depends on the specific nature of the problem at hand. Therefore, the choice of BWO in this work is justified by its adaptability to the characteristics of the proposed model and its prior successful application in similar scenarios.
Despite progress in the field of transformer top-oil temperature prediction, several critical challenges remain unresolved. Traditional sensor-based monitoring methods, while capable of real-time temperature data acquisition, struggle to accurately predict future temperature changes and are susceptible to environmental interference under complex operating conditions, failing to meet the high-precision requirements for equipment condition forecasting in smart grids. Meanwhile, existing data-driven and machine learning approaches, although improving prediction accuracy to some extent, still face limitations such as insufficient model generalization and a tendency to fall into local optima when dealing with large-scale, high-dimensional, nonlinear time-series data. For example, SVMs are sensitive to hyperparameter configurations, PSO algorithms are prone to local optima in high-dimensional problems, and traditional RNNs and their variants suffer from gradient vanishing or explosion when processing long sequence data. Additionally, existing research lacks sufficient exploration in multi-source data fusion and model adaptability to different seasons and time granularities, making it difficult to comprehensively address the complex operational variations in practical applications.
To enhance the precision of transformer top-oil temperature prediction and address these research gaps, this study proposes a hybrid model integrating BWO with advanced neural network architectures, namely BWO-TCN-BiGRU-Attention. The model leverages BWO to optimize hyperparameters such as learning rate, number of BiGRU neurons, attention mechanism key values, and regularization parameters, effectively avoiding local optima through its robust global search capability. The architecture synergistically combines TCN for multi-scale temporal dependency extraction with BiGRU’s bidirectional state transition mechanism to enhance temporal pattern representation. The hierarchical attention mechanism facilitates dynamic feature weight allocation across time steps, amplifying contextual salience detection through learnable importance scoring. Empirical evaluations demonstrate significant improvements in prediction accuracy (23.7% reduction in MAE) and generalization capability (18.4% improvement in RMSE).
The primary contributions of this study are three-fold in terms of methodological innovation:
  • The global spatial and temporal features of the oil temperature sequence can be fully extracted by utilizing the serial structure of TCN and BiGRU. This approach allows for the effective capture of feature information at different scales and leads to a significant improvement in prediction accuracy.
  • The self-attention mechanism selects useful features for prediction and filters out unimportant information that may cause interference, thereby making multi-feature prediction more accurate.
  • The study conducted multiple combinations of data input experiments based on the actual transformer data from China, covering different transformers, different time spans, and different time windows, etc. These experiments collectively demonstrated that our architecture outperforms the five models, namely ELM, PSO-SVR, Informer, CNN-BiLSTM-Attention, and CNN-GRU-Attention, and has strong robustness and generalization ability.
The remaining chapters of this paper are arranged as follows: Section 2 elaborates on the architecture and principle of the proposed BWO-TCN-BiGRU-Attention model in detail; Section 3 elaborates on the application scenarios of the model through specific cases; Section 4 presents a comprehensive display of the experimental results, verifying the significant advantages of BWO in the optimization process and the effectiveness of the proposed method in predicting the top-oil temperature of transformers; finally, Section 5 summarizes the research of this paper and looks forward to future work.

3. Model Framework

This section outlines the architectural framework of the proposed BWO-TCN-BiGRU-Attention model, developed for top-oil temperature forecasting in power transformers. The model adopts a sequential structure that integrates multiple neural network components, each contributing distinct strengths to enhance predictive performance. Specifically, TCN captures localized short-term patterns, effectively modeling immediate temperature variations. To address TCN’s limitations in representing causal dependencies, BiGRU is employed to incorporate both past and future contextual information, thereby modeling long-term temporal relationships. The attention mechanism further refines temporal representations by dynamically weighting critical input features. Complementing these components, BWO is utilized to fine-tune the hyperparameters of the entire framework, ensuring optimal performance. This integrative approach adopts a serial structure and enables the multiscale analysis of oil temperature dynamics and yields statistically significant improvements in forecasting accuracy.
Section 3.1 systematically examines the structural configuration of the TCN and demonstrates its effectiveness in capturing localized transient patterns within sequential data. Building on this, Section 3.2 offers an in-depth analysis of BiGRU, focusing on its architectural capacity to model bidirectional temporal dependencies. Section 3.3 then outlines the operational principles of the attention mechanism, highlighting its discriminative function in hierarchical feature extraction. Following this, Section 3.4 delves into the BWO algorithm, detailing its mathematical formulation and iterative optimization process. Finally, Section 3.5 integrates the preceding components into a unified framework, presenting the complete architecture and execution flow of the BWO-TCN-BiGRU-Attention model while emphasizing the synergistic interactions among its constituent modules.

3.1. Temporal Convolutional Network

The TCN is built on three key architectural elements: causal convolution, dilated convolution, and residual connections [71]. Causal and dilated convolutions work together to capture multi-scale temporal patterns by expanding the receptive field exponentially. Residual connections help address vanishing and exploding gradients, a common issue in deep CNNs, thereby improving training stability [72].

3.1.1. Causal Convolution

The purpose of causal convolution is to ensure that, when calculating the output of the current time step, it only relies on the current and previous time steps and does not introduce information from future time steps. Suppose that the filter F consists of K weights f 1 , f 2 , , f K and the sequence X consists of T elements x 1 , x 2 , , x T . For any time point t in the sequence X, the causal convolution is given by
F X x t = k = 1 K f k x t K + k

3.1.2. Dilated Convolution

Dilated convolution serves as the core mechanism in TCN for expanding the receptive field. In conventional convolutional operations, the spatial extent of the receptive field remains constrained by both kernel size and network depth. By strategically interleaving dilation rates—defined as the interspersed intervals between kernel elements—dilated convolution achieves expansive temporal coverage, thereby exponentially amplifying the receptive field without incurring additional computational overhead. A comparative schematic illustration of standard CNN versus dilated convolution with a dilation rate of 2 is presented in Figure 2.

3.1.3. Residual Connections

Residual connectivity is an important mechanism used in TCNs to alleviate deep network training problems. It avoids the loss of information during transmission by passing the input directly to the subsequent layers. The structure of the residual block can be represented as
R e s i d u a l   B l o c k x = R e L U L a y e r N o r m C a u s a l C o n v x + x
where C a u s a l C o n v x denotes the causal convolution operation; L a y e r N o r m is the layer normalization operation used to stabilize the training process; R e L U is the activation function used to introduce nonlinearities, and the structure of residual linkage is shown in Figure 3.
As illustrated in Figure 3, the architectural composition of TCN comprises iteratively stacked residual modules. Each module intrinsically integrates four computational components: causal dilated convolution, layer normalization, ReLU activation functions, and dropout layers, with their structural interconnections explicitly delineated in Figure 4. This hierarchical stacking paradigm not only facilitates the construction of deep feature hierarchies but also intrinsically circumvents the vanishing/exploding gradient pathologies pervasive in deep neural architectures through residual skip-connection mechanisms.
The TCN architecture utilizes dilated convolutional sampling that strictly enforces causal constraints in temporal analysis by ensuring that current predictions are unaffected by future information. Unlike conventional convolutional methods, TCN achieves a large temporal receptive field through strategically dilated kernel strides while maintaining output dimensionality. This enables the effective capture of long-range dependencies across sequential data. As a result, the architecture significantly improves computational efficiency and enhances the model’s ability to represent long-term dependencies and generalize predictive performance.

3.2. GRU and BiGRU

RNNs are well suited for modeling sequential data. However, they often face training difficulties due to vanishing or exploding gradients. To address these issues, the GRU was proposed. GRU retains the recursive structure of RNNs but adds gates to improve gradient stability and capture long-term dependencies [73]. Compared with LSTM networks, GRU offers similar accuracy with a simpler structure. It requires less computation and trains faster, although both models share similar principles [74,75]. GRU has two main gates: the reset gate and the update gate. The reset gate controls how much past information is ignored. The update gate balances old and new information, helping the model retain important context over time. Figure 5 shows the structure and computation process of the GRU. The detailed algorithm is given below [76]:
First, the reset r t and update gate z t are computed to determine the extent to which historical state information is forgotten and the proportion of prior state information retained, respectively:
r t = σ W r · h t 1 , x t + b r
z t = σ W z · h t 1 , x t + b z
where σ is the activation function, which normalizes the input to the range (0, 1).
Next, the candidate output state h t ~ is computed, combining the current input and the historical state information adjusted by the reset gate:
h t ~ = t a n h W h ~ · r t     h t 1 , x t + b h ~
where t a n h is activation function, which normalizes the input to the range (−1, 1).
Finally, the state h t of the current time step is obtained by fusing the previous state and the candidate state based on the weights of the update gates:
h t = 1 z t     h t 1 + z t     h t ~
where x t is the input sequence value at the current time step, h t 1 is the output state at the previous moment, W r , W z , W h ~ , b r , b z , and b h ~ are the corresponding weight coefficient matrices and bias terms of each part, respectively; σ is the sigmoid activation function used to normalize the gated signal to the (0, 1) interval; t a n h is the hyperbolic tangent function used to normalize the state values to the (−1, 1) interval; * denotes the element-by-element Hadamard product.
The conventional GRU architecture, confined exclusively to assimilating historical information preceding the current timestep, exhibits inherent limitations in incorporating future contextual signals. To address the temporal directional constraint, the BiGRU framework is adopted. This enhanced architecture synergistically integrates forward-propagating and backward-propagating GRU layers, enabling the concurrent extraction of both antecedent and subsequent temporal patterns. The schematic representation of this architectural configuration is illustrated in Figure 6.
As can be seen from Figure 6, the hidden layer state h t of BiGRU at time step t consists of two parts: the forward hidden layer state h t and the backward hidden layer state h t . The forward hidden layer state h t is determined by the current input x t and the forward hidden layer state h t 1 of the previous moment; the backward hidden layer state h t is determined by the current input x t and the backward hidden layer state h t + 1 of the next moment. The formula of BiGRU is as follows:
h t = G R U f o r w a r d h t 1 , x t
h t = G R U b a c k w a r d h t + 1 , x t
h t = h t , h t
where w i i = 1 ,   2 , ,   6 denotes the weight from one cell layer to another.

3.3. Attention Mechanism

Within temporal sequence processing frameworks, attention mechanisms have been strategically incorporated to optimize feature saliency through selective information prioritization. This computational paradigm operates via adaptive inter-state affinity quantification, employing a context-sensitive weight allocation scheme that dynamically enhances anomaly discernment capacity while ensuring system robustness [77].
First, the correlation between the hidden layer states h i and h j is calculated as shown in Equation (10).
g i j = t a n h W 1 h i + W 2 h j + b
where g i j denotes the correlation between h i and h j , is nonlinearly transformed by the weight matrices W 1 and W 2 and the bias b .
Next, the correlation g i j is converted to an attention weight a i j using the softmax function:
a i j = e x p g i j j g i j
In Equation (11), the attention weights a i j represent the importance of h j to h i , reflecting the degree of contribution to the current prediction at different points in time.
Finally, the weighted hidden states are obtained by weighted summation H i :
H i = j a i j h j
Equation (12) combines the contributions of all hidden states h j to h i , where the contribution of each h j is determined by its corresponding attention weight a i j .
Through the above process, the attention mechanism enables the model to adaptively focus on the most critical time points for the task at hand, leading to more accurate predictions and greater robustness in time series analysis [78].

3.4. Beluga Whale Optimization

Regarding the overall structure of the GRU network, several hyperparameters need to be determined, and the BWO optimization algorithm is used to help achieve this. Its metaheuristic framework operationalizes three biomimetic behavioral phases: exploration (emulating paired swimming dynamics), exploitation (simulating prey capture strategies), and whale-fall mechanisms. Central to its efficacy are self-adaptive balance factors and dynamic whale-fall probability parameters that govern phase transition efficiency between exploratory and exploitative search modalities. Furthermore, the algorithm integrates Lévy flight operators to augment global convergence characteristics during intensive exploitation phases [68].
The BWO metaheuristic framework employs population-based stochastic computation, where each beluga agent is conceptualized as a candidate solution vector within the search space. These solution vectors undergo iterative refinement through successive optimization cycles. Consequently, the initialization protocol for these computational entities is governed by mathematical specifications detailed in Equations (13) and (14).
X = x 1,1 x 1,2 x 1 , D x 2,1 x 2,2 x 2 , D x N , 1 x N , 2 x N , D
F X = f x 1,1 , x 1,2 , , x 1 , D f x 2,1 , x 2,2 , , x 2 , D f x N , 1 , x N , 2 , , x N , D
where N is the number of populations, D is the problem dimension, and X and F represent the location of individual beluga whales and the corresponding solution, respectively.
The BWO algorithm can be gradually converted from exploration to exploitation through the balancing factor B f implementation, and the exploration phase occurs when the balancing factor B f > 0.5 , while the exploitation phase occurs when B f 0.5 . The specific mathematical model is shown in Equation (15).
B f = B 0 1 T / 2 T m a x
where B 0 varies randomly in the range of 0 ,   1 in each iteration, and T and T m a x are are the current and maximum number of iterations, respectively. As the number of iterations T increases, the fluctuation range of B f decreases from 0 ,   1 to 0 ,   0.5 , indicating that the probability of the development and exploration phases changes significantly, while the probability of the development phase increases with the increasing number of iterations T.

3.4.1. Exploration Phase

The exploratory phase of the BWO was established by the swimming behavior of beluga whales. The swimming behavior of beluga whales is usually that of swimming closely together in a synchronized or mirrored manner, so their swimming positions are in pairwise form. The position of the search agent is determined by the paired swimming of the beluga whales, and the position of the beluga whales is updated, as shown in Equation (16).
X i , j T + 1 = X i , P j T + ( X r , P 1 T X i , P j T ) ( 1 + r 1 ) s i n ( 2 π r 2 ) ,       j = e v e n X i , P j T + ( X r , P 1 T X i , P j T ) ( 1 + r 1 ) c o s ( 2 π r 2 ) ,       j = o d d
where P j is a randomly selected integer from the D dimension, denoting the dimension; r 1 and r 2 denote random numbers in the 0 ,   1 interval, used to increase the randomness of exploration; s i n ( 2 π r 2 ) and c o s ( 2 π r 2 ) simulate the direction of the beluga whale’s swimming direction, and the values of these functions determine the orientation of the beluga whale when updating its position, realizing the synchronous or mirroring behavior of the beluga whale when swimming or diving. Even and odd are denoted by the single and double numbers of whales, respectively, with the population being indicated by the number of individuals.

3.4.2. Exploitation Phase

During the exploitation phase, the BWO framework emulates belugas’ coordinated foraging by sharing spatial information to locate prey collectively. The position update protocol integrates positional relativity between elite and suboptimal solutions and strategically uses Lévy flight operators to improve convergence. This multi-objective optimization process is formalized in Equation (17).
X i T + 1 = r 3 · X B e s t T r 4 · X i T + C 1 · L F · X r T X i T
where T is the current iteration number, X i T and X r T are the current positions of the ith beluga and the random beluga, respectively, XbT is the best position in the beluga population, r 3 and r 4 denote the random numbers in the interval 0 ,   1 , and C 1 is the randomized jumping strength that measures the flight strength of the Levi’s, and the computational equations are shown in Equation (18).
C 1 = 2 · 1 T / T m a x
A mathematical model of the Lévy flight function L F is used to simulate the capture of prey by beluga whales using this strategy, and the mathematical model is shown in Equation (19).
L F = 0.05 ( u × σ ) / v 1 β
σ = Γ ( 1 + β ) × s i n ( π β 2 ) Γ ( 1 + β ) 2 × β × 2 ( β 1 ) 2 1 β
where Γ refers to the gamma function, which is the extension of the factorial function in the real and complex domains.; u and v are normally distributed random numbers; and β is a default constant equal to 1.5.

3.4.3. Whale Fall Phase

Within the BWO framework, the whale-fall phase employs a probabilistic elimination mechanism that mimics stochastic population dynamics via controlled attrition. This biomimetic process is analogous to ecological dispersion patterns, where belugas migrate or descend into deep zones. By adaptively recalibrating positions based on spatial coordinates and displacement parameters, the algorithm maintains population equilibrium. The governing equations for this phase are formalized in Equation (21).
X i T + 1 = r 5 · X i T r 6 · X r T + r 7 · X s t e p
where r 5 , r 6 , and r 7 are random numbers in the interval ( 0,1 ) , and X s t e p is the step size of the whale fall, defined as follows.
X s t e p = ( u b I b ) · e ( C 2 · T / T m a x )
The step size is affected by the boundaries of the problem variables, the number of current iterations, and the maximum number of iterations. Here, u b and I b denote the upper and lower bounds of the variables, respectively, whilst C 2 is a step factor related to the decline probability of whales and population size, which is defined as shown in Equation (23).
C 2 = 2 × W f × n
where the whale fall probability Wf is calculated as a linear function, as shown in Equation (24):
W f = 0.1 0.05 T / T m a x
The whale-fall probability exhibits a progressive diminution from 0.1 during initial iterations to 0.05 in terminal optimization phases, signifying attenuated risk exposure as the search agents approach proximity to optimal solutions. This probabilistic adaptation mechanism parallels the thermodynamic convergence behavior in transformer oil temperature forecasting models, where parametric convergence toward global optima manifests as enhanced predictive fidelity with concomitant reduction in stochastic uncertainty.
Figure 7 shows the workflow of the beluga optimization algorithm for transformer oil temperature prediction, demonstrating in detail how the BWO algorithm can optimize the problem by simulating the exploratory, exploitative, and falling behaviors of the Beluga whale.

3.5. Framework of the Proposed Method

The proposed architecture synergistically integrates BWO, TCN, BiGRU, and attention mechanisms into a unified temporal forecasting framework. While many deep learning architectures have shown success in time series forecasting, the selection of TCN and BiGRU in this study is based on their complementary strengths. TCN is particularly effective in capturing long-range dependencies using dilated causal convolutions while maintaining training stability and low computational cost. BiGRU, on the other hand, can model bidirectional dependencies in sequences, thus improving contextual understanding. Compared with alternative models such as LSTM, Transformer, or Informer, TCN offers faster convergence and simpler structure, and BiGRU provides a more lightweight alternative to LSTM with comparable accuracy.
Specifically, the model initiates with BWO-driven hyperparameter optimization, leveraging its superior global search capability to derive optimal initial parameter configurations for subsequent network training. Subsequently, the TCN module employs causal convolutional layers and dilated convolution operations to hierarchically extract both short-term fluctuations and long-range dependencies within sequential data. Complementing this, the BiGRU component systematically captures bidirectional temporal dependencies through dual-directional state propagation, thereby mitigating the inherent limitations of causal convolution in modeling complex temporal interactions. Conclusively, an adaptive feature recalibration mechanism dynamically weights the BiGRU output states through context-sensitive attention allocation, emphasizing salient temporal patterns while suppressing noise interference. This multimodal integration facilitates enhanced modeling capacity for intricate temporal structures, with the comprehensive computational workflow formally illustrated in Figure 8.

4. Experimental Case Analysis

4.1. Data Source and Processing

4.1.1. Introduction to the Dataset

The dataset employed in this study comprises the temperature data of the power transformers from two distinct counties in China, spanning from July 2016 to June 2018 [79]. Each data point consists of the target value (oil temperature), a timestamp, and six power load features, namely “HUFL”, “HULL”, “MUFL”, “MULL”, “LUFL”, “LULL”, and “OT”, the definitions of which are presented in Table 3. To evaluate the predictive performance of the different models across various real-world scenarios, our comparative experiments were conducted across different seasons and at two temporal granularities: hourly and 15 min intervals. Unless otherwise specified, the data for spring are presented at an hourly granularity. Within each season, the ratio of the training set to the testing set is 5:1, and a 10-fold cross-validation method is employed. The original transformer temperature data are shown in Figure 9.
Table 4 presents the statistical characteristics of temperature data for two power transformers at both hourly and 15 min temporal granularities, revealing differences in their oil temperature features. The average oil temperature (26.61 °C) and standard deviation of power Transformer 2 are both higher than those of Transformer 1 (11.89 °C), indicating that its oil temperature is not only higher overall but also more volatile. This necessitates that the predictive model possesses a stronger capacity for dynamic capture to handle drastic changes and extreme conditions. In contrast, the oil temperature of Transformer 1 changes relatively smoothly. In this study, the temperature dataset was meticulously examined to detect and address outliers and missing values, thereby ensuring the reliability of the dataset. By conducting comparative experiments across different transformers, seasons, and temporal granularities, one can comprehensively evaluate the predictive performance of various models in real-world scenarios.
In this study, the correlation between OT and other power load features for Transformer 1 and Transformer 2 was analyzed, as shown in Figure 10. For both transformers, oil temperature maintains a strong correlation with HULL and MULL. The correlation between oil temperature and power load features is generally higher for Transformer 1 than for Transformer 2, indicating that power load features have a more pronounced impact on oil temperature in Transformer 1. In contrast, the correlation between oil temperature and the features is weaker for Transformer 2, especially with HUFL having a minimal impact on oil temperature. Moreover, LUFL and LULL exhibit a negative correlation with oil temperature. This indicates that the impact of power load features on oil temperature varies across different transformers, providing a reliable data foundation for subsequent modeling.

4.1.2. Data Preprocessing

To enhance the training performance of the neural network, the min-max normalization technique was employed to standardize the data of each sample, with the mathematical expression presented in Equation (25).
X = X i X m i n X m a x X m i n
Among them, X and X i represent the transformer temperature data before and after normalization, respectively, while X m i n and X m a x denote the minimum and maximum values in the training sample, respectively. After undergoing min-max normalization, the data are scaled to the interval [0, 1], which facilitates the acceleration of the neural network’s convergence rate, enhances training efficiency, and improves the model’s robustness to varying feature scales.

4.1.3. Simulation Environment

All simulation experiments were conducted on a personal computer equipped with an AMD Ryzen 7 5800H processor (AMD, Santa Clara, CA, USA) and an NVIDIA GeForce RTX 3060 GPU (NVIDIA, Santa Clara, CA, USA), running MATLAB 2024a.

4.2. Evaluation Metrics

To comprehensively evaluate and compare the prediction performance of different models, this paper selects the mean absolute error (MAE), mean absolute percentage error (MAPE), and root mean square error (RMSE) as the primary evaluation metrics. Lower values of these metrics indicate smaller deviations between the predicted values and the actual values, and thus a higher prediction accuracy for the models. Here, y 1 denotes the actual value, and y 2 denotes the predicted value. The calculation formulas for these metrics are shown below.
M A E = 1 n i = 1 n y 1 y 2
M A P E = 1 n i = 1 n y 1 y 2 y 1 × 100 %
R M S E = 1 n i = 1 n y 1 y 2 2

5. Experimental Results

5.1. Hyperparameter Optimization

The optimal combination of hyperparameters was identified by employing the BWO algorithm to search within the predefined parameter ranges. Specifically, the learning rate was set within [0.001, 0.01], the number of BiGRU neurons was set within [10, 50], the number of attention mechanism keys was set within [2, 50], and the regularization parameter was set within [0.0001, 0.001]. Through iterative optimization by the BWO algorithm, the optimal hyperparameter combination was ultimately determined.
This study compared the iterative curves of Transformer 1 for the BWO optimization algorithm and the commonly used DBO [80], WOA [60], HHO [61], NGO [58], PSO [81], GA [82], and FA (firefly algorithms) [83]. The specific parameters of the metaheuristic algorithms are shown in Table 5. For all algorithms, the population size was set to 30, the maximum number of iterations was set to 20, and the replication times was set to 30. As shown in Figure 11, BWO shows a rapid decline and stabilizes at a lower value. In contrast, HHO exhibits a slower convergence rate. Algorithms such as DBO, GA, PSO, and WOA also demonstrate convergence but are outperformed by BWO. NGO maintains a steady performance but does not reach lower values. In the optimization process, the fitness function used to evaluate the performance of the model and guide the optimization algorithm is the mean squared error (MSE). This metric is chosen for its ability to penalize larger errors, thereby driving the optimization towards minimizing the overall prediction error. As shown in Table 6, this method identifies the optimal hyperparameters for the TCN-BiGRU-Attention model.
To ensure a fair comparison among metaheuristic algorithms, each algorithm was independently executed 30 times. The results are summarized in Table 7, including the best, worst, average, and standard deviation (Std) of the fitness values. Along with Figure 12, it shows that BWO achieved the lowest average fitness and the smallest standard deviation, with both high accuracy and robustness across repeated runs.

5.2. Model Comparison

To demonstrate the superiority of the TCN-BiGRU-Attention model optimized by BWO, the study compared its performance on the test set with that of mainstream transformer oil temperature prediction methods such as ELM, PSO-SVR [84], Informer [79], CNN-BiLSTM-Attention, and CNN-GRU-Attention. To ensure a fair and credible performance comparison, all baseline models were carefully tuned under equivalent experimental settings. The same training and testing data splits, normalization techniques, and early stopping criteria were applied across all models. For Informer, CNN-BiLSTM-Attention and CNN-GRU-Attention, empirical parameter tuning were performed to achieve optimal results. Additionally, PSO-SVR was optimized using particle swarm optimization with the same population size and number of iterations as BWO. Detailed hyperparameter configurations for each baseline model are listed in Table 8.
Table 9 presents the MAE, MAPE, and RMSE of different models, while Figure 12 visually illustrates the differences between the predicted and actual values of the first 230 samples for these models. Our proposed method outperforms all others on both the Transformer 1 and Transformer 2 datasets. Compared with the second-best models, CNN-BiLSTM-Attention and Informer, the reductions in MAE, MAPE, and RMSE are 26.2%, 26.3%, 22.2% and 39.5%, 41.3%, 41.2%, respectively. Notably, as shown in the enlarged detail in Figure 13, the predictions of the BWO-optimized TCN-BiGRU-Attention model maintain a small gap with the actual values even at temperature peaks and troughs. Moreover, in the Transformer 2 dataset with larger oil temperature fluctuations, although the model’s error margin has slightly increased, its superiority over other models has been further enhanced. This indicates that it has stronger robustness and stability in handling complex operating conditions and extreme situations, providing a more reliable safeguard for the safe and stable operation of the power system.
Figure 14 displays the relative error time-series curves for the two datasets across all samples, including both the training and prediction sets. As shown in panel (a), although the oil temperature sequence of Transformer 1 is relatively stable, significant relative errors still occur near the 300th, 500th, and 1000th samples, likely due to the small actual values and large predicted values. The magnified relative error plot indicates that the relative error of our model is generally lower than that of other models. As shown in panel (b), although Transformer 2 has smaller extreme values of relative error overall compared to Transformer 1, it still exhibits significant volatility and instability. The maximum RE (relative error) for ELM is approximately 27%, for PSO-SVR, it is about 77%, for Informer, it is less than 50%, for CNN-BiLSTM-Attention, it is 40%, for CNN-GRU-Attention, it is 54%, and for our model, it is 26%. This demonstrates the robustness and accuracy of our model in handling extreme and unstable oil temperature sequences.

5.3. Uncertainty Analysis

To further validate the stability and significance of the model performance, the prediction results of all models were tested on two cases with 10 independent runs conducted for each case. The MAE, RMSE, and MAPE metrics were calculated for each run. Subsequently, independent sample t-tests were conducted on these metric values to assess the statistical significance of differences between the proposed model and other models. The calculated t-values are shown in Table 10.
Based on the calculated t-values, the corresponding p-values were determined, and a 95% confidence interval was established. The uncertainty analysis results are presented in Table 11 and Table 12.
Based on the statistical analysis results presented in Table 11 and Table 12, it is evident that the p values for all models are significantly lower than the 0.05 significance level. This indicates that there are statistically significant differences between the proposed model and other models in terms of the three evaluation metrics: MAE, RMSE, and MAPE. Furthermore, the confidence intervals do not include zero, which further corroborates the significance of these differences. These findings suggest that the proposed model has a distinct and stable superiority or inferiority relationship with other models in terms of predictive performance, and these differences are highly credible from a statistical standpoint.

5.4. Ablation Studies

To further validate the contributions of each component in our proposed BWO-TCN-BiGRU-Attention model and demonstrate the necessity of integrating them into a unified framework, we conducted ablation studies. These studies isolate the performance of individual components and their combinations to highlight their respective contributions to the overall predictive model. The ablation experiments were conducted on both Transformer 1 and Transformer 2 datasets, and the performance metrics (MAE, MAPE, and RMSE) were recorded for each configuration.
The results summarized in Table 13 indicate that each component of our hybrid model plays a crucial role in enhancing the predictive performance. For instance, removing the attention mechanism leads to a significant increase in MAE, MAPE, and RMSE for both datasets. Specifically, the MAE increases from 0.5258 to 0.6134 for Transformer 1 and from 0.9995 to 1.3075 for Transformer 2. Similarly, the absence of BWO also results in higher error metrics. For Transformer 1, the MAE increases from 0.5258 to 0.5966, and for Transformer 2, it increases from 0.9995 to 1.2687. This underscores the importance of BWO in optimizing hyperparameters and avoiding local optima, thereby enhancing the overall performance.
The standalone performance of TCN and BiGRU further demonstrates their individual capabilities. TCN alone achieves an MAE of 1.0791 and a MAPE of 6.27% for Transformer 1, while BiGRU alone achieves an MAE of 1.3708 and a MAPE of 7.55%. However, their combined performance, especially when augmented with the attention mechanism and BWO, significantly outperforms any individual component or simpler architecture. For example, the full model achieves an MAE of 0.5258 and a MAPE of 2.75% for Transformer 1, which is substantially lower than the standalone TCN or BiGRU.
The synergy between TCN, BiGRU, the attention mechanism, and BWO enhances the ability to capture both short-term and long-term dependencies while prioritizing relevant features and optimizing hyperparameters. This comprehensive integration leads to a better performance in predicting transformer oil temperature compared to simpler architectures.

5.5. Evaluation of Parameters and Time Window Width

In the task of transformer oil temperature prediction, optimizing model training parameters and reasonably setting the time window width are key factors in improving prediction accuracy. To further validate the stability and generalization capability of the model, comparative experiments were conducted on Transformer 1 and Transformer 2, using different optimizers and time window widths. Figure 15a–f show that Adam outperformed other optimizers on both transformers in terms of optimizer effect. In Transformer 1, the model with the Adam optimizer achieved an MAE of 0.5258, a MAPE of 2.75%, and an RMSE of 0.6353, significantly outperforming other optimizers such as RMSprop, Adadelta, and SGD. Similarly, it maintained the lead in Transformer 2 with an MAE of 0.9995, a MAPE of 2.73%, and an RMSE of 1.2158. This indicates that the Adam optimizer has a stronger convergence ability and robustness in dealing with multi-scale, nonlinear time series prediction problems such as transformer oil temperature. Furthermore, to evaluate the impact of time window width on model performance, the study conducted experiments with four different time window lengths (3, 6, 12, and 24). The results shown in Figure 15g–l indicate that a time window width of 6 achieved the best performance on both transformers. In Transformer 1, the MAPE was the lowest at 2.75% with a window size of 6, while in Transformer 2, the MAPE was 2.73% under the same setting, significantly outperforming other window lengths. This suggests that a medium-length time window can more effectively capture the short-term dynamic changes in oil temperature, avoiding the problems of insufficient information due to a too short window or overfitting caused by an excessively long window. The comprehensive results indicate that the combination of the Adam optimizer and a time window length of 6 is most suitable for the BWO-optimized TCN-BiGRU-Attention model proposed in this study.
To assess the ability of each model to capture long-range temporal dependencies, we conducted a time window sensitivity experiment by varying the input sequence length (3, 6, 12, 24). As shown in Table 14, the proposed BWO-TCN-BiGRU-Attention model achieved the best performance at a window size of 6, with minimal performance degradation at longer windows. In contrast, the baseline models, including CNN-BiLSTM-Attention, CNN-GRU-Attention, and Informer, exhibited significantly higher error rates as the time window increased. For example, when the window size expanded from 6 to 24 on Transformer 1, the RMSE of CNN-GRU-Attention rose sharply from 1.5689 to 2.0367, while our model’s RMSE only slightly increased from 0.6353 to 0.7905. This confirms that the temporal convolutional network with dilated convolutions allows the model to effectively model long-range dependencies without the usual loss of prediction. Furthermore, the bidirectional GRU component enhances contextual awareness, and the attention mechanism selectively emphasizes relevant temporal features. Thus, the proposed method can effectively capture long-range temporal dependencies with resilience to performance drops when the input length increases.

5.6. Effects of Different Seasons and Temporal Granularities

This study evaluated the MAPE performance of various prediction models for Transformer 1 and Transformer 2 across four seasons—spring, summer, autumn, and winter (see Table 15). Seasonal temperatures significantly affect the prediction accuracy of transformer oil temperature, particularly during the extreme heat of summer and the cold of winter. High temperatures lead to large oil temperature fluctuations, increasing modeling complexity and causing the MAPE of traditional models to rise in summer. In winter, the low base value of oil temperature means that even minor prediction deviations can result in large relative errors. However, the overall error values are relatively low due to the stable sequence. The results indicate that the proposed improved TCN-BiGRU-Attention model achieved the lowest MAPE for both transformers across all seasons. For Transformer 1, the MAPE values were 2.75% in spring, 3.44% in summer, 3.93% in autumn, and 2.46% in winter. For Transformer 2, the MAPE values were 2.73% in spring, 2.78% in summer, 3.07% in autumn, and 2.05% in winter. In contrast, traditional models such as ELM and PSO-SVR exhibited higher MAPE values, for example, 10.09% and 7.73% for Transformer 1 in spring, and 6.76% and 8.08% for Transformer 2 in summer. Even recently, well-performing models like Informer and CNN-BiLSTM-Attention had higher MAPE values than the proposed model in most seasons for Transformer 1, such as 8.17% for Informer in summer and 3.86% for CNN-BiLSTM-Attention in autumn.
In this study, ten experiments were conducted for each model during the fixed season of spring, recording their MAPE and variance, with the results shown in Figure 16. The experimental results indicate that the adjustment of temporal granularity significantly affects both the prediction accuracy and stability of the models. In Transformer 1, as the temporal granularity was reduced from 1 h to 15 min, the MAPE of all models decreased. Specifically, our model’s MAPE was 2.75% at the 1 h granularity and 2.98% at the 15 min granularity, showing a slight increase in error but still remaining the best. Notably, at the same temporal granularity, our model not only significantly outperformed other models in terms of average error but also maintained the lowest MAPE variance, at 0.37% and 0.25%, respectively, demonstrating strong stability and robustness against disturbances. The experiments on Transformer 2 further validated this conclusion: the MAPE of our model was 2.73% at the 1 h granularity and further decreased to 2.16% at the 15 min granularity, with the corresponding error variance decreasing from 0.45% to 0.40%. In contrast, although traditional models such as ELM and PSO-SVR showed some improvement when the temporal granularity was reduced, their error levels remained significantly higher than those of our model. For example, at the 15-min granularity of Transformer 1, the MAPE of ELM was 6.78%, and that of PSO-SVR was 5.95%, while our model’s error was less than half of theirs. Overall, the model proposed in this study maintained superior performance across different temporal granularities, demonstrating good adaptability to high-frequency data.

5.7. SHAP Analysis Results

The SHAP (SHapley Additive exPlanations) analysis provides a comprehensive understanding of the feature contributions to the predictions for both Transformer 1 and Transformer 2. The results are visualized through bar plots and scatter plots, which depict the mean absolute SHAP values and the distribution of SHAP values across different feature values, respectively. For Transformer 1, Figure 17a indicates that all features HUFL, LULL, LUFL, MULL, MUFL, and HULL have significant impacts on the predictions, with HUFL being the most influential (mean absolute SHAP value of 0.86). This suggests that a high useful load has the most substantial effect on the output, followed by low useless load, low useful load, middle useless load, middle useful load, and high useless load. Figure 18a further illustrates that the SHAP values for these features are distributed across a range, indicating varying degrees of influence depending on the specific feature values. High values of HUFL and LULL tend to increase the model’s output, while high values of LUFL, MULL, MUFL, and HULL show a mix of positive and negative impacts.
For Transformer 2, the SHAP analysis reveals a different pattern of feature influence. Figure 17b shows that LULL is the most influential feature (mean absolute SHAP value of 0.28), followed by LUFL, MULL, MUFL, HULL, and HUFL, which has no significant impact. This shows that the predictions for Transformer 2 are primarily driven by low useless load, with other features contributing to a lesser extent. Figure 18b demonstrates that the SHAP values for LULL are predominantly positive, suggesting a consistent positive impact on the model’s output. In contrast, the SHAP values for LUFL, MULL, MUFL, and HULL are more evenly distributed around zero, with a more nuanced influence on the predictions. The absence of impact from HUFL suggests that a high useful load does not significantly contribute to the oil temperature predictions for Transformer 2.

6. Conclusions

This study successfully developed and validated a novel BWO-TCN-BiGRU-Attention model for predicting the top oil temperature of transformers. The model integrates the strengths of various advanced technologies, including BWO for hyperparameter optimization, TCN for capturing short-term dependencies, BiGRU for handling long-term dependencies, and the attention mechanism for enhancing feature extraction. The experimental results demonstrate that, on the Transformer 1 dataset, the model achieved an MAE of 0.5258, a MAPE of 2.75%, and an RMSE of 0.6353; on the Transformer 2 dataset, the MAE was 0.9995, the MAPE was 2.73%, and the RMSE was 1.2158. In seasonal tests, the model’s MAPE was 2.75% in spring, 3.44% in summer, 3.93% in autumn, and 2.46% in winter for Transformer 1, and 2.73%, 2.78%, 3.07%, and 2.05% for Transformer 2, respectively, outperforming the benchmark models. Across the different time granularities, the model exhibited strong generalization ability and stability. At a time granularity of 1 h, the MAPE for Transformer 1 was 2.75% and for Transformer 2 was 2.73%. At a 15 min time granularity, the MAPE for Transformer 1 slightly increased to 2.98%, with a marginal rise in error but still maintaining the best performance; the MAPE for Transformer 2 further decreased to 2.16%. The BWO algorithm applied in this study has certain limitations. It may require significant computational resources in high-dimensional spaces and necessitates parameter tuning to maintain stable performance across tasks. Additionally, geographical diversity and transformer types may potentially impact the model’s generalizability. To ascertain the model’s adaptability, future research should conduct tests across a broader range of geographical areas and various types of transformers. Additionally, investigations should explore the architecture’s applicability to other power systems and climates, and examine the feasibility of deploying this algorithm for real-time predictive models. Furthermore, we intend to incorporate multi-source data, including environmental temperature, humidity, and transformer service life, to enrich input features and enhance predictive accuracy. These endeavors aim to provide robust technical support for the stable operation of smart grids and related fields.

Author Contributions

Conceptualization, J.L.; methodology, Z.H.; software, B.L.; validation, X.Z.; formal analysis, J.L.; investigation, Z.H.; resources, B.L.; data curation, X.Z.; writing—original draft preparation, J.L., Z.H. and B.L.; writing—review and editing, J.L., Z.H. and B.L.; visualization, J.L.; supervision, Z.H.; project administration, B.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The data presented in this study are available on request from the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

Nomenclature

ACOAnt colony optimizationMAPEMean absolute percentage error
BiLSTMBidirectional long short-term memoryMSEMean squared error
BWOBeluga whale optimizationNGONorthern eagle optimization
CNNConvolutional neural networkPSOParticle swarm optimization
COACoati optimization algorithmRERelative error
DEDifferential evolutionaryRMSERoot mean square error
ELMExtreme learning machineRNNRecurrent neural network
FAFirefly algorithmsSASimulated annealing
GAGenetic algorithmsSSASparrow search algorithm
GRUGated recurrent unitSVMSupport vector machines
HBOHeap-based optimizationStdStandard deviation
HHOHarris hawk optimizationSVRSupport vector regression
LSTMLong short-term memoryTCNTemporal convolutional network
MAEMean absolute errorWOAWhale optimization algorithm

References

  1. Liu, X.; Xie, J.; Luo, Y.; Yang, D. A Novel Power Transformer Fault Diagnosis Method Based on Data Augmentation for KPCA and Deep Residual Network. Energy Rep. 2023, 9, 620–627. [Google Scholar] [CrossRef]
  2. Xu, J.; Hao, J.; Zhang, N.; Liao, R.; Feng, Y.; Liao, W.; Cheng, H. Simulation Study on Converter Transformer Windings Stress Characteristics under Harmonic Current and Temperature Rise Effect. Int. J. Electr. Power Energy Syst. 2025, 165, 110505. [Google Scholar] [CrossRef]
  3. Olivares-Galván, J.C.; Georgilakis, P.S.; Ocon-Valdez, R. A Review of Transformer Losses. Electr. Power Compon. Syst. 2009, 37, 1046–1062. [Google Scholar] [CrossRef]
  4. Tsili, M.A.; Amoiralis, E.I.; Kladas, A.G.; Souflaris, A.T. Power Transformer Thermal Analysis by Using an Advanced Coupled 3D Heat Transfer and Fluid Flow FEM Model. Int. J. Therm. Sci. 2012, 53, 188–201. [Google Scholar] [CrossRef]
  5. Sun, L.; Xu, M.; Ren, H.; Hu, S.; Feng, G. Multi-Point Grounding Fault Diagnosis and Temperature Field Coupling Analysis of Oil-Immersed Transformer Core Based on Finite Element Simulation. Case Stud. Therm. Eng. 2024, 55, 104108. [Google Scholar] [CrossRef]
  6. Guo, Y.; Chang, Y.; Lu, B. A Review of Temperature Prediction Methods for Oil-Immersed Transformers. Measurement 2025, 239, 115383. [Google Scholar] [CrossRef]
  7. Singh, J.; Singh, S.; Singh, A. Distribution Transformer Failure Modes, Effects and Criticality Analysis (FMECA). Eng. Fail. Anal. 2019, 99, 180–191. [Google Scholar] [CrossRef]
  8. Zhao, Z.; Xu, J.; Zang, Y.; Hu, R. Adaptive Abnormal Oil Temperature Diagnosis Method of Transformer Based on Concept Drift. Appl. Sci. 2021, 11, 6322. [Google Scholar] [CrossRef]
  9. Fauzi, N.A.; Ali, N.H.N.; Ker, P.J.; Thiviyanathan, V.A.; Leong, Y.S.; Sabry, A.H.; Jamaludin, Z.B.; Lo, C.K.; Mun, L.H. Fault Prediction for Power Transformer Using Optical Spectrum of Transformer Oil and Data Mining Analysis. IEEE Access 2020, 8, 136374–136381. [Google Scholar] [CrossRef]
  10. Beheshti Asl, M.; Fofana, I.; Meghnefi, F. Review of Various Sensor Technologies in Monitoring the Condition of Power Transformers. Energies 2024, 17, 3533. [Google Scholar] [CrossRef]
  11. Zhu, J.; Xu, Y.; Peng, C.; Zhao, S. Fault Analysis of Oil-Immersed Transformer Based on Digital Twin Technology. J. Comput. Electron. Inf. Manag. 2024, 14, 9–15. [Google Scholar] [CrossRef]
  12. Zhang, P.; Zhang, Q.; Hu, H.; Hu, H.; Peng, R.; Liu, J. Research on Transformer Temperature Early Warning Method Based on Adaptive Sliding Window and Stacking. Electronics 2025, 14, 373. [Google Scholar] [CrossRef]
  13. Yang, L.; Chen, L.; Zhang, F.; Ma, S.; Zhang, Y.; Yang, S. A Transformer Oil Temperature Prediction Method Based on Data-Driven and Multi-Model Fusion. Processes 2025, 13, 302. [Google Scholar] [CrossRef]
  14. Zheng, H.; Li, X.; Feng, Y.; Yang, H.; Lv, W. Investigation on Micro-mechanism of Palm Oil as Natural Ester Insulating Oil for Overheating Thermal Fault Analysis of Transformers. High Volt. 2022, 7, 812–824. [Google Scholar] [CrossRef]
  15. Vatsa, A.; Hati, A.S.; Rathore, A.K. Enhancing Transformer Health Monitoring With AI-Driven Prognostic Diagnosis Trends: Overcoming Traditional Methodology’s Computational Limitations. IEEE Ind. Electron. Mag. 2024, 18, 30–44. [Google Scholar] [CrossRef]
  16. Meshkatodd, M.R. Aging Study and Lifetime Estimation of Transformer Mineral Oil. Am. J. Eng. Appl. Sci. 2008, 1, 384–388. [Google Scholar] [CrossRef]
  17. Abdali, A.; Abedi, A.; Mazlumi, K.; Rabiee, A.; Guerrero, J.M. Novel Hotspot Temperature Prediction of Oil-Immersed Distribution Transformers: An Experimental Case Study. IEEE Trans. Ind. Electron. 2023, 70, 7310–7322. [Google Scholar] [CrossRef]
  18. Nordman, H.; Lahtinen, M. Thermal Overload Tests on a 400-MVA Power Transformer with a Special 2.5-p.u. Short Time Loading Capability. IEEE Trans. Power Deliv. 2003, 18, 107–112. [Google Scholar] [CrossRef]
  19. Thiviyanathan, V.A.; Ker, P.J.; Leong, Y.S.; Abdullah, F.; Ismail, A.; Jamaludin, Z. Power Transformer Insulation System: A Review on the Reactions, Fault Detection, Challenges and Future Prospects. Alex. Eng. J. 2022, 61, 7697–7713. [Google Scholar] [CrossRef]
  20. Hamed Samimi, M.; Dadashi Ilkhechi, H. Survey of Different Sensors Employed for the Power Transformer Monitoring. IET Sci. Meas. Amp. Technol. 2020, 14, 1–8. [Google Scholar] [CrossRef]
  21. Na, Q.; Wen, Y. Design of Multi-Point Intelligent Temperature Monitoring System for Transformer Equipment. J. Phys. Conf. Ser. 2020, 1550, 062007. [Google Scholar] [CrossRef]
  22. Wang, Y.; Zhao, D.; Jia, Y.; Wang, S.; Du, Y.; Li, H.; Zhang, B. Acoustic Sensors for Monitoring and Localizing Partial Discharge Signals in Oil-Immersed Transformers under Array Configuration. Sensors 2024, 24, 4704. [Google Scholar] [CrossRef] [PubMed]
  23. Amoda, O.A.; Tylavsky, D.J.; McCulla, G.A.; Knuth, W.A. Acceptability of Three Transformer Hottest-Spot Temperature Models. IEEE Trans. Power Deliv. 2012, 27, 13–22. [Google Scholar] [CrossRef]
  24. Sippola, M.; Sepponen, R.E. Accurate Prediction of High-Frequency Power-Transformer Losses and Temperature Rise. IEEE Trans. Power Electron. 2002, 17, 835–847. [Google Scholar] [CrossRef]
  25. Deng, Y.; Ruan, J.; Quan, Y.; Gong, R.; Huang, D.; Duan, C.; Xie, Y. A Method for Hot Spot Temperature Prediction of a 10 kV Oil-Immersed Transformer. IEEE Access 2019, 7, 107380–107388. [Google Scholar] [CrossRef]
  26. Lyu, Z.; Wan, Z.; Bian, Z.; Liu, Y.; Zhao, W. Integrated Digital Twins System for Oil Temperature Prediction of Power Transformer Based On Internet of Things. IEEE Internet Things J. 2025, 2025, 3530440. [Google Scholar] [CrossRef]
  27. Institute of Electrical and Electronics Engineers. IEEE Guide for Loading Mineral-Oil-Immersed Transformers and Step-Voltage Regulators; Institute of Electrical and Electronics Engineers, IEEE-SA Standards Board, Eds.; Institute of Electrical and Electronics Engineers: New York, NY, USA, 2012; ISBN 9780738171951. [Google Scholar]
  28. Hippert, H.S.; Pedreira, C.E.; Souza, R.C. Neural Networks for Short-Term Load Forecasting: A Review and Evaluation. IEEE Trans. Power Syst. 2001, 16, 44–55. [Google Scholar] [CrossRef]
  29. Taheri, A.A.; Abdali, A.; Rabiee, A. A Novel Model for Thermal Behavior Prediction of Oil-Immersed Distribution Transformers With Consideration of Solar Radiation. IEEE Trans. Power Deliv. 2019, 34, 1634–1646. [Google Scholar] [CrossRef]
  30. Cheng, L.; Yu, T. Dissolved Gas Analysis Principle-Based Intelligent Approaches to Fault Diagnosis and Decision Making for Large Oil-Immersed Power Transformers: A Survey. Energies 2018, 11, 913. [Google Scholar] [CrossRef]
  31. Rao, W.; Zhu, L.; Pan, S.; Yang, P.; Qiao, J. Bayesian Network and Association Rules-Based Transformer Oil Temperature Prediction. J. Phys. Conf. Ser. 2019, 1314, 012066. [Google Scholar] [CrossRef]
  32. Zhang, X.; Liu, G.; Wang, H.; Li, X. Application of a Hybrid Interpolation Method Based on Support Vector Machine in the Precipitation Spatial Interpolation of Basins. Water 2017, 9, 760. [Google Scholar] [CrossRef]
  33. Lee, K.-R. A Study on SVM-Based Speaker Classification Using GMM-Supervector. J. IKEEE 2020, 24, 1022–1027. [Google Scholar] [CrossRef]
  34. Smyl, S. A Hybrid Method of Exponential Smoothing and Recurrent Neural Networks for Time Series Forecasting. Int. J. Forecast. 2020, 36, 75–85. [Google Scholar] [CrossRef]
  35. Xi, Y.; Lin, D.; Yu, L.; Chen, B.; Jiang, W.; Chen, G. Oil Temperature Prediction of Power Transformers Based on Modified Support Vector Regression Machine. Int. J. Emerg. Electr. Power Syst. 2023, 24, 367–375. [Google Scholar] [CrossRef]
  36. Huang, S.-J.; Shih, K.-R. Short-Term Load Forecasting via ARMA Model Identification Including Non-Gaussian Process Considerations. IEEE Trans. Power Syst. 2003, 18, 673–679. [Google Scholar] [CrossRef]
  37. Zhang, T.; Sun, H.; Peng, F.; Zhao, S.; Yan, R. A Deep Transfer Regression Method Based on Seed Replacement Considering Balanced Domain Adaptation. Eng. Appl. Artif. Intell. 2022, 115, 105238. [Google Scholar] [CrossRef]
  38. Cuk, A.; Bezdan, T.; Jovanovic, L.; Antonijevic, M.; Stankovic, M.; Simic, V.; Zivkovic, M.; Bacanin, N. Tuning Attention Based Long-Short Term Memory Neural Networks for Parkinson’s Disease Detection Using Modified Metaheuristics. Sci. Rep. 2024, 14, 4309. [Google Scholar] [CrossRef]
  39. Jovanovic, A.; Jovanovic, L.; Zivkovic, M.; Bacanin, N.; Simic, V.; Pamucar, D.; Antonijevic, M. Particle Swarm Optimization Tuned Multi-Headed Long Short-Term Memory Networks Approach for Fuel Prices Forecasting. J. Netw. Comput. Appl. 2025, 233, 104048. [Google Scholar] [CrossRef]
  40. Lin, J.; Van Wijngaarden, A.J.D.L.; Wang, K.-C.; Smith, M.C. Speech Enhancement Using Multi-Stage Self-Attentive Temporal Convolutional Networks. IEEE/ACM Trans. Audio Speech Lang. Process. 2021, 29, 3440–3450. [Google Scholar] [CrossRef]
  41. Fernando, T.; Sridharan, S.; Denman, S.; Ghaemmaghami, H.; Fookes, C. Robust and Interpretable Temporal Convolution Network for Event Detection in Lung Sound Recordings. IEEE J. Biomed. Health Inform. 2022, 26, 2898–2908. [Google Scholar] [CrossRef]
  42. Yang, Z.; Liao, W.; Liu, K.; Chen, X.; Zhu, R. Power Quality Disturbances Classification Using A TCN-CNN Model. In Proceedings of the 2022 7th Asia Conference on Power and Electrical Engineering (ACPEE), Hangzhou, China, 16–17 April 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 2145–2149. [Google Scholar]
  43. Wang, X.; Xie, G.; Zhang, Y.; Liu, H.; Zhou, L.; Liu, W.; Gao, Y. The Application of a BiGRU Model with Transformer-Based Error Correction in Deformation Prediction for Bridge SHM. Buildings 2025, 15, 542. [Google Scholar] [CrossRef]
  44. Sun, R.; Chen, J.; Li, B.; Piao, C. State of Health Estimation for Lithium-Ion Batteries Based on Novel Feature Extraction and BiGRU-Attention Model. Energy 2025, 319, 134756. [Google Scholar] [CrossRef]
  45. Zhang, X.; Zhao, H.; Yao, J.; Wang, Z.; Zheng, Y.; Peng, T.; Zhang, C. A Multi-Scale Component Feature Learning Framework Based on CNN-BiGRU and Online Sequential Regularized Extreme Learning Machine for Wind Speed Prediction. Renew. Energy 2025, 242, 122427. [Google Scholar] [CrossRef]
  46. Dong, Y.; Zhong, Z.; Zhang, Y.; Zhu, R.; Wen, H.; Han, R. Intelligent Prediction Method of Hot Spot Temperature in Transformer by Using CNN-LSTM&GRU Network. In Proceedings of the 2023 International Conference on Advanced Robotics and Mechatronics (ICARM), Sanya, China, 8 July 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 7–12. [Google Scholar]
  47. Wang, X.; Liu, X.; Bai, Y. Prediction of the Temperature of Diesel Engine Oil in Railroad Locomotives Using Compressed Information-Based Data Fusion Method with Attention-Enhanced CNN-LSTM. Appl. Energy 2024, 367, 123357. [Google Scholar] [CrossRef]
  48. Zou, D.; Xu, H.; Quan, H.; Yin, J.; Peng, Q.; Wang, S.; Dai, W.; Hong, Z. Top-Oil Temperature Prediction of Power Transformer Based on Long Short-Term Memory Neural Network with Self-Attention Mechanism Optimized by Improved Whale Optimization Algorithm. Symmetry 2024, 16, 1382. [Google Scholar] [CrossRef]
  49. Abualigah, L. Particle Swarm Optimization: Advances, Applications, and Experimental Insights. Comput. Mater. Contin. 2025, 82, 1539–1592. [Google Scholar] [CrossRef]
  50. Ballerini, L. Particle Swarm Optimization in 3D Medical Image Registration: A Systematic Review. Arch. Comput. Methods Eng. 2025, 32, 311–318. [Google Scholar] [CrossRef]
  51. Ma, J.; Gao, W.; Tong, W. A Deep Reinforcement Learning Assisted Adaptive Genetic Algorithm for Flexible Job Shop Scheduling. Eng. Appl. Artif. Intell. 2025, 149, 110447. [Google Scholar] [CrossRef]
  52. Chen, Y.; Dong, Z.; Wang, Y.; Su, J.; Han, Z.; Zhou, D.; Zhang, K.; Zhao, Y.; Bao, Y. Short-Term Wind Speed Predicting Framework Based on EEMD-GA-LSTM Method under Large Scaled Wind History. Energy Convers. Manag. 2021, 227, 113559. [Google Scholar] [CrossRef]
  53. Emary, E.; Zawbaa, H.M.; Grosan, C. Experienced Gray Wolf Optimization Through Reinforcement Learning and Neural Networks. IEEE Trans. Neural Netw. Learn. Syst. 2018, 29, 681–694. [Google Scholar] [CrossRef]
  54. Yang, Y.; Yan, J.; Zhou, X. A Heat Load Prediction Method for District Heating Systems Based on the AE-GWO-GRU Model. Appl. Sci. 2024, 14, 5446. [Google Scholar] [CrossRef]
  55. Yue, Y.; Cao, L.; Lu, D.; Hu, Z.; Xu, M.; Wang, S.; Li, B.; Ding, H. Review and Empirical Analysis of Sparrow Search Algorithm. Artif. Intell. Rev. 2023, 56, 10867–10919. [Google Scholar] [CrossRef]
  56. Mohammed, H.M.; Umar, S.U.; Rashid, T.A. A Systematic and Meta-Analysis Survey of Whale Optimization Algorithm. Comput. Intell. Neurosci. 2019, 2019, 8718571. [Google Scholar] [CrossRef] [PubMed]
  57. Zhu, D.; Li, R.; Zheng, Y.; Zhou, C.; Li, T.; Cheng, S. Cumulative Major Advances in Particle Swarm Optimization from 2018 to the Present: Variants, Analysis and Applications. Arch. Comput. Methods Eng. 2025, 32, 1571–1595. [Google Scholar] [CrossRef]
  58. Dehghani, M.; Hubalovsky, S.; Trojovsky, P. Northern Goshawk Optimization: A New Swarm-Based Algorithm for Solving Optimization Problems. IEEE Access 2021, 9, 162059–162080. [Google Scholar] [CrossRef]
  59. Li, Y.; Lin, X.; Liu, J. An Improved Gray Wolf Optimization Algorithm to Solve Engineering Problems. Sustainability 2021, 13, 3208. [Google Scholar] [CrossRef]
  60. Mirjalili, S.; Lewis, A. The Whale Optimization Algorithm. Adv. Eng. Softw. 2016, 95, 51–67. [Google Scholar] [CrossRef]
  61. Heidari, A.A.; Mirjalili, S.; Faris, H.; Aljarah, I.; Mafarja, M.; Chen, H. Harris Hawks Optimization: Algorithm and Applications. Future Gener. Comput. Syst. 2019, 97, 849–872. [Google Scholar] [CrossRef]
  62. Dorigo, M.; Birattari, M.; Stutzle, T. Ant Colony Optimization. IEEE Comput. Intell. Mag. 2006, 1, 28–39. [Google Scholar] [CrossRef]
  63. Dehghani, M.; Montazeri, Z.; Trojovská, E.; Trojovský, P. Coati Optimization Algorithm: A New Bio-Inspired Metaheuristic Algorithm for Solving Optimization Problems. Knowl. Based Syst. 2023, 259, 110011. [Google Scholar] [CrossRef]
  64. Gharehchopogh, F.S.; Namazi, M.; Ebrahimi, L.; Abdollahzadeh, B. Advances in Sparrow Search Algorithm: A Comprehensive Survey. Arch. Comput. Methods Eng. 2023, 30, 427–455. [Google Scholar] [CrossRef] [PubMed]
  65. Van Laarhoven, P.J.M.; Aarts, E.H.L. Simulated Annealing: Theory and Applications; Springer: Dordrecht, The Netherlands, 1987; ISBN 9789048184385. [Google Scholar]
  66. Karaboğa, D.; Ökdem, S. A Simple and Global Optimization Algorithm for Engineering Problems: Differential Evolution Algorithm. Turk. J. Electr. Eng. Comput. Sci. 2004, 12, 53–60. [Google Scholar]
  67. Mitchell, M. An Introduction to Genetic Algorithms; Complex Adaptive Systems, 7 Print; The MIT Press: Cambridge, MA, USA, 2001; ISBN 9780262631853. [Google Scholar]
  68. Zhong, C.; Li, G.; Meng, Z. Beluga Whale Optimization: A Novel Nature-Inspired Metaheuristic Algorithm. Knowl. Based Syst. 2022, 251, 109215. [Google Scholar] [CrossRef]
  69. Wan, A.; Peng, S.; AL-Bukhaiti, K.; Ji, Y.; Ma, S.; Yao, F.; Ao, L. A Novel Hybrid BWO-BiLSTM-ATT Framework for Accurate Offshore Wind Power Prediction. Ocean. Eng. 2024, 312, 119227. [Google Scholar] [CrossRef]
  70. Wolpert, D.H.; Macready, W.G. No free lunch theorems for optimization. IEEE Trans. Evol. Computat. 1997, 1, 67–82. [Google Scholar] [CrossRef]
  71. Lara-Benítez, P.; Carranza-García, M.; Luna-Romera, J.M.; Riquelme, J.C. Temporal Convolutional Networks Applied to Energy-Related Time Series Forecasting. Appl. Sci. 2020, 10, 2322. [Google Scholar] [CrossRef]
  72. Feng, Y.; Zhu, J.; Qiu, P.; Zhang, X.; Shuai, C. Short-Term Power Load Forecasting Based on TCN-BiLSTM-Attention and Multi-Feature Fusion. Arab. J. Sci. Eng. 2024, 50, 5475–5486. [Google Scholar] [CrossRef]
  73. Li, X.; Zhou, S.; Wang, F. A CNN-BiGRU Sea Level Height Prediction Model Combined with Bayesian Optimization Algorithm. Ocean. Eng. 2025, 315, 119849. [Google Scholar] [CrossRef]
  74. Ławryńczuk, M.; Zarzycki, K. LSTM and GRU Type Recurrent Neural Networks in Model Predictive Control: A Review. Neurocomputing 2025, 632, 129712. [Google Scholar] [CrossRef]
  75. Karthik, R.V.; Pandiyaraju, V.; Ganapathy, S. A Context and Sequence-Based Recommendation Framework Using GRU Networks. Artif. Intell. Rev. 2025, 58, 170. [Google Scholar] [CrossRef]
  76. Teng, F.; Song, Y.; Guo, X. Attention-TCN-BiGRU: An Air Target Combat Intention Recognition Model. Mathematics 2021, 9, 2412. [Google Scholar] [CrossRef]
  77. Niu, Z.; Zhong, G.; Yu, H. A Review on the Attention Mechanism of Deep Learning. Neurocomputing 2021, 452, 48–62. [Google Scholar] [CrossRef]
  78. Hernández, A.; Amigó, J.M. Attention Mechanisms and Their Applications to Complex Systems. Entropy 2021, 23, 283. [Google Scholar] [CrossRef] [PubMed]
  79. Zhou, H.; Zhang, S.; Peng, J.; Zhang, S.; Li, J.; Xiong, H.; Zhang, W. Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting. Proc. AAAI Conf. Artif. Intell. 2021, 35, 11106–11115. [Google Scholar] [CrossRef]
  80. Xue, J.; Shen, B. Dung Beetle Optimizer: A New Meta-Heuristic Algorithm for Global Optimization. J. Supercomput. 2023, 79, 7305–7336. [Google Scholar] [CrossRef]
  81. Kennedy, J.; Eberhart, R. Particle swarm optimization. In Proceedings of the ICNN’95—International Conference on Neural Networks, Perth, WA, Australia, 27 November–1 December 1995; IEEE: Piscataway, NJ, USA, 1995; pp. 1942–1948. [Google Scholar] [CrossRef]
  82. Sumida, B.H.; Houston, A.I.; McNamara, J.M.; Hamilton, W.D. Genetic algorithms and evolution. J. Theor. Biol. 1990, 147, 59–84. [Google Scholar] [CrossRef]
  83. Watanabe, O.; Zeugmann, T. (Eds.) Stochastic Algorithms: Foundations and Applications; Springer: Berlin/Heidelberg, Germany, 2009. [Google Scholar] [CrossRef]
  84. Shiyong, L.; Jing, X.; Mianzhi, W.; Rongbin, X.; Bin, J.; Kai, W.; Qingquan, L. Prediction of Transformer Top Oil Temperature Based on Improved Weighted Support Vector Regression Based on Particle Swarm Optimization. In Proceedings of the 2021 International Conference on Advanced Electrical Equipment and Reliable Operation (AEERO), Beijing, China, 15–17 October 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 1–5. [Google Scholar]
Figure 1. The basic structure of oil-immersed transformers.
Figure 1. The basic structure of oil-immersed transformers.
Mathematics 13 01785 g001
Figure 2. Convolution schematic: (a) standard convolution with a 3 × 3 kernel (and padding); (b) dilated convolution with a 3 × 3 kernel and dilation factor of 2.
Figure 2. Convolution schematic: (a) standard convolution with a 3 × 3 kernel (and padding); (b) dilated convolution with a 3 × 3 kernel and dilation factor of 2.
Mathematics 13 01785 g002
Figure 3. Overall structure of TCN.
Figure 3. Overall structure of TCN.
Mathematics 13 01785 g003
Figure 4. Residual block internal structure.
Figure 4. Residual block internal structure.
Mathematics 13 01785 g004
Figure 5. Gated loop unit structure.
Figure 5. Gated loop unit structure.
Mathematics 13 01785 g005
Figure 6. Bidirectional gated recurrent unit structure.
Figure 6. Bidirectional gated recurrent unit structure.
Mathematics 13 01785 g006
Figure 7. The workflow of BWO in the transformer oil temperature prediction task.
Figure 7. The workflow of BWO in the transformer oil temperature prediction task.
Mathematics 13 01785 g007
Figure 8. Framework for the presentation of the methodology.
Figure 8. Framework for the presentation of the methodology.
Mathematics 13 01785 g008
Figure 9. Top-oil temperature variation curves for two transformers: (a) Transformer 1; (b) Transformer 2.
Figure 9. Top-oil temperature variation curves for two transformers: (a) Transformer 1; (b) Transformer 2.
Mathematics 13 01785 g009
Figure 10. Pearson correlation analysis results for two transformers: (a) Transformer 1; and (b) Transformer 2.
Figure 10. Pearson correlation analysis results for two transformers: (a) Transformer 1; and (b) Transformer 2.
Mathematics 13 01785 g010
Figure 11. Fitness iteration curves of different algorithms.
Figure 11. Fitness iteration curves of different algorithms.
Mathematics 13 01785 g011
Figure 12. Boxplots of fitness values per algorithm.
Figure 12. Boxplots of fitness values per algorithm.
Mathematics 13 01785 g012
Figure 13. Comparison of prediction performance and true values for different models on the test set: (a) Transformer 1; and (b) Transformer 2.
Figure 13. Comparison of prediction performance and true values for different models on the test set: (a) Transformer 1; and (b) Transformer 2.
Mathematics 13 01785 g013
Figure 14. Relative error curves for different models using the entire dataset: (a) Transformer 1; and (b) Transformer 2.
Figure 14. Relative error curves for different models using the entire dataset: (a) Transformer 1; and (b) Transformer 2.
Mathematics 13 01785 g014
Figure 15. Evaluation of the model optimizers and sliding window lengths for transformers: (ac) model optimizer evaluation for Transformer 1; (df) model optimizer evaluation for Transformer 2; (gi) sliding window length evaluation for Transformer 1; and (jl) sliding window length evaluation for Transformer 2.
Figure 15. Evaluation of the model optimizers and sliding window lengths for transformers: (ac) model optimizer evaluation for Transformer 1; (df) model optimizer evaluation for Transformer 2; (gi) sliding window length evaluation for Transformer 1; and (jl) sliding window length evaluation for Transformer 2.
Mathematics 13 01785 g015
Figure 16. MAPE and variance of different models at different time granularities in spring.
Figure 16. MAPE and variance of different models at different time granularities in spring.
Mathematics 13 01785 g016
Figure 17. SHAP summary plot for (a) Transformer 1; (b) Transformer 2.
Figure 17. SHAP summary plot for (a) Transformer 1; (b) Transformer 2.
Mathematics 13 01785 g017
Figure 18. SHAP dependence plot for (a) Transformer 1; and (b) Transformer 2.
Figure 18. SHAP dependence plot for (a) Transformer 1; and (b) Transformer 2.
Mathematics 13 01785 g018
Table 1. Comparison of methods for measuring top oil temperature of transformer using temperature sensors.
Table 1. Comparison of methods for measuring top oil temperature of transformer using temperature sensors.
Detection MethodAdvantagesDisadvantages
Traditional temperature sensors (e.g., Pt100) [20]
  • Measure transformer oil temperature accurately.
  • Mature technology, low cost, easy installation, and maintenance.
  • Requires regular calibration.
  • No remote real-time monitoring and delayed data acquisition.
Detection system based on microcontroller [21]
  • Monitor oil temperature in real time and compare with thresholds.
  • Activate cooling and alarms automatically.
  • Requires professional technicians.
  • Limited for complex fault diagnosis.
Surface acoustic wave sensor [22]
  • Wireless sensing, no power supply or installation issues.
  • Monitor oil temperature and level simultaneously.
  • Accuracy affected by environmental factors.
  • High integration, complex installation and debugging.
Table 2. Comparison of optimization algorithms.
Table 2. Comparison of optimization algorithms.
Types of ModelReferencesDescriptionsAdvantagesDisadvantages
Particle Swarm Optimization (PSO)[57]Simulates bird flock foraging behavior to optimize solutions through group cooperation and information sharing.Simple to implement, with few parameters and wide applicability.Easily falls into local optimality and performs poorly on high-dimensional problems.
Northern Eagle Optimization (NGO)[58]Simulates the hunting behavior of the northern goshawk, balancing exploration and exploitation.The optimization capability is strong and can effectively avoid local optima.High computational cost and complex parameterization in high-dimensional problems.
Grey Wolf Optimizer (GWO)[59]Simulates social hierarchies and hunting behaviors of gray wolves using alpha wolves to guide searches.Simple structure and strong global search capability.Easily falls into local optima and parameter tuning has substantial impacts.
Whale Optimization Algorithm (WOA)[60]Simulates the bubble net feeding behavior of whales.Easy to implement and searchable.Inadequate localized search capabilities and sensitive parameter settings.
Harris Hawk Optimization (HHO)[61]Simulates the hunting behavior of a falcon, optimized by fast swooping and random search.Strong global search capability and fast convergence.High degree of randomness and potential instability in results.
Ant Colony Optimization (ACO)[62]Simulation of ant foraging behavior with pheromone delivery and update-guided search.Suitable for combinatorial optimization problems and well adapted to multimodal problems.Slow convergence and parameter sensitivity.
Coati Optimization Algorithm (COA)[63]Simulates the hunting behavior of a raccoon, including exploration (hunting and attacking iguanas) and escaping predators.Ensures initial solutions are uniformly distributed to generate high-quality results and maintain robustness across various optimization problems.Possible slow convergence and high computational cost in high-dimensional problems.
Sparrow Search Algorithm (SSA)[64]Simulates the foraging and anti-predator behaviors of sparrowsSmall dependence on initial parameters and good performance in dynamic environments.Convergence may be slow and complex problems require multiple iterations.
Simulated Annealing (SA)[65]Simulation of the substance annealing process, optimized by controlling the temperature step by step down.Enhances global search performance by escaping local optima.Possible slow convergence and need for temperature parameter adjustment.
Differential Evolutionary (DE)[66]Global optimization algorithm based on differential operations for continuous parameter optimization.Simple and effective for high-dimensional problems, easy to parallelize.May fall into local optimality, sensitive to parameters.
Genetic Algorithms (GA)[67]Mimics the natural evolutionary process, optimized by selection, crossover and mutation operations.Achieves strong global search performance for complex and discrete problems.High computational complexity and slow convergence.
Table 3. Dataset characteristics.
Table 3. Dataset characteristics.
FieldDateHUFLHULLMUFLMULLLUFLLULLOT
DescriptionThe recorded dateHigh useful loadHigh useless loadMiddle useful loadMiddle useless loadLow useful loadLow useless loadOil temperature (target)
Table 4. Temperature data statistics of power transformers.
Table 4. Temperature data statistics of power transformers.
DataMeanStandard ErrorMedianModeStandard
Deviation
Sample VarianceMinimumMaximum
Transformer 1 (1 h)13.320.0611.409.228.5773.39−4.0846.01
Transformer 2 (1 h)26.610.0926.5811.6411.89141.33−2.6558.88
Transformer 1 (15 min)13.320.0311.400.008.5773.36−4.2246.01
Transformer 2 (15 min)26.610.0526.5811.6411.89141.29−2.6558.88
Table 5. Fitness parameter settings for different algorithms.
Table 5. Fitness parameter settings for different algorithms.
AlgorithmParametersValues
AllPopulation size, maximum iterative number, replication times30, 20, 20
BWOProbability of whale fall decreased at interval Wf[0.1, 0.05]
DBOk, λ, b, S0.1, 0.1, 0.3, 0.5
WOAProbability of encircling mechanism, spiral factor0.5, 1
HHOInterval of E0[−11]
NGO R = 0.02 · 1 t T /
PSOCognitive and social constant C1, C2, inertia weight2, 2, [0.9, 0.1]
GACrossover rate, mutation rate (Gaussian)0.8, 0.05
FAMaximum brightness, absorption coefficient, attraction coefficient, randomization parameter1, 1, 1.5, 0.2
Table 6. Optimal hyperparameters for the models.
Table 6. Optimal hyperparameters for the models.
Hyperparameters of TCN-BiGRU-AttentionOptimal Values
Learning rate0.01
Number of neurons50
Number of attention keys40
Regularization parameter0.0001
Table 7. Performance of metaheuristic algorithms across independent runs.
Table 7. Performance of metaheuristic algorithms across independent runs.
AlgorithmBestWorstAverageStd
BWO0.232640.234390.233510.00045
DBO0.235700.238360.236980.00065
WOA0.233370.235880.234950.00060
HHO0.234150.237650.235660.00073
NGO0.233040.234940.233800.00051
PSO0.235720.239240.237040.00085
GA0.234290.238850.235630.00090
FA0.234900.238160.236460.00078
Table 8. Hyperparameter configurations of baseline models.
Table 8. Hyperparameter configurations of baseline models.
ModelKey HyperparametersValue Range/Settings
PSO-SVRPopulation size/iterations30/20
C, εGrid search: C [ 1 ,   100 ] , ε [ 0 . 01 ,   0 . 1 ]
Kernel functionRBF
InformerLearning rate0.0005
Batch size32
Encoder/decoder layers2/1
Attention heads4
CNN-BiLSTM-AttentionConv filter size/LSTM units/attention size[3, 5], [64], [32]
Dropout rate0.2
CNN-GRU-AttentionSame as above with GRU replacing LSTM
Table 9. Prediction indicators of different models.
Table 9. Prediction indicators of different models.
Forecasting ModelsDataset
Transformer 1Transformer 2
MAEMAPERMSEMAEMAPERMSE
ELM1.812310.09%2.12762.11266.11%2.3443
PSO-SVR1.42637.73%1.63392.19616.42%2.6443
Informer1.13876.11%1.37981.65314.65%2.0685
CNN-BiLSTM-Attention0.71203.73%0.81632.00975.69%2.4215
CNN-GRU-Attention1.31717.14%1.56891.96345.61%2.3722
Ours0.52582.75%0.63530.99952.73%1.2158
Table 10. t-values of statistical significance tests for MAE, RMSE, and MAPE.
Table 10. t-values of statistical significance tests for MAE, RMSE, and MAPE.
ModelTransformert Value
MAERMSEMAPE
ELMTransformer 1−114.663−72.074−64.494
Transformer 2−47.928−47.884−60.947
PSO-SVRTransformer 1−61.290−60.943−70.091
Transformer 2−56.487−61.144−36.44
InformerTransformer 1−58.494−51.796−70.103
Transformer 2−43.721−59.024−39.914
CNN-BiLSTM-AttentionTransformer 1−16.537−15.350−18.012
Transformer 2−55.774−53.806−45.662
CNN-GRU-AttentionTransformer 1−52.443−64.605−68.290
Transformer 2−49.257−53.525−51.991
Table 11. Uncertainty analysis results of each model of Transformer 1.
Table 11. Uncertainty analysis results of each model of Transformer 1.
ModelMAERMSEMAPE
p ValueConfidence Intervalp ValueConfidence Intervalp ValueConfidence Interval
ELM3.097 × 10−27(−1.310, −1.263)1.296 × 10−23(−1.536, −1.449)9.514 × 10−23(−7.579, −7.101)
PSO-SVR2.370 × 10−22(−0.931, −0.870)2.625 × 10−22(−1.033, −0.964)2.139 × 10−23(−5.129, −4.831)
Informer5.472 × 10−22(−0.635, −0.591)4.825 × 10−21(−0.775, −0.714)2.132 × 10−23(−3.461, −3.259)
CNN-BiLSTM-Attention2.492 × 10−12(−0.210, −0.163)8.753 × 10−12(−0.206, −0.156)5.823 × 10−13(−1.094, −0.866)
CNN-GRU-Attention3.864 × 10−21(−0.823, −0.760)9.225 × 10−23(−0.964, −0.903)3.412 × 10−23(−4.525, −4.255)
Table 12. Uncertainty analysis results of each model of Transformer 2.
Table 12. Uncertainty analysis results of each model of Transformer 2.
ModelMAERMSEMAPE
p ValueConfidence Intervalp ValueConfidence Intervalp ValueConfidence Interval
ELM1.932 × 10−20(−1.162, −1.064)1.964 × 10−20(−1.178, −1.079)2.620 × 10−22(−3.497, −3.263)
PSO-SVR1.022 × 10−21(−1.241, −1.152)2.474 × 10−22(−1.478, −1.379)2.554 × 10−18(−3.903, −3.477)
Informer9.967 × 10−20(−0.685, −0.622)4.656 × 10−22(−0.883, −0.822)5.056 × 10−19(−2.021, −1.819)
CNN-BiLSTM-Attention1.284 × 10−21(−1.048, −0.972)2.442 × 10−21(−1.253, −1.159)4.590 × 10−20(−3.096, −2.824)
CNN-GRU-Attention1.185 × 10−20(−1.005, −0.923)2.682 × 10−21(−1.202, −1.111)4.513 × 10−21(−2.996, −2.764)
Table 13. Results of ablation studies.
Table 13. Results of ablation studies.
Forecasting ModelsDataset
Transformer 1Transformer 2
MAEMAPERMSEMAEMAPERMSE
Ours0.52582.75%0.63530.99952.73%1.2158
Ours w/o Attention
Mechanism
0.61344.10%1.23811.30754.20%1.9003
Ours w/o Beluga Whale Optimization0.59663.69%0.90801.26873.96%1.7360
TCN1.07916.27%1.48632.03125.48%2.4929
BiGRU1.37087.55%1.60722.24615.91%2.7853
Table 14. MAE, MAPE, and RMSE of different models under varying time window lengths on Transformer 1 and Transformer 2.
Table 14. MAE, MAPE, and RMSE of different models under varying time window lengths on Transformer 1 and Transformer 2.
Time WindowModelTransformerMAEMAPERMSE
3OursT10.71143.26%0.7382
3CNN-BiLSTM-AttentionT10.88214.02%0.9213
3CNN-GRU-AttentionT11.16845.76%1.3045
3InformerT11.03575.12%1.1876
6OursT10.52582.75%0.6353
6CNN-BiLSTM-AttentionT10.71203.73%0.8163
6CNN-GRU-AttentionT11.31717.14%1.5689
6InformerT11.13876.11%1.3798
12OursT10.61862.98%0.7038
12CNN-BiLSTM-AttentionT10.93624.54%1.0589
12CNN-GRU-AttentionT11.48297.61%1.7193
12InformerT11.25346.42%1.5126
24OursT10.82794.21%0.7905
24CNN-BiLSTM-AttentionT11.11275.06%1.2430
24CNN-GRU-AttentionT11.73458.23%2.0367
24InformerT11.43696.87%1.7318
3OursT21.34113.52%1.7496
3CNN-BiLSTM-AttentionT21.68254.17%2.0127
3CNN-GRU-AttentionT21.79314.69%2.1390
3InformerT21.62184.58%1.9873
6OursT20.99952.73%1.2158
6CNN-BiLSTM-AttentionT22.00975.69%2.4215
6CNN-GRU-AttentionT21.96345.61%2.3722
6InformerT21.65314.65%2.0685
12OursT21.28273.39%1.5785
12CNN-BiLSTM-AttentionT22.11385.83%2.5102
12CNN-GRU-AttentionT22.02735.75%2.4683
12InformerT21.71924.87%2.1037
24OursT21.50703.84%1.8206
24CNN-BiLSTM-AttentionT22.31966.02%2.7394
24CNN-GRU-AttentionT22.14256.10%2.6347
24InformerT21.85635.03%2.3046
Table 15. MAPE (%) of Transformer 1–2 in different seasons and models.
Table 15. MAPE (%) of Transformer 1–2 in different seasons and models.
ModelTransformerSpringSummerAutumnWinter
ELMTransformer 110.0911.2710.628.28
Transformer 26.116.766.386.57
PSO-SVRTransformer 17.738.397.457.20
Transformer 26.428.087.335.81
InformerTransformer 16.118.175.725.54
Transformer 24.656.354.574.08
CNN-BiLSTM-AttentionTransformer 13.734.303.863.64
Transformer 25.696.046.266.09
CNN-GRU-AttentionTransformer 17.148.017.706.20
Transformer 25.615.795.575.30
Adjusted TCN-BiGRU-AttentionTransformer 12.753.443.932.46
Transformer 22.732.783.072.05
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Liu, J.; Hou, Z.; Liu, B.; Zhou, X. Mathematical and Machine Learning Innovations for Power Systems: Predicting Transformer Oil Temperature with Beluga Whale Optimization-Based Hybrid Neural Networks. Mathematics 2025, 13, 1785. https://doi.org/10.3390/math13111785

AMA Style

Liu J, Hou Z, Liu B, Zhou X. Mathematical and Machine Learning Innovations for Power Systems: Predicting Transformer Oil Temperature with Beluga Whale Optimization-Based Hybrid Neural Networks. Mathematics. 2025; 13(11):1785. https://doi.org/10.3390/math13111785

Chicago/Turabian Style

Liu, Jingrui, Zhiwen Hou, Bowei Liu, and Xinhui Zhou. 2025. "Mathematical and Machine Learning Innovations for Power Systems: Predicting Transformer Oil Temperature with Beluga Whale Optimization-Based Hybrid Neural Networks" Mathematics 13, no. 11: 1785. https://doi.org/10.3390/math13111785

APA Style

Liu, J., Hou, Z., Liu, B., & Zhou, X. (2025). Mathematical and Machine Learning Innovations for Power Systems: Predicting Transformer Oil Temperature with Beluga Whale Optimization-Based Hybrid Neural Networks. Mathematics, 13(11), 1785. https://doi.org/10.3390/math13111785

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop