A Hybrid Transformer-Generative Adversarial Network-Gated Recurrent Unit Model for Intelligent Load Balancing and Demand Forecasting in Smart Power Grids

Larijani, Ata; Ghafourian, Ehsan; Vaziri, Ali; Martín, Diego; Hernando-Gallego, Francisco

doi:10.3390/electronics15081579

Open AccessArticle

A Hybrid Transformer-Generative Adversarial Network-Gated Recurrent Unit Model for Intelligent Load Balancing and Demand Forecasting in Smart Power Grids

by

Ata Larijani

¹

,

Ehsan Ghafourian

²,

Ali Vaziri

³

,

Diego Martín

^4,*

and

Francisco Hernando-Gallego

⁵

¹

Department of Management Information Systems, Spears School of Business, Oklahoma State University, Stillwater, OK 74074, USA

²

Department of Computer Science, IowaState University, Ames, IA 50010, USA

³

Department of Systems Engineering, Stevens Institute of Technology, Hoboken, NJ 07030, USA

⁴

Department of Computer Science, Escuela de Ingeniería Informática de Segovia, Universidad de Valladolid, 40005 Segovia, Spain

⁵

Department of Applied Mathematics, Escuela de Ingeniería Informática de Segovia, Universidad de Valladolid, 40005 Segovia, Spain

^*

Author to whom correspondence should be addressed.

Electronics 2026, 15(8), 1579; https://doi.org/10.3390/electronics15081579

Submission received: 4 March 2026 / Revised: 3 April 2026 / Accepted: 6 April 2026 / Published: 10 April 2026

(This article belongs to the Special Issue Applications of Machine Learning and Artificial Intelligence in Modern Power and Energy Systems, 2nd Edition)

Download

Browse Figures

Versions Notes

Abstract

Accurate demand forecasting and adaptive load balancing are critical for maintaining stability and efficiency in modern smart power grids. This study proposes a hybrid deep learning (DL) framework, termed Transformer-Generative Adversarial Network-Gated Recurrent Unit (Transformer-GAN-GRU), which integrates global attention-based temporal modeling, generative data augmentation, and sequential refinement into a unified architecture. The proposed framework captures both long- and short-term dependencies while improving representation of imbalanced demand patterns. The model is evaluated on three heterogeneous benchmark datasets, namely Pecan Street, the reliability test system-grid modernization laboratory consortium (RTS-GMLC), and the reference energy disaggregation dataset (REDD). Experimental results demonstrate that the proposed model consistently outperforms state-of-the-art baselines, achieving a maximum accuracy (Acc) of 99.49%, a recall of 99.67%, and an area under the curve (AUC) of 99.83%. In addition to high predictive performance, the framework exhibits strong stability, fast convergence, and low inference latency, confirming its suitability for real-time deployment in smart grid environments.

Keywords:

intelligent load balancing; smart power grids; transformer; generative adversarial network; gated recurrent unit

1. Introduction

The rapid evolution of modern smart power grids has fundamentally transformed the way electricity is generated, distributed, and consumed [1,2,3]. The increasing penetration of renewable energy sources, the proliferation of distributed energy resources, and the dynamic behavior of end-users have significantly increased the complexity of grid operation [4,5,6]. Unlike traditional power systems with relatively predictable load patterns, contemporary smart grids exhibit high variability, nonlinear demand dynamics, and strong temporal dependencies [7]. Under such conditions, accurate demand forecasting and intelligent load balancing are no longer optional optimization tools but essential components for maintaining grid stability, operational efficiency, and economic sustainability [8]. Inaccurate load prediction can lead to severe operational consequences, including voltage instability, frequency deviation, inefficient generator dispatch, and increased operational costs [9,10,11,12]. Moreover, peak demand misestimation may trigger unnecessary reserve activation or, conversely, insufficient supply allocation, both of which compromise grid reliability [13,14,15]. As power systems move toward more decentralized and data-driven architectures, the ability to anticipate demand fluctuations and dynamically balance supply becomes increasingly critical [16]. Therefore, developing robust, scalable, and high-precision forecasting frameworks has become a central research focus in smart energy management [17].

Traditional statistical methods, such as autoregressive models and regression-based approaches, often struggle to capture nonlinear interactions and multi-scale temporal dependencies inherent in smart grid data [18,19,20]. While these models may perform adequately under stationary conditions, their predictive capability deteriorates in the presence of renewable intermittency, demand spikes, and distribution imbalance. This limitation has motivated the adoption of advanced machine learning and deep learning techniques that can learn complex representations directly from high-dimensional time-series data [21]. In recent years, deep learning (DL) has emerged as a powerful paradigm for modeling complex temporal and nonlinear systems in energy forecasting applications. Recurrent architectures such as gated recurrent units (GRU) effectively capture short-term temporal dependencies and sequential dynamics, enabling responsive modeling of rapid demand fluctuations [22]. Attention-based architectures, particularly transformers, have demonstrated superior capability in modeling long-range dependencies and global contextual relationships across time horizons, making them well-suited for capturing seasonal patterns and delayed grid interactions [23]. Meanwhile, generative adversarial networks (GANs) provide a robust mechanism for learning underlying data distributions and addressing imbalance or rare-event representation challenges, which are common in peak load scenarios [24].

Despite the individual strengths of these architectures, existing approaches typically employ them in isolation, limiting their ability to simultaneously address distribution imbalance, long-term temporal structure, and short-term dynamic refinement. To overcome these limitations, this study proposes a unified Transformer–GAN–GRU framework that integrates generative data augmentation, global attention-based encoding, and gated sequential refinement within a single architecture. By combining these complementary mechanisms, the proposed model aims to enhance predictive accuracy, stability, and scalability for intelligent load balancing and demand forecasting in modern smart power grids.

1.1. Related Works and Research Gaps

Fetaji et al. [25] propose an AI-driven framework for smart grid optimization that combines deep neural forecasting with reinforcement learning-based load management. Their approach addresses real-time optimization challenges and sustainability gaps using multiple public datasets, including ElectricityLoadDiagrams20112014, Pecan Street, and GEFCom2017. While the study demonstrates improved peak reduction and forecasting accuracy, its primary focus lies in operational policy optimization rather than deep architectural modeling of multi-scale temporal dependencies. Moreover, generative modeling and structured hybrid integration between attention and recurrent mechanisms are not explored, leaving room for more comprehensive temporal–distribution modeling strategies. Neubert et al. [26] investigate the detection of photovoltaic systems and electric vehicles in smart meter data using a hybrid CNN and multilayer perceptron classifier. By leveraging the Pecan Street dataset and augmented EV charging data, the study achieves strong detection accuracy for DER identification. However, the work concentrates on classification of distributed energy resources rather than forecasting or intelligent load balancing. Additionally, the architecture does not explicitly address long-range temporal dependencies or demand imbalance issues, which are critical for predictive grid management tasks.

Hadish et al. [27] introduce a Transformer-based architecture for residential energy consumption forecasting using the Pecan Street dataset. Their results highlight the strength of attention mechanisms in modeling long-term dependencies and achieving superior predictive accuracy compared to traditional models. Nevertheless, the study employs a standalone Transformer model without integrating generative or recurrent refinement modules. Consequently, short-term temporal smoothing and imbalance-aware modeling remain insufficiently addressed, limiting adaptability in highly dynamic grid conditions. Moreno Jaramillo et al. [28] develop a non-intrusive load monitoring (NILM) framework to identify distributed energy resources from smart meter net-demand data using conventional machine learning techniques. Their approach achieves high classification performance with relatively low computational complexity. Similarly, Brown et al. [29] propose a statistical disaggregation method to infer rooftop photovoltaic generation and installed capacity from censored smart meter readings. While both studies enhance grid observability, they focus primarily on disaggregation and DER identification rather than predictive load balancing. Furthermore, neither work incorporates advanced deep generative or attention-based architectures to capture multi-scale temporal complexity.

Liu [30] presents a hierarchical deep reinforcement learning framework for intelligent load balancing, integrating PPO with a dual-layer control architecture and a graph-based embedding network. The approach effectively addresses spatial and temporal variability in modern grids. Likewise, Zhou et al. [31] apply multi-agent DRL combined with graph neural networks for optimal power flow under high renewable penetration. Although reinforcement learning demonstrates strong adaptability, these works emphasize control optimization rather than high-precision multi-scale demand forecasting. The predictive backbone in such systems could benefit from more expressive hybrid deep learning architectures. Karuppiah and Thakur [32] propose a federated learning framework using GRU-based forecasting combined with metaheuristic appliance scheduling. Their privacy-preserving Split-FL architecture demonstrates strong performance across REDD, UK-DALE, and Pecan Street datasets. However, the model relies primarily on recurrent mechanisms without incorporating global attention modeling or generative distribution balancing. Similarly, Masoumi and Korkali [33] employ deep generative modeling with bidirectional GRU and attention for adversarially robust resource adequacy estimation. While their work highlights the robustness of generative modeling, it focuses on reliability estimation rather than integrated forecasting and load balancing.

Skorupa et al. [34] analyze probabilistic transmission planning under renewable uncertainties using RTS-GMLC simulations, emphasizing system-level risk assessment. Chen et al. [35] explore multimicrogrid load balancing through EV charging network coordination, modeling the interaction between transportation systems and smart grids. These studies underscore the growing complexity of grid operation but do not directly address unified deep learning architectures for predictive load management. Akbar et al. [36] propose a semi-supervised NILM framework using TCN and LSTM, demonstrating improved appliance state detection. Nonetheless, the approach targets disaggregation tasks rather than scalable, distribution-aware load forecasting.

1.2. Research Gaps in Current Research

Despite the substantial body of research on load forecasting and intelligent energy management, several critical limitations remain in the existing literature. First, many studies rely on single-paradigm models such as standalone long short-term memory (LSTM), deep convolutional neural network (DCNN), or transformer architectures, which focus primarily on either sequential dependency or feature extraction, but not both in a structured and complementary manner. This often results in partial modeling of smart grid dynamics, where long-term seasonal trends and short-term fluctuations are not simultaneously captured. Second, numerous works evaluate models primarily on predictive accuracy without addressing convergence stability, statistical robustness, or operational feasibility, leading to architectures that perform well numerically but lack reproducibility and scalability in real-world deployments.

A significant research gap also exists in modeling multi-scale temporal dependencies within heterogeneous grid datasets. While recurrent networks such as LSTM and GRU effectively capture short-term dynamics, they often struggle with long-range correlations and delayed demand effects. Conversely, Transformer-based approaches model global temporal dependencies but may overlook fine-grained local transitions if not properly structured. Most existing studies do not explicitly separate or hierarchically assign these roles within a unified framework. Additionally, few works investigate structured cooperation between attention-based and recurrent mechanisms, leaving an open gap in designing scale-aware hybrid architectures for energy forecasting.

Another critical gap lies in handling data imbalance and rare-event representation, particularly in peak demand and high-variance operational states. Smart grid datasets frequently exhibit skewed distributions, yet many forecasting models are trained directly on raw data without addressing imbalance, leading to biased predictions and reduced sensitivity to critical high-demand scenarios. Although GANs have been applied in some energy-related tasks, their integration is often limited to preprocessing stages rather than being embedded within a unified predictive pipeline. Furthermore, limited attention has been given to evaluating how generative augmentation influences convergence speed, stability, and statistical significance in time-series forecasting contexts.

Finally, most prior works focus either on residential-level datasets or system-scale simulations, without validating generalization across heterogeneous data granularities. This creates a gap in understanding whether a single architecture can scale from appliance-level variability to grid-level operational complexity. Additionally, limited studies conduct comprehensive evaluations that combine predictive metrics, convergence behavior, variance analysis, inference latency, and statistical hypothesis testing within one framework. As a result, there remains a need for a robust, distribution-aware, multi-scale, and statistically validated architecture capable of addressing the intrinsic complexity of intelligent load balancing in modern smart power grids. The key research gaps identified in the existing literature can be summarized as follows:

Over-reliance on single-paradigm deep learning models without structured hybridization.
Limited joint modeling of long-term and short-term temporal dependencies.
Lack of hierarchical separation between global attention and local sequential refinement.
Insufficient handling of data imbalance and rare peak-load events.
Limited integration of generative modeling within unified forecasting pipelines.
Inadequate evaluation of convergence stability and training robustness.
Weak cross-dataset generalization analysis across heterogeneous grid scales.
Absence of comprehensive statistical validation and deployment-oriented evaluation metrics.

The motivation of this paper arises directly from the limitations identified in existing smart grid forecasting research. Modern load dynamics exhibit multi-scale temporal behavior, nonlinear interactions, and distribution imbalance, yet most prior models address these characteristics in isolation. Architectures focused solely on recurrent mechanisms tend to capture short-term transitions but fail to model long-range dependencies effectively. Conversely, attention-based models capture global patterns but may overlook localized fluctuations critical for fine-grained load balancing decisions. This fragmentation in modeling strategies motivated the need for a unified framework that explicitly decomposes the forecasting problem into complementary learning tasks rather than relying on a single architectural paradigm.

The integration of a Transformer module is motivated by the necessity to capture long-term dependencies and global contextual relationships inherent in smart grid data. Electricity demand is influenced by daily cycles, weekly seasonality, and delayed system interactions, which require attention-based mechanisms capable of modeling non-local correlations. However, global attention alone is insufficient for stabilizing short-term demand transitions. Therefore, the inclusion of a GRU module is driven by the need to refine local sequential dynamics and provide gated memory control, ensuring stable modeling of rapid fluctuations and transitional states in load behavior.

Furthermore, the incorporation of a GAN module is motivated by the persistent issue of distribution imbalance and rare-event underrepresentation in smart grid datasets. Peak load states and abrupt demand changes are often sparsely distributed yet critically important for operational stability. Traditional supervised models trained on imbalanced data tend to bias toward dominant patterns. By integrating generative adversarial learning within the predictive pipeline, the framework enhances representation of under-sampled demand states and stabilizes feature distributions, thereby improving sensitivity and robustness.

The selection of the three benchmark datasets (Pecan Street, reliability test system–grid modernization laboratory consortium (RTS-GMLC), and reference energy disaggregation dataset (REDD)) also aligns with the core motivation of developing a scalable and generalizable solution. These datasets represent heterogeneous operational scales, from residential-level consumption to appliance-level granularity and grid-scale operational complexity. Evaluating the proposed architecture across these distinct environments ensures that the model addresses not only predictive performance but also adaptability and structural scalability. Collectively, these motivations guided the design of the proposed Transformer-GAN-GRU framework as a structured, distribution-aware, and multi-scale solution for intelligent load balancing and demand forecasting in modern smart power systems.

1.3. Paper Contributions and Organization

The main contributions of this paper are as follows:

This paper introduces a new hybrid DL framework termed Transformer–GAN–GRU for intelligent load balancing and demand forecasting in smart power grids. The proposed model integrates attention-based global modeling, generative distribution learning, and gated sequential refinement into a unified predictive architecture specifically designed to address multi-scale temporal dynamics and distribution imbalance in energy datasets.
The proposed framework structurally decomposes the forecasting task into complementary learning stages. The Transformer module captures long-range temporal dependencies and global contextual patterns; the GAN module enhances representation of imbalanced and rare demand states through generative distribution modeling; and the GRU module refines short-term temporal transitions using gated memory mechanisms. This synergistic integration improves predictive stability, convergence efficiency, and robustness compared to single-paradigm models.
The proposed model is extensively evaluated on three heterogeneous benchmark datasets (Pecan Street, REDD, and RTS-GMLC) representing residential, appliance-level, and grid-scale environments. Its performance is compared against seven state-of-the-art baseline architectures: Bidirectional encoder representations from transformers (BERT), transformer, deep belief network (DBN), GAN, LSTM, GRU, and DCNN.
The evaluation employs a comprehensive set of performance metrics, including accuracy (Acc), recall, area under the curve (AUC), root mean square error (RMSE), variance analysis, inference latency, runtime, and two-tailed t-test for statistical significance. Experimental results demonstrate that the proposed Transformer–GAN–GRU framework achieves superior predictive accuracy, faster convergence, extremely low variance, statistically significant improvements, and real-time inference feasibility, confirming its robustness and practical applicability for modern smart grid systems.

The remainder of this paper is structured as follows. Section 2 presents the overall research methodology, including the adopted datasets, the data preparation pipeline, and the detailed description of the core architectural components, namely the Transformer encoder, GAN module, GRU network, and the proposed integrated Transformer–GAN–GRU framework. Section 3 reports the experimental results, providing comprehensive quantitative evaluations of the proposed architecture against baseline models using multiple statistical and computational performance metrics. Section 4 discusses the findings in depth, analyzing model stability, scalability, statistical robustness, and real-world applicability for intelligent load balancing in smart power grids. Finally, Section 5 concludes the paper by summarizing the key outcomes and highlighting the broader implications of the proposed framework for next-generation energy management systems.

2. Materials and Methods

Figure 1 illustrates the overall workflow of the proposed intelligent load balancing and demand forecasting framework. The figure presents the complete processing pipeline, beginning with real-world smart grid data acquisition and ending with the evaluation of the trained hybrid model. It visually summarizes how raw energy consumption data are transformed through preprocessing, augmented and encoded via deep learning modules, and ultimately used to produce accurate load forecasts and adaptive balancing decisions. The flowchart reflects the hierarchical cooperation among the GAN, Transformer, and GRU modules within a unified architecture.

The process starts with the collection of real-world benchmark datasets representing heterogeneous smart grid environments. These datasets capture residential consumption dynamics, large-scale grid reliability conditions, and energy disaggregation patterns, ensuring that the proposed framework is evaluated under diverse operational scenarios. Before entering the learning architecture, the raw data undergo structured preprocessing, including missing value handling, normalization, temporal segmentation, and demand pattern encoding. This stage enhances statistical consistency and ensures that temporal dependencies are properly structured for sequential modeling. Detailed dataset descriptions and preprocessing procedures are presented in the following sections.

After preprocessing, the training data are passed through the generative module. The GAN component serves as a data-level enhancement mechanism rather than a forecasting tool. Its primary role is to mitigate distribution imbalance between peak and off-peak demand periods and to generate synthetic high-demand scenarios that are typically underrepresented in real datasets. By augmenting rare load spikes and extreme operating conditions, the GAN improves the robustness and generalization capability of the overall model. This generative augmentation strengthens the representation space before temporal encoding begins.

The augmented data are then processed by the Transformer encoder. The Transformer is responsible for capturing global temporal dependencies through self-attention mechanisms. Unlike conventional recurrent models that process sequences step-by-step, the Transformer evaluates the entire time window simultaneously, enabling it to learn long-range relationships such as daily cycles, seasonal variations, and trend-level demand correlations. This global attention-driven representation forms a high-level contextual embedding of the grid dynamics. The advantage of incorporating the Transformer lies in its ability to model complex, non-local temporal interactions that are critical in modern smart grids characterized by distributed generation and fluctuating demand patterns. Following global encoding, the contextual embeddings are passed to the GRU module. While the Transformer captures long-term dependencies, the GRU refines short-term sequential fluctuations and local temporal transitions. This recurrent refinement stage enhances prediction smoothness and improves responsiveness to rapid demand variations. The combination of attention-based global reasoning and gated recurrent local modeling provides a complementary hierarchical structure that increases forecasting precision and stability.

The final output layer produces the predicted load demand, which is then used to support intelligent load balancing decisions. By comparing predicted demand with available generation capacity, the framework enables proactive redistribution and adaptive control strategies. Performance is evaluated using multiple statistical metrics, including accuracy, recall, AUC, and runtime, ensuring both predictive reliability and computational efficiency. The key novelty of the proposed framework lies in its tri-layer integration strategy that unifies generative modeling, attention-based encoding, and recurrent refinement within a single end-to-end architecture. Unlike conventional forecasting models that rely solely on either recurrent networks or Transformers, this approach introduces data-level augmentation through GAN prior to feature-level temporal modeling. The hierarchical cooperation between global attention and local sequential learning enhances robustness against distribution shifts and extreme demand events. Furthermore, the unified design improves generalizability across heterogeneous grid scenarios without significantly increasing computational complexity. Detailed mathematical formulations and module-specific implementations are provided in the subsequent sections.

2.1. Dataset

The selection of appropriate benchmark datasets is critical for validating predictive load balancing frameworks in smart power grids. In this study, three widely recognized and heterogeneous datasets (Pecan Street, RTS-GMLC, and REDD) are utilized to ensure robustness, scalability, and generalizability of the proposed Transformer–GAN–GRU model. These datasets collectively represent residential-level consumption behavior, bulk power system operational dynamics, and appliance-level disaggregation scenarios. Their diversity allows the framework to be evaluated under different temporal resolutions, load variability patterns, and structural complexities, thereby strengthening the credibility of the proposed approach for real-world deployment.

The Pecan Street dataset is a high-resolution residential energy dataset collected from hundreds of households in the United States under the Dataport program. It provides granular electricity consumption data, typically sampled at 1 min or 15 min intervals, along with metadata including solar generation, electric vehicle charging, HVAC consumption, and weather information. The dataset contains millions of time-stamped measurements, enabling fine-grained temporal forecasting analysis. Key attributes include active power demand (kW), voltage measurements, solar photovoltaic output, temperature readings, and occupancy-related variables. Due to its high temporal resolution and renewable integration components, this dataset is particularly suitable for modeling short-term demand fluctuations and distributed generation effects. It does not contain cyber-attacks or fault labels, as it primarily represents normal operational residential behavior.

The RTS-GMLC dataset represents a synthetic but realistic large-scale transmission-level power system model developed for grid modernization research. It includes detailed generation profiles, load distributions, transmission constraints, contingency scenarios, and reliability metrics. The system consists of dozens of buses, generators, and transmission lines, providing structured data on power flows, generation dispatch, ramp rates, reserve requirements, and hourly load demand profiles. RTS-GMLC is particularly useful for studying grid-level balancing strategies and operational reliability under varying demand and renewable penetration conditions. Unlike residential datasets, it incorporates system-level constraints and variability but does not explicitly include cyber-attack annotations. Its structured topology and operational metadata make it ideal for validating scalability and grid-wide balancing performance.

The REDD dataset focuses on appliance-level electricity consumption within residential buildings. It contains both aggregate household power demand and individual appliance-level measurements, sampled at high frequency (up to 1 s resolution for some channels). The dataset includes multiple homes, with labeled appliance categories such as refrigerators, lighting systems, HVAC units, and washing machines. REDD enables evaluation of fine-grained load decomposition and short-term demand variability modeling. It contains hundreds of thousands of timestamped samples across multiple channels. While it is not an intrusion or attack dataset, its appliance-level labeling allows investigation of micro-pattern demand dynamics and load variability under heterogeneous consumption behaviors.

Collectively, these three datasets provide complementary perspectives on smart grid demand dynamics: residential variability (Pecan Street), system-level operational complexity (RTS-GMLC), and appliance-level granularity (REDD). The absence of cyber-attack labels reflects the forecasting-oriented nature of this study, which focuses on intelligent load balancing rather than intrusion detection. The combination of high-resolution temporal data, system-scale operational modeling, and fine-grained consumption patterns ensures that the proposed framework is validated across multiple real-world energy contexts. Detailed preprocessing strategies applied to these datasets are described in the next section.

To ensure statistical stability, temporal consistency, and reliable learning behavior in the proposed Transformer–GAN–GRU framework, a structured data preparation pipeline is applied prior to model training and evaluation. Real-world smart grid datasets inherently contain irregularities such as missing measurements, heterogeneous load magnitudes, and non-stationary temporal patterns. Therefore, the raw load sequences are systematically processed through missing data handling, normalization, temporal segmentation, and demand pattern encoding to construct a stable and informative input space for the hybrid architecture.

Energy consumption datasets frequently contain missing samples due to sensor malfunctions, communication dropouts, or recording inconsistencies. To mitigate distortion while preserving temporal continuity, a two-stage imputation strategy is adopted. For short missing intervals, linear interpolation is applied to maintain local smoothness without introducing artificial oscillations. The interpolated value at time index

t

is computed according to Equation (1):

x_{t} = x_{t_{1}} + \frac{t - t_{1}}{t_{2} - t_{1}} (x_{t_{2}} - x_{t_{1}})

(1)

where

t_{1}

and

t_{2}

denote the nearest valid timestamps surrounding the missing interval.

For longer missing intervals exceeding a predefined duration threshold

τ

, interpolation may generate unrealistic trends. In such cases, forward–backward filling is employed, where missing values are replaced with the closest available observation in time. Samples with missing ratios greater than a tolerance level

γ

are discarded to prevent bias during optimization.

Since residential and grid-level load magnitudes vary significantly across datasets due to differences in infrastructure scale, appliance usage, and generation capacity, scale normalization is required to ensure balanced learning. Min–Max normalization is applied to each load sequence to bound the data within a fixed interval and prevent high-magnitude consumers from dominating the optimization process. The normalized load value

x^{'}

is computed as shown in Equation (2). This bounded transformation improves numerical stability during attention score computation in the Transformer and gated updates in the GRU. To avoid data leakage, normalization parameters are estimated exclusively from the training data and then consistently applied to the testing set.

x^{'} = \frac{x - x_{m i n}}{x_{m a x} - x_{m i n}}

(2)

where

x_{m i n}

and

x_{m a x}

denote the minimum and maximum load values within each household profile.

To enable structured sequential learning, the normalized load sequence

{x_{1}, x_{2}, \dots, x_{T}}

is segmented into fixed-length overlapping temporal windows using a sliding-window strategy. Each window is defined according to Equation (3):

X_{k} = {x_{k}, x_{k + 1}, \dots, x_{k + L - 1}}

(3)

where

L

represents the window length;

k

denotes the starting index of each segment. A stride parameter

s

controls the degree of overlap, with

s < L

producing overlapping windows that enhance temporal continuity. The window length is selected to align with dominant consumption cycles, such as daily or sub-daily patterns, enabling the model to capture both short-term fluctuations and longer periodic structures.

Following segmentation, each window is transformed into a structured representation that captures intrinsic demand dynamics beyond raw magnitude values. For each segment X, statistical descriptors are extracted to summarize its behavior. The average load level is computed as shown in Equation (4), the intra-window trend is quantified using the gradient measure in Equation (5), and the variability within the segment is calculated as the variance given in Equation (6):

μ_{k} = \frac{1}{L} \sum_{i = 0}^{L - 1} {x^{'}}_{k + i}

(4)

g_{k} = \frac{{x^{'}}_{k + i - 1} - {x^{'}}_{k}}{L}

(5)

σ_{k}^{2} = \frac{1}{L} \sum_{i = 0}^{L - 1} {({x^{'}}_{k + i} - μ_{k})}^{2}

(6)

where

μ_{k}, g_{k}, σ_{k}^{2}

denote the mean load level, directional trend, and intra-window variability, respectively. These descriptors provide a compact representation of rising, falling, stable, or highly dynamic demand conditions. By encoding both magnitude and temporal behavior, the resulting feature space improves the effectiveness of generative augmentation, enhances attention-based contextual modeling, and stabilizes sequential forecasting within the integrated Transformer–GAN–GRU framework.

2.2. Transformer Encoder

The transformer encoder is an attention-based sequence modeling architecture designed to capture global dependencies within temporal data through parallelized self-attention mechanisms. Unlike recurrent networks that process sequences sequentially, the Transformer evaluates all time steps simultaneously, enabling efficient modeling of long-range correlations and complex temporal interactions. This property is particularly suitable for smart power grid forecasting, where demand patterns exhibit multi-scale temporal structures such as hourly fluctuations, daily cycles, and seasonal trends. By leveraging attention weights to dynamically focus on relevant historical positions, the Transformer encoder enhances contextual representation quality and improves predictive robustness under variable demand conditions [37].

Figure 2 illustrates the internal structure of the Transformer encoder employed in this study. The architecture begins with input embeddings augmented by positional encoding to preserve temporal order information. The encoded inputs are then processed by a multi-head self-attention block, followed by a position-wise feed-forward neural network. Residual connections and layer normalization are applied after each sub-layer to stabilize training and improve gradient flow. The figure also details the scaled dot-product attention mechanism, where query, key, and value projections are computed, scaled, masked if necessary, and transformed through a softmax operation before weighted aggregation. The processing flow begins with transforming the input sequence into embedding vectors, which are enriched with positional encodings to retain sequence order. These representations are projected into query, key, and value spaces and passed into multiple parallel attention heads. Each head learns distinct relational patterns across time steps. The outputs of all heads are concatenated and linearly transformed before entering the feed-forward network. Residual connections ensure information preservation across layers, while normalization enhances convergence stability. The final encoder output provides a high-level contextual representation of the load sequence, which is subsequently refined by the GRU module [38].

The positional encoding mechanism is defined by Equations (7) and (8), which describe the sinusoidal functions used to inject deterministic position information into the embedding space. These equations ensure that each temporal index is represented uniquely while preserving relative positional relationships across the sequence [38].

{P E}_{(p o s, 2 i)} = s i n (\frac{p o s}{1000^{2 i / d}}),

(7)

{P E}_{(p o s, 2 i + 1)} = c o s (\frac{p o s}{1000^{2 i / d}})

(8)

where

p o s

is the position index;

i

is the dimension index;

d

is the embedding size.

The linear projections that transform the input embedding into query, key, and value matrices are defined by Equations (9)–(11). These transformations map the same input representation into distinct subspaces, enabling relational comparison across time steps.

\vec{Q} = {\vec{Z} W}^{Q},

(9)

\vec{K} = {\vec{Z} W}^{K},

(10)

\vec{V} = {\vec{Z} W}^{V}

(11)

where

W^{Q}

,

W^{K}

, and

W^{V}

are learnable weight matrices responsible for transforming the input into distinct subspaces.

The scaled dot-product attention mechanism is formally defined in Equation (12). This equation computes similarity scores between queries and keys, scales them to prevent gradient instability, applies softmax normalization, and aggregates the value vectors accordingly [39].

A t t e n t i o n (\vec{Q}, \vec{K}, \vec{V}) = s o f t m a x (\frac{{\vec{Q} \vec{K}}^{T}}{\sqrt{d_{K}}}) \vec{V}

(12)

where

d_{K}

is the dimensionality of each attention head.

To enhance representation diversity, multiple attention heads operate in parallel. The aggregation of these heads is defined in Equations (13) and (14), where each head performs independent attention before concatenation and linear transformation.

M u l t i H e a d (\vec{Q}, \vec{K}, \vec{V}) = C o n c a t ({h e a d}_{1}, {h e a d}_{2}, \dots, {h e a d}_{h}) W^{O},

(13)

{h e a d}_{1} = A t t e n t i o n (\vec{Q} W_{i}^{Q}, \vec{K} W_{i}^{K}, \vec{V} W_{i}^{V})

(14)

Following the attention layer, a position-wise feed-forward neural network refines the representation. Equation (15) defines this transformation, which introduces nonlinearity and increases expressive capacity [39].

F F N N (\vec{x}) = R e L U (0, {\vec{x} W}_{1}, + b_{1}) W_{2} + b_{2}

(15)

where

\vec{x}

is the input vector corresponding to a single token or position in the sequence;

W

is the weight matrix;

b

is bias vector.

Finally, residual connections combined with layer normalization are applied to stabilize deep stacking and preserve contextual information. These operations are expressed in Equations (16) and (17).

\overset{´}{\vec{Z}} = L a y e r N o r m (\vec{Z} + M u l t i H e a d (\vec{Q}, \vec{K}, \vec{V})),

(16)

{\vec{Z}}^{o u t} = L a y e r N o r m (\overset{´}{\vec{Z}} + F F N N (\overset{´}{\vec{Z}}))

(17)

Through this layered attention-driven encoding process, the Transformer module produces globally contextualized load representations capable of modeling complex inter-temporal dependencies. This structured embedding significantly enhances forecasting accuracy when integrated with the subsequent generative and recurrent components of the proposed hybrid framework.

2.3. GAN

GANs are deep generative models designed to learn the underlying distribution of real data through an adversarial training process between two competing neural networks: a generator and a discriminator. The generator attempts to synthesize data samples that resemble real observations, while the discriminator aims to distinguish between authentic and synthetic samples. This competitive interaction drives the generator to progressively improve its synthetic outputs until they become statistically indistinguishable from real data. In the context of smart power grid forecasting, GANs are particularly valuable for addressing data imbalance and limited representation of rare demand conditions such as peak load spikes or abnormal consumption transitions. By augmenting the training dataset with high-quality synthetic samples, the GAN enhances robustness, reduces overfitting, and improves generalization of the downstream Transformer–GRU forecasting architecture [40].

Figure 3 illustrates the adversarial structure of the GAN employed in this study. The real dataset is fed directly to the discriminator, while the generator produces synthetic load samples from latent noise vectors. Both real and generated samples are evaluated by the discriminator, which outputs a probability indicating whether a sample is real or fake. Through iterative adversarial updates, the discriminator becomes better at classification, and the generator becomes better at deception. This adversarial dynamic enables the model to approximate the true load distribution more accurately. The operational flow begins by sampling latent noise vectors from a predefined distribution and passing them through the generator network to produce synthetic load profiles. These generated samples are combined with real observations and fed into the discriminator. The discriminator computes classification probabilities and updates its parameters to improve real–fake discrimination accuracy. Simultaneously, the generator updates its parameters based on the discriminator’s feedback, minimizing its ability to be detected as fake. Over successive training iterations, this process leads to improved synthetic data realism and enhanced representation diversity for rare demand patterns [40].

The fundamental adversarial objective that governs the interaction between the generator and discriminator is defined in Equation (18). This minimax formulation describes how the discriminator maximizes classification accuracy while the generator attempts to minimize it [41].

\begin{matrix} m i n \\ G \end{matrix} \begin{matrix} m a x \\ D \end{matrix} V (D, G) = E_{X ~ p_{d a t a} (x)} [l o g (D (X))] + E_{Z ~ p_{Z} (z)} [l o g (1 - D (G (Z)))]

(18)

where

p_{Z} (z)

denotes the prior distribution over the latent noise vectors;

p_{d a t a} (x)

represents the distribution of real data;

D (G (Z))

denotes the probability assigned to a sample synthesized by the generator;

D (X)

refers to the probability output of the discriminator for a real input

X

.

To explicitly optimize the discriminator, Equation (19) defines the discriminator loss function, which penalizes incorrect classification of real and synthetic samples.

L_{D} = - (E_{X ~ p_{d a t a} (x)} [l o g (D (X))] + E_{Z ~ p_{Z} (z)} [l o g (1 - D (G (Z)))])

(19)

where

L_{D}

is the loss functions for the discriminator.

The generator’s objective is defined in Equation (20), where the generator aims to minimize the probability that generated samples are identified as fake. Through this adversarial optimization process, the GAN learns to approximate the true distribution of smart grid load profiles. The resulting synthetic samples enrich underrepresented demand states, strengthen statistical diversity, and improve the stability of the integrated Transformer–GAN–GRU forecasting framework [41].

L_{G} = - E_{Z ~ p_{Z} (z)} [l o g (1 - D (G (Z)))]

(20)

where

L_{G}

is the loss functions for the generator.

2.4. GRU

The GRU is a recurrent neural network (RNN) architecture specifically designed to model sequential dependencies while mitigating the vanishing gradient problem commonly observed in traditional RNNs. Unlike conventional recurrent structures, the GRU incorporates gating mechanisms that regulate information flow across time steps, enabling the model to selectively retain or discard historical information. In the context of smart power grid forecasting, demand sequences exhibit both smooth transitions and sudden fluctuations. The GRU is particularly suitable for capturing these localized temporal variations, complementing the global contextual representation provided by the Transformer encoder. Its relatively compact structure compared to long short-term memory (LSTM) reduces computational overhead while maintaining strong temporal modeling capability, making it well-suited for scalable load prediction systems [42].

Figure 4 illustrates the sequential configuration of the GRU-based temporal learning module. The structure consists of stacked GRU cells that process input sequences step-by-step, passing hidden states from one time index to the next. Each GRU cell receives the current input vector along with the previous hidden state, generating an updated hidden representation. The final hidden state is forwarded to a dense layer and subsequently to the output layer for load prediction. This chained architecture enables progressive refinement of temporal information and ensures smooth state transitions across time windows [42].

The internal operational mechanism of a GRU cell is depicted in Figure 5. The diagram highlights the two primary gating structures: the reset gate and the update gate. The reset gate determines how much past information should be ignored when computing the candidate hidden state, effectively controlling memory reset during abrupt changes. The update gate governs how much of the previous hidden state should be retained versus replaced by newly computed information. Together, these gates dynamically balance memory preservation and adaptation, allowing the network to respond efficiently to both stable demand patterns and rapid load variations. This adaptive gating mechanism is central to the GRU’s effectiveness in modeling smart grid temporal dynamics [43].

The mathematical formulation of the GRU begins with the computation of the update gate, which regulates the retention of historical information. Equation (21) defines this gate as a sigmoid transformation of the current input and previous hidden state [44].

Z_{t} = σ (W_{z} x_{t} + U_{z} h_{t - 1} + b_{z})

(21)

where

Z_{t}

is update gate;

h_{t - 1}

is previous hidden state;

x_{t}

is input vector;

W_{z}

and

U_{z}

are learnable weight matrices;

b_{z}

is the bias term;

σ

is the sigmoid function.

Next, the reset gate is computed as shown in Equation (22). This gate controls how strongly the previous hidden state contributes to the candidate state calculation.

r_{t} = σ (W_{r} x_{t} + U_{r} h_{t - 1} + b_{r})

(22)

where

r_{t}

is reset gate;

W_{r}

and

U_{r}

are learnable weight matrices;

b_{r}

is the bias term.

The candidate hidden state, which represents newly computed information based on the current input and the gated previous state, is defined in Equation (23). The reset gate modulates the contribution of past memory during this computation [44].

\tilde{h_{t}} = t a n h (W_{h} x_{t} + U_{h} ({r_{t} ⊙ h}_{t - 1}) + b_{h})

(23)

where

W_{h}

and

U_{h}

are learnable weight matrices;

b_{h}

is the bias term.

Finally, the updated hidden state is calculated as a weighted combination of the previous hidden state and the candidate state. Equation (24) formalizes this interpolation process, governed by the update gate. Through this gated update mechanism, the GRU effectively captures short-term temporal fluctuations while preserving relevant historical dependencies. When integrated with the attention-driven Transformer and the distribution-enhancing GAN, the GRU serves as the final temporal refinement layer, producing stable and high-precision load forecasts for intelligent smart grid balancing [44].

h_{t} = (1 - Z_{t}) {⊙ h}_{t - 1} + Z_{t} ⊙ \tilde{h_{t}}

(24)

2.5. Proposed Transformer-GAN-GRU

Figure 6 presents the complete architecture of the proposed Transformer–GAN–GRU framework. The figure illustrates how generative data augmentation, global attention-based encoding, and sequential gated learning are unified into a single end-to-end predictive system. The upper section represents the Transformer encoder, the lower-left block corresponds to the GAN module, and the central sequential block represents the GRU-based temporal predictor. The architecture highlights the flow of information from raw input sequences to synthetic augmentation, contextual encoding, and final demand prediction. The interaction between modules follows a hierarchical structure. Initially, preprocessed load sequences are passed to the GAN module, where synthetic samples are generated to enrich the distribution of rare or underrepresented demand patterns. These augmented sequences are merged with real data and fed into the Transformer encoder. The Transformer extracts global contextual representations by modeling long-range dependencies across the temporal window. The encoded embeddings are then passed into the GRU layer, which refines short-term transitions and produces the final load forecast. This layered interaction ensures that data diversity, global attention modeling, and local temporal adaptation are jointly optimized.

The GAN module operates at the data level. Its primary responsibility is distribution enhancement. By generating synthetic load profiles that resemble extreme or infrequent demand states, it mitigates class imbalance and improves the robustness of subsequent temporal modeling. The output of the generator is not directly used for prediction; instead, it expands the training space and strengthens the statistical representation of the dataset. The Transformer module operates at the feature level. It converts augmented input sequences into high-dimensional contextual embeddings. Through multi-head self-attention, it captures non-local temporal correlations, such as delayed demand responses and periodic cycles. This stage transforms raw sequential inputs into globally informed representations suitable for refined temporal processing.

The GRU module operates at the sequence refinement level. While the Transformer captures global structure, the GRU emphasizes local temporal continuity and short-term fluctuations. It processes contextual embeddings sequentially and produces the final forecast vector. This combination allows the architecture to balance long-term reasoning with immediate temporal adaptation. The novelty of the proposed framework lies in its tri-level integration strategy. Unlike conventional approaches that independently apply GAN-based augmentation or attention-based forecasting, this method integrates generative modeling, attention encoding, and gated sequential learning within a unified training pipeline. The GAN enhances data distribution, the Transformer extracts global dependencies, and the GRU ensures temporal smoothness. This layered cooperation improves stability under distribution shifts and enhances generalization across heterogeneous grid scenarios without significantly increasing computational complexity. Formally, the generative process is defined in Equation (25), which represents the mapping from latent noise vectors to synthetic load samples.

\tilde{X} = G (z), z ~ p_{z} (z)

(25)

where

G (.)

denotes the generator network;

p_{z} (z)

is the prior latent distribution;

L

is the sequence length;

d

is the input feature dimension;

X \in R^{L \times d}

is the original sequence;

The augmented training set is defined in Equation (26) as the union of real and generated samples.

X_{a u g} = {X, \tilde{X}}

(26)

The Transformer encoder then maps the augmented sequence into contextual embeddings, as expressed in Equation (27).

H = T r a n s f o r m e r (X_{a u g})

(27)

where

H \in R^{L \times d^{'}}

is the temporal embedding output;

d^{'}

is the embedding size after transformation;

T r a n s f o r m e r (.)

denotes the full encoder stack applied to the sequence.

The GRU module refines these contextual embeddings to produce the final forecast sequence, as defined in Equation (28).

\hat{Y} = G R U (H)

(28)

where

\hat{Y} \in R^{L \times 1}

denotes the predicted load output.

To jointly optimize the architecture, a unified objective function is defined in Equation (29), combining forecasting accuracy and adversarial distribution alignment. The forecasting loss is formulated in Equation (30) as the mean squared error between true and predicted loads.

φ_{t o t a l} = {λ_{1} φ}_{f o r e c a s t} + {λ_{2} φ}_{G A N}

(29)

φ_{f o r e c a s t} = \frac{1}{N} \sum_{i = 1}^{N} {(Y_{i} - {\hat{Y}}_{i})}^{2}

(30)

where

Y_{i}

denotes the true load value;

{\hat{Y}}_{i}

the predicted output;

N

the number of training samples;

λ_{1}

and

λ_{2}

balancing coefficients controlling the contribution of each loss component.

Through this integrated formulation, the Transformer–GAN–GRU framework simultaneously enhances data diversity, captures global temporal dependencies, and refines local sequential dynamics, resulting in a scalable and high-precision solution for intelligent load balancing in modern smart power grids.

Beyond the structural integration, the proposed Transformer–GAN–GRU framework introduces several methodological innovations tailored specifically to intelligent load balancing in smart power grids. Unlike conventional forecasting pipelines where data augmentation and temporal modeling are treated as independent preprocessing and prediction stages, the proposed architecture establishes a dependency-aware augmentation mechanism. The GAN is not used merely to increase data volume, but to strategically enrich underrepresented demand regimes, particularly peak-load and rapid-transition states that critically influence balancing decisions. This targeted distribution shaping ensures that the forecasting model is trained on a statistically more symmetric and operationally realistic representation of grid behavior.

Another key innovation lies in the hierarchical separation of temporal abstraction levels. The Transformer encoder is responsible for global dependency extraction, modeling long-range correlations and periodic structures across the entire temporal window, while the GRU focuses exclusively on local sequential refinement and short-term volatility adaptation. This deliberate functional decomposition prevents redundancy between attention-based and recurrent mechanisms, allowing each module to operate within its optimal temporal scale. As a result, the framework achieves improved stability under multi-scale demand fluctuations compared to standalone Transformer or GRU architectures.

Furthermore, the proposed method introduces a unified optimization perspective that jointly considers adversarial distribution alignment and forecasting accuracy within a single objective function. Rather than freezing the GAN after augmentation, the framework maintains adversarial learning influence during training, allowing the generative component to continuously adapt to evolving embedding representations. This coordinated learning strategy enhances generalization across heterogeneous datasets and reduces sensitivity to domain shifts, making the architecture particularly suitable for real-world smart grid environments characterized by distributed generation, dynamic consumption patterns, and operational uncertainty. Collectively, these contributions distinguish the proposed Transformer–GAN–GRU model from conventional hybrid approaches by transforming it from a simple architectural combination into a structured, scale-aware, and distribution-adaptive predictive system specifically optimized for intelligent load balancing applications.

3. Results

All experiments were implemented in Python 3.10.13 using a unified DL environment to ensure consistency across the proposed model and all benchmark architectures. The primary framework for model development was PyTorch 2.1.0 with CUDA 11.8 acceleration. Supporting libraries included NumPy 1.26, Pandas 2.1, and SciPy 1.11 for numerical processing and data handling. Dataset splitting, normalization utilities, and performance metrics were implemented using Scikit-learn 1.3. Visualization and result plotting were performed using Matplotlib 3.8 and Seaborn 0.13. For the Transformer–GAN–GRU architecture, native PyTorch modules were used for implementing multi-head attention, recurrent layers, and adversarial training. The BERT baseline was implemented using the HuggingFace Transformers 4.35 library with a time-series adaptation layer. The DBN model was implemented using stacked restricted Boltzmann Machines constructed via PyTorch-based custom layers. The LSTM and DCNN models were implemented using PyTorch’s nn.LSTM and convolutional modules to ensure architectural parity and fair comparison. All models were trained under identical preprocessing conditions and optimization strategies to maintain experimental consistency.

Experiments were conducted on a workstation equipped with an Intel Core i7-12700K CPU, 32 GB RAM, and an NVIDIA RTX 3080 GPU (10 GB VRAM) running Ubuntu 22.04 LTS with CUDA-enabled acceleration. For each dataset, samples were randomly divided into 70% training, 15% validation, and 15% testing sets. Hyper-parameters were tuned using the validation subset, and final performance metrics were computed solely on the unseen test set. Early stopping with a patience threshold was applied to prevent overfitting, and each experiment was repeated three times with different random seeds to ensure statistical stability of the reported results. To avoid any potential data leakage, all preprocessing steps, including normalization and GAN-based data augmentation, were applied exclusively to the training set. The validation and test sets were strictly kept unseen during both model training and synthetic data generation. In particular, the GAN model was trained only on the training data and used to augment training samples without introducing any information from validation or test sets. This ensures a fair and unbiased evaluation of the model performance.

The proposed Transformer-GAN-GRU framework is compared against seven established DL architectures: BERT, Transformer, DBN, GAN, LSTM, GRU, and DCNN. These architectures were selected to represent diverse modeling paradigms, including attention-based contextual learning, probabilistic generative pretraining, recurrent sequential modeling, and convolutional feature extraction. BERT is included as a strong transformer-based baseline capable of capturing bidirectional contextual dependencies, making it suitable for evaluating the benefit of adversarial augmentation and recurrent refinement in time-series forecasting. DBN is selected as a representative deep generative model that captures hierarchical feature representations through unsupervised pretraining, providing a probabilistic comparison framework. LSTM serves as a widely adopted recurrent benchmark for sequential demand forecasting, enabling direct comparison between traditional gated memory mechanisms and the proposed hybrid approach.

DCNN is incorporated to evaluate the effectiveness of purely convolutional temporal feature extraction, particularly for capturing local patterns without explicit recurrence or attention. These baselines collectively cover global attention, deep generative modeling, long-memory recurrence, and convolutional locality, ensuring comprehensive and fair performance evaluation relative to the proposed architecture. In addition to cross-model comparisons, an ablation study is conducted to evaluate the contribution of each module within the proposed framework. Specifically, performance is compared across individual components (Transformer-only, GRU-only, GAN-only augmentation), pairwise combinations (Transformer-GRU, GAN-GRU, Transformer-GAN), and the full Transformer–GAN–GRU configuration. This structured ablation analysis quantifies the incremental benefit of generative augmentation, attention-based encoding, and gated sequential refinement, thereby validating the necessity and complementary interaction of each module in addressing intelligent load balancing and demand forecasting challenges.

The performance of all models is evaluated using the following metrics: accuracy (Acc), recall, area AUC, RMSE, two-tailed t-test, variance analysis, inference latency, and runtime. These metrics collectively assess classification reliability, regression precision, statistical significance, prediction stability, and computational efficiency, ensuring a comprehensive evaluation of intelligent load forecasting and balancing performance. The accuracy metric, defined in Equation (31), measures the overall proportion of correctly predicted samples among all evaluated instances. It reflects the global classification reliability of the model, particularly when distinguishing between different demand states or load conditions. High accuracy indicates strong general prediction capability; however, it alone may not fully capture performance under imbalanced demand scenarios.

A c c u r a c y = \frac{t r u e p o s i t i v e + t r u e n e g a t i v e}{t r u e p o s i t i v e + t r u e n e g a t i v e + f a l s e p o s i t i v e + f a l s e n e g a t i v e}

(31)

Recall, defined in Equation (32), evaluates the model’s ability to correctly identify positive instances, such as peak or high-demand states. This metric is particularly important in smart grid applications, where failure to detect critical demand spikes may result in instability or inefficient load balancing decisions. A high recall value ensures that significant demand events are not overlooked.

R e c a l l = \frac{t r u e p o s i t i v e}{t r u e p o s i t i v e + f a l s e n e g a t i v e}

(32)

The area under the receiver operating characteristic (ROC) curve, defined in Equation (33), measures the model’s discrimination capability across different classification thresholds. It reflects the trade-off between true positive rate and false positive rate over varying decision boundaries. A higher AUC indicates stronger separability between different load states and improved robustness under threshold variations.

A U C = \int_{0}^{1} R O C (t) d t

(33)

where,

R O C (t)

is ROC curve at threshold

t

.

The RMSE, defined in Equation (34), quantifies the deviation between predicted load values and observed ground truth measurements. RMSE directly evaluates regression precision and is particularly suitable for continuous load forecasting tasks. Lower RMSE values indicate higher forecasting accuracy and better load balancing precision.

R M S E = \sqrt{\frac{1}{N} \sum_{i = 1}^{N} {[x_{i} - {\hat{x}}_{i}]}^{2}},

(34)

where

x_{i}

is the observed value;

{\hat{x}}_{i}

is the calculated value.

Although load forecasting is inherently a regression task, the objective of this study extends beyond numerical prediction to intelligent load state classification for operational decision-making. Specifically, the predicted continuous load values are used to infer demand states (e.g., normal vs. high-demand conditions), which are critical for triggering load balancing actions in smart grid systems. Therefore, classification-oriented metrics such as accuracy, recall, and AUC are employed to evaluate the model’s ability to correctly identify these operational states under varying thresholds. In addition, regression-based evaluation is also incorporated through RMSE to assess numerical prediction accuracy. This combined evaluation framework ensures both precise forecasting and reliable decision-making capability, aligning with practical smart grid requirements.

Beyond predictive metrics, statistical significance is assessed using a two-tailed t-test. This test evaluates whether the performance differences between the proposed model and baseline architectures are statistically meaningful rather than resulting from random variation. By comparing mean performance values across multiple runs, the t-test ensures that reported improvements are reliable and reproducible. Variance analysis is conducted to measure prediction stability across repeated experiments. Lower variance indicates that the model maintains consistent performance under different initialization seeds and training splits, which is essential for real-world deployment in dynamic smart grid environments. Inference latency measures the time required to generate predictions for a single input sequence. This metric evaluates real-time applicability, which is critical for intelligent load balancing systems that require timely decision-making. Finally, total runtime reflects the overall computational cost of training and evaluation. This includes model convergence time and resource utilization. Assessing runtime ensures that performance improvements are achieved without excessive computational overhead, supporting scalability in large-scale smart grid implementations.

Proper hyper-parameter configuration plays a critical role in the performance and stability of deep learning architectures. In hybrid models such as Transformer–GAN–GRU, suboptimal parameter choices may lead to unstable adversarial training, slow convergence, gradient explosion or vanishing, and poor generalization across unseen demand patterns. For example, an excessively large learning rate may cause oscillatory behavior in GAN training, while insufficient attention heads or hidden units may limit the model’s representational capacity. Similarly, inappropriate dropout or weight decay values may result in either overfitting or underfitting, both of which degrade predictive reliability in smart grid environments.

To ensure optimal parameter selection, a structured grid search strategy was employed. Grid search systematically evaluates combinations of predefined parameter values within selected ranges and identifies the configuration that yields the best validation performance. For each architecture, multiple candidate values for learning rate, batch size, number of layers, hidden units, dropout rate, and other architecture-specific parameters were defined. The models were trained across all combinations, and the configuration achieving the lowest validation error and highest stability was selected. This exhaustive yet controlled exploration ensures fair comparison across architectures and reproducible optimization.

Table 1 presents the final hyper-parameter configurations obtained through grid search for the proposed Transformer–GAN–GRU framework and all baseline models. For the proposed architecture, the learning rate was selected from the range 0.001 to 0.005 and finalized at 0.003, while the batch size was chosen as 32 from 16, 32, and 64. The feed-forward hidden size was optimized within 1024, 2048, and 4096 and fixed at 2048. The dropout rate was selected as 0.2 from 0.1, 0.2, and 0.3, and weight decay was set to 0.01. The number of attention heads was tuned within 2, 4, and 6 and finalized at 4, while the number of encoder layers was selected as 4 from 2, 4, and 6. The GRU module was configured with 2 layers selected from 1, 2, and 3. The sequence length was set to 5 based on temporal window sensitivity analysis. GELU activation was used in the Transformer blocks, while tanh was employed in the recurrent units. Adam optimizer was selected after comparison with SGD and RMSProp due to faster convergence stability.

For the BERT baseline, the learning rate was selected as 0.004 from 0.001, 0.004, and 0.006, batch size was fixed at 64, and dropout remained at 0.2. The number of self-attention heads per layer was tuned within 4, 6, and 8 and selected as 6, while the encoder depth was set to 6 layers. GELU activation and Adam optimization were retained for consistency. For the DBN model, the number of hidden layers was chosen as 6 from 4, 6, and 8, with a learning rate of 0.006 and momentum of 0.5 selected from 0.3, 0.5, and 0.7. Tanh and sigmoid activations were used in stacked RBM layers. The LSTM architecture employed a learning rate of 0.006, batch size 64, and 3 hidden layers selected from 1, 2, 3, and 4, with sequence length tuned to 6. For the DCNN model, the number of convolution layers was selected as 7 from 4, 5, and 7, kernel size was fixed at 5 × 5, pooling strategy used 2 × 2 max pooling, and hidden layers were set to 4. All baseline models used Adam optimization to maintain fairness across training conditions.

Table 2 presents the comparative classification performance of all evaluated architectures across three heterogeneous smart grid datasets. The metrics include accuracy, recall, and AUC, allowing assessment of overall correctness, sensitivity to high-demand states, and discrimination capability across varying thresholds. The results clearly demonstrate consistent superiority of the proposed Transformer–GAN–GRU framework over all baseline models across all datasets, confirming both robustness and generalization capability. For the Pecan Street dataset, the proposed model achieves an accuracy of 99.49%, recall of 99.67%, and AUC of 99.83%, significantly outperforming the closest competitor, BERT, which achieves 91.12% accuracy and 92.65% AUC. Compared to standalone Transformer (90.35% accuracy) and GRU (87.32% accuracy), the improvement exceeds 8–12 percentage points. Traditional deep models such as DBN, LSTM, and CNN remain below 90% accuracy, highlighting the limitation of single-paradigm architectures in capturing complex residential demand patterns.

On the REDD dataset, which includes appliance-level variability and high-frequency fluctuations, the proposed framework achieves 99.08% accuracy, 99.44% recall, and 99.71% AUC. In contrast, BERT reaches 90.43% accuracy, while Transformer-only achieves 89.17%. The improvement margin over the best baseline exceeds approximately 8.5% in accuracy and nearly 7.6% in AUC. This indicates that the hybrid architecture effectively handles fine-grained load variations and improves sensitivity to subtle demand transitions. For the RTS-GMLC dataset, representing grid-level operational complexity, the proposed model maintains strong performance with 99.02% accuracy, 99.32% recall, and 99.63% AUC. The second-best model, BERT, achieves 89.44% accuracy and 91.28% AUC, resulting in nearly a 10% absolute improvement in accuracy and more than 8% improvement in AUC. Other baselines show further degradation, particularly CNN and GRU, which remain below 87% accuracy. This consistent performance across residential, appliance-level, and system-scale datasets demonstrates strong scalability and adaptability of the proposed framework.

The superior performance of the Transformer–GAN–GRU architecture can be attributed to its structured tri-level integration. The GAN module enhances distribution diversity and reduces imbalance in peak-load conditions, improving sensitivity and recall. The Transformer encoder captures long-range temporal dependencies and periodic grid behavior, strengthening contextual representation. The GRU module refines short-term transitions and smooths prediction trajectories. Unlike standalone architectures that focus solely on either attention, recurrence, or convolution, the proposed framework combines generative distribution alignment, global dependency modeling, and gated sequential refinement within a unified optimization strategy. This complementary interaction explains the substantial and consistent performance gains observed across all evaluation metrics and datasets.

Figure 7 presents the bar-chart visualization of the quantitative results reported in Table 2, illustrating the comparative performance of all evaluated models across the three datasets. For each dataset, the metrics accuracy, recall, and AUC are plotted, allowing a direct visual comparison of classification reliability, sensitivity to high-demand conditions, and discrimination capability. From a visual perspective, the superiority of the proposed framework is evident across all datasets and metrics. In Pecan Street (Figure 7a), the performance gap between the proposed model and the second-best method (BERT) exceeds approximately 8 percentage points across metrics. In REDD (Figure 7b), the difference remains consistently large, particularly in recall and AUC, indicating improved sensitivity to fine-grained load fluctuations. In RTS-GMLC (Figure 7c), which represents grid-level complexity, the separation becomes even more pronounced, with nearly 10 percentage points improvement in accuracy compared to BERT and substantially larger gains over CNN and GRU. Additionally, the error bars show relatively low variance for the proposed model, reflecting stability across repeated experiments. Overall, the bar charts visually reinforce the numerical findings by demonstrating consistent, dataset-independent superiority of the Transformer–GAN–GRU architecture in intelligent load forecasting tasks.

Figure 8 illustrates the ROC curves of all evaluated models across the three datasets. Each curve represents the relationship between sensitivity (true positive rate) and 1-specificity (false positive rate) under varying classification thresholds. The area under the ROC curve quantifies the overall discriminative ability of a model; a higher AUC indicates stronger capability to distinguish between different load states across all possible threshold values. Unlike single-point metrics such as accuracy, the ROC curve evaluates robustness under threshold variations, making it particularly important for operational smart grid environments where decision thresholds may dynamically change.

The ROC curves clearly demonstrate that the proposed Transformer–GAN–GRU model consistently dominates the upper-left region of the plots across all datasets, indicating superior sensitivity at lower false positive rates. In all three subfigures, the curve of the proposed model rises sharply toward the top-left corner and remains above all baselines across the entire threshold spectrum. This behavior confirms strong separability between demand states. In contrast, CNN, GRU, and LSTM exhibit flatter curves, indicating weaker discrimination performance. Even advanced baselines such as BERT and standalone Transformer show noticeably lower curvature, particularly in the low false-positive region, suggesting reduced robustness when stricter decision thresholds are applied.

From a practical perspective, this result is particularly significant for intelligent load balancing systems. In real-world grid management, false alarms (false positives) may trigger unnecessary resource allocation, while missed detections (false negatives) may cause instability during peak demand. To mitigate the impact of false positives in practical grid management, the proposed framework supports threshold-aware decision tuning based on operational priorities. Specifically, decision thresholds can be adjusted using ROC characteristics to achieve a desired balance between false positive rate and detection sensitivity, depending on system conditions. In scenarios where unnecessary control actions are costly, a stricter threshold can be selected to reduce false alarms, whereas in critical stability conditions, a more sensitive threshold may be adopted to avoid missed detections. Furthermore, the integration of generative adversarial network (GAN)-based data augmentation improves the representation of rare and borderline demand states, enabling the model to learn more discriminative decision boundaries. This reduces ambiguity between normal and peak conditions and consequently lowers the likelihood of false triggering. In addition, the stability of the proposed model under variance analysis indicates consistent behavior across different operating regimes, which further supports reliable threshold selection in real-world deployments. The superior ROC behavior of the proposed model implies more reliable threshold-independent decision capability, allowing grid operators to maintain stability under varying operational constraints. Importantly, the strong curvature consistency across residential-level (Pecan Street), appliance-level (REDD), and system-scale (RTS-GMLC) datasets demonstrates that the hybrid architecture generalizes well across different grid granularities. This threshold-robust discrimination capability complements the high accuracy values reported earlier and further validates the effectiveness of integrating generative augmentation with global attention encoding and gated sequential refinement.

Figure 9, Figure 10 and Figure 11 illustrate the training convergence behavior of all evaluated models by plotting RMSE versus training epochs for the Pecan Street, REDD, and RTS-GMLC datasets, respectively. These curves provide insight into learning speed, convergence stability, and final error minimization capability. Unlike static performance tables, convergence plots reveal how efficiently each architecture optimizes its parameters and how rapidly it reaches a stable predictive state. In Figure 9, the proposed Transformer–GAN–GRU model exhibits the fastest convergence among all architectures. Its RMSE drops sharply within the first 50 epochs and approaches near-zero levels before epoch 120, indicating rapid stabilization and efficient gradient propagation. In contrast, BERT and Transformer show gradual reduction but require nearly 200–250 epochs to approach their minimum values. Recurrent models such as LSTM and GRU converge more slowly and stabilize at higher RMSE levels. CNN and DBN demonstrate even slower decay, reflecting weaker temporal modeling capacity for residential consumption dynamics. The accelerated convergence of the proposed framework can be attributed to distribution enrichment via GAN and hierarchical temporal decomposition between attention and gated recurrence.

In Figure 10, which includes high-frequency appliance-level variations, convergence differences become more pronounced. The proposed model again achieves rapid error reduction, reaching low RMSE values around epoch 80–100. Transformer and BERT converge more steadily but plateau at higher error levels after approximately 200 epochs. LSTM shows oscillatory behavior before stabilizing, indicating sensitivity to local fluctuations. GAN-only and DBN models reduce error gradually but lack the refinement capability of hybrid attention-recurrent integration. The early stabilization of the proposed architecture suggests improved gradient stability and better representation learning in fine-grained temporal conditions. In Figure 11, representing large-scale grid operational complexity, the convergence pattern remains consistent. The proposed model demonstrates sharp initial error reduction and stabilizes before epoch 100, whereas other models require 200–300 epochs to reach their minimum. CNN and GRU show relatively slower convergence and higher final RMSE, indicating limited capacity to capture system-level dependencies. The consistent early convergence across all datasets confirms that the tri-level integration enhances optimization efficiency. The GAN module improves initial distribution alignment, reducing gradient noise, while the Transformer captures global structure early in training. The GRU then refines local dynamics, leading to smoother and faster convergence compared to standalone architectures.

Table 3 presents the ablation study of the proposed Transformer–GAN–GRU framework, systematically evaluating the contribution of each architectural component across the Pecan Street, REDD, and RTS-GMLC datasets. By progressively removing or combining modules (Transformer, GAN, and GRU), the table quantifies how each component influences accuracy, recall, and AUC. The full Transformer–GAN–GRU model consistently achieves the highest results across all datasets, confirming that the three modules operate in a complementary manner. When GAN is removed (Transformer–GRU), performance drops moderately, indicating that synthetic feature enrichment plays a measurable role in enhancing class balance and representation diversity. Similarly, removing GRU (Transformer–GAN) leads to a slight decline compared to the full model, suggesting that while the Transformer captures global contextual dependencies effectively, the recurrent refinement provided by GRU improves sequential consistency and local temporal smoothing. The drop is more visible in recall and AUC, which reflects reduced sensitivity in complex load fluctuation scenarios.

The dual combinations further clarify the interaction mechanisms. Transformer–GAN performs better than Transformer–GRU in certain datasets, implying that feature distribution augmentation is particularly beneficial in datasets with higher imbalance or irregular load patterns. Conversely, Transformer–GRU outperforms GAN–GRU, highlighting that global attention-based modeling contributes more significantly than generative augmentation when long-range dependencies dominate the signal structure. The GAN-GRU combination, while stronger than a standalone GAN or GRU, still lacks the global relational modeling capacity of attention mechanisms, which explains its intermediate performance. Standalone architectures (Transformer, GAN, GRU) show noticeably lower performance compared to hybrid variants. Transformer alone models contextual relations but lacks distribution balancing and sequential refinement. GAN alone enhances representation diversity but cannot fully capture structured temporal dependencies. GRU alone effectively models short-term temporal dynamics but struggles with long-range interactions and class imbalance. The ablation results, therefore, confirm that intrusion detection and load pattern classification in these datasets require simultaneous modeling of global temporal dependencies, distribution enrichment, and sequential refinement. The integrated tri-module design addresses these challenges holistically, which explains the consistent superiority of the complete Transformer-GAN-GRU framework across all evaluation metrics and datasets.

Although the proposed model achieves very high performance across all evaluation metrics, it is important to analyze these results in the context of potential overfitting and generalization capability. The high accuracy and AUC values can be attributed to the structured nature of smart grid datasets, where temporal patterns exhibit strong regularity and predictable dependencies, especially under stable operating conditions. To mitigate overfitting, several measures were incorporated during model development. First, the model was evaluated across three heterogeneous datasets (Pecan Street, REDD, and RTS-GMLC), representing residential-, appliance-, and grid-level scenarios, which demonstrates cross-domain generalization capability. Second, a strict data split strategy (training/validation/testing) was employed along with early stopping to prevent over-training. Third, variance analysis and repeated experiments with different random seeds confirmed the stability and consistency of the model performance. Furthermore, the inclusion of GAN-based data augmentation enhances representation of rare and high-variance demand states, reducing bias toward dominant patterns and improving generalization. Despite these measures, we acknowledge that real-world deployment may involve additional uncertainties such as unseen distribution shifts, measurement noise, and evolving grid dynamics. Therefore, further validation on real-time streaming data and cross-regional datasets will be considered as part of future work. These precautions ensure that the reported performance is not influenced by data leakage or artificial information transfer between training and evaluation sets, thereby supporting the validity of the results.

Although the absolute performance differences between model variants may appear moderate in percentage terms, the improvements are consistent across all datasets and evaluation metrics. In high-accuracy regimes (above 95%), even small numerical gains correspond to meaningful improvements in decision reliability, particularly in smart grid applications where misclassification of peak demand events can lead to instability or inefficient resource allocation. More importantly, the ablation results demonstrate that each component contributes complementary functionality rather than redundant complexity. The GAN module improves representation of rare and high-variance demand states, the Transformer captures long-range temporal dependencies, and the GRU refines short-term sequential dynamics. The consistent performance degradation observed when removing any component confirms that the full architecture is necessary to achieve balanced, robust, and reliable predictive behavior.

4. Discussion

In the previous section, the quantitative performance of the proposed Transformer–GAN–GRU framework was evaluated using standard predictive metrics such as accuracy, recall, AUC, and RMSE across three benchmark smart grid datasets. Those results demonstrated clear numerical superiority of the hybrid architecture compared with both standalone deep learning models and partially combined variants. However, while such metrics confirm classification and forecasting effectiveness, they do not fully capture model stability, robustness, statistical significance, or deployment feasibility. Therefore, beyond predictive accuracy, a deeper analytical evaluation is required to understand whether the observed improvements are consistent, statistically reliable, and practically meaningful.

In this section, we further examine the proposed architecture through variance analysis, runtime evaluation, inference latency measurement, and statistical hypothesis testing using a two-tailed t-test. These complementary analyses aim to assess generalizability across datasets, performance consistency under variability, computational efficiency, and real-world deployment suitability in smart grid environments. By analyzing stability, computational cost, and statistical significance alongside predictive performance, we provide a comprehensive evaluation of the proposed framework’s scalability, comparability, and applicability in operational power systems.

Table 4, Table 5 and Table 6 report the runtime required for each architecture to reach specific RMSE convergence levels, where predefined RMSE thresholds act as stopping criteria. Instead of measuring total training duration, this evaluation focuses on how efficiently each model approaches acceptable prediction error levels. The thresholds (RMSE < 15, <10, <5, and <2.5) represent progressively stricter accuracy targets, thereby allowing a direct comparison of convergence speed and optimization efficiency across architectures.

In Table 4 (Pecan Street), the proposed Transformer–GAN–GRU model significantly outperforms all baselines in convergence speed. It reaches RMSE < 15 in only 12 s, compared to 88–191 s for competing architectures. Even for stricter thresholds such as RMSE < 5, it converges in 84 s, while BERT and Transformer require 479 and 529 s, respectively. Notably, most baseline models fail to reach RMSE < 5 or RMSE < 2.5 within the training window. This rapid convergence is attributed to the synergistic design: the Transformer quickly captures global load dependencies, GAN stabilizes feature distributions early in training, and GRU refines local temporal dynamics, reducing gradient oscillations. In Table 5 (REDD), a similar pattern emerges. The proposed model reaches RMSE < 10 in 38 s, while the next best model (BERT) requires 262 s. For RMSE < 5, the hybrid model converges in 95 s, whereas other architectures require between 579 and 718 s or fail to reach that threshold. Since REDD contains high-frequency appliance-level fluctuations, the improved convergence suggests that the hybrid structure effectively decomposes both global and local temporal patterns, accelerating optimization and reducing training instability.

In Table 6, which represents large-scale grid operational data, the performance gap becomes even more pronounced. The proposed model reaches RMSE < 10 in 48 s, while BERT and Transformer require 293 and 428 s, respectively. None of the baseline architectures reach the strictest threshold (RMSE < 2.5), whereas the proposed model achieves it in 173 s. This indicates not only faster convergence but also superior final optimization capability. The results confirm that the integrated architecture reduces computational redundancy and improves gradient efficiency, making it particularly suitable for real-time smart grid applications where rapid model adaptation is essential. Overall, these tables demonstrate that the proposed Transformer–GAN–GRU framework achieves substantially faster convergence across all datasets and error thresholds. This efficiency enhances practical deployability, as lower runtime directly translates to reduced computational cost and improved adaptability in dynamic power grid environments.

Table 7 presents the inference latency comparison of the proposed Transformer–GAN–GRU model against baseline architectures across the Pecan Street, REDD, and RTS-GMLC datasets. Unlike training runtime, inference latency measures the time required to generate predictions for new unseen samples, which directly reflects deployment feasibility in real-time smart grid systems. The reported values (in milliseconds) indicate the average forward-pass computation time per sample for each architecture under identical hardware conditions. From the results, it can be observed that the proposed Transformer–GAN–GRU model exhibits slightly higher inference latency (8.2–8.6 ms) compared to lighter standalone models such as DCNN (6.8–7.1 ms) and GAN (6.9–7.2 ms). This difference is expected due to the additional architectural depth and multi-module integration. However, the latency increase remains marginal (approximately 1–1.5 ms on average) compared to significantly more complex models like BERT and LSTM, whose inference times approach similar upper ranges (around 8.0–8.4 ms). Importantly, the proposed model maintains stable latency across all three datasets, indicating consistent computational behavior regardless of dataset scale or temporal resolution.

Although the proposed architecture is not the absolute fastest in inference, its latency remains within a very low millisecond range, which is fully compatible with real-time smart grid operation requirements. Considering that it delivers substantially higher predictive accuracy and faster convergence during training, the slight increase in inference cost represents a favorable trade-off. Therefore, the Transformer–GAN–GRU framework achieves an effective balance between predictive performance and operational efficiency, making it suitable for practical deployment in intelligent load balancing and real-time demand forecasting systems.

From a computational perspective, the proposed Transformer–GAN–GRU framework demonstrates an efficient balance between model complexity and practical deployment requirements. As shown in Table 4, Table 5 and Table 6, the proposed model consistently achieves faster convergence compared to baseline architectures, reaching predefined RMSE thresholds in significantly fewer training seconds. This indicates improved optimization efficiency despite the hybrid multi-module design. In terms of inference, Table 7 shows that the proposed model maintains low latency within the millisecond range, comparable to lightweight architectures and significantly lower than more complex models such as BERT. Importantly, the GAN component operates only during training and does not contribute to inference overhead, ensuring that deployment complexity remains controlled. Furthermore, the stable latency observed across datasets of different scales (Pecan Street, REDD, RTS-GMLC) indicates good scalability with respect to both data resolution and system size. While the hybrid architecture introduces moderate additional training cost, the overall computational trade-off is favorable, as it achieves superior predictive performance and faster convergence without compromising real-time applicability.

Table 8 reports the variance of model performance across 35 independent training runs for the Pecan Street, REDD, and RTS-GMLC datasets. This evaluation measures the stability and robustness of each architecture under repeated experiments with different random initializations and data shuffling. Unlike single-run accuracy results, variance analysis reveals how sensitive a model is to stochastic training dynamics and whether its performance is consistently reproducible. The results show that the proposed Transformer–GAN–GRU model achieves extremely low variance values (0.00023–0.00062), which are several orders of magnitude smaller than all baseline models. In contrast, other architectures exhibit significantly higher variance, with values increasing progressively from BERT and Transformer to recurrent and convolutional models. For instance, DCNN reaches variance levels above 8–9 across datasets, indicating strong sensitivity to initialization and training fluctuations. The substantial gap between the proposed model and all baselines suggests that the hybrid integration improves optimization stability and reduces performance oscillation across runs.

Variance is a critical metric in smart grid applications because real-world deployment demands reliable and repeatable behavior under changing conditions. High variance implies unpredictable performance, which can be risky in load balancing and demand forecasting scenarios. The extremely low variance of the proposed architecture indicates strong convergence stability and robustness to stochastic variations, making it highly dependable for operational environments. This stability also implies better generalization potential for future datasets and evolving grid conditions, reinforcing the suitability of the Transformer–GAN–GRU framework for long-term intelligent energy management systems.

Table 9 presents the results of two-tailed statistical t-tests conducted at a 0.01 significance level to evaluate whether the performance improvements of the proposed Transformer–GAN–GRU model over baseline architectures are statistically meaningful. The table reports the p-values obtained from pairwise comparisons between the proposed model and each competing architecture across the Pecan Street, REDD, and RTS-GMLC datasets. The “Results” column indicates whether the observed differences are statistically significant under the predefined threshold.

The reported p-values are consistently far below 0.01 for all comparisons and across all datasets, ranging from 0.0009 down to 0.00001. This confirms that the performance gains of the proposed model over BERT, Transformer, DBN, GAN, LSTM, GRU, and DCNN are not due to random variation or stochastic training effects. Even in comparisons against strong baselines such as BERT and Transformer, the p-values remain significantly small, reinforcing the reliability of the observed improvements.

The consistency of significance across three distinct datasets further demonstrates that the superiority of the proposed architecture generalizes across different grid scales and temporal resolutions. From a methodological perspective, these results highlight that the performance advantages of the Transformer–GAN–GRU are statistically robust and reproducible. The extremely low p-values indicate strong effect sizes and stable superiority rather than marginal or dataset-specific improvements. This strengthens the scientific validity of the proposed approach and confirms that its enhanced predictive performance is structurally grounded in the hybrid architecture rather than being an artifact of experimental randomness.

The comprehensive experimental analysis demonstrates that the proposed Transformer–GAN–GRU architecture is not only statistically superior but also practically aligned with the operational requirements of smart power grids. The combination of high predictive accuracy, low variance across repeated runs, fast convergence, and acceptable inference latency indicates that the framework maintains a strong balance between performance and computational efficiency. In real-world smart grid environments (where demand patterns are dynamic, partially unpredictable, and often imbalanced), such stability and responsiveness are essential. The ability to converge rapidly to low RMSE levels reduces retraining overhead, while low inference latency ensures compatibility with near real-time decision support systems for load redistribution and grid stability control.

From a scalability perspective, the architecture demonstrates consistent behavior across datasets of different sizes, temporal resolutions, and structural complexities. Whether modeling high-frequency residential consumption (Pecan Street and REDD) or system-level operational dynamics (RTS-GMLC), the model maintains stable performance and statistically significant improvements. This cross-dataset generalizability suggests that the hybrid structure can scale from microgrid environments to larger transmission-level systems without architectural redesign. Moreover, the modular design allows flexible adaptation: the Transformer handles global dependency modeling for seasonal and long-term trends, the GAN mitigates distribution imbalance and rare-event representation issues, and the GRU refines short-term fluctuations—together forming a multi-scale learning mechanism suitable for evolving grid infrastructures.

In the broader context of intelligent load balancing and demand forecasting, the problem itself inherently involves nonlinear dynamics, temporal hierarchy, and distribution irregularities. Traditional single-structure models often address only one aspect of this complexity. The proposed integrated architecture instead reflects the multidimensional nature of smart grid behavior: attention mechanisms capture systemic interactions, generative modeling improves robustness under skewed load states, and gated recurrence stabilizes sequential transitions. This holistic modeling strategy positions the framework as a viable candidate for deployment in advanced energy management systems, distributed energy resource coordination, and adaptive grid automation. Ultimately, the discussion confirms that the architectural design is not only experimentally validated but also structurally compatible with the technical and operational demands of next-generation smart power systems.

To further evaluate the robustness and reproducibility of the proposed Transformer–GAN–GRU framework, a comprehensive hyperparameter sensitivity analysis was conducted by varying several key parameters around their optimized values while keeping all other settings fixed. Specifically, the effects of the GAN augmentation ratio, learning rate, number of Transformer attention heads, and number of GRU hidden units were examined on the Pecan Street dataset. As shown in Table 10, the proposed model maintains consistently strong performance across moderate variations of these hyperparameters. The best results are achieved at a learning rate of 0.003, 8 attention heads, and 128 GRU hidden units. When the parameters deviate from these optimal values, only slight reductions in Acc, Recall, and AUC are observed, indicating that the model performance degrades gracefully rather than abruptly.

In particular, the GAN augmentation ratio demonstrates that moderate augmentation (around 30%) provides the best balance between data diversity and distribution consistency, while excessive augmentation slightly reduces performance due to potential distribution distortion. Similarly, increasing model complexity beyond optimal settings (e.g., excessive attention heads or GRU units) does not yield further gains, suggesting that the proposed architecture is well-calibrated.

While the proposed framework focuses on point forecasting and classification-based decision support, we acknowledge that probabilistic forecasting can provide additional insights into uncertainty and risk in smart grid operation. Techniques such as prediction intervals, quantile regression, and probabilistic calibration can further enhance reliability assessment under uncertain demand conditions. Incorporating uncertainty-aware forecasting within the proposed hybrid architecture represents an important direction for future work, particularly for risk-sensitive energy management applications.

5. Conclusions and Future Research Directions

The increasing complexity of modern smart power grids, driven by distributed generation, dynamic consumption behavior, and fluctuating demand patterns, has made intelligent load balancing and accurate demand forecasting critical for maintaining grid stability and operational efficiency. Traditional forecasting approaches often struggle to simultaneously capture long-term temporal dependencies, short-term fluctuations, and distribution imbalance inherent in real-world energy data. To address these challenges, this study proposed a novel hybrid Transformer–GAN–GRU framework that integrates attention-based global modeling, generative feature augmentation, and gated sequential refinement into a unified architecture. The model was evaluated on three benchmark datasets (Pecan Street, REDD, and RTS-GMLC) representing diverse grid environments and temporal granularities. Through extensive experiments including performance comparison, ablation analysis, convergence evaluation, latency assessment, variance analysis, and statistical significance testing, the proposed approach demonstrated superior accuracy, stability, scalability, and computational efficiency, confirming its effectiveness for intelligent load balancing and demand forecasting in modern smart grid systems.

The quantitative evaluation consistently demonstrates the clear superiority of the proposed Transformer–GAN–GRU framework across all datasets and performance dimensions. The model achieves approximately 99.49% accuracy, 99.67% recall, and 99.83% AUC on residential-level data, 99.08% accuracy and 99.71% AUC on appliance-level data, and 99.02% accuracy with 99.63% AUC on system-scale grid data, surpassing the strongest baseline models by roughly 8–10% in accuracy and more than 8% in AUC. The ROC behavior confirms strong discriminative capability, with high sensitivity maintained even under stricter decision thresholds. In addition to predictive strength, the framework exhibits rapid convergence, reaching strict RMSE targets several times faster than competing architectures, while maintaining low inference latency (around 8–9 ms), which is suitable for near real-time deployment. The extremely low variance across repeated runs further indicates strong training stability and reproducibility, and statistical hypothesis testing confirms that the observed improvements are highly significant (p-values well below 0.01). Together, these numerical and statistical findings validate the robustness, efficiency, scalability, and practical applicability of the proposed architecture for intelligent load balancing and demand forecasting in smart grid environments.

The results of this paper demonstrate that intelligent load balancing and demand forecasting in modern smart power grids require a multi-level modeling strategy capable of simultaneously handling distribution imbalance, long-range temporal dependencies, and short-term dynamic fluctuations. The proposed Transformer–GAN–GRU framework confirms that integrating generative augmentation, attention-based contextual encoding, and gated sequential refinement leads to substantial improvements in predictive accuracy, robustness, and convergence efficiency. From a modeling perspective, the study shows that addressing data-level imbalance through GAN, capturing global structural patterns through Transformer, and refining local transitions via GRU is not merely an architectural combination, but a structured decomposition of the intrinsic characteristics of smart grid demand behavior. This layered design directly aligns with the statistical and temporal complexity observed in real-world energy datasets and validates the necessity of hybrid architectures for high-fidelity forecasting tasks.

Despite the strong performance of the proposed framework, several practical limitations should be acknowledged. First, the model relies on offline training using historical datasets, and its performance under real-time streaming conditions or rapidly evolving grid dynamics has not been explicitly validated. Second, although multiple heterogeneous datasets are used, cross-dataset transfer learning is not explored, which may limit adaptability under unseen distribution shifts. Third, the current framework focuses on point forecasting and classification-based decision support, without explicitly modeling prediction uncertainty, which can be important for risk-aware grid operation. From a practical deployment perspective, the proposed model is well-suited for scenarios where high prediction accuracy and fast convergence are critical, such as real-time load balancing and demand monitoring systems. However, integration into operational smart grid environments would require additional considerations, including continuous model updating, handling of streaming data, and integration with control and optimization modules. Addressing these aspects will be essential for transitioning the proposed approach from simulation-based evaluation to real-world deployment.

From the perspective of the smart grid problem itself, the findings indicate that reliable load forecasting cannot rely solely on high accuracy metrics but must also ensure stability, scalability, and operational feasibility. The proposed model achieves not only superior numerical performance but also low variance, rapid convergence, statistical significance, and real-time inference capability, making it suitable for deployment in adaptive energy management systems. The consistent performance across residential, appliance-level, and grid-scale datasets suggests strong generalizability to heterogeneous operational environments. Practically, this implies that advanced hybrid deep learning frameworks can enhance grid resilience, improve resource allocation efficiency, and reduce operational uncertainty in dynamic power systems. Collectively, the study supports the conclusion that integrated attention–generative–recurrent modeling constitutes a viable and effective paradigm for next-generation intelligent smart grid control and decision support systems.

Future work can extend the proposed Transformer–GAN–GRU framework in several meaningful directions. One potential enhancement involves incorporating real-time streaming adaptation mechanisms, allowing the model to update dynamically as new load patterns emerge in evolving smart grid environments. Additionally, integrating external contextual variables such as weather forecasts, market pricing signals, and distributed renewable generation data could further improve predictive precision and operational relevance. From an architectural perspective, exploring lightweight transformer variants or adaptive attention mechanisms may reduce computational overhead while preserving performance, facilitating large-scale deployment in edge or microgrid systems. Furthermore, extending the framework toward multi-task learning (such as jointly performing demand forecasting, anomaly detection, and load classification) could enhance system intelligence and resource optimization in fully autonomous energy management platforms.

Although the proposed model is evaluated across multiple heterogeneous datasets, it is important to note that cross-dataset transfer validation (i.e., training on one dataset and testing on another) is not explicitly investigated in this study. This is primarily due to inherent differences in data distributions, temporal resolutions, and feature characteristics across the datasets considered (Pecan Street, REDD, and RTS-GMLC), which represent distinct operational scenarios ranging from residential to grid-level environments. While the consistent performance of the proposed model across these datasets demonstrates its adaptability and generalization capability within each domain, cross-domain transfer learning introduces additional challenges such as distribution shift and feature misalignment. Addressing these challenges typically requires dedicated domain adaptation or normalization strategies. Therefore, investigating cross-dataset transfer performance and incorporating domain adaptation mechanisms will be considered as an important direction for future work further to enhance the generalization capability of the proposed framework.

Author Contributions

Conceptualization, A.L., E.G., A.V., D.M. and F.H.-G.; methodology, A.L., E.G. and D.M.; software, A.L. and F.H.-G.; validation, A.L., E.G. and D.M.; formal analysis, A.L. and F.H.-G.; investigation, A.L., E.G. and A.V.; resources, D.M.; data curation, A.V.; writing—original draft preparation, A.L., E.G., A.V., D.M. and F.H.-G.; writing—review and editing, A.L., E.G., A.V., D.M. and F.H.-G.; visualization, E.G. and A.V.; supervision, D.M.; project administration, D.M.; funding acquisition, D.M. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors on request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Liu, Y.; Sun, B.; Wu, Y.; Zhang, Y.; Yang, J.; Wang, W.; Thotakura, N.L.; Liu, Q.; Liu, Y. Time synchronization techniques in the modern smart grid: A comprehensive survey. Energies 2025, 18, 1163. [Google Scholar] [CrossRef]
Kaveh, M.; Yan, Z.; Jäntti, R. Secrecy performance analysis of RIS-aided smart grid communications. IEEE Trans. Ind. Inform. 2024, 20, 5415–5427. [Google Scholar] [CrossRef]
Liceaga-Ortiz-De-La-Peña, J.M.; Ruiz-Vanoye, J.A.; Xicoténcatl-Pérez, J.M.; Díaz-Parra, O.; Fuentes-Penna, A.; Barrera-Cámara, R.A.; Robles-Camarillo, D.; Márquez-Vera, M.A.; Trejo-Macotela, F.R.; Ortiz-Suárez, L.A. Advancing Smart Energy: A Review for Algorithms Enhancing Power Grid Reliability and Efficiency Through Advanced Quality of Energy Services. Energies 2025, 18, 3094. [Google Scholar] [CrossRef]
Cavus, M. Advancing power systems with renewable energy and intelligent technologies: A comprehensive review on grid transformation and integration. Electronics 2025, 14, 1159. [Google Scholar] [CrossRef]
Kaveh, M.; Mosavi, M.R.; Martín, D.; Aghapour, S. An efficient authentication protocol for smart grid communication based on on-chip-error-correcting physical unclonable function. Sustain. Energy Grids Netw. 2023, 36, 101228. [Google Scholar] [CrossRef]
Aghajari, H.A.; Niknam, T.; Shasadeghi, M.; Sharifhosseini, S.; Taabodi, M.; Sheybani, E.; Javidi, G.; Pourbehzadi, M. Analyzing complexities of integrating Renewable Energy Sources into Smart Grid: A comprehensive review. Appl. Energy 2025, 383, 125317. [Google Scholar] [CrossRef]
Wen, L.; Zhou, K.; Feng, W.; Yang, S. Demand side management in smart grid: A dynamic-price-based demand response model. IEEE Trans. Eng. Manag. 2022, 71, 1439–1451. [Google Scholar] [CrossRef]
Yu, D.; Ma, Z.; Wang, R. Efficient smart grid load balancing via fog and cloud computing. Math. Probl. Eng. 2022, 2022, 3151249. [Google Scholar] [CrossRef]
Kaveh, M.; Ghadi, F.R.; Zhang, Y.; Yan, Z.; Jäntti, R. Voltage profile-driven physical layer authentication for RIS-aided backscattering tag-to-tag networks. IEEE Internet Things J. 2025, 12, 51099–51113. [Google Scholar] [CrossRef]
Al-Shetwi, A.Q.; Issa, W.K.; Aqeil, R.F.; Ustun, T.S.; Al-Masri, H.M.K.; Alzaareer, K.; Abdolrasol, M.G.M.; Abdullah, M.A. Active power control to mitigate frequency deviations in large-scale grid-connected PV system using grid-forming single-stage inverters. Energies 2022, 15, 2035. [Google Scholar] [CrossRef]
Cavus, M.; Ayan, H.; Sari, M.; Akbulut, O.; Dissanayake, D.; Bell, M. Enhancing Smart Grid Reliability Through Data-Driven Optimisation and Cyber-Resilient EV Integration. Energies 2025, 18, 4510. [Google Scholar] [CrossRef]
Kaveh, M.; Mosavi, M.R. A lightweight mutual authentication for smart grid neighborhood area network communications based on physically unclonable function. IEEE Syst. J. 2020, 14, 4535–4544. [Google Scholar] [CrossRef]
Goyal, G.R.; Vadhera, S. Solution to uncertainty of renewable energy sources and peak hour demand in smart grid system. Meas. Sens. 2024, 33, 101129. [Google Scholar] [CrossRef]
Singh, A.R.; Sujatha, M.S.; Kadu, A.D.; Bajaj, M.; Addis, H.K.; Sarada, K. A deep learning and IoT-driven framework for real-time adaptive resource allocation and grid optimization in smart energy systems. Sci. Rep. 2025, 15, 19309. [Google Scholar] [CrossRef]
Kaveh, M.; Ghadi, F.R.; Li, Z.; Yan, Z.; Jäntti, R. Secure backscatter communications through RIS: Modeling and performance. IEEE Trans. Veh. Technol. 2025, 75, 4464–4477. [Google Scholar] [CrossRef]
Cifci, A. Interpretable prediction of a decentralized smart grid based on machine learning and explainable artificial intelligence. IEEE Access 2025, 13, 36285–36305. [Google Scholar] [CrossRef]
Biswal, B.; Deb, S.; Datta, S.; Ustun, T.S.; Cali, U. Review on smart grid load forecasting for smart energy management using machine learning and deep learning techniques. Energy Rep. 2024, 12, 3654–3670. [Google Scholar] [CrossRef]
Aoun, A.; Adda, M.; Ilinca, A.; Ghandour, M.; Ibrahim, H.; Salloum, S. Efficient Modeling of Distributed Energy Resources’ Impact on Electric Grid Technical Losses: A Dynamic Regression Approach. Energies 2024, 17, 2053. [Google Scholar] [CrossRef]
Kaveh, M.; Ghadi, F.R.; Hernando-Gallego, F.; Martín, D.; Wong, K.K.; Jäntti, R. Physical Layer Security over Fluid Reconfigurable Intelligent Surface-assisted Communication Systems. IEEE Wirel. Commun. Lett. 2026, 15, 1697–1701. [Google Scholar] [CrossRef]
Tian, R.; Wang, J.; Sun, Z.; Wu, J.; Lu, X.; Chang, L. Multi-Scale Spatial-Temporal Graph Attention Network for Charging Station Load Prediction. IEEE Access 2025, 13, 29000–29017. [Google Scholar] [CrossRef]
Xu, J.; Li, K.; Li, D. Multioutput framework for time-series forecasting in smart grid meets data scarcity. IEEE Trans. Ind. Inform. 2024, 20, 11202–11212. [Google Scholar] [CrossRef]
Wang, J.; Si, Y.; Zhu, Y.; Zhang, K.; Yin, S.; Liu, B. Cyberattack detection for electricity theft in smart grids via stacking ensemble GRU optimization algorithm using federated learning framework. Int. J. Electr. Power Energy Syst. 2024, 157, 109848. [Google Scholar] [CrossRef]
Cavus, M.; Allahham, A. Spatio-temporal attention-based deep learning for smart grid demand prediction. Electronics 2025, 14, 2514. [Google Scholar] [CrossRef]
Kaneva, T.; Valova, I.; Gabrovska-Evstatieva, K.; Evstatiev, B. A data-driven approach for generating synthetic load profiles with GANs. Appl. Sci. 2025, 15, 7835. [Google Scholar] [CrossRef]
Fetaji, B.; Fetaji, M.; Ebibi, M.; Fetaji, E. Optimizing energy in smart grids: A novel AI and data analytics approach to address critical gaps and enhance sustainability. In Proceedings of the 2025 60th International Scientific Conference on Information, Communication and Energy Systems and Technologies (ICEST), Ohrid, North Macedonia, 26–28 June 2025; pp. 1–5. [Google Scholar] [CrossRef]
Neubert, M.; Gnepper, O.; Mey, O.; Schneider, A. Detection of electric vehicles and photovoltaic systems in smart meter data. Energies 2022, 15, 4922. [Google Scholar] [CrossRef]
Hadish, S.; Guizani, M.; Aloqaily, M.; Khan, L.U. Transformer based architecture for smart grid energy consumption forecasting. In Proceedings of the 2025 International Wireless Communications and Mobile Computing, Abu Dhabi, United Arab Emirates, 12–16 May 2025; pp. 1726–1731. [Google Scholar]
Jaramillo, A.F.M.; Lopez-Lorente, J.; Laverty, D.M.; Martinez-Del-Rincon, J.; Morrow, D.J.; Foley, A.M. Effective identification of distributed energy resources using smart meter net-demand data. IET Smart Grid 2022, 5, 120–135. [Google Scholar] [CrossRef]
Brown, J.; Abate, A.; Rogers, A. Disaggregation of household solar energy generation using censored smart meter data. Energy Build. 2021, 231, 110617. [Google Scholar] [CrossRef]
Liu, B. Deep reinforcement learning for intelligent load balancing in smart power grids. IEEE Access 2025, 13, 164170–164185. [Google Scholar] [CrossRef]
Zhou, L.; Huo, L.; Liu, L.; Xu, H.; Chen, R.; Chen, X. Optimal power flow for high spatial and temporal resolution power systems with high renewable energy penetration using multi-agent deep reinforcement learning. Energies 2025, 18, 1809. [Google Scholar] [CrossRef]
Karuppiah, M.; Thakur, S. Spatiotemporal federated learning for privacy-preserving load forecasting and appliance scheduling in smart city homes. IEEE Trans. Consum. Electron. 2025, 71, 11826–11833. [Google Scholar] [CrossRef]
Masoumi, A.; Korkali, M. Adversarially robust power grid resource adequacy estimation with deep generative modeling. Electr. Power Syst. Res. 2025, 241, 111374. [Google Scholar] [CrossRef]
Skorupa, T.; Bolacell, G.S.; Toniazzo, E.; Da Rosa, M.A.; Da Silva, A.M.L. Transmission planning perspectives for the new uncertainties in power systems. In Proceedings of the 2024 18th International Conference on Probabilistic Methods Applied to Power Systems (PMAPS), Auckland, New Zealand, 24–26 June 2024; pp. 1–6. [Google Scholar] [CrossRef]
Chen, X.; Wang, H.; Wu, F.; Wu, Y.; Gonzalez, M.C.; Zhang, J. Multimicrogrid load balancing through EV charging networks. IEEE Internet Things J. 2022, 9, 5019–5026. [Google Scholar] [CrossRef]
Akbar, M.K.; Amayri, M.; Bouguila, N. A novel non-intrusive load monitoring technique using semi-supervised deep learning framework for smart grid. Build. Simul. 2024, 17, 441–457. [Google Scholar] [CrossRef]
Bâra, A.; Oprea, S.V. Transformer-based forecasting with synthetic input data generation for day-ahead electricity markets. J. King Saud Univ. Comput. Inf. Sci. 2025, 37, 233. [Google Scholar] [CrossRef]
Zakynthinos, A.; Michalakopoulos, V.; Sarmas, E.; Marinakis, V. Transfer learning techniques on temporal fusion transformers for short-term building load forecasting under limited data conditions. Energy Build. 2026, 354, 116935. [Google Scholar] [CrossRef]
Saeed, F.; Rehman, A.; Shah, H.A.; Diyan, M.; Chen, J.; Kang, J.M. SmartFormer: Graph-based transformer model for energy load forecasting. Sustain. Energy Technol. Assess. 2025, 73, 104133. [Google Scholar] [CrossRef]
Efatinasab, E.; Brighente, A.; Donadel, D.; Conti, M.; Rampazzo, M. Towards robust stability prediction in smart grids: GAN-based approach under data constraints and adversarial challenges. Internet Things 2025, 33, 101662. [Google Scholar] [CrossRef]
Qiu, Y.; Qian, J.; You, J.; Lu, M.; Zhang, X.; Zhang, J. The generation of power grid anomalous events based on domain-adversarial neural networks. Int. J. Data Sci. Anal. 2026, 22, 60. [Google Scholar] [CrossRef]
Wang, Z.; Zhang, Y.; Hu, J.; Zhang, Y.; Li, L.; Li, W. A hybrid ensemble-optimized bidirectional gated recurrent unit method for short-term load forecasting of electric vehicle charging stations. J. Energy Storage 2026, 152, 120768. [Google Scholar] [CrossRef]
Dong, J.; Jiang, Y.; Chen, P.; Li, J.; Wang, Z.; Han, S. Short-term power load forecasting using bidirectional gated recurrent units-based adaptive stacked autoencoder. Int. J. Electr. Power Energy Syst. 2025, 165, 110459. [Google Scholar] [CrossRef]
Heng, S. Enhanced multi-energy load forecasting via multi-task learning and GRU-attention networks in integrated energy systems. Electr. Eng. 2025, 107, 7673–7683. [Google Scholar] [CrossRef]

Figure 1. Workflow of the proposed model for intelligent load balancing and demand forecasting.

Figure 2. Architecture of the Transformer encoder.

Figure 3. Architecture of the GAN.

Figure 4. Sequential architecture of the GRU-based temporal prediction module.

Figure 5. Internal gating operations of a standard GRU cell.

Figure 6. Integrated architecture of the proposed Transformer–GAN–GRU framework.

Figure 7. Bar-chart comparison of models on: (a) Pecan Street; (b) REDD; (c) RTS-GMLC datasets.

Figure 8. ROC curve analysis of models on: (a) Pecan Street; (b) REDD; (c) RTS-GMLC datasets.

Figure 9. Training convergence curves of RMSE across Pecan Street datasets.

Figure 10. Training convergence curves of RMSE across REDD datasets.

Figure 11. Training convergence curves of RMSE across RTS-GMLC datasets.

Table 1. Hyper-parameter configurations for the proposed algorithms.

Model	Parameter	Value
Transformer-GAN-GRU	Learning rate	0.003
	Batch size	32
	Feed forward hidden size	2048
	Weight decay	0.01
	Dropout rate	0.2
	Number of attention heads	4
	Number of encoder layers	4
	Sequence length	5
	Number of GRU layers	2
	Momentum term	0.05
	Convergence threshold	0.066
	Activation function	GELU, tanh
	Optimizer	Adam
BERT	Learning rate	0.004
	Batch size	64
	Dropout rate	0.2
	Number of self-attention heads per layer	6
	Number of transformer encoder layers	6
	Activation Function	GELU
	Optimizer	Adam
DBN	Momentum	0.5
	Number of hidden layers	6
	Learning rate	0.006
	Activation	Tanh and sigmoid
	Optimizer	Adam
LSTM	Learning rate	0.006
	Batch size	64
	Sequence length	6
	Activation function	Tanh & sigmoid
	Number of hidden layers	3
	Optimizer	Adam
DCNN	Number of convolution layers	7
	Kernel size	5 × 5
	Pooling type	Max pooling (2 × 2)
	Number of hidden layers	4
	Activation	Tanh
	Optimizer	Adam

Table 2. Quantitative performance comparison of the models for intelligent load forecasting.

Model	Pecan Street			REDD			RTS-GMLC
Model	Accuracy	Recall	AUC	Accuracy	Recall	AUC	Accuracy	Recall	AUC
Transformer-GAN-GRU	99.49	99.67	99.83	99.08	99.44	99.71	99.02	99.32	99.63
BERT	91.12	91.93	92.65	90.43	91.08	92.13	89.44	90.09	91.28
Transformer	90.35	91.23	92.08	89.17	90.05	91.77	88.61	89.61	90.08
DBN	89.66	90.44	91.18	88.65	89.67	90.08	87.24	88.50	89.37
GAN	88.49	89.35	90.62	88.79	89.15	90.63	86.59	87.81	88.55
LSTM	87.68	88.74	89.91	86.76	87.14	88.52	85.37	86.34	87.31
GRU	87.32	88.30	89.65	86.08	87.64	88.91	84.19	85.92	86.89
DCNN	86.51	87.09	88.16	85.17	86.91	87.32	83.18	84.09	85.64

Table 3. Ablation study of Transformer-GAN-GRU components.

Model	Pecan Street			REDD			RTS-GMLC
Model	Accuracy	Recall	AUC	Accuracy	Recall	AUC	Accuracy	Recall	AUC
Transformer-GAN-GRU	99.49	99.67	99.83	99.08	99.44	99.71	99.02	99.32	99.63
Transformer-GAN	95.09	95.86	96.24	94.11	95.08	96.91	93.12	94.48	95.80
Transformer-GRU	94.32	95.51	96.05	93.12	94.15	95.62	92.20	93.47	94.68
GAN-GRU	93.48	94.55	95.83	92.37	93.66	94.62	91.92	92.67	93.83
Transformer	90.35	91.23	92.08	89.17	90.05	91.77	88.61	89.61	90.08
GAN	88.49	89.35	90.62	88.79	89.15	90.63	86.59	87.81	88.55
GRU	87.32	88.30	89.65	86.08	87.64	88.91	84.19	85.92	86.89

Table 4. Run time required by models to reach predefined RMSE on the Pecan Street dataset.

Proposed Methods	Run Time (s)
Proposed Methods	RMSE < 15	RMSE < 10	RMSE < 5	RMSE < 2.5
Transformer-GAN-GRU	12	32	84	127
BERT	91	218	479	-
Transformer	88	241	529	-
DBN	110	348	681	-
GAN	146	368	-	-
LSTM	174	425	-	-
GRU	163	463	-	-
DCNN	191	528	-	-

Table 5. Run time required by models to reach predefined RMSE on the REDD dataset.

Proposed Methods	Run Time (s)
Proposed Methods	RMSE < 15	RMSE < 10	RMSE < 5	RMSE < 2.5
Transformer-GAN-GRU	17	38	95	143
BERT	115	262	579	-
Transformer	136	361	718	-
DBN	174	392	-	-
GAN	189	401	-	-
LSTM	205	493	-	-
GRU	218	483	-	-
DCNN	224	628	-	-

Table 6. Run time required by models to reach predefined RMSE on the RTS-GMLC dataset.

Proposed Methods	Run Time (s)
Proposed Methods	RMSE < 15	RMSE < 10	RMSE < 5	RMSE < 2.5
Transformer-GAN-GRU	21	48	117	173
BERT	135	293	779	-
Transformer	187	428	-	-
DBN	205	448	-	-
GAN	253	493	-	-
LSTM	289	583	-	-
GRU	305	588	-	-
DCNN	301	738	-	-

Table 7. Inference latency comparison of the Transformer–GAN–GRU and baseline models.

Algorithm	Inference Latency (ms)
Algorithm	Pecan Street	REDD	RTS-GMLC
Transformer-GAN-GRU	8.2	8.4	8.6
BERT	7.7	7.9	8.2
Transformer	7.1	7.3	7.4
DBN	7.5	7.8	8.1
GAN	6.9	7.1	7.2
LSTM	8.1	8.3	8.4
GRU	7.3	7.5	7.8
DCNN	6.8	6.9	7.1

Table 8. Variance of proposed models across 35 independent runs.

Algorithm	Variance
Algorithm	Pecan Street	REDD	RTS-GMLC
Transformer–GAN–GRU	0.00023	0.00041	0.00062
BERT	1.81456	2.96325	3.14596
Transformer	3.32145	4.21458	5.32105
DBN	4.85632	5.32146	6.25103
GAN	5.21453	6.05236	7.36524
LSTM	6.97856	7.95236	8.96532
GRU	7.32156	8.01254	9.02365
DCNN	8.01452	9.025413	9.98546

Table 9. Statistical t-test results of proposed models at a 0.01 significance level.

Algorithm	t-tests
	Pecan Street		REDD		RTS-GMLC
	p-Value	Results	p-Value	Results	p-Value	Results
Transformer–GAN–GRU vs. BERT	0.0007	Significant	0.0008	Significant	0.0009	Significant
Transformer–GAN–GRU vs. Transformer	0.0006	Significant	0.0007	Significant	0.0008	Significant
Transformer–GAN–GRU vs. DBN	0.0003	Significant	0.0005	Significant	0.0007	Significant
Transformer–GAN–GRU vs. GAN	0.0002	Significant	0.0003	Significant	0.0005	Significant
Transformer–GAN–GRU vs. LSTM	0.00003	Significant	0.00005	Significant	0.00006	Significant
Transformer–GAN–GRU vs. GRU	0.00002	Significant	0.00004	Significant	0.00005	Significant
Transformer–GAN–GRU vs. DCNN	0.00001	Significant	0.00002	Significant	0.00003	Significant

Table 10. Hyperparameter sensitivity analysis of the proposed model on the Pecan Street dataset.

Parameter	Tested Value	Acc (%)	Recall (%)	AUC (%)
GAN augmentation ratio	10%	99.11	99.28	99.47
	20%	99.32	99.51	99.72
	30%	99.49	99.67	99.83
	40%	99.25	99.43	99.60
Learning rate	0.0005	98.91	99.08	99.21
	0.001	99.21	99.38	99.54
	0.003	99.49	99.67	99.83
	0.005	99.18	99.31	99.46
Attention heads	2	99.07	99.22	99.39
	4	99.31	99.49	99.68
	8	99.49	99.67	99.83
	12	99.27	99.44	99.59
GRU hidden units	32	99.02	99.16	99.34
	64	99.24	99.41	99.61
	128	99.49	99.67	99.83
	256	99.28	99.69	99.81

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Larijani, A.; Ghafourian, E.; Vaziri, A.; Martín, D.; Hernando-Gallego, F. A Hybrid Transformer-Generative Adversarial Network-Gated Recurrent Unit Model for Intelligent Load Balancing and Demand Forecasting in Smart Power Grids. Electronics 2026, 15, 1579. https://doi.org/10.3390/electronics15081579

AMA Style

Larijani A, Ghafourian E, Vaziri A, Martín D, Hernando-Gallego F. A Hybrid Transformer-Generative Adversarial Network-Gated Recurrent Unit Model for Intelligent Load Balancing and Demand Forecasting in Smart Power Grids. Electronics. 2026; 15(8):1579. https://doi.org/10.3390/electronics15081579

Chicago/Turabian Style

Larijani, Ata, Ehsan Ghafourian, Ali Vaziri, Diego Martín, and Francisco Hernando-Gallego. 2026. "A Hybrid Transformer-Generative Adversarial Network-Gated Recurrent Unit Model for Intelligent Load Balancing and Demand Forecasting in Smart Power Grids" Electronics 15, no. 8: 1579. https://doi.org/10.3390/electronics15081579

APA Style

Larijani, A., Ghafourian, E., Vaziri, A., Martín, D., & Hernando-Gallego, F. (2026). A Hybrid Transformer-Generative Adversarial Network-Gated Recurrent Unit Model for Intelligent Load Balancing and Demand Forecasting in Smart Power Grids. Electronics, 15(8), 1579. https://doi.org/10.3390/electronics15081579

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Hybrid Transformer-Generative Adversarial Network-Gated Recurrent Unit Model for Intelligent Load Balancing and Demand Forecasting in Smart Power Grids

Abstract

1. Introduction

1.1. Related Works and Research Gaps

1.2. Research Gaps in Current Research

1.3. Paper Contributions and Organization

2. Materials and Methods

2.1. Dataset

2.2. Transformer Encoder

2.3. GAN

2.4. GRU

2.5. Proposed Transformer-GAN-GRU

3. Results

4. Discussion

5. Conclusions and Future Research Directions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI