BiGTA-Net: A Hybrid Deep Learning-Based Electrical Energy Forecasting Model for Building Energy Management Systems

: The growth of urban areas and the management of energy resources highlight the need for precise short-term load forecasting (STLF) in energy management systems to improve economic gains and reduce peak energy usage. Traditional deep learning models for STLF present challenges in addressing these demands efﬁciently due to their limitations in modeling complex temporal dependencies and processing large amounts of data. This study presents a groundbreaking hybrid deep learning model, BiGTA-net, which integrates a bi-directional gated recurrent unit (Bi-GRU), a temporal convolutional network (TCN), and an attention mechanism. Designed explicitly for day-ahead 24-point multistep-ahead building electricity consumption forecasting, BiGTA-net undergoes rigorous testing against diverse neural networks and activation functions. Its performance is marked by the lowest mean absolute percentage error (MAPE) of 5.37 and a root mean squared error (RMSE) of 171.3 on an educational building dataset. Furthermore, it exhibits ﬂexibility and competitive accuracy on the Appliances Energy Prediction (AEP) dataset. Compared to traditional deep learning models, BiGTA-net reports a remarkable average improvement of approximately 36.9% in MAPE. This advancement emphasizes the model’s signiﬁcant contribution to energy management and load forecasting, accentuating the efﬁcacy of the proposed hybrid approach in power system optimizations and smart city energy enhancements.


Introduction
The rapid influx of populations into urban areas presents many challenges, ranging from resource constraints to heightened traffic and escalating greenhouse gas (GHG) emissions [1].Many cities globally are transitioning towards 'smart cities' to handle these multifaceted urban issues efficiently [2].At its core, a smart city aims to enhance its inhabitants' efficiency, safety, and living standards [3].For example, smart cities tackle GHG emissions by reducing traffic congestion, optimizing energy usage, and incorporating alternatives, such as electric vehicles, energy storage systems (ESSs), and sustainable energy sources [4].A significant portion of urban GHG emissions is attributed to building electricity consumption, which powers essential systems and amenities such as heating, domestic hot water (DHW), ventilation, lighting, and various electronic appliances [5].Thus, advancing energy efficiency in urban buildings, especially through the integration of energy storage and renewable energy sources, is paramount.Recognizing this, many smart city designs have embraced integrated systems such as building energy management systems (BEMSs) to boost the energy efficiency of existing infrastructure [6].
stand out due to their learning capabilities and generalization ability [13].Understanding the characteristics of building electricity consumption data, including time-based [23] and spatial features [24], is essential for efficiently applying these DL models.Despite these sophisticated techniques, the models often deliver unreliable, low forecasts [25].They need help with several challenges, such as issues related to short-term memory, overfitting, learning from scratch, and understanding complex variable correlations.Some researchers have investigated hybrid models to overcome these challenges, as single models frequently have difficulties learning time-based and spatial features simultaneously [13].
Building upon the aforementioned advances in forecasting techniques, a multitude of studies, including those mentioned above, have delved into the realm of STLF.These collective efforts spanning several years are comprehensively summarized in Table 1.Taking a leaf from hybrid model designs, Aksan et al. [26] introduced models that combined variational mode decomposition (VMD) with DL models, such as CNN and RNNs.Their models, VMD-CNN-long short-term memory (LSTM) and VMD-CNN-gated recurrent unit (GRU), showcased versatility, adeptly managing seasonal and daily energy consumption variations.Wang et al. [27], in their recent endeavors, proposed a wavelet transform neural network that uniquely integrates time and frequency-domain information for load forecasting.Their model leveraged three cutting-edge wavelet transform techniques, encompassing VMD, empirical mode decomposition (EMD), and empirical wavelet transform (EWT), presenting a comprehensive approach to forecasting.Zhang et al. [28] emphasized the indispensable role of STLF in modern power systems.They introduced a hybrid model that combined EWT with bidirectional LSTM.Moreover, their model integrated the Bayesian hyperparameter optimization algorithm, refining the forecasting process.Saoud et al. [29] ventured into wind speed forecasting and introduced a model that amalgamated the stationary wavelet transform with quaternion-valued neural networks, marking a significant stride in renewable energy forecasting.Kim et al. [30] seamlessly merged the strengths of RNN and one-dimensional (1D)-CNN for STLF, targeting the refinement of prediction accuracy.They adjusted the hidden state vector values to suit closely better-occurring prediction times, showing a marked evolution in prediction approaches.Jung et al. [31] delved into attention mechanisms (Att) with their Att-GRU model for STLF.Their approach was particularly noteworthy for adeptly managing sudden shifts in power consumption patterns.Zhu et al. [32] showcased an advanced LSTM-based dual-attention model, meticulously considering the myriad of influencing factors and the effects of time nodes on STLF.Liao et al. [33], with their innovative fusion of LSTM and a time pattern attention mechanism, augmented STLF methodologies, emphasizing feature extraction and model versatility.By incorporating external factors, their comprehensive approach improved feature extraction and demonstrated superior performance compared to existing methods.While effective, their model should have capitalized on the strengths of hybrid DL models, such as GRU and temporal convolutional network (TCN), which could be used to handle both long-term dependencies and varying input sequence lengths [34].
BiGTA-net is introduced as a novel hybrid DL model that seamlessly integrates the strengths of a bi-directional gated recurrent unit (Bi-GRU), a temporal convolutional network (TCN), and an attention mechanism.These components collectively address the persistent challenges encountered in STLF.The conventional DL models sometimes require assistance in dealing with intricate nonlinear dependencies.However, the amalgamation within the proposed model represents an innovative approach for capturing long-term data dependencies and effectively handling diverse input sequences.Moreover, the incorporation of the attention mechanism within BiGTA-net optimizes the weighting of features, thereby enhancing predictive accuracy.This research establishes its unique contribution within the energy management and load forecasting domains, which can be attributed to the following key contributions:

•
BiGTA-net emerges as a pioneering hybrid DL model designed to enhance day-ahead forecasting within power system operation, prioritizing accuracy.

•
The experimental framework employed for testing BiGTA-net's capabilities is strategically devised, showcasing its adaptability and resilience across different models and configurations.

•
Utilizing data sourced from a range of building types, the approach employed in this study establishes the widespread applicability and adaptability of BiGTA-net across diverse consumption scenarios.
The structure of this paper is outlined as follows: Section 2 elaborates on the configuration of input variables that are crucial to the STLF model and discusses the proposed hybrid deep learning model, BiGTA-net.In Section 3, the performance of the model is thoroughly examined through extensive experimentation.Finally, Section 4 encapsulates the findings and provides an overview of the study.

Materials and Methods
This section provides an in-depth exploration of the meticulous processes utilized to structure the datasets, configure the models, and assess their performance.Serving as an initial reference, Figure 1 displays a block schema that visually encapsulates the progression of the approach from raw datasets to performance evaluation.This schematic illustration is essential in providing readers with a comprehensive perspective of the methodological steps, emphasizing critical inputs, outputs, and incorporated innovations.

Data Preprocessing
This section explains the procedure undertaken to identify crucial input variables necessary for shaping the STLF model.Central to this study is the forecast of the day-ahead hourly electricity load.This forecasting holds immense significance, primarily due to its role as a foundational element in the planning and optimization of power system operations for the upcoming day [35].These forecasts contribute to the following aspects:

•
Demand Response: An approach centered on adjusting electricity consumption patterns rather than altering the power supply.This method ensures the power system can cater to fluctuating demands without overextending its resources.

•
ESS Scheduling: This entails critical decisions on when to conserve energy in storage systems and when to discharge it.Effective scheduling ensures optimal stored energy utilization, aligning with predicted demand peaks and troughs.

Data Preprocessing
This section explains the procedure undertaken to identify crucial input variables necessary for shaping the STLF model.Central to this study is the forecast of the dayahead hourly electricity load.This forecasting holds immense significance, primarily due to its role as a foundational element in the planning and optimization of power system operations for the upcoming day [35].These forecasts contribute to the following aspects: • Demand Response: An approach centered on adjusting electricity consumption patterns rather than altering the power supply.This method ensures the power system can cater to fluctuating demands without overextending its resources.

•
ESS Scheduling: This entails critical decisions on when to conserve energy in storage systems and when to discharge it.Effective scheduling ensures optimal stored energy utilization, aligning with predicted demand peaks and troughs.

•
Renewable Electricity Production: Anticipating the forthcoming electricity demand facilitates strategic planning for harnessing renewable sources.It ensures renewable sources are optimally utilized, considering their intermittent nature.
The study explores two distinct datasets that represent divergent building types, contributing to a comprehensive understanding of power consumption patterns and enhanc-

Input variable configuration
• Temporal data augmentation using Eqs.(1)−(4) • THI and WCT calculation using Eqs.( 5) and ( 6 The study explores two distinct datasets that represent divergent building types, contributing to a comprehensive understanding of power consumption patterns and enhancing the formulation of the model.The first dataset originates from Sejong University, exemplifying educational institutions [36].In contrast, the Appliances Energy Prediction (AEP) dataset represents residential buildings [37].The objective was to enhance the precision of the STLF model by incorporating insights from these datasets, ensuring its adaptability to various electricity consumption scenarios.
Sejong University employed the Power Planner tool, which generates electricity usage statistics, to optimize electricity consumption.These statistics include predicted bills, electricity consumption, and load pattern analysis.Five years' worth of hourly electricity consumption data, spanning from March 2015 to February 2021, were compiled using this tool.From the collected dataset, approximately 0.006% of time points (equivalent to 275 instances) contained missing values, which were imputed based on prior research on handling missing electricity consumption data.Conversely, the publicly available AEP dataset provides residential electricity consumption data at 5 min intervals.To align with the study's objective of predicting day-ahead hourly electricity consumption, this dataset was resampled at 1 h intervals.
Details of the building electricity consumption, including statistical analysis, data collection periods, and building locations, are presented in Table 2, while Figure 2 illustrates the electricity consumption distribution through a histogram.Figures 3 and 4 illustrate boxplots representing the hourly electricity consumption.Figure 3 presents the consumption data segmented by hours for two datasets: the educational building dataset (Figure 3a) and the AEP dataset (Figure 3b).Similarly, Figure 4 provides boxplots of the same consumption data, which is segmented by days of the week, again for the educational building dataset (Figure 4a) and the AEP dataset (Figure 4b).The minimum and maximum values in Table 2 are omitted due to university privacy concerns.Analysis of Figure 3 revealed a clear distinction in electricity consumption during work hours and non-work hours for both datasets.While the educational building dataset showed a noticeable variation in electricity consumption between weekdays and weekends, the AEP dataset needed to show such a clear distinction.ing the formulation of the model.The first dataset originates from Sejong University, exemplifying educational institutions [36].In contrast, the Appliances Energy Prediction (AEP) dataset represents residential buildings [37].The objective was to enhance the precision of the STLF model by incorporating insights from these datasets, ensuring its adaptability to various electricity consumption scenarios.
Sejong University employed the Power Planner tool, which generates electricity usage statistics, to optimize electricity consumption.These statistics include predicted bills, electricity consumption, and load pattern analysis.Five years' worth of hourly electricity consumption data, spanning from March 2015 to February 2021, were compiled using this tool.From the collected dataset, approximately 0.006% of time points (equivalent to 275 instances) contained missing values, which were imputed based on prior research on handling missing electricity consumption data.Conversely, the publicly available AEP dataset provides residential electricity consumption data at 5 min intervals.To align with the study's objective of predicting day-ahead hourly electricity consumption, this dataset was resampled at 1 h intervals.
Details of the building electricity consumption, including statistical analysis, data collection periods, and building locations, are presented in Table 2, while Figure 2 illustrates the electricity consumption distribution through a histogram.Figures 3 and 4 illustrate boxplots representing the hourly electricity consumption.Figure 3 presents the consumption data segmented by hours for two datasets: the educational building dataset (Figure 3a) and the AEP dataset (Figure 3b).Similarly, Figure 4 provides boxplots of the same consumption data, which is segmented by days of the week, again for the educational building dataset (Figure 4a) and the AEP dataset (Figure 4b).The minimum and maximum values in Table 2 are omitted due to university privacy concerns.Analysis of Figure 3 revealed a clear distinction in electricity consumption during work hours and non-work hours for both datasets.While the educational building dataset showed a noticeable variation in electricity consumption between weekdays and weekends, the AEP dataset needed to show such a clear distinction.The study considered a spectrum of external and internal factors in determining the input variables.Among the external factors, timestamps and weather details held significance.These timestamp details encompass the month, hour, day of the week, and holiday indicators.Such details are crucial as they elucidate diverse electricity consumption patterns within buildings.For example, hourly electricity consumption can vary based on

Timestamp Information
The study considered a spectrum of external and internal factors in determining the input variables.Among the external factors, timestamps and weather details held significance.These timestamp details encompass the month, hour, day of the week, and holiday indicators.Such details are crucial as they elucidate diverse electricity consumption patterns within buildings.For example, hourly electricity consumption can vary based on customary working hours, mealtime tendencies, and other factors.Similarly, distinct days of the week and holiday indicators can provide insights into contrasting consumption patterns, particularly when contrasting workdays with weekends.
A significant challenge emerges when considering time-related data: the disparity in representing cyclical time data.Specifically, within the hourly context, the difference between 11 p.m. and midnight, though consecutive hours, is illustrated as a substantial gap of 23 units.To address such disparities and effectively capture the cyclic essence of these variables with their inherent sequential structure, a two-dimensional projection was utilized.Equations ( 1) through (4) were employed to transition from representing these categorical variables in one-dimensional space to depicting them as continuous variables in two-dimensional space [30]: Hour y = cos(360 • /24 × Hour). ( For the day of the week (DOTW) component, considering the ISO 8601 standard where Monday is denoted as one and Sunday as seven, a similar challenge emerges, with a numerical gap of six between Sunday and Monday.This numerical gap can be addressed with the following equations: DOTW y = cos(360 here the x and y subscripts in Equations ( 1) to ( 4) indicate the two-dimensional coordinates to represent the cyclical nature of hours and days of the week.The transformation to a two-dimensional space allows for a more natural representation of cyclical time data, reducing potential discontinuities.Beyond these considerations, the analysis also encompassed the integration of holiday indicators [36].These indicators, denoting weekends and national holidays, were represented as binary variables: '1' indicated a date falling on either a holiday or a weekend, while '0' indicated a typical weekday.By incorporating these indicators, the aim was to account for the evident influence of holidays and weekends on electricity consumption patterns.Notably, the month within a year significantly affects these patterns.However, due to constraints posed by the AEP dataset, which provides data for only a single year, the incorporation of monthly variations was not feasible.As a result, monthly data were not included in the analysis for the AEP dataset.

Climate Data
Climate conditions exert a notable influence on STLF, primarily attributed to their integral role within the operational dynamics of high-energy-consuming devices.This influence extends to heating and cooling systems, whose operational patterns align closely with fluctuations in weather conditions [38].The AEP dataset encompasses six distinct weather variables: temperature, humidity, wind velocity, atmospheric pressure, visibility, and dew point.Conversely, the Korea Meteorological Administration (KMA) offers a comprehensive collection of weather forecast data for each region in South Korea.These data include a range of variables, such as climate observations, forecasts for rainfall likelihood and quantity, peak and trough daily temperatures, wind metrics, and humidity levels [39].To heighten the real-world applicability of the method, the primary input variables were se-lectively chosen as temperature, humidity, and wind velocity.This selection was motivated by two factors: firstly, these variables are present both in the AEP dataset and in KMA's forecasts.Secondly, their well-documented strong correlation with power consumption patterns supports their significance [30].
The data reservoir was populated through the automated synoptic observing system of the Korea Meteorological Administration (KMA), maintained by the Seoul Meteorological Observatory.This observatory is located within a mere 10 km of the Sejong University campus.The objective was to contextualize the climatic variables with the environmental conditions of the university's academic buildings.To bridge the gap between raw climatic data and its tangible influence on electricity consumption-the human perceptual experience of temperature fluctuations-two distinct indices were extrapolated.The temperaturehumidity index (THI) [40], colloquially known as the discomfort index, provides insights into the perceived discomfort caused by the summer heat, thereby influencing the use of cooling systems.Conversely, the wind chill temperature (WCT) [41] encapsulates the chilling effect of winter weather, often prompting the activation of heating appliances.These perceptual aspects are formulated mathematically in Equations ( 5) and ( 6), respectively, where Temp, Humi, and WS represent temperature, humidity, and wind speed.
Drawing from the feedback loop between temperature, humidity, and the human body's thermoregulation, Equation ( 5) for THI has been crafted.Its constants-1.8,32, 0.55, 0.0055, and 26-are the outcome of rigorous empirical studies that evaluated human discomfort across a spectrum of temperature and humidity gradients [40].WCT = 13.12 + 0.6215 × Temp − 11.37 × WS 0.16 + 0.3965 × Temp × WS 0.16 . ( The derivation of Equation ( 6) for the wind chill temperature (WCT) is grounded in a model that seeks to quantify the perceived decrease in ambient temperature due to wind speed, particularly in colder regions.The constants incorporated within the equation-13.12,0.6215, 11.37, and 0.3965-as well as the exponent 0.16 trace their origins to comprehensive field experiments conducted across various weather conditions.These experiments were designed to establish a comprehensive model for human tactile perception of cold induced by wind [41].Taking these considerations into account, the analysis encompassed a set of ten external determinants that were carefully selected as input variables for the model's training process.

Past Power Consumption
Past power consumption data were treated as internal factors, as they capture recent patterns in electricity usage [31].Data from the same time point one day and one week prior were utilized.Power consumption data from the preceding day could provide insight into the most recent hourly trends, while power consumption data from the preceding week could capture the most recent weekly patterns [21].Given the potential variation in power usage patterns between regular days and holidays, holiday indicators were also integrated for both types of power consumption [36].
Furthermore, an innovative inclusion was made of a past electricity usage value as an internal factor, effectively capturing the trend in electricity consumption leading up to the prediction time point over a span of one week [36].To achieve this, two distinct scenarios, illustrated in Figure 5, were considered.In the first scenario, if the prediction time point fell on a regular day, the mean electricity consumption of the preceding seven regular days was computed.In the second scenario, if the prediction time point corresponded to a holiday, the average electricity consumption of the preceding seven holidays was calculated.As a result, five internal factors were incorporated for model training, and a comprehensive list of all input variables and their respective details can be found in Table 3.
narios, illustrated in Figure 5, were considered.In the first scenario, if the prediction time point fell on a regular day, the mean electricity consumption of the preceding seven regular days was computed.In the second scenario, if the prediction time point corresponded to a holiday, the average electricity consumption of the preceding seven holidays was calculated.As a result, five internal factors were incorporated for model training, and a comprehensive list of all input variables and their respective details can be found in Table 3.

BiGTA-Net Modeling
The BiGTA-net model, illustrated in Figure 6, presents a meticulously crafted hybrid architecture that adeptly merges the advantages of both Bi-GRU and TCN, effectively transcending their respective limitations.The primary objective is to formulate a threestage prediction model that systematically enhances predictive accuracy by harnessing the inherent strengths of these constituent components.To achieve this objective, a significant attention mechanism is seamlessly integrated to facilitate the harmonious fusion of Bi-GRU and TCN.This orchestrated synergy serves the purpose of constructing a predictive model for building electricity consumption that boasts high accuracy and encompasses multiple stages of prediction refinement.For an in-depth comprehension of the theoretical foundations underpinning Bi-GRU and TCN, readers are referred to Appendix A, which provides comprehensive details.This supplementary resource offers a thorough exploration of the conceptual underpinnings, operational principles, and pertinent prior research pertaining to these two pivotal elements within the model.
Bi-GRU and TCN.This orchestrated synergy serves the purpose of constructing a predictive model for building electricity consumption that boasts high accuracy and encompasses multiple stages of prediction refinement.For an in-depth comprehension of the theoretical foundations underpinning Bi-GRU and TCN, readers are referred to Appendix A, which provides comprehensive details.This supplementary resource offers a thorough exploration of the conceptual underpinnings, operational principles, and pertinent prior research pertaining to these two pivotal elements within the model.

Bidirectional Gated Recurrent Unit
The modeling journey commences with the Bi-GRU, an advancement over traditional RNNs designed to excel in processing sequential time-series data.While conventional RNNs are recognized for their capability to recall historical sequences, they have encountered challenges such as gradient vanishing and exploding.To address these challenges, the GRU was introduced, incorporating specialized gating mechanisms to effectively manage long-term data dependencies [42].Within the architecture, two distinct GRUsforward and backward GRUs-are integrated to compose the Bi-GRU, enabling a comprehensive analysis of sequence dynamics [43].Despite its computationally demanding

Bidirectional Gated Recurrent Unit
The modeling journey commences with the Bi-GRU, an advancement over traditional RNNs designed to excel in processing sequential time-series data.While conventional RNNs are recognized for their capability to recall historical sequences, they have encountered challenges such as gradient vanishing and exploding.To address these challenges, the GRU was introduced, incorporating specialized gating mechanisms to effectively manage long-term data dependencies [42].Within the architecture, two distinct GRUs-forward and backward GRUs-are integrated to compose the Bi-GRU, enabling a comprehensive analysis of sequence dynamics [43].Despite its computationally demanding dual-structured design, this two-pronged approach empowers the model to discern intricate temporal patterns.For an in-depth comprehension of the mathematical intricacies underpinning the Bi-GRU's design, readers are referred to the extensive elaboration in the Keras official documentation [44].

Temporal Convolutional Network
The TCN emerges as a groundbreaking solution tailored explicitly for time-series data processing, offering a countermeasure to challenges encountered by sequential models such as the Bi-GRU.TCN employs causal convolutions at its core, ensuring predictions rely solely on current and past data, preserving the temporal sequence's integrity [45].A defining characteristic of TCNs is their adeptness in capturing long-term patterns through dilated convolutions.These convolutions expand the network's receptive field by introducing fixed steps between neighboring filter taps, enhancing computational efficiency while capturing extended dependencies [46].The TCN architecture also incorporates residual blocks, addressing the vanishing gradient problem and ensuring stable learning across layers.TCN's adaptability to varying sequence lengths and seamless integration with Bi-GRU outputs form a hierarchical structure that boosts computational efficiency and learning potential.However, TCN's lack of inherent consideration for future data points can impact tasks with significant forward-looking dependencies.

Attention Mechanism
The innovation becomes prominent through the introduction of the attention mechanism, a dynamic concept within the realm of deep learning.This mechanism assigns significance or 'attention' to specific segments of sequences, ensuring the model captures essential features for precise predictions.Within the context of the BiGTA-net architecture, this concept has been ingeniously adapted, resulting in a distinctive approach that seamlessly integrates Bi-GRU, TCN, and the attention mechanism.The attention mechanism introduced is referred to as the dual-stage self-attention mechanism (DSSAM), situated at the junction of TCN's output and the subsequent stages of the model [47].By establishing correlations across various time steps and dimensions, the DSSAM enhances computational efficiency while strategically highlighting informative features.
The role of the attention mechanism is pivotal in refining the output generated by TCN.Instead of treating all features uniformly, it adeptly identifies and amplifies the most relevant and predictive elements.This dynamic allocation of attention ensures that while the Bi-GRU captures temporal dynamics and the TCN captures long-term dependencies, the attention mechanism focuses on crucial features.As a result, the model achieves enhanced predictive capabilities by synergizing the strengths of Bi-GRU, TCN, and the attention mechanism.The approach incorporates the utilization of the scaled exponential linear units (SELU) [48] activation function, a strategic choice made to address challenges linked to long-term dependencies and gradient vanishing.This integration of SELU enhances stability in the learning process and ultimately contributes to more accurate predictions [49].

Evaluation Criteria
To evaluate the predictive capabilities of the forecasting model, a variety of performance metrics were utilized, including mean absolute percentage error (MAPE), root mean square error (RMSE), and mean absolute error (MAE).These metrics hold widespread recognition and offer a robust assessment of prediction accuracy [50].
The MAPE serves as a valuable statistical measure of prediction accuracy, particularly pertinent in the context of trend forecasting.This metric quantifies the error as a percentage, rendering the outcomes intuitively interpretable.While the MAPE may become inflated when actual values approach zero, this circumstance does not apply to the dataset under consideration.The calculation of MAPE is performed using Equation (7).
where Y t and Ŷt represent the actual and predicted values, respectively, and n represents the total number of observations.
The RMSE, or the root mean square deviation, aggregates the residuals to provide a single metric of predictive capability.The RMSE, calculated using Equation (8), is the square root of the average squared differences between the forecast values ( Ŷt ) and the actual values (Y t ).The RMSE equals the standard deviation for an unbiased estimator, indicating the standard error.
The MAE is a statistical measure used to gauge the proximity of predictions or forecasts to the eventual outcomes.This metric is calculated by considering the average of the absolute differences between the predicted and actual values.Equation ( 9) outlines the calculation for the MAE.

Experimental Design
The experiments were conducted in an environment that utilized Python (v.3.8)[51], complemented by machine learning libraries such as scikit-learn (v.1.2.1) [52] and Keras (v.2.9.0) [44,53].The computational resources included an 11th Gen Intel(R) Core(TM) i9-11900KF CPU operating at 3.50 GHz, an NVIDIA GeForce RTX 3070 GPU, and 64.0GB of RAM.The proposed BiGTA-net model was evaluated against various well-regarded RNN models, such as LSTM, Bi-LSTM, GRU, and GRU-TCN.The hyperparameters were standardized across all models to ensure a fair and balanced comparison.This approach minimized potential bias in the evaluation results due to model-specific preferences or advantageous parameter settings.The common set of hyperparameters for all the models included 25 training epochs, a batch size of 24, and the Adam optimizer with a learning rate of 0.001 [54].The MAE was chosen as the key metric for evaluating the performance of the models, providing a standardized measure of comparison.
The training dataset for the BiGTA-net model was constructed by utilizing hourly electrical consumption data from 1 to 7 March 2015, for the educational building dataset, and from 11 to 17 January 2016, for the AEP dataset.In the case of the educational building dataset, the data spanning from 8 March 2015, to 28 February 2019, was allocated for training, while the subsequent period, 1 March 2019, to 29 February 2020, was designated as the testing set.For the AEP dataset, data ranging from 18 January to 30 April 2016, was employed for training purposes, with the timeframe between 1 and 27 May 2016, reserved for testing.The dataset was partitioned into training (in-sample) and testing (out-of-sample) subsets, maintaining an 80:20 ratio.Prior to the division, min-max scaling was applied to the training data, standardizing the raw electricity consumption values within a specific range.This scaling transformation was subsequently extended to the testing data, ensuring uniformity in the range of both training and testing datasets.This process ensured that the original data scale did not influence the model's performance.

Experimental Results
In the experimental outcomes, the performances of diverse model configurations were initially investigated, as presented in Table 4. Specifically, a total of 16 models with varying network architectures, activation functions, and the incorporation of the attention mechanism were evaluated.Among the specifics detailed in Table 4, the prominent focus is on the Bi-GRU-TCN-I model, alternatively known as BiGTA-net, which was proposed in this study.This particular model embraced the Bi-GRU-TCN architecture, utilized the SELU activation function, and integrated the attention mechanism, setting it apart from the remaining models.The performance of these models was evaluated using three key metrics: MAPE, RMSE, and MAE, as presented in Tables 5-10.The experimental results were divided into two main categories, results obtained from the educational building dataset and the AEP dataset.In the context of the educational building dataset, the proposed model (Bi-GRU-TCN-I) consistently showcased superior performance in comparison to alternative model configurations.As illustrated in Table 5, the proposed model achieved the lowest MAPE, underscoring its heightened predictive accuracy.Strong corroboration for its superior performance is also substantiated by the findings presented in Tables 6 and 7, where the proposed model demonstrates the least RMSE and MAE values, respectively, signifying a close alignment between the model's predictions and actual values.• Table 5 demonstrates that among all models, the proposed Bi-GRU-TCN-I model boasts the best MAPE performance with an average of 5.37.The Bi-GRU-TCN-II model follows closely with a MAPE of 5.39.When exploring the performance of LSTM-based models, LSTM-TCN-III emerges as a top contender with a MAPE of 5.59, which, although commendable, is still higher than the leading Bi-GRU-TCN-I model.The Bi-LSTM-TCN results, on the other hand, range from 6.90 to 8.53, further emphasizing the efficacy of the BiGTA-net.Traditional GRU-TCN models displayed a wider variation in MAPE values, from 5.68 to 6.50.

•
In Table 6, when assessing RMSE values, the proposed BiGTA-net model (Bi-GRU-TCN-I) again leads the pack with a score of 171.3.This result is significantly better than all other models, with the closest competitor being Bi-GRU-TCN-II at 169.5 RMSE.Among the LSTM variants, LSTM-TCN-I holds the most promise, with an RMSE of 134.8.However, the Bi-GRU models are generally superior in predicting values closer to the actual values, underscoring their robustness.• Table 7, although not provided in its entirety, indicates the reliability of BiGTA-net with the lowest MAE of 122.0.Bi-GRU-TCN-II closely follows with an MAE of 122.7.
As observed from previous results, other models, potentially including the LSTM and Bi-LSTM series, reported higher MAE scores, ranging between 131.6 and 153.7.
In the context of the AEP dataset, as demonstrated in Tables 8-10, the proposed model (Bi-GRU-TCN-I) showcased competitive performance.While marginal differences were observed among the various model configurations, the Bi-GRU-TCN-I model consistently outperformed the alternative models in terms of MAPE, RMSE, and MAE metrics.

•
In Table 8, which presents the MAPE comparison for the AEP dataset, the proposed model, Bi-GRU-TCN-I, still manifests the lowest average MAPE of 26.77.This result emphasizes its unparalleled predictive accuracy among all tested models.Delving into the LSTM family, the LSTM-TCN-I achieved an average MAPE of 28.42, while the Bi-LSTM-TCN-I recorded an average MAPE of 29.12.It is notable that while these models exhibit competitive performance, neither managed to outperform the BiGTA-net.

•
Table 9, focused on the RMSE comparison, depicts the Bi-GRU-TCN-I model registering an RMSE of 375.9 on step 1.This performance, when averaged, proves to be competitive with the other models, especially when considering the range for all the models, which goes as low as 369.1 for Bi-GRU-TCN-III and as high as 622.2 for Bi-LSTM-TCN-III.Looking into the LSTM family, LSTM-TCN-I kicked off with an RMSE of 473.6, whereas Bi-LSTM-TCN-I began with 431.1.This further accentuates the superiority of the BiGTA-net in terms of prediction accuracy.

•
Finally, in Table 10, where MAE values are compared, the Bi-GRU-TCN-I model still shines with an MAE of 198.4.This consistently low error rate across different evaluation metrics underscores the robustness of the BiGTA-net across various datasets.
In summary, the proposed model, Bi-GRU-TCN-I, designated as BiGTA-net, exhibited exceptional performance across both datasets, affirming its efficacy and dependability in precise electricity consumption forecasting.These outcomes serve to substantiate the benefits derived from the incorporation of the Bi-GRU-TCN architecture, utilization of the SELU activation function, and integration of the attention mechanism, thereby validating the chosen design approaches.
To evaluate the performance of the BiGTA-net model, a comprehensive comparative analysis was conducted.This analysis included models such as Att-LSTM, Att-Bi-LSTM, Att-GRU, and Att-Bi-GRU, all of which integrate the attention mechanism, a characteristic known for enhancing prediction capabilities.Furthermore, this evaluation also incorporated several state-of-the-art methodologies introduced over the past three years, offering a robust understanding of BiGTA-net's performance relative to contemporary models:

•
Park and Hwang [55] introduced the LGBM-S2S-Att-Bi-LSTM, a two-stage methodology that merges the functionalities of the light gradient boosting machine (LGBM) and sequence-to-sequence Bi-LSTM.By employing LGBM for single-output predictions from recent electricity data, the system transitions to a Bi-LSTM reinforced with an attention mechanism, adeptly addressing multistep-ahead forecasting challenges.

•
Khan et al. [57] also introduced the Att-CNN-GRU, blending CNN and GRU and enriching with a self-attention mechanism.This model specializes in analyzing refined electricity consumption data, extracting pivotal features via CNN, and subsequently transitioning the output through GRU layers to grasp the temporal dynamics of the data.
Table 11 elucidates the comparative performance of several attention-incorporated models on the educational building dataset, with the BiGTA-net model's performance distinctly superior.Specifically, BiGTA-net records a MAPE of 5.37 (±0.44%),RMSE of 171.3 (±15.0 kWh), and MAE of 122.0 (±10.5 kWh).The Att-LSTM model, a unidirectional approach, records a MAPE of 8.38 (±1.57%),RMSE of 242.1 (±48.2 kWh), and MAE of 188.8 (±39.5 kWh).Its bidirectional sibling, the Att-Bi-LSTM, delivers a slightly better MAPE at 7.85 (±0.70%) but comparable RMSE and MAE values.Interestingly, GRU-based models, such as Att-GRU and Att-Bi-GRU, lag with higher error metrics, the former recording a MAPE of 13.42 (±3.39%).The 2023 Att-CNN-GRU model reports a MAPE of 6.35 (±0.23%), an RMSE of 189.6 (±5.3 kWh), still falling short compared to the BiGTAnet.The RAVOLA model from 2022 registers an impressive MAPE of 7.17 (±0.63%), but again, BiGTA-net outperforms it.In essence, these results demonstrate the BiGTA-net's unparalleled efficiency when measured against traditional unidirectional models and newer advanced techniques.Table 12 unveils the comparative performance metrics of various attention-incorporated models using the AEP dataset.Distinctly, the BiGTA-net model consistently outperforms its peers, equipped with a sophisticated blend of the attention mechanism and SELU activation within its bidirectional framework.This model impressively returns a MAPE of 26.77 (±0.90%),RMSE of 386.5 (±6.3 Wh), and MAE of 198.4 (±3.2 Wh).The Att-LSTM model offers a MAPE of 30.91 (±1.03%),RMSE of 447.4 (±5.3 Wh), and MAE of 239.8 (±5.3 Wh).Its bidirectional counterpart, the Att-Bi-LSTM, shows a modest enhancement, delivering a MAPE of 30.54 (±2.58%),RMSE of 402.7 (±8.9 Wh), and MAE of 214.0 (±7.3 Wh).The GRU-based models present a close-knit performance.For instance, the Att-GRU model achieves a MAPE of 30.03 (±0.25%),RMSE of 443.5 (±3.9 Wh), and MAE of 234.5 (±2.4 Wh), while the Att-Bi-GRU mirrors this with slightly varied figures.The 2023 model, Att-CNN-GRU, logs a MAPE of 29.94 (±1.73%),RMSE of 405.1 (±9.7 Wh), yet its precision remains overshadowed by BiGTA-net.RAVOLA, a 2022 entrant, exhibits metrics such as a MAPE of 35.89 (±5.78%), emphasizing the continual advancements in the domain.The disparities in performance underscore BiGTA-net's superiority.Models that lack the refined structure of BiGTA-net falter in their forecast accuracy, thereby underscoring the merits of the introduced architecture.The combination of Bi-GRU and TCN, along with the integration of attention mechanisms and the adoption of the SELU activation function, synergistically reinforced BiGTAnet as a robust model.The experimental results consistently demonstrated BiGTA-net's exceptional performance across diverse datasets and metrics, highlighting the model's efficacy and flexibility in different forecasting contexts.These results decisively endorsed the effectiveness of the hybrid approach utilized in this study.

Discussion
To highlight the effectiveness of the BiGTA-net model, rigorous statistical analysis was employed, utilizing both the Wilcoxon signed-rank [58] and the Friedman [59] tests.

•
Wilcoxon Signed-Rank Test: The Wilcoxon signed-rank test [58], a non-parametric counterpart for the paired t-test, is formulated to gauge differences between two paired samples.Mathematically, given two paired sets of observations, x and y, the differences d i = y i -x i are computed.Ranks are then assigned to the absolute values of these differences, and subsequently, these ranks are attributed either positive or negative signs depending on the sign of the original difference.The test statistic W is essentially the sum of these signed ranks.Under the null hypothesis, it is assumed that W follows a specific symmetric distribution.Suppose the computed p-value from the test is less than the chosen significance level (often 0.05).We have grounds to reject the null hypothesis in that case, implying a statistically significant difference between the paired samples.

•
Friedman Test: The Friedman test [59] is a non-parametric alternative to the repeated measures ANOVA.At its core, this test ranks each row (block) of data separately.The differences among the columns (treatments) are evaluated using the ranks.This expression is mathematically captured in the following expression, referred to as Equation (10).
where N is the number of blocks, k is the number of treatments, and R j is the sum of the ranks for the jth treatment.The observed value of x 2 is then compared with the critical value from the x 2 distribution with k − 1 degrees of freedom.
The meticulous validation, as demonstrated in Tables 13 and 14, underscores the proficiency of BiGTA-net in the context of energy management.To fortify the conclusions drawn from the analyses, the approach was anchored on three crucial metrics: MAPE, RMSE, and MAE.The data were aggregated across all deep learning models, focusing on 24 h forecasts at hourly intervals.Comprehensive results stemming from the Wilcoxon and Friedman tests, each grounded in the metrics, are presented in Tables 13 and 14.A perusal of the table illustrates the distinct advantage of BiGTA-net, with p-values consistently falling below the 0.05 significance threshold across varied scenarios and metrics.Delving deeper into the tables, the BiGTA-net consistently outperforms other models in both datasets.The exceptionally low p-values from the Wilcoxon and Friedman tests indicate significant differences between the BiGTA-net and its competitors.In almost every instance, other models were lacking when juxtaposed against the BiGTA-net's results.This empirical evidence is vital in understanding the superior capabilities of the BiGTA-net in energy forecasting.Furthermore, the fact that the p-values consistently fell below the conventional significance threshold of 0.05 only emphasizes the robustness and reliability of BiGTA-net.The variations in metrics, namely, MAPE, RMSE, and MAE, across Tables 13 and 14 vividly portray the margin by which BiGTA-net leads in accuracy and precision.The unique architecture and methodology behind BiGTA-net have positioned it as a front-runner in this domain.
In the intricate realm of BEMS, the gravity of data-driven decisions cannot be overstated; they bear a twofold onus of economic viability and environmental stewardship.The need for precise and decipherable modeling is, therefore, undeniably paramount.BiGTAnet envisaged as an advanced hybrid model, sought to meet these exacting standards.Its unique amalgamation of Bi-GRU and TCN accentuates its proficiency in parsing intricate temporal patterns, which remain at the heart of energy forecasting.
In the complex BEMS landscape, BiGTA-net's hybrid design brings a distinctive strength in capturing intricate temporal dynamics.However, this prowess has its challenges.Particularly in industrial environments or regions heavily dependent on unpredictable renewable energy sources, the model may find it challenging to adapt to abrupt shifts in energy consumption patterns swiftly.This adaptability issue is further accentuated when considering the sheer volume of data that the energy sector typically handles.Given the influx of granular data from many sensors and IoT devices, BiGTA-net's intricate architecture could face scalability issues, especially when implemented across vast energy distribution networks or grids.Furthermore, the predictive nature of energy management demands an acute sense of foresight, especially with the increasing reliance on renewable energy sources.In this context, the TCN's inherent limitations in accounting for prospective data pose challenges, especially when energy matrices constantly change, demanding agile and forward-looking predictions.
Within the multifaceted environment of the BEMS domain, the continuous evolution and refinement of models, i.e., BiGTA-net are essential.One avenue of amplification lies in broadening its scope to account for external determinants.By incorporating influential factors such as climatic fluctuations and scheduled maintenance events directly into the model's input parameters, BiGTA-net could enhance responsiveness to unpredictable energy consumption variances.Further bolstering its real-time applicability, introducing an adaptive learning mechanism designed to self-tune based on the influx of recent data could ensure that the model remains abreast of the ever-changing energy dynamics.Additionally, enhancing the model's interpretability is vital in a sector where transparency and clarity are paramount.Integrating principles from the "explainable AI" domain into BiGTA-net can provide a deeper understanding of its decision-making process, enabling stakeholders to discern the rationale behind specific energy consumption predictions and insights.
As the forward trajectory of BiGTA-net within the energy sector is contemplated, several avenues of research come into focus.Foremost is the potential enhancement of the model's attention mechanism, tailored explicitly to the intricacies of energy consumption dynamics.The model's ability to discern and emphasize critical energy patterns could be substantially elevated by tailoring attention strategies to highlight domain-specific energy patterns.Furthermore, while BiGTA-net showcases an intricate architecture, the ongoing challenge resides in seamlessly integrating its inherent complexity with optimal predictive accuracy.By addressing this balance, models could be engineered to be more streamlined and suitable for decentralized or modular BEMS frameworks, all while retaining their predictive capabilities.Lastly, a compelling proposition emerges for integrating BiGTAnet's forecasting prowess with existing BEMS decision-making platforms.Such integration holds the promise of a future where real-time predictive insights seamlessly inform energy management strategies, thereby advancing both energy utilization efficiency and a tangible reduction in waste.
While BiGTA-net has demonstrated commendable forecasting capabilities in its initial stages, a thorough exploration of its limitations in conjunction with potential improvements and future directions can contribute to the enhancement of its role within the BEMS domain.By incorporating these insights, the relevance and adaptability of BiGTA-net can be advanced, thus positioning it as a frontrunner in the continuously evolving energy sector landscape.

Conclusions
Our study presents the BiGTA-net, a transformative deep-learning model tailored for urban energy management in smart cities, enhancing the accuracy and efficiency of STLF.This model harmoniously integrates the capabilities of Bi-GRU, TCN, and an attention mechanism, capturing both recurrent and convolutional data patterns effectively.A thorough examination of the BiGTA-net against other models on the educational building dataset showcased its distinct superiority.Specifically, BiGTA-net excelled with a MAPE of 5.37, RMSE of 171.3, and MAE of 122.0.Notably, the closest competitor, Bi-GRU-TCN-II, lagged slightly with metrics such as MAPE of 5.39 and MAE of 122.7.This superiority was mirrored in the AEP dataset, where BiGTA-net again led with a MAPE of 26.77, RMSE of 386.5, and MAE of 198.4.Such consistent outperformance underscores the model's capability, especially when juxtaposed with other configurations.
Furthermore, the integration of the attention mechanism serves to enhance the performance of BiGTA-net, reinforcing its effectiveness in forecasting tasks.The distinct bidirectional architecture of BiGTA-net demonstrated superior performance, further establishing its supremacy.This performance advantage becomes notably apparent when contrasted with models, i.e., Att-LSTM, which exhibited higher errors across pivotal metrics, highlighting the resilience and dependability of the proposed model.The evident strength of BiGTA-net lies in its innovative amalgamation of Bi-GRU and TCN, harmonized with the attention mechanism and bolstered by the SELU activation function.Its consistent dominance across diverse datasets and metrics robustly validates the efficacy of this hybrid approach.
Despite its promising results, it is important to explore the BiGTA-net's capabilities further and identify areas for improvement.Its generalizability has yet to be extensively tested beyond the datasets used in this study, which presents a limitation.Future research should apply the model across various consumption domains, such as residential or industrial sectors, and compare its effectiveness with a wider range of advanced machine learning models.By doing so, researchers can further refine the model for specific scenarios and delve deeper into hyperparameter optimizations.looking perspective.As an extension of its capabilities, TCN's stackable nature combines layers, amplifying its capacity to perceive intricate temporal nuances.

Figure 1 .
Figure 1.Schematic flow of data preprocessing and BiGTA-net modeling.

Figure 1 .
Figure 1.Schematic flow of data preprocessing and BiGTA-net modeling.

Figure 2 .
Figure 2. Distribution of hourly electricity consumption of a building.(a) Educational building dataset; (b) Appliances Energy Prediction dataset.

Figure 2 .
Figure 2. Distribution of hourly electricity consumption of a building.(a) Educational building dataset; (b) Appliances Energy Prediction dataset.

Figure 3 .
Figure 3. Boxplots for hourly electricity consumption by hours.(a) Educational building dataset; (b) Appliances Energy Prediction dataset.

Figure 4 .
Figure 4. Boxplots for hourly electricity consumption by days of the week.(a) Educational building dataset; (b) Appliances Energy Prediction dataset.2.1.1.Timestamp Information

Figure 4 .
Figure 4. Boxplots for hourly electricity consumption by days of the week.(a) Educational building dataset; (b) Appliances Energy Prediction dataset.

Figure 5 .
Figure 5. Average electricity use per hour for each day of the week and holidays.

Figure 5 .
Figure 5. Average electricity use per hour for each day of the week and holidays.

Table 4 .
Comparison of hybrid deep learning model architectures.

Table 1 .
Comparative analysis of previous studies and the current research concerning short-term load forecasting.

•
Renewable Electricity Production: Anticipating the forthcoming electricity demand facilitates strategic planning for harnessing renewable sources.It ensures renewable sources are optimally utilized, considering their intermittent nature.

Table 2 .
Building electricity consumption dataset information.

Table 3 .
Input variables and their information for BiGTA-net modeling.

Table 3 .
Input variables and their information for BiGTA-net modeling.

Table 5 .
MAPE comparison for the educational building dataset.

Table 6 .
RMSE comparison for the educational building dataset.

Table 7 .
MAE comparison for the educational building dataset.

Table 9 .
RMSE comparison for the AEP dataset.

Table 10 .
MAE comparison for the AEP dataset.

•
Moon et al. [21]presented RABOLA, previously touched upon in the Introduction section.This model is an innovative ranger-based online learning strategy for electricity consumption forecasts in intricate building structures.At its core, RABOLA utilizes ensemble learning strategies, specifically bagging, boosting, and stacking.It employs tools, namely, the random forest, gradient boosting machine, and extreme gradient boosting, for STLF while integrating external variables such as temperature and timestamps for improved accuracy.• Khan et al. [56] unveiled the ResCNN-LSTM, a segmented framework targeting STLF.The primary phase is data driven, ensuring data quality and cleanliness.The next phase combines a deep residual CNN with stacked LSTM.This model has shown commendable performance on the Individual Household Electricity Power Consumption (IHEPC) and Pennsylvania, Jersey, and Maryland (PJM) datasets.

Table 11 .
Performance comparison of attention-inclusive models on the educational building dataset.Left values denote the mean values across all steps, while the values in parentheses on the right represent the corresponding standard deviations.

Table 12 .
Performance comparison of attention-inclusive models on the AEP dataset.Left values denote the mean values across all steps, while the values in parentheses on the right represent the corresponding standard deviations.

Table 13 .
Results of the Wilcoxon signed-rank and Friedman tests with BiGTA-net on the educational building dataset.

Table 14 .
Results of the Wilcoxon signed-rank and Friedman tests with BiGTA-net on the AEP dataset.