Hyperparameter Optimization of Bayesian Neural Network Using Bayesian Optimization and Intelligent Feature Engineering for Load Forecasting

This paper proposes a new hybrid framework for short-term load forecasting (STLF) by combining the Feature Engineering (FE) and Bayesian Optimization (BO) algorithms with a Bayesian Neural Network (BNN). The FE module comprises feature selection and extraction phases. Firstly, by merging the Random Forest (RaF) and Relief-F (ReF) algorithms, we developed a hybrid feature selector based on grey correlation analysis (GCA) to eliminate feature redundancy. Secondly, a radial basis Kernel function and principal component analysis (KPCA) are integrated into the feature-extraction module for dimensional reduction. Thirdly, the Bayesian Optimization (BO) algorithm is used to fine-tune the control parameters of a BNN and provides more accurate results by avoiding the optimal local trapping. The proposed FE-BNN-BO framework works in such a way to ensure stability, convergence, and accuracy. The proposed FE-BNN-BO model is tested on the hourly load data obtained from the PJM, USA, electricity market. In addition, the simulation results are also compared with other benchmark models such as Bi-Level, long short-term memory (LSTM), an accurate and fast convergence-based ANN (ANN-AFC), and a mutual-information-based ANN (ANN-MI). The results show that the proposed model has significantly improved the accuracy with a fast convergence rate and reduced the mean absolute percent error (MAPE).


Introduction
An accurate ELF is essential for smart grids (SGs) to make strategic decisions such as operational and planning management [1], load switching [2], energy generation expansion, maintenance scheduling, security, demand monitoring inspections, and providing a reliable energy supply [3], since inaccurate forecast results may pose serious challenges in making short-and long-term decisions and planning for SGs. Overestimation in a forecast may lead to excessive spinning reserves, production capacity, and limited energy distribution, resulting in higher operational costs. In contrast, underestimation may create consistency, power quality, safety, and monitoring issues. Therefore, the distribution system operators (DSOs) need an acceptable accuracy to guarantee endurance and stable grid operation [4]. For this purpose, much devotion is given to provide instant, accurate, and stable load forecasts to ensure the safe and reliable operation of the grid [5]. However, the accuracy of ELF often cannot cope with societal requirements. It is affected by probabilistic and uncertain factors been fully confirmed due to the small amount of experimental data from related studies and the seasonal diversity.
Hence, a new feature engineering (FE) and optimization concept is introduced. The proposed BO algorithm has been selected to adjust the control parameters because of its fast convergence and robustness in finding the optimal solution [32][33][34]. The BO algorithm optimizes the threshold weights for the filters and finds the optimized thresholds to be used in the FE module for feature selection. The complete forecasting framework consists of an integrated framework of feature engineering (FE), a stochastic BNN model, and the BO algorithm (FE-BNN-BO). The performance of the proposed FE-BNN-BO model is validated by comparing the results with the existing models in terms of mean absolute percentage error (MAPE).

Contributions
The real contributions are presented as follows: • An ingenious and robust framework, FE-BNN-BO has been proposed that integrates the FE and BO algorithm with BNN. The FE module solves the concern associated with redundancy and irrelevance (dimension reduction). In the meantime, the BO algorithm optimizes the hyper-parameters of the BNN predictor to enhance accuracy while securing fast convergence. The combination of the FE module and the BO algorithm significantly improves the performance and effectiveness of the BNN model. • BNN models are complex in estimating computational efficiency and cannot handle uncertainty. Therefore, the iterative and irrelevant features may enhance the complexity, slow down the BNN training process and affect the prediction accuracy. The proposed FE addresses this problem by combining the Random Forest and Relief-F-based feature selector and radial-based kernel principal component analysis (RBKPCA)-based feature extractor. The feature selector converges the Random Forest with Relief-F, calculates the importance of the feature, selects the relevant feature, and discards the irrelevant feature. This further enhances the computational performance and efficacy of the BNN model. • Moreover, the BO algorithm is automatically applied to search for the best ensemble configuration. The devised BO algorithm is more controllable and efficient in time and complexity than the widely used grid search methods. • The proposed model is validated against the latest hourly load data obtained from the USA electricity grid. The proposed frameworks outperformed the benchmark frameworks, such as LSTM, ANN-MI, ANN-AFC, and Bi-Level, when considering the accuracy and convergence speed.

Paper Organization
The rest of the sections of this work are organized as follows: the recent and relevant work is demonstrated in Section 2. Section 3 illustrates the proposed system model. Section 4 discusses the simulation results and discussion. The paper is finalized in Section 5.

Literature Survey
This is because static and ML models are widely used in the literature. These can be divided into two major categories to understand better how they perform and the impediments that come with them. A detailed description is as follows.

Individual ELF Models
The individual models are used for ELF without fusing any other algorithm. Therefore, only the algorithmic efficacy is estimated using various performance parameters. The authors [35] have proposed a distribution practice for meteorological data predicting the prospective load. The energy system is further divided into two sub-systems depending on the climate. In addition, the two distinct forecasting models, Grey and ARIMA, are used in both sub-systems. The fitted models are assessed by approximating them to the definitive models using MAPE as a performance metric. In [36], an individual approach based on a deep recurrent neural network (DRNN) is introduced to forecast the household load. This approach ensures the overfitting issue more efficiently than the classical DNN systems. Furthermore, the results show the improved performance of the proposed strategy by comparing it to the other single methods, such as ARIMA, SVM, Grey, and traditional RNNs. Authors in [37] proposed an RNN based on the long short-term memory (RNN-LSTM) framework to forecast the household loads. The forecast accuracy has been enhanced by utilizing the embedded appliance-usage series strategy of training data. However, the author ignores the convergence rate and computational complexity, and they focus only on accuracy. The demand response scheme has considered the ANN for the price forecasting in [38]. The proposed price-forecasting model uses mixed-integer linear programming (MILP) to lessen energy costs. Simulation results depict that hourly demand response is more optimistic than a day ahead, with an enhanced ability to encounter the industrial market by diminishing cost. The authors presented a probabilistic prediction model for predicting PV, electrical energy consumption, and scalability [39]. Quantile regression (QR) models and dynamic Gaussian processes (GP) are applied to the Sydney metropolitan area data for probabilistic prediction. Simulation results demonstrate that the proposed model excels in all three predictive scenarios. In [40], a long-term predictive model was proposed to improve the relative prediction accuracy of the integrated power resource plan. The authors [41] have investigated new aspects using loads and temperatures over the past few hours. The primary purpose is to determine the hourly moving average temperature with a time lag to improve the accuracy of the prediction. The impact of timeliness is examined in three scenarios: the aggregated geographic hierarchy level, the lowest geographic hierarchy level, and each time. However, it improves accuracy at the cost of model complexity. Though individual models are robust and fast converging, they are still inaccurate and do not reach the required level.
The above discussion finalizes that single strategies are not helpful in all facets (rate of convergence, accuracy, stability) due to each technique's unique flaws, imperfections, defects, and intrinsic limitations [41]. For example, non-linear and seasonal behavior cannot be learned from a linear regression-based model [12]. The gray model is distinctive for the exponential growth trend [11]. Expert systems rely on a solid knowledge base [16], and intelligent methods rely on thresholds, weights, biases, and hyperparameter adjustments [14]. These annoyances affect ELF and result in inconsistent performance. Due to these shortcomings, individual methods cannot achieve all goals (accuracy, rate of convergence, stability) simultaneously. Multiple optimization algorithms, such as metaheuristic [42], bio-inspired [43], and heuristics [44], are integrated into a single model to devise hybrid models to overcome the problems and limitations of the single methods [33]. The goal is to attain increased precision and overwhelm the flux of final forecasts by optimizing thresholds, hyper-parameter adjustments, random weight initialization, and biases for individual models.

Hybrid ELF Models
The new integrated and hybrid predictive model is an intelligent solution that maximizes the desired characteristics of individual models to ensure superior performance [45,46]. The hybrid model integrates the FE engine and the optimization engine with a combination of prediction algorithms to improve accuracy by optimizing the control parameters of the prediction engine. For instance, a hybrid wavelet neural network (WNN) based on a stepped forward differential evolution (SFDE) framework is devised. The optimizing algorithm SFDE efficiently tunes the hyperparameters of the proposed WNN. Experimental results show that the proposed framework is efficient, when compared with different frameworks such as ANN-based particle swarm optimization (ANN-PSO), ANN-based genetic algorithm (GA-ANN), and ANN-based evolutionary programming (ANN-EP) in terms of accuracy, efficiency, effectiveness, and hyperparameters tuning for ELF [47]. A hybrid model of nonlinear AR with GA and an extrinsic NN is proposed in [48] for STLF. Use statistical and pattern-recognition-based schemes to fine-tune the input parameters of the proposed model. GA is used for the weight and bias of the NN training selection. The proposed model is validated by comparing it to existing models such as means and regression tree models with extrinsic inputs. The author proposed a robust STLF framework with an automated data cleaning method for load prediction of distribution feeders [49]. The previous day's building level LF model was proposed based on DL [50]. The proposed DL model is validated by comparing the accuracy with the traditional models. An integrated framework for VMD, LSTM, and BO algorithms has been proposed [51]. This model aims to be superior to existing models regarding accuracy and stability. A modified hybrid model of the multipurpose cuckoo search algorithm (MCSA) and GNRR has been proposed [52]. The proposed model will be tested against existing models for predictive accuracy using real-time load data from the Australian energy market operator (AEMO). The author [53] proposed a prediction engine based on the Neural Elman network to predict the future load of SG. Intelligent algorithms optimally adjust the biases and weights of this network to acquire accurate predictions. The author [54] proposed an STLF model based on SSVR. The main goal is to enhance the accuracy and efficiency of comparative predictions. The output of the prediction engine passes through the optimization engine and fine-tuning the parameters to improve the accuracy and efficiency of exact predictions. However, the prediction accuracy is enhanced at the expense of computational complexity. A fused framework is presented in [55] based on SVR and DE to enhance the forecasting performance by adapting SVR parameters. The developed framework surpasses backpropagation ANN, regression frameworks, and typical SVR. While in [56], a hybrid of SVR and the fruit-fly (Ff) algorithm framework is designed to address hyperparameter selection and improve forecasting accuracy. In addition, a new approach has been developed to achieve accurate ELF by merging the firefly optimization algorithm (FFO) with the SVR model and fitting optimal hyperparameters in [57,58]. A hybrid prediction strategy has been proposed to solve the above problems. However, the hybrid prediction strategy has improved modeling capabilities compared to the non-hybrid methods. Still, slow convergence and long execution times have a problem due to the many adjustable parameters. In [59], the author used a Bi-Level strategy based on ANN and the DE algorithm for ELF. Methods based on AFC-ANN and the modified extended DE algorithm (MEDEA) [60] have been proposed to predict future loads [61].
The combination of static optimal predictions calculates the weights of each model through pairwise performance fitting of training verifications that have been studied empirically and theoretically unpredictably. The authors have developed a hybrid model consisting of the navel switching delayed PSO (NSDPSO) algorithm and ELM for STLF [62]. Weights and biases are optimized using the proposed NSDPSO algorithm. The tanh function is considered an activation function because it has more generalization issues and avoids unwanted hidden nodes and overfitting issues. Experimental results show that the proposed framework is superior to RBFNN. The devised model successfully applies to the STLF of the energy system. A hybrid prediction framework has been developed that combines a feature extraction method with a two-step prediction engine. The two-stage prediction engine uses Ridgelet NN (RNN) and Elman NN (ENN) to provide accurate predictions. Optimization algorithms are applied to determine the control parameters of the prediction engine [63]. The hybrid models above can be considered optimistic and valuable in improving prediction accuracy by adequately modifying the hyperparameters. However, the authors of these articles focus on bias initialization and random weighting optimization or proper adjustment and selection of hyperparameters. Moreover, none of these models considered accuracy, rate of convergence, and stability simultaneously. From numerous analyses and investigations, one aspect (bias initialization and random weight optimization, or proper setting and selection of hyperparameters) and one measurement (convergence, accuracy, or stability) are not enough. We have concluded that it is sufficient. Therefore, a robust hybrid model is needed to overcome the problems of current models while improving prediction accuracy and stability with fast convergence rates.
From the literature, we can safely conclude that ELF has made great strides in energy management. However, existing approaches are not practical when dealing with large amounts of data. Adjusting control parameters is complex, and redundancy, irrelevance, and dimensionality reduction are unavoidable, which makes the calculation very difficult as it cannot quickly converge. Furthermore, the above literature does not consider forecast accuracy and rate of convergence at the same time. To address these problems, we require a fast and accurate model. Therefore, an SVM and gradient descent (GD) algorithm-based model is proposed [64]. However, this model introduces computational complexity and fails to converge. Some authors have focused on feature selection algorithms, traditional classifier decision trees (DTs), and ANNs [65]. However, the DT faces the problem of overfitting. So, while DT works well in training, it does not work well in prediction. ANNs have limited generalizability and are challenging to control convergence. The authors [41] proposed a hybrid feature selection (HFS), extraction, and classification model for STLF. However, this method is too complex to converge.

Proposed Model
This study proposes a novel hybrid framework based on the FE method, neural network model (BNN), and BO algorithm for ELF, as shown in Figure 1. This work targets only daily load forecasting using a new concept of scalability and robustness evaluation. The proposed model is an integrated framework of three modules: (i) an FE module comprised of hybrid feature selector (HFS) and the feature extractor (FX), (ii) a forecasting module based on the BNN model, and (iii) an optimizing module based on the BO algorithm.

BO for optimization
Selected optimal control par...

FE Module
The first module is FE. In this phase, abstract and critical features are picked and removed from the preprocessed data, while repetitious and irrelevant elements are discarded. The desired features are picked from the dataset by GCA and extracted by radial basis KPCA (RB-KPCA). FS relies on GCA to drive feature selection. It includes Relief-F (ReF) and Random Forest (RaF) algorithms for estimating the importance of features as depicted in Figure 2. In addition, the FS decides whether to reserve or abandon a feature based on the extent. RBKPCA-based FX uses kernel functions to process high-dimensional nonlinear data. Feature extraction aims to reduce redundant features. Below is a brief demonstration of the FE module.

FS
The feature selection system is primarily based totally on GCA, which has evolved with the aid of using RaF and ReF and is managed using a combined controlling threshold (ϕ). The GCA more or less selects a feature space in which the maximum applicable and preferred features are kept, and inappropriate features are discarded primarily based totally on the feature significance and feature selection ϕ. Let L be the electric load data matrix, which is defined as follows: The columns show the feature index, while the rows present the timestamps. Moreover, l mn is the mth component of data, which is nth hour ahead of the electrical energy consumption pattern that is to be forecasted. Equation (1) can also be expressed in the form of a time sequence: Many factors affect the ELF pattern in different ways. GCA estimates the importance of each component and its impact on ELF. GCA effectively controls the feature selection process by determining the correlation between each feature and the final ELF pattern. GCA resolves the accessibility of data signals by correlation. Correlation is directly correlated to the proximity of the data signal. As a result, GCA determines the proximity of the two data signals. Each framework feature has its physical meaning and dimensions, so dimensionless data is standardized by the mean or maximum when GCA is executed. Since the original sequence has the characteristic of "the larger the better" [66], it can be normalized as: where Z j (k) is the original sequence, Z * j (k) is the sequence after the data preprocessing, Zmax j (k) is the largest value of Z j (k), and Zmin j (k) is the smallest value of Z j (k). The grey rational coefficient (η) after normalization [66] is calculated in Equation (5) as follows: where Σ 0j (k) is the deviation sequence of the reference sequence Z * 0 (k) and the comparability sequence Z * j (k), and ν is a distinguishing coefficient, fixed to 0.50 [67] and represented in Equation (6): The grey relational grade (G j ) [68,69] is a weighting sum of the η. It is defined in Equation (7) as follows: Grey relational analysis is a measurement of the absolute value of the data difference between sequences, and it can be used to calculate approximation correlation. The lowcorrelated features are deleted, and the remainder of the selected items l kn are arranged from least to most significant, providing the time sequence t j as illustrated in Equation (8): where δ illustrates the dropped function and t j is the time series. The RaF evaluator β processes the boot-strap-bagging (BSB) samples [70]. BSB samples are split into out-of-bat (OoB) samples and training samples. In the first evaluator β, all weights are initialized to zero, and RaF training begins. Then, the feature's importance is determined by the OoB data with noise. For the second evaluator α, the weights are updated with the concept of distance among hits and misses. Both α and β evaluators forward the determined feature importance to the FS to perform feature selection based on the ϕ. The represents the feature importance calculated by ReF and RaF, respectively. The parameters are updated in Equations (9) and (10), respectively: where C is the class and b * is the randomly selected item in the C, and function D, b * , N j (C) computes the attribute difference D between r 1 and r 2 . The function is mathematically modeled as in Equation (11): The ReF-based feature importance F i f and RaF-based feature importance F i r are normalized for feature selection, as depicted in Equations (12) and (13): A combined feature importance value greater than ϕ is considered as a key feature, while those with value less than ϕ are rendered irrelevant. The core features are restored, while the irrelevant features are eliminated. This procedure is mathematically represented in Equation (14): The selected features are passed to a feature extraction phase that uses RBKPCA to reduce redundancy between features.

FX
The feature extraction procedure based on RB-KPCA is committed in the second stage. This operation aims to remove redundant data to solve the dimensionality-lessening problem. The output of the FS is sent to the RB-KPCA-based FX. This produces a dimensionally diminished matrix presented in Equation (15), including the most relevant features of interest that can be modeled as in [71]: R = r 1 , r 2 , r 3 , . . . , r j T (15) where r j is the jth variable associated with the EL. The correlation between eigenvalues and features is calculated as follows: where the covariance matrix of R is denoted by V, and f * represents the feature space, while the eigenvector is represented by ev and λ is the eigenvalue. Furthermore, V f * ev is determined using Equation (17): where φ shows the feature space and input data mapping, and r, z expresses the product of r and z. Equation (16) becomes Equation (19) by proposing the above-mentioned modifications: where ev for λ = 0 can be determined as in Equation (20): where γ j denotes indices corresponding to r j . The kernel function mentioned in [72] is now utilized as follows in Equation (21): after combining Equations (19) and (20), the combined form is defined as: where Now, Equation (19) may be rewritten as: To conduct dimensionality reduction by normalization, the eigenvectors γ and λ are chosen. Therefore, we have: The consequential Equation (26) can be achieved by substituting Equation (20) for Equation (25), which is as follows: The LHS of Equation (26) = λ n γ n , γ n The principal component extraction can be calculated in Equation (31): where P signifies the principal element and the generalized versions of the kernel function are: • Linear kernel function: the linear kernel is used when the data is linearly separable. It can be separated by one line. This is one of the most commonly used kernels. This is primarily used when a particular dataset contains a large number of features. Mathematically, it may be formulated in Equation (32): • Kernel function based on logistic sigmoidal: this function is equivalent to a two-layer, perceptron model of the neural network, which is used as an activation function for artificial neurons. Equation (33) show the mathematical representation of kernel-based sigmoid function: • Kernel function based on radial basis: radial basis function kernels or RBF kernels are common kernel functions used in various kernel-learning algorithms. In particular, they are often used to classify SVMs. Mathematically, an RBF kernel is represented in Equation (34): After the FE step, the selected and extracted feature matrix is provided as input to the BNN-based prediction engine for predicting performance patterns.

BNN-Based Forecasting Module
The fundamental purpose of NN training is to obtain an appropriate network architecture A and weight vector w. NN offers an implicit function f (x, w) design that connects the input variable x to the output variable y provided as A and w. For the dataset D = (x 1 , τ 1 ), (x 2 , τ 2 ), ...(x n , τ n ), with assumed A, to achieve w with reference to the weight vector by training it, the mapping function f (x, w) has the lowest error E D (w) [73] and is presented in Equation (35): Overfitting concerns and generalizing performance reduction are always present in the NN training process. Therefore, a regularization approach in the NN training process used, and a new mathematical error called the generic error is upgraded by replacing a E D (x) mapping error [73] depicted in Equation (37): where The factor E w (w) in Equation (38) is known as the weight decay term, and the driving parameters are α and β. They have the ability to impact the complexity and adaptability of NN. The prediction function of the ANN family is generally determined using the root mean squared error (RMSE) [74], which is defined in Equation (39): The mean absolute error (MAPE) [74] is commonly used in the electrical market, and may be defined in Equation (40): Using the Bayesian approach, Equation (41) depicts the posterior probability distribution of w [75,76] as under: where P(D | w, β) is the probability function, and P(w | α) is the prior of w. If each error item presented in Equation (42): Each error has a normal probability with zero mean and variance 1/β, the weights w k , k = 1, . . . , m would also be a normal with zero mean and variance 1/α [76], then Substituting (43) and (44) into (45), the posterior becomes as: and the evidence function has the accompanying structure [77]: Let w * be the maximum value point of P(w | D, α, β), i.e., w * be the minimum value point of E(w) [24]. Using the Taylor expansion of E(w) around w and retaining terms up to the second order, then where H is the Hession Matrix of E(w) at w * . Thus, the posterior distribution can be written as in Equation (51): where Z * E (α, β) is the normalization factor. While: Picking a legitimacy premise of the weight space, with the end goal that the H is the identity I [78]: Let λ 1 , λ 1 , . . . , λ p be the eigenvalues of the matrix A, then the H has eigenvalues λ 1 + α, λ 2 + α, . . . , λ p + α [79], hence If the logarithm evidence for Equation (47) acquires maximization at point α, then The most probable values of hyper-parameters α, β are The predicted power consumption pattern is dispatched to the optimization module to further minimize errors and enhance accuracy.

Optimization Module Based on BO
The optimizing module is used to provide accurate, dependable, and robust predicting outputs by interacting with BNN to optimize hyper-parameters to generate effective and consistent predictive results. A probabilistic model is used to construct many evolutionary algorithms for this purpose. The BO algorithm has gained a lot of attention among these optimization systems.

BO-Based Optimizer
Significant hyper-parameters in traditional models, such as BNN, heavily depend on the datasets. Assuring a correct fit for these hyper-parameters is an art. The DL framework employs a variety of hyper-parameters' tuning algorithms, such as grid search and random search, among others [32][33][34]. BO is a viable strategy for locating the extrema of a given objective function (OF ). The OF is estimated as a Gaussian process (GP) and perceived as a proxy function (pf). BO performs well when the closed-form expression of the provided OF is unknown, but specific observations may be derived from it. In our devised model, BO is employed to find the best hyper-parameters for discovering the test or validation loss minima. The hyper-parameter search space is denoted by S, and model parameters such as the number of hidden layers are represented by N h , dropout rate as T d , batch size as B s , etc. Thus, the OF can be expressed as: The S for determining the optimal model hyper-parameter arrangement can be characterized as s * ∈ S such that: Here, the observations of the OF can be expressed as: This allows BO to develop a probabilistic model within F(s), so the model can be used to find the next position in S in a sample search, which can be found using Bayesian theory, as BO builds a posterior distribution of OF , and subsequent hyperparameter configurations are selected from this distribution. Use the initial sampling point information to calculate the shape of OF and the hyperparameters that optimize the expected results. The framework hyper-parameters in our article generate that OF , and the goal is to optimize the negative of the validation loss. BO is used to calculate critical BNN hyperparameter values. Since these variables affect prediction accuracy and robustness, BO designed a system to select and optimize the hyper-parameters of the BNN framework. Train the BNN network with these optimum circumstances after fine-tuning it with BO (best features plus fine-tuned hyper-parameters). This is the final forecasting model that will be put to the test. The overall step-by-step procedure of the proposed framework is depicted in Figure 3. Step-by-step working flow chart of the proposed schematic framework for ELF. Red box shows data pre-processing and feature engineering module, green box shows BNN-based forecaster module, and purple box represents BO-algorithm-based optimization module.

BO Algorithm for Hyperparameters Tuning
As a model-based hyperparameter-tuning technique, the BO algorithm models the conditional probabilities of the validation set performance when hyperparameters are selected using surrogate functions. In contrast to the grid or random searches, the BO algorithm tracks all historical evaluations. Therefore, avoid wasting calculations to evaluate bad hyperparameters. In addition, the acquisition function finds the most promising hyperparameter to assess in the next iteration. The proposed model applies the BO algorithm strategies to find the optimal hyperparameters in the dynamic ensemble module. The BO algorithm achieves better tuning efficiency in a much shorter evaluation time. BO algorithm consists primarily of five parts: the hyperparameter space, OF , the acquisition function, the history of evaluations, and the surrogate function. In this article, we define the hyperparameters domain in Table 1. The OF is the forecasting error on the augmented validation data. We implement the tree-based Parzen window estimation (TPE) practice to accomplish the probabilistic modeling of the surrogate function and adopt the expected improvement to be the acquisition function A, defined in Equation (66).
where g is the OF and g * is the threshold of OF , given the hyperparameter choice ν.
The simplified algorithmic description of the TPE-based BO algorithm is shown below in Algorithm 1.

Simulation Setup
The CPUs and GPUs used in this task are an Intel Core i710701 K @ 3.82 GHz and an NVIDIA GeForce RTX 2070 SUPER. Modeling, training, tuning, and testing are programmed in Python 3.7. The libraries used for this task are: statsmodels 0.12.0 (for Relief-F and Random Forest), sklearn 0.23.1 (for LR, SVR, EWTFCMSVR, BNN, and ANNMI),

Compared Models
The BO-algorithm-based optimization module directly correlates to between the convergence rate and accuracy. However, the devised model is better than the existing models such as ANN-MI [80], LSTM, Bi-Level [59], and AFC-ANN [81]. The above models have been determined as benchmark models due to structural resemblances with the evolved model. Convergence rate and accuracy are two quantifiable metrics to evaluate performance.
• Time consumed during execution by the forecasting approach is called convergence rate, and the execution time is calculated in seconds (s). • While Accuracy (A) is defined as: and is measured in (%).
The simulation parameters are enumerated in Table 2 and are retained as the same for the presented and benchmark schemes. The explicit depiction of the simulation results is presented as follows:

Description of Dataset
Historical daily EL data from the publicly available PJM electricity market is used to evaluate the effectiveness of the proposed scheme. The training forecasting model is characterized by various factors (temperature, humidity, dew point, and time of day). The historic and hourly load data for the USA electrical system for the last four years (2017-2020) is used. The data includes humidity, temperature, and load parameters. The electric grid (FE) has the highest load profile and covers the most densely inhabited area. The dataset goes through the feature engineering module, where the abstracted features are extracted from the specified dataset. The subset (abstracted features) of the dataset is divided into training samples and test samples. We considered three years of data for network training and one year for network testing. The input vector, the above mentioned variables, and the main target load profile are included in the training data samples from 2017 to 2019. Test data samples are collected and used for testing in 2020. Data sample validation is created from the training sequence data to improve the parameter selection for validation errors.

Learning Curve Evaluation
A learning curve is a pictorial illustration that approximates the efficiency of a framework when training and testing data samples across different numbers of epochs. We can use the learning curve to notice if the selected model is training or storing data. If the bias and variance are high, the learning curve is poor, so the model does not memorize or learn. The high bias results in higher training and testing error as well as faster convergence rates. In contrast, significant variances occur when there is a substantial gap between training and test errors. In either case, the model is inappropriate, and the generalization is inadequate. Overfitting occurs when test errors begin to increase and training errors decrease. This shows that the model memorizes but does not learn. Therefore, such a model is under generalized. The dropout method and early stopping prevent overfitting problems [82]. However, for BNN, it is observed that the test errors gradually decrease, similar to the training errors in the USA power grid (FE). Therefore, the BNN model solved the problem of overfitting. In addition, the gap between training and testing errors is small, with no bias or variance, as shown in Figure 4.   Figure 5 shows the day-ahead ELF profile with an hourly resolution using the developed and other benchmark frameworks, such as LSTM, Bi-Level, ANN-MI, and AFC-ANN for the USA power grid (FE). The graphical representation shows that all predictive models, including the proposed model, can capture nonlinear load behavior from historical data and predict future electrical loads based on the captured behavior. It is also clear that models such as Bi-Level, MI-ANN, AFC-ANN, and LSTM use the Levenberg-Marquardt, sigmoid activation function, and multivariate AR algorithms for network training. In contrast, customized BNNs are trained with the Tangent Hyperbolic (Tanh) function due to their short execution time. This can be seen in Figure 5. The predictive load for the day ahead, based on a BNN-based model and other benchmark models, is shown in Table 3. The MAPE of the devised BNN-based framework is 0.4920%, the MAPEs of the ANN-AFC, ANN-MI, and Bi-Level models are 2.9186%, 4.3371%, and 2.4741% respectively. The developed model has lowered MAPE compared to the benchmark models, resulting in superior accuracy.   Figure 6 presents a performance rating of the devised and benchmark models in relation to the rate of convergence of the USA power grid (FE). There is an inverse relation between the rate of convergence and forecast accuracy. The ANN-AFC framework is more accurate than the ANN-MI framework. This gain in accuracy reaches at the expense of more prolonged execution times. As shown in Figure 6, the execution time rises from 20 s to 110 s. The execution time of the proposed framework has been reduced for two reasons:

Convergence Rate Evaluation
• Abstractive features are fed into the training and forecasting module, reducing network training time.

•
The BO algorithm is used due to its significantly faster convergence rate. Due to the adjustments made to the proposed model, the proposed STLF framework lowered the execution time from 110 s to 42 s. In contrast, ANN-MI has excellent performance in term of convergence rate, even though this model has no optimization module integrated into it. This tendency is seen well in Figure 6.

Scalability Analysis
Scalability research shows whether the framework under development is scalable or suitable for the scenario under consideration. Bias, threshold, input samples, and random weights are adjusted and tuned by Equation (1). These factors affect the accuracy by calculating the errors and convergence rate by calculating the execution time of the proposed framework depicted in Figure 7a,b. Forecast accuracy increases from 0 to 700 data samples and tends to stabilize while boosting samples. We can notice the effect from the value of l in expression (1). It is closely related to BNN training. An important value of l during the training process indicates fine-tuning and increases forecast accuracy. Similarly, Figure 7b shows the relationship between sample size and execution time. Using FE for feature selection, BNN for prediction, and BO algorithm for optimization, the developed model shows relatively good scalability compared to the benchmark models.

Computational Time Analysis
Individual models, LSTM, BNN, and Bi-Level do not integrate both FE and optimization modules, resulting in short computational times (τ c ), and the worst error performances for daily time horizons are listed in Table 4. The performance analysis of proposed and benchmark frameworks in terms of computational time and MAPE is depicted in Figure 8a,b. However, when both the FE and the optimization modules are integrated into these individual models, the trade-off between accuracy and rate of convergence increases When the optimization module, the FE module, or both modules are integrated with the individual predictive models, they counteract the increased τ c . In addition, this increase in time is due to the trade-off between convergence speed and accuracy, thereby achieving higher accuracy at the expense of excessive τ c . The proposed FE-BNN-BO framework reduces τ c by developing changes to the BO algorithm. Therefore, the FE module modifies the functional space by removing redundant and irrelevant features, and the BO-algorithm-based optimization module adjusts the control parameters of BNN to ensure an accurate ELF.

Robustness Evaluation
Stochastic noises (white noise, harmonic noise, asymmetric dichotomous noise, and Lévy noise) have a great adverse effect on the prediction accuracy of electric power load. Pre-filtering real-time can effectively improve measurement accuracy. Pretreating and statistically inspecting the electric power load data is essential to characterize the stochastic noise of the electric power load. The proposed feature engineering (FE) is used to denoise load data. FE is significantly reduced stochastic noise amplitude of power load data. Therefore, the proposed time series model and FE method can effectively suppress the stochastic noise of the power load data and improve the prediction accuracy of the power load in order to maintain the robustness of the proposed model. Figure 9 shows the robustness assessment of the proposed FE-BNN-BO model and benchmark models such as LSTM, Bi-Level, ANN-AFC, and MI-ANN. The evaluation is performed by adding an error (noise) to each function and observing the accuracy of each scheme. The proposed framework is more robust than the benchmark frameworks. This is because the noise in the feature has little effect on accuracy, reducing the number of important and irrelevant features dropped during the FE phase. Therefore, the proposed FE-BNN-BO framework is also robust against functional noise.

Remark 1.
Energy consumption forecasting is of prime importance for the restructured environment of energy management in the electricity market. Accurate energy consumption forecasting is essential for efficient energy management in the smart grid (SG); however, the energy consumption pattern is non-linear with a high level of uncertainty and volatility. Keeping in view the nonlinearity and complexity of the investigated problem, a BO algorithm is proposed for the optimization module of the proposed model to further improve accuracy with reasonable convergence of the forecasting results returned from the BNN-based forecaster. The proposed FE-BO-BNN model is examined on FE power grid data from the USA in terms of MAPE and convergence rate. Simulation results validated that the proposed FE-BO-BNN model achieved 0.4920 accuracy in terms of MAPE, which is better than the benchmark models, such as Bi-Level (2.4721), AFC-ANN (2.9286), and MI-ANN (4.3371). The proposed model reduced the average execution time by 21.1%, 35.5%, and 61% when compared to MI-ANN, AFC-ANN, and Bi-Level, respectively. It is concluded that our proposed FE-BO-BNN model outperformed benchmark electrical-energy-consumption forecasting models in terms of both accuracy and convergence rate.

Conclusions
ELF is an essential component of the reliable operation of the energy system, since accurate LF is helpful in reducing the generation-demand mismatch through optimal decision-making and advance planning. However, the short-and/or long-term power generation or infrastructure planning depends on accurate forecast results with the possibility of marginal error, although a great effort is being given to the development of accurate forecasting algorithms. However, there is still a possibility to further improve the algorithmic accuracy by considering their control parameters, since the performance and accuracy depend on these control parameters. In this regard, the paper has presented a new and hybrid load-forecasting model based on BNN and BO. The proposed framework has used the BO algorithm to fine-tune the hyper-parameters of BNN to improve its accuracy. The FE module is integrated into the BNN model to further improve the computational efficiency and solve the problem of model dimensionality reduction. Through this combination, the proposed model simultaneously achieves higher stability, convergence, and accuracy. The devised framework is assessed using an hourly load dataset obtained from the USA energy grid (FE). The devised model is evaluated and compared with other latest models such as Bi-Level, ANN-AFC, ANN-MI, and LSTM, considering accuracy and convergence rate. In other words, the proposed ELF model outperforms Bi-Level by 15.73%, MI-ANN by 29.1%, and AFC-ANN by 3.97%.

Conflicts of Interest:
The authors declare no conflict of interest.