An Integrated Model of Deep Learning and Heuristic Algorithm for Load Forecasting in Smart Grid

: Accurate load forecasting plays a crucial role in the effective energy management of smart cities. However, the smart cities’ residents’ load proﬁle is nonlinear, having high volatility, uncertainty, and randomness. Forecasting such nonlinear proﬁles requires accurate and stable prediction models. On this note, a prediction model has been developed by combining feature preprocessing, a multilayer perceptron, and a genetic wind-driven optimization algorithm, namely FPP-MLP-GWDO. The developed hybrid model has three parts: (i) feature preprocessing (FPP), (ii) a multilayer perceptron (MLP), and (iii) a genetic wind-driven optimization (GWDO) algorithm. The MLP is the key part of the developed model, which uses a multivariate autoregressive algorithm and rectiﬁed linear unit (ReLU) for network training. The developed hybrid model known as FPP-MLP-GWDO is evaluated using Dayton Ohio grid load data regarding aspects of accuracy (the mean absolute percentage error (MAPE), Theil’s inequality coefﬁcient (TIC), and the correlation coefﬁcient (CC)) and convergence speed (computational time (CT) and convergence rate (CR)). The ﬁndings endorsed the validity and applicability of the developed model compared to other literature models such as the feature selection–support vector machine–modiﬁed enhanced differential evolution (FS-SVM-mEDE) model, the feature selection–artiﬁcial neural network (FS-ANN) model, the support vector machine–differential evolution algorithm (SVM-DEA) model, and the autoregressive (AR) model regarding aspects of accuracy and convergence speed. The ﬁndings conﬁrm that the developed FPP-MLP-GWDO model achieved an accuracy of 98.9%, thus surpassing benchmark models such as the FS-ANN (96.5%), FS-SVM-mEDE (97.9%), SVM-DEA (97.5%), and AR (95.7%). Furthermore, the FPP-MLP-GWDO signiﬁcantly reduced the CT (299s) compared to the FS-SVM-mEDE (350s), SVM-DEA (240s), FS-ANN (159s), and AR (132s) models.


Introduction
Accurate load prediction is indispensable for the effective planning, operation, and energy management of the smart power grid (SPG).It is essential for ensuring the SPG's sustainable, secure, and reliable operation, thus benefiting supply and demand-side stakeholders [1][2][3].On the supply side, precise load prediction enables effective resource allocation to meet residents' energy demands and optimize resource utilization [4].Conversely, on the demand side, accurate load prediction is imperative for proactive equipment management, load scheduling, efficient energy utilization, and optimal energy management [5].However, the accuracy of load forecasting is affected by inherent data uncertainties and randomness.These uncertainties make the task of consistently improving forecast accuracy complex and challenging.Consequently, there is a pressing need to develop models that are capable of enhancing forecast accuracy by effectively addressing the inherent uncertainties in load patterns.
In recent decades, load forecasting techniques are developed by numerous authors like the time series methods: exponential smoothing [6,7], Kalman filters [8], regression methods [9,10], the grey forecasting model (GM) [11], and the autoregressive integrated moving average (ARIMA), as well as ARMAX methods [12][13][14].In [15], the authors developed the ARIMA-MPSO model for load forecasting.These prediction methods are capable of forecasting electric load.However, the accuracy improvement is not up to the mark due to the method's inherent shortcomings.For instance, linear regressors are suitable for solving linear problems and have the worst performance while addressing nonlinear problems.Methods such as ARIMA take historical/current records for prediction while ignoring other influencing parameters.GM methods can only cater to exponential growth trends problems.Artificial intelligence (AI) emerged as a smart solution to resolve the issues of traditional methods.For example, these include expert systems [16], radial basis fuzzy logic models [17,18], machine learning models [19], neural networks [20][21][22], and multilayer perceptron (MLP) models [23].AI models outperform traditional models in terms of accuracy.However, these methods also suffer from some limitations.For instance, expert systems rely on knowledge acquisition and are challenged when handling uncertainty, radial basis logic models are computationally expensive and have limited generalization capability, and neural network models become trapped in local optima.In [24,25], deep learning models were introduced to resolve the drawbacks of existing models and to improve forecast accuracy.However, these models have high computational complexity.These deep layer models and hybrid methods are superior to intrinsic methods in terms of of accuracy.However, they ignore data preprocessing approaches, which are important for improved accuracy.Considering the limitations of AI methods, hybrid methods have been developed.For example, in [26][27][28], a hybrid method combining the regression neural network (RGNN) and fruitfly optimization (FFO) algorithm is developed for load forecasting.In [29], an efficient hybrid model using a neural network optimized with the artificial bee colony optimization algorithm is introduced to address the load forecasting issue.The paper in [30] cascaded the support vector machine (SVM) with AI for electric load forecasting in an SPG.Data-driven models have been developed to identify the services needed for load forecasting in smart cities [31].The authors developed the Mc-SVNN model for sunspot number time series, USD-to-euro currency exchange rate forecasting, daily temperature prediction, and power demand forecasting and wind speed forecasting in Abu Dhabi [32,33].The proposed model is compared with the literature works in Table 1.
As previously discussed, neural network models have limitations such as interpretability, being trapped into local minima due to limited extrapolation and generalization ability, and being unable to select abstracted features from datasets due shallow layouts.The MLP model employs learning principles (the multivariate autoregressive algorithm (MVARA) and ReLU) and the heuristic optimization algorithm to address these limitations and to minimize the error metric to forecast accuracy enhancement.With this motivation, in this work, the FPP-MLP-GWDO model is developed.First of all, the developed model employs FPP, which uses candidate interaction concepts, redundancy, and relevancy filters to return suitable features to the MLP forecaster.Secondly, the developed model uses MLP as a forecaster by utilizing learning principles, i.e., the MVARA and ReLU, to enhance model generalization and to facilitate accurate prediction.Finally, the developed model employees the genetic wind-drive optimization (GWDO) algorithm [34] as the optimizer due to the powerful search ability for the optimal solution with a faster convergence rate [35].The GWDO further improves the prediction accuracy by optimizing filter thresholds (irrelevancy and redundancy) weights, as well as biases of the MLP forecaster.This work is a continuation of the earlier work [36] where the FS-FCRBM-GWDO is developed.It is compared with existing models with respect to error metrics and the CC for validation.The novelty and technical contributions are highlighted below:

•
A FPP-MLP-GWDO has been developed, where the preprocessing FPP and postprocessing GWDO have been cascaded with the MLP for accuracy improvement.

•
Based on existing feature selection techniques [37], FPP has been developed where the feature interaction concept has been introduced, in addition to filters (irrelevancy, redundancy) for the selection of key features.

•
The GWDO has been applied to the returned predictions from the MLP to further improve the accuracy by optimizing the filter threshold (irrelevancy and redundancy) weights and biases.

•
The developed hybrid model, FPP-MLP-GWDO, is evaluated using Dayton Ohio grid load data regarding aspects of accuracy (the mean absolute percentage error (MAPE), Theil's inequality coefficient (TIC), and the correlation coefficient (CC)) and convergence speed (the CT and CR).The findings endorsed the validity and superiority of the developed model compared to the literature models such as the feature selection-support vector machine-modified enhanced differential evolution (FS-SVM-mEDE) [38], the feature selection-artificial neural network (FS-ANN) [26], the support vector machine-differential evolution algorithm (SVM-DEA) [39], and the autoregressive (AR) model regarding aspects of accuracy and convergence speed.
The remaining sections of this work are organized as follows: Section 2 presents the FPP-MLP-GWDO proposed model.Section 3 presents simulations and discussions.Finally, this work's conclusion is present in Section 4. The FPP-MLP-GWDO hybrid model offers several significant advantages over existing models in load forecasting.Its notably improved forecast accuracy stands out, thereby making it a valuable model for precise predictions.Moreover, its ability to converge quickly is ideal for real-time applications, while its adaptability allows it to handle various scenarios and datasets effectively.The model's prowess in capturing nonlinear load patterns ensures accuracy even in complex situations, and its robustness in the face of data variability instills confidence in its reliability.Furthermore, it enhances resource allocation, thus leading to cost savings, and it is designed for scalability to meet the evolving demands of expanding smart city infrastructures.Lastly, the integration of feature preprocessing simplifies data preparation, thereby streamlining the forecasting process.Overall, the FPP-MLP-GWDO model significantly advances load forecasting, thus offering improved accuracy, efficiency, and fast convergence speed.

Developed Hybrid FPP-MLP-GWDO Model
The developed FPP-MLP-GWDO is a hybrid model with three parts: FPP, an MLP forecaster, and a GWDO optimizer, which is depicted in Figure 1.The developed model goal is to enhance the accuracy and convergence speed simultaneously.The first part, FPP, uses filters (redundancy and irrelevancy) and interaction operations.The FPP takes load data and other influencing parameters such as the dew point, wind speed, humidity, and temperature as input.Through FPP, the received data is first normalized and then fed to filters (redundancy and irrelevancy) and interaction operations.The FPP goal is to return key features and to clean the data (maximum relevancy and feature interactions and minimum redundancy) for the MLP forecaster.The MLP forecaster is trained on the received data to learn future load patterns (forecast) for the Dayton Ohio electric grid.The predicted load pattern returned from the MLP forecaster is fed to the optimizer part based on a GWDO to further improve the accuracy by reducing the error amount.The comprehensive explanation is as follows.

Feature Preprocessing
The historical load and exogenous parameters are fed into the FPP part.First, the cleansing operation recovers missing values with earlier day average.Then, the cleaned data are passed through the normalization step to make them within the activation function bounds as follows.
where X represents input data, std indicates the standard deviation, and Norm shows the normalized data.X is the input data, which includes the load demand data denoted by P(hr, d), the temperature illustrated by T(hr, d), the dew point indicated by D(hr, d), and the humidity parameter, which is denoted by (H(hr, d)).The (hr, d) indicates the hour and specific day, respectively.The wind speed, dew point, humidity, and temperature are influencing parameters, which are also known as exogenous parameters.The FPP part includes filters (redundancy and irrelevancy) and feature interaction operations.The FPP aims to discard redundant, irrelevant, and nonconstructive features from the dataset, because redundant information causes execution overhead during the training, and irrelevant features act as outliers.A comprehensive explanation of the FPP filters and feature interaction operations is give below.

Relevant Feature Selection: Relevancy Filter
The relevancy filter in FPP plays a significant role in selecting key features.The relevancy filter selects key features by correlating input features with the target.Relevancy measurements have been made by many techniques [40].This work uses mutual information (MI) to measure feature relevancy, i.e., how closely a and y are correlated in the data.The MI observes the y under a.The MI for a and y is computed via individual p(a)andp(y) probability distributions (PDs) and the joint PD p(a, y), and it is indicated by I(a; y).
where S is the set having the input variables (a 1 , . . ., a M ) and the target variable y.The computation involves checking the information that is common between the input a i and the target y to determine their degree of association.When the common information between the two variables exceeds a certain threshold, it indicates a strong relationship.Moreover, the determination of the significance of the input a i concerning the target y is computed through the following approach.
D(a i ) represents the measure of relevance between the input and the target variables.

Redundant Feature Elimination: Redundancy Filter
The redundant feature elimination has significant importance with respect to improving the convergence speed.On the other hand, redundant features slow down the convergence speed.On this note, the FPP employed a redundancy filter to find redundant features from the input features set using the MI mechanism.The aim is to rectify the input feature set by discarding redundant features and keeping relevant features.According to the research conducted in [40], it was observed that closely correlated input variables have a negative impact on features selection.The explanation is that two input variables share a significant amount of common information regarding the target variable, but they share very little redundant information.As a result, an input with limited redundant information related to the target variable might be mistakenly considered redundant and excluded, even though it could provide essential abstract features for the proposed model.To address these challenges, a redundancy measure called interaction gain (Ig) was introduced in [41], which is defined in Equation (4).
The redundancy measure, denoted as RM(a i , a s ), represents the degree of redundancy between the potential inputs a i and a s , while y denotes the target variable.The mathematical modeling of the information gain Ig can be expressed in relation to the joint and individual entropy as follows: The individual entropy values of a i , a s , and y are represented by H(a i ), H(a s ), and H(y), respectively.In contrast, the joint entropy values are denoted by H(a i , a s ), H(a i , y), H(a s , y), and H(a i , a s , y).

Feature Interaction
The authors in [37] introduced the concept of filters (irrelevancy and redundancy) with the aim of eliminating redundant/irrelevant features while retaining the relevant ones for subsequent steps.However, a drawback of filter-based methods is that they may discard features that were initially deemed irrelevant, even though such features may be relevant when considered in combination with other features in the set.Building upon this observation, the FPP introduces the notion of interaction, in addition to a filters (irrelevancy and redundancy) approach.If two variables from the input set, a i and a s , possess redundant features with respect to the target variable y, the joint MI estimate between a i and y will be lower than the their combined individual MI estimates.Consequently, this leads to a negative value, as indicated by Equation ( 4), which signifies the presence of redundant features a i and a s for the model.By taking the absolute of the result from Equation (4), we obtain a measure that quantifies the extent of redundancy.In contrast, when the input variables a i and a s interact with y (the target variable), their combined MI values with y exceed the sum of their individual MIs.Therefore, a positive value in Equation ( 4) indicates the presence of interacting features, and its positive/absolute magnitude reflects the extent of the interaction.Therefore, to account for redundancy and interaction, Equation (4) may be expressed with reference to the concept of the interaction gain Ig: RM(a i , a s ) = Int g (a i ; a s ; y), if Int g (a i ; a s ; y) < 0, 0 otherwise ( 6) The equation labeled as Equation ( 6) is derived by making adjustments to Equation (4) to measure redundancy.Equation (7) calculates the interaction measure.The computation of the interaction measure I M(a i ) for each potential feature is computed below.
The objective of this adapted technique for feature selection is to optimize the relevance and interaction measures while minimizing the redundancy using a filters-based approach.Unlike existing techniques such as those proposed by [37, 41,42], our FPP technique takes into account the interaction between candidates in addition to relevance and redundancy filters.Figure 2 illustrates an FPP technique flow chart [36].The comprehensive explanation, along with a stepwise explanation, are listed below.
Step 1-Potential inputs: The technique takes the input set consisting of a potential inputs set and the target value y.
Step 2-Prefiltering: The prefiltering part of the FPP is illustrated as below: • The enclosed blocks within the dashed box represent the prefiltering part, during which the relevancy/interaction are computed.The potential inputs are then ranked according to these computed estimates/measures.

•
We assess the individual and the gained information to measure the information content.This is done using an adapted form of Equation ( 4), which is illustrated in the flowchart presented in Figure 2. The function f (, ) used in the equation monotonically increases, while the weight factor α balances the relevancy and interaction measures.Depending on the specific forecasting problem, this factor can be adjusted and finely tuned.

•
The potential inputs identified in the prefiltering step (S p ) are organized in a descending sequence as per their information value.
Step 3-Filtering stage: The filtering stage [36] of the FPP is illustrated in Figure 3 and presented as follows: • The prefiltering stage output serves as the input for the filtering stage, where the preselected features are divided into selected (S s ) and nonselected (S n ) features, as illustrated in Figure 2. Redundancy measure is computed using Equation ( 9), which is modeled below: R( Here, R( p a i ) represents the measure of redundancy for every potential input p a i belonging to the set p S.

•
The assessment of the informational significance of the potential features comprises three metrics: redundancy, relevancy, and interaction.In mathematical terms, this evaluation can be expressed as follows: Here, α, β > 0, V( p a i ) represents the information content, g(, ; ) denotes a monotonically increasing linear function, and β corresponds to a tuneable parameter.

•
The determination of the information content is made using the following decision process: In this process, information content is compared to the redundancy threshold, which is denoted as R th .If the information value is equal to or greater than the relevancy threshold, it is added to the list of selected features ( s S).Otherwise, it is included in the list of ( n S), which includes nonselected features.• Features, both selected and nonselected, are arranged in descending order based on their information content.Then, a union operator is applied to create a unified set.Subsequently, the postfiltering phase takes both of the sets and their combinations as input [36] as presented in Figure 4.  Step 4-Postfiltering: In this phase, adjustments are made to both the selected ( s S) and nonselected inputs, thus resulting in updates to the V(.) information value.These updated information contents are then reassessed via Equation ( 11) to determine whether the potential inputs should be included in the selected or nonselected features.
The FPP stops when the nonselected features set n S is empty, and no entry remains.
During each iteration, the potential input set undergoes prefiltering, filtering, and postfiltering.This ensures that the process avoids becoming stuck in an infinite loop and that it successfully returns the selected features set.Eventually, the selected features are passed into the MLP forecaster.

MLP Forecaster
This part of the proposed model is the MLP forecaster that can be trained to accurately predict future load patterns.The literature review concludes that all the currently available models have the ability to forecast non-linear electrical load.Hence, an MLP has been selected as the preferred intelligent forecaster because it predicts nonlinear load patterns with a satisfactory level of accuracy with earlier converging capability, it possesses a scalable nature, and it exhibits enhanced performance as it scales.The MLP is a variant of ANN comprising multiple layers of interconnected nodes referred to as neurons.Each neuron in one layer is connected to the neurons in the adjacent layers, thus forming a feed-forward architecture.The MLP has the ability to acquire intricate patterns and relationships within load data through a training process, where the weights and biases of neurons are adjusted using input-output pairs.The MLP forecaster employs an MVARA and utilizes ReLU as the learning rule to predict load patterns.The MVARA and ReLU are selected as the learning rules for the MLP forecaster.These rules are chosen because their earlier converging capability ensures a low CT and fast CR, as well as addresses common network challenges such as overfitting and a vanishing gradient.This allows the MLP to make accurate load predictions.The MLP forecaster has a layered architecture comprising the input, hidden, and output layers.Each layer is composed of artificial neurons.The MLP constitutes a feed-forward network comprised of fully connected layers.In this layout, each neuron in a layer is connected to the neurons in the subsequent layer through synaptic weights, as illustrated in Figure 5.
The MLP selects potential inputs from a given dataset and maps the input vector x(t) to the output vector F t .The MLP's output is represented as follows: In Equation ( 12), the f (y i is ReLU, which is modeled in Equation (13).
The output vector F t shows the day-ahead forecast results, and it is obtained through a combination of factors.These factors include the weight factor W i , the linear weight β j between the input/output nodes, the input elements a j , and the input to the hidden nodes y i .The training of the MLP involves utilizing the MVARA and the ReLU transfer functions.The calculation for y i is expressed in Equation (15).
In Equation ( 15), w ij represents the weight between the neurons in the input and hidden layers, and b i denotes the bias for the hidden layer.The learning process persists until one of the following conditions is satisfied: the iteration maximum limit is reached, the stopping criterion is fulfilled, or the error function is minimized.The error function is modeled below in Equation ( 15).
In the Equation ( 15), the actual output of the network pattern is denoted by R t , while the forecast output is represented by F t .Additionally, N corresponds to the number of training samples used in the process.
By incorporating ReLU, the MLP forecaster is able to capture nonlinearities and interactions.In the literature, various algorithms have been used to update the weight and bias vectors during the training process, including the Levenberg Marquardt algorithm [41], the MVARA [43], the back-propagation, and the gradient descent [44].Out of the available training/learning algorithms, the choice of utilizing the MVARA for network training was made because of its ability to converge rapidly and deliver enhanced performance.The selected features from the FPP stage, denoted as S 1 , S 2 , . . .S n , are inputted into the MLP forecaster stage.In this stage, the network training utilizes data samples from the first three years, while the testing phase employs data samples from the last year.The ultimate goal is to train the MLP forecaster through this process to accurately predict load patterns.The MLP forecaster generates an error signal, along with weights and biases, that are adjusted using the MVARA [45].The MAPE serves as the objective function for the optimizer, which aims to enhance the accuracy by adjusting error signal.

GWDO Optimizer
The MLP forecaster produces a load pattern with a certain level of the MAPE, which is minimized as per the capabilities of the MLP, MVARA, and ReLU.To further reduce the MAPE in the predicted load, the MLP forecaster output is fed into our proposed GWDO optimizer.The GWDO optimizer aims to further decrease the error in the predicted load pattern.Therefore, the optimizer treats the minimization of the error as objective, which is mathematically represented in Equation ( 16).
In Equation ( 16), I th , R th , and C i represent the redundancy threshold, irrelevancy threshold, and potential input interaction, respectively.The GWDO optimizer tunes/adjusts these parameters and provides feedback to the FPP.In the FPP stage, the feature selection approach utilizes the optimized I th , C i , and R th as potential input interactions for the key features selection.
Integrating the GWDO optimizer with the MLP forecaster improves accuracy, albeit at the expense of a degraded convergence rate.Usually, this integration of the optimizer with the forecaster is implemented in applications, where the main emphasis focuses on the accuracy rather than the speed of convergence.
To optimize the forecaster hyperparameters, numerous techniques have been suggested by researchers.These techniques encompass heuristic approaches, as well as quadratic, convex, linear, and nonlinear programming.In this study, linear programming was not utilized because of the nonlinear nature of the problem.Nonlinear programming was excluded because of the extended execution time they entail.Convex optimization was rejected because these processes converge slowly.
Heuristic algorithms such as DE [46], EDE [47], and mEDE [38] were rejected because of challenges such as inadequate precision, sluggish convergence, and the inclination to get stuck in local optima [48].To overcome the constraints inherent in the current approaches, the GWDO was suggested as a means to effectively optimize the hyperparameters, thereby exhibiting rapid convergence speed.The GWDO is a hybrid approach that combines the key merits of the WDO [35] and the GA [49].This hybridization proves to be advantageous, as it leverages the fast convergence speed of the WDO while benefiting from the diversity of population provided by the GA.

Results and Discussions
To assess effectiveness of the FPP-MLP-GWDO model, MATLAB simulations were performed on a laptop featuring a Core i3, a CPU Intel(R) @2.4GHz, and 8GB of RAM.The performance of the FPP-MLP-GWDO framework was evaluated by comparing it with the literature models: the FS-SVM-mEDE [38], FS-ANN [26], SVM-DEA [39], and AR model.These benchmark frameworks were selected because they share architectural similarities with the developed FPP-MLP-GWDO model.
The FPP-MLP-GWDO model was evaluated using load data from the Dayton Ohio grid.This dataset was obtained from the PJM electricity market, which is publicly available and openly accessible [50] and was also used in a previous study [37].Figure 6 illustrates the four years of load data from the Dayton Ohio grid, which span from 2014 to 2017.The MLP forecaster uses eighty percent data for training and allocates the remaining twenty percent for testing purposes.The learning curve assesses the effectiveness of a model by comparing its performance on training and testing data samples over multiple epochs.The objective is to determine whether a model is genuinely learning from the data or merely memorizing it.A poor learning curve is indicative of high variance and bias in the model, thereby suggesting that it is more focused on memorizing the training data than on extracting meaningful patterns.Such a model, characterized by both high variance and bias, typically shows decreased accuracy and a limited ability to generalize effectively.The MLP forecaster learning curve exhibited favorable characteristics for two key reasons.Firstly, there was minimal bias and variance, as evidenced by the small difference between the errors observed during training and testing.Secondly, both the training/testing errors declined as the epochs grew.The MLP forecaster learning curve is illustrated in Figure 7. Initially, when the number of epochs was zero, the MAPE was high, thus indicating that the model was not yet well trained.However, as the number of epochs increased, the MAPE gradually decreased, thus eventually converging to a minimum acceptable value.This point of convergence, known as the saturation point, signifies that the model was effectively trained and had achieved satisfactory performance.During the simulations, we utilized the control parameters listed in Table 2, and their rationale is documented in the literature [36].These control parameters remained consistent for the FPP-MLP-GWDO and comparative models, thereby ensuring a fair comparative analysis.
The evaluation of the developed model focused on two performance metrics: the accuracy (MAPE, TIC, and CC) and the convergence speed (CT and CR).Modeling of the MAPE and TIC is presented in Equations ( 17) and (18), respectively.
The CC metric is defined in (18).
In Equation ( 17), R t and F t denote the actual and predicted load values, respectively, while µ a and µ F correspond to the mean values of the actual and predicted load, respectively.
The accuracy is computed from the error using Equation (20).
where A represents the accuracy.The convergence speed is computed using the CT and CR, which are comprehensively presented below: • The convergence speed encompasses two aspects: the CT and CR.The CT refers to the time it takes for the forecaster to return the predicted load pattern.On the other hand, the CR represents the rate at which the model converges to an iteration returning an optimal result, where the error no longer decreases significantly with increasing iterations.Forecasts with lower CT and CR values (requiring fewer epochs to converge) are considered faster, while higher CT and CR values indicate slower convergence.In this study, the CT is expressed in seconds, while the CR is in aspects of iterations.A comprehensive analysis of the performance metrics for the FPP-MLP-GWDO model and existing models is presented below.

Accuracy Evaluation
The proposed model accuracy was evaluated for both the day-and week-ahead load forecasting below.In this work, the developed FPP-MLP-GWDO model leveraged the MVARA and ReLU to capture nonlinear load trends.In contrast, the comparative models-the FS-ANN, SVM-DEA, and FS-SVM-mEDE-utilized the Levenberg Marquardt algorithm and sigmoidal function to capture nonlinear load trends.The selection of the MVARA and ReLU activation function is based on their advantages, including fast convergence and the ability to address challenges such as overfitting, vanishing gradients, etc. Figure 8 clearly demonstrates that the developed FPP-MLP-GWDO model closely followed the actual pattern, thereby outperforming the benchmark models (the AR, FS-ANN, SVM-DEA, and FS-SVM-mEDE) in terms of load prediction accuracy.The MAPE for the FPP-MLP-GWDO model was recorded at 1.10%, while SVM-DEA achieved 2.5%, the FS-SVM-mEDE achieved 2.1%, the FS-ANN achieved 3.5%, and the AR achieved 4.3%.This comparison is clearly illustrated in Figure 8, thereby affirming the superior accuracy of the FPP-MLP-GWDO model.

Week-Ahead Load Prediction
Figure 9 displays the week-ahead load forecasting with the hour resolution for the proposed FPP-MLP-GWDO model and the existing benchmark frameworks (the AR, FS-ANN, SVM-DEA, and FS-SVM-mEDE) on the Dayton Ohio grid.The FPP-MLP-GWDO model stood out with its accurate, fast, and stable load prediction, thus surpassing the performance outcomes of the comparative models.Notably, the FPP-MLP-GWDO returned load profile closely aligned with the target load, as clearly depicted in Figure 9.
The proposed hybrid model, FPP-MLP-GWDO, achieved an impressive MAPE accuracy of 1.12%.In contrast, the comparative models such as the AR, FS-ANN, FS-SVM-mEDE, and SVM-DEA exhibited MAPE values of 4.6%, 3.5%, 2.1%, and 2.5%, respectively.The superior performance of the FPP-MLP-GWDO model can be attributed to its unique combination of the MLP with the MVARA, ReLU, and GWDO optimizer.The load forecasting curve generated by the proposed model closely aligned with the target curve, thus further confirming its superior performance compared to the benchmark models.The developed FPP-MLP-GWDO and the comparative models: FS-ANN, SVM-DEA, and FS-SVM-mEDE were evaluated regarding the commutative distribution function (CDF) of errors, as shown in Figure 10.
The findings reveal that the FPP-MLP-GWDO was superior to the comparative models in terms of the CDF.The utilization of the MLP, with its deep layers designed to capture essential features, enabled reliable prediction, even in highly uncertain situations.Therefore, the proposed model presents an optimal choice for distribution system operators aiming to enhance the efficiency of the SPG.
The statistical evaluation of the accuracy is presented in Figure 11.The MAPE serves as a metric for quantifying the variance between the predicted and the real values.A lower MAPE indicates higher accuracy, while a larger MAPE indicates poorer accuracy.The accuracy analysis, in terms of the MAPE, is depicted in Figure 11.The MAPE values for the proposed model, FS-ANN, SVM-DEA, FS-SVM-mEDE, were 1.10%, 3.5%, 2.5%, and 2.1%, respectively.
The performance evaluations and discussions presented above conclude that the SVM-DEA is superior to the FS-ANN in terms of the MAPE.The improved accuracy of the SVM-DEA model can be attributed to the integration of the DEA optimizer with the SVM forecaster.These modifications devised in the developed model contributed to enhanced accuracy.Additionally, the developed FPP-MLP-GWDO model exhibited better performance in terms of the accuracy than the comparative models (the AR, FS-ANN, SVM-DEA, and FS-SVM-mEDE) illustrated in Figure 11.However, it is noteworthy that the integration of the DEA optimizer led to an increase in the CT.

Convergence Speed Evaluation
The developed model and the comparative models' convergence speed is evaluated using two aspects: the CT and CR, which are comprehensively discussed in the subsequent section.

Convergence Speed in terms of the CT
First, the convergence speed evaluation in terms of the CT is presented in Figure 12.The findings reveal that the FPP-MLP-GWDO had a CT of 299 s.In contrast, the AR, FS-ANN, SVM-DEA, and FS-SVM-mEDE had CT times of 132 s, 159 s, 240 s, and 350 s, respectively.The results demonstrate that the CT increased from 159 s to 240 s when the optimizer was integrated.The intrinsic forecaster models without an optimizer and feature selector had low CT times and vice versa.Thus, adding a preprocessing/feature selector technique or optimizer with an intelligent forecaster increases the CT.The proposed model had a low CT compared to similar nature hybrid forecasting model (where both the FS and mEDE optimizer were integrated with the forecaster).On the other hand, the AR, FS-ANN, and SVM-DEA models had 132 s, 159 s and 240 s CT times, respectively, which were lower than the proposed model, because, with an ANN, only the feature selector is integrated, and no optimizer is included; with the SVM, only the DEA optimizer is added, and the feature selector is integrated.The findings are illustrated in Figure 12.The proposed FPP-MLP-GWDO model reduced the CT compared to the comparative models for several reasons: the use of the fast-converging GWDO optimizer [34] instead of the EDE or mEDE optimizer [38,46,47], the utilization of the MVARA and ReLU in lieu of the sigmoid function, the adoption of the MLP, which is superior to the ANN, and the introduction of a novel concept for the feature interaction in the FPP for feature selection in addition to filters (irrelevancy and redundancy).In contrast, the comparative models only employ the MI filters approach (irrelevancy and redundancy) or the DE optimizer, which are computationally expensive.It noteworthy that the developed model needs more CT than the FS-ANN.This notable difference in results is due to the absence of an optimizer in the FS-ANN model (see Figure 12).Thus, the discussion concludes that a tradeoff exists between the accuracy and the convergence speed.

Convergence Speed in Terms of the CR
The performance analysis regarding the convergence speed in terms of the CR for 100 iterations is presented in Figure 13.The figure demonstrates the fast convergence and effective search capability of the FPP-MLP-GWDO compared to the FS-ANN, SVM-DEA, and FS-SVM-mEDE models.Figure 13 illustrates the MAPE of the FPP-MLP-GWDO model and the comparative models (AR, FS-ANN, SVM-DEA, and FS-SVM-mEDE) over more than 110 iterations.The MAPE decreased as the number of iterations increased, which was observed for both the FPP-MLP-GWDO model and the comparative models.Nevertheless, it is noteworthy that the proposed model demonstrated its rapid convergence and effective search capability by reaching convergence approximately by the 18th iteration.On the other hand, the comparative models such as the AR, FS-ANN, SVM-DEA, and FS-SVM-mEDE converged around the 55th, 39th, 35th, and 31th iterations, respectively.This analysis demonstrates that the GWDO is more suitable as an optimizer in hybrid models.An overall evaluation of the FPP-MLP-GWDO, AR, FS-ANN, SVM-DEA, and FS-SVM-mEDE models is summarized in Table 3.This evaluation encompasses various aspects, including the computational complexity, CT, CR, and accuracy.
The simulations, performance analysis, and discussions mentioned above conclude that the hybrid model FPP-MLP-GWDO demonstrates superior performance compared to benchmark models such as the FS-SVM-mEDE, SVM-DEA, and FS-ANN in aspects of the accuracy, CR, CT, complexity, etc.

Conclusions
Load forecasting is imperative for decision-making processes in the SPG, thus enabling the efficient utilization of available generation, operational planning, load scheduling, and the assessment of contracts.To address these needs, a novel model called FPP-MLP-GWDO was introduced, where FPP, the MLP forecaster, and the GWDO optimizer cascaded with the aim to achieve accurate load prediction while maintaining an affordable convergence speed.In the FPP-MLP-GWDO model, an innovative approach to feature interaction, in addition to filters (irrelevancy and redundancy), were used in the FPP to find favorable features for the MLP forecaster.Considering the nonlinear and intricate nature of the problem at hand, the GWDO was used as the optimizer to optimize the forecasting results obtained from the MLP forecaster, thereby enhancing accuracy while ensuring reasonable convergence.To asses the developed model performance, experiments were conducted using the Dayton Ohio grid dataset, therein employing metrics such as accuracy (MAPE, TIC, CC) and convergence speed (CT and CR).The findings confirm that the developed FPP-MLP-GWDO model achieved an accuracy of 98.9%, thus surpassing benchmark models such as the AR (95.7%),FS-ANN (96.5%),FS-SVM-mEDE (97.9%), and SVM-DEA (97.5%).

Figure 7 .
Figure 7. Developed model learning curve on testing and training sets with respect to the MAPE.

3. 1 . 1 .
Figure 8 illustrates the comparison of the day-ahead load predictions for the Dayton Ohio grid between our proposed FPP-MLP-GWDO model and the existing models (the AR, FS-ANN, SVM-DEA, and FS-SVM-mEDE).The results clearly demonstrate the forecasting capability of the proposed model in accurately predicting the day-ahead load for the Dayton Ohio grid.It is evident that all the forecasters, including both the developed and comparative models, have the capability to capture the nonlinear historical load trends.The nonlinear trends capturing the capability of forecasters are due to the activation functions (ReLU, sigmoidal, Tanh, etc.) and learning algorithms (Levenberg Marquardt algorithm, MVARA, back-propagation, gradient descent, etc.).In this work, the developed FPP-MLP-GWDO model leveraged the MVARA and ReLU to capture nonlinear load trends.In contrast, the comparative models-the FS-ANN, SVM-DEA, and FS-SVM-mEDE-utilized the Levenberg Marquardt algorithm and sigmoidal function to capture nonlinear load trends.The selection of the MVARA and ReLU activation function is based on their advantages, including fast convergence and the ability to address challenges such as overfitting, vanishing gradients, etc. Figure8clearly demonstrates that the developed FPP-MLP-GWDO model closely followed the actual pattern, thereby outperforming the benchmark models (the AR, FS-ANN, SVM-DEA, and FS-SVM-mEDE) in terms of load prediction accuracy.The MAPE for the FPP-MLP-GWDO model was recorded at 1.10%, while SVM-DEA achieved 2.5%, the FS-SVM-mEDE achieved 2.1%, the FS-ANN achieved 3.5%, and the AR achieved 4.3%.This comparison is clearly illustrated in Figure8, thereby affirming the superior accuracy of the FPP-MLP-GWDO model.

Figure 8 .
Figure 8.The day-ahead load forecasting of the developed model using Dayton Ohio grid data.

Figure 9 .
Figure 9.The week-ahead load forecasting of the developed model using Dayton Ohio grid data.

Figure 10 .
Figure 10.Evaluation of FPP-MLP-GWDO and comparative models in terms of the CDF using the MAPE.

Figure 11 .
Figure 11.Proposed model evaluation in terms of the MAPE using Dayton Ohio grid data.

Figure 12 .
Figure 12.Developed model evaluation in comparison with existing models in terms of computational time using Dayton Ohio grid data.

Figure 13 .
Figure 13.Proposed model evaluation in comparison with existing models' convergence speed values in terns of the CR.

Table 1 .
Recent related works compared with the proposed model.

Table 3 .
Evaluating the complexity, CT, CR, and accuracy of the suggested model and existing models such as AR, FS-ANN, SVM-DEA, and FS-SVM-mEDE.