PLS-CNN-BiLSTM: An End-to-End Algorithm-Based Savitzky–Golay Smoothing and Evolution Strategy for Load Forecasting

: This paper proposes an effective deep learning framework for Short-Term Load Forecasting (STLF) of multivariate time series. The proposed model consists of a hybrid Convolutional neural network-Bidirectional Long Short-Term Memory (CBiLSTM) based on the Evolution Strategy (ES) method and the Savitzky–Golay (SG) ﬁlter (SG-CBiLSTM). The adopted methodology incorporates the virtue of different prepossessing blocks to enhance the performance of the CBiLSTM model. In particular, a data-augmentation strategy is employed to synthetically improve the feature representation of the CBiLSTM model. The augmented data is forwarded to the Partial Least Square (PLS) method to select the most informative features above the predeﬁned threshold. Next, the SG algorithm is computed for smoothing the load to enhance the learning capabilities of the underlying system. The structure of the SG-CBiLSTM for the ISO New England dataset is optimized using the ES technique. Finally, the CBiLSTM model generates output forecasts. The proposed approach demonstrates a remarkable improvement in the performance of the original CBiLSTM model. Furthermore, the experimental results strongly conﬁrm the high effectiveness of the proposed SG-CBiLSTM model compared to the state-of-the-art techniques. network-bidirectional long short-term memory. The experimental results strongly reveal that the SG-CBiLSTM model is advantageous for coping with the sudden variation of the load demand and improving the overall accuracy to reach a mean R = 99.18%. The simulation results have demonstrated that the proposed methodology has the following characteristics: (1) The ten polynomial levels provide the most suitable form after the variation of the levels of the SG filter; (2) the proposed SG-CBiLSTM model acquires a higher extrapolation performance to achieve better results than the single CBiLSTM-AE; (3) the proposed model outperforms nine deep learning models used as benchmarks. The SG-CBiLSTM-AE is perfectly tailored to accurately predict the hourly load data. However, its advantages actually hide some limitations related to the poor performance under an extended forecasting horizon. Future work will broaden the scope to include an additional evaluation of the proposed SG-CBiLSTM model for other power systems and a variety of forecasting applications such as the electricity price and renewable energy prediction.


Introduction
For power systems, grid stability is becoming very sensible to the bulk integration of renewable energy sources due to the irregularity and seasonality of numerous external factors. These factors fall into two major aspects: (i) Meteorological, such as temperature, irradiation, and wind speed, and (ii) social, such as working days, lifestyles, and number of consumers [1]. Thus, grid sustainability is threatened by the high fluctuations in the load profiles. These fluctuations cause financial penalties and serious technical issues related to the electrical grid. Load forecasting (LF) is a central solution to effectively cope with the dynamic operational environment. The LF is mainly classified into three categories based on the time horizon: was found very sensible to the inconsistency of residents' lifestyles which significantly diminishes its extrapolation ability [19].
Authors in [14] introduce a hybrid CNN-LSTM model for residential energy consumption forecasting. While the proposed model outperforms several benchmarks, the visualization graphs and the score metrics clearly show that the experimental results are not satisfactory with a MAPE = 34.84% due to the irregular trend of power consumption data. To the authors' best knowledge, the research works focus to date on the development of the forecasting engine rather than the deep investigation of the feature representations related to load data itself. The DL architectures can automatically carry out the data processing. However, there are very few available research works oriented towards the hybridization of DL models along with different features engineering techniques to achieve higher prediction accuracy [20,21]. Therefore, this paper aims to tackle the non-linearity and volatility of STLF by coupling data augmentation, variable importance analysis, and feature smoothing alongside with hybrid CBiLSTM model with the aim to provide a highly reliable hourly STLF system. The major contributions of this paper are the following: • A novel forecasting framework is comprehensively presented for STLF. The proposed SG-CBiLSTM model is recommended for highly nonlinear time series, whose patterns are difficult to capture by sample models. • A novel method for STLF is introduced and verified based on real-world dataset and simulation results. Regardless the numerous benefits of STLF, the actual forecasting methods remain less effective yielding non-negligible forecast errors due to the non-linearity of the load patterns. The proposed methodology greatly enhances the overall accuracy of the prediction engine for the STLF application. • The performance of the hybrid model is investigated using point forecast assessment, ten-fold cross-validation, score metrics, reference DL models, and simulation graphs.
The structure of this paper is organized as follows: Section 2 introduces the existing state-of-the-art techniques. Section 3 presents the preliminaries relative to the current study and describes the proposed model. Section 4 evaluates the feasibility of the model with the numerical analysis. Moreover, the comparison of the proposed method with some relevant ML techniques considered as benchmarks for this study is conducted in Section 4. Section 5 concludes the study and ends the paper.

Related Works
The need for very precise load forecasts requires the use of well-defined techniques in the feature engineering procedure. In this research study, the proposed techniques are divided into three folds: data smoothing, features selection, and Data Augmentation (DA). A brief introduction of these techniques is given as follows:

Data Augmentation (DA)
The DA technique is employed to reduce the underfitting problem of DL architectures by describing the data in a meaningful way [19]. DA improves the size and quality of the load information. This improvement lies in generating more information from the data representation to give a better insight about the load profile for a specific forecasting horizon. However, the application of DA is not very common on time series contrary to its vast applicability to image recognition and natural language processing. For instance, the authors in [22] derive from the datetime data type additional information to the celebration days and holidays such as Christmas Day, Thanksgiving Day, and Independence Day. Authors in [23] proposed an augmented CNN method to provide better learning potential covering unexplored input space for the CNN training. The key idea of the proposed DA procedure consists of concatenating heterogeneous residual load data generated from the electrical load of a single household. A technical paper in [24] provides a complete survey on the latest taxonomy of time series DL technique.
In our work, the hourly temperature is used as indicator feature for the load state, the proposed DA technique consists of two stages: firstly apply the time decomposition to reveal the behavior of consumers and then lagged load generation to follow the chains of load variation in time.

Feature Importance and Dimensionality Reduction
Feature importance is highly required to identify the most relevant features of the data representation. Feature dimension optimization is essential to improve the forecasting accuracy comprehensibility, and avoid overfitting problems. For a high dimensional data analysis, The removal of irrelevant and repetitive feature labels is the simplest technique to perform Dimensionality Reduction (DR). In [23], the authors studied the effect of DR on the load demand. Their study consists of generating 67 vectors regarded as possible potential inputs of the forecasting system. Next, they demonstrated that DR-based Principal Components Analysis (PCA) and Canonical Variate Analysis (CVA) succeed to enhance the accuracy of the forecasting results with the performance superiority of the PCA method. PCA is a nonparametric classical method frequently used to extract a smaller representative data set from high-dimensional data while conserving its spatial characteristics. The mechanism of PCA lies in using orthogonal linear transformation to transfer a dataset to another coordinate system, In the new coordination, the first parameter has the greatest variance followed by the remaining coordinates, respectively. The number of coordinates is fixed based on the data variability.
However, Feed Foreword Neural Network (FFNN) is not very effective for high dimensional space. Thus, the FFNN-based PCA produces non-negligible forecast errors. On the other hand, PCA presents a high sensitivity to the data distribution in the search space. Another study in [25] employed Echo State Networks with PCA decomposition for STLF application. While PCA achieves an acceptable results for correlated and accelerates the forecasting system, it has been noticed that the selection of the number of Principal Components (PC) requires a level of expertise to achieve the desired results. A wrong PC parameter causes a significant decrease in the performance on the model performance since some relevant features are omitted. Further, the interpretability of the features of PCA decomposition is relatively a difficult task.

Data Smoothing
To improve temporal cognition of data representation, the removal of the noisy samples is mandatory to boost the forecasting system accuracy by allowing important patterns to stand out. Furthermore, smoothing techniques are very important to track the seasonality of the data and thus improve the performance of ML techniques. Most of the commonly used techniques implement moving average and seasonal exponential smoothing due to their simplicity and good performance in time series forecasting [26]. For instance, the authors in [27] reviewed exponential smoothing techniques, specifically, single exponential smoothing, double exponential smoothing, holt's method, and adaptive response rate exponential smoothing. The cited methods are advantageous in terms of ease of interpretability and integration of the changing pattern effect of the data series into the prediction model. However, most of those approaches face problems with seasonal data and parameter tuning difficulties [28]. To overcome these problems, a Stavisky Golay (SG) filter is proposed in this paper to tackle the rapid variations from signals. In [29], the authors used an SG filter to remove the noise from phonocardiogram signals in the preprocessing stage. Then, an automatic classification CNN model receives the smoothed results to generate forecasts. In [30], an LSTM SG model was proposed to predict the Large-Scale Water Quality Time Series. However, the SG smoothing receives little attention in the field of smart grid applications despite its considerable importance. From our paper, an SG filter is implemented for the noise reduction of the load demand to further supplement the existing smoothing techniques for STLF in generating more accurate results.

The Proposed Methodology
The various parts of the hybrid SG-CBiLSTM, particularly, SG, PLS, and CBiLSTM have been presented. Furthermore, the workflow of the proposed technique is explicitly explained.

Bidirectional Lstm (BiLSTM) Model
The major success of LSTM model is accorded to its excellent ability in solving the vanishing problem of vanilla Recurrent Neural Network (RNN). The LSTM mechanism consists of the deployment of three memory gates, namely, input gate (i t ), forget ( f t ) gate, and output gate (o t ). Therefore, LSTM network carries out long-term dependencies of sequential data with high efficiency. The mathematical equations of LSTM gates are given as follows [31]: where x t , σ, and c t denote the input sample at time t, the sigmoid activation function, and the memory unit, stands for the bias and weight matrix for each gate, respectively. The symbol is the corresponding multiplication of the elements. Firstly, the h t−1 , c t−1 , and x t pass the input information to the LSTM unit. Then, the LSTM gates interact with the inputs to generate an action based on a logic function. After passing by f t , a new cell state c t is built. x t and h t−1 move to the forget layer to quantify the importance of the information between 0 and 1. At this stage, the f t gate takes a decision whether if the information has to be stored, maintained, or removed. Then, the forget gate will update the cell state c t with the new important information based on the proportion of the information occupied by the actual and the previous cell state. The final hidden layer of the LSTM is computed to obtain the remaining state value [31].
To enhance the learning capabilities of the traditional LSTM model, the temporal structure considers two-way relationship of the input data. In other words, instead of using one direction of input processing through the LSTM gates, the Bi-directional LSTM (BiLSTM) model considers the next information when dealing with the current time series data as shown in Figure 1.
This bidirectional processing enhances the information-wise fashion manner with a door mechanism by capturing more structural information. The BiLSTM model encodes the information on back to front. This leads to acquiring additional context information that leads to better generalization ability. First, the LSTM cells are forwarded from the input sequence. Then, the reverse form of the input sequence is integrated into the LSTM network. The output of the forward layer (h f t ) and backward layer (h b t ) of the BiLSTM model are generated as [32]: where α and σ are the numerical factors respecting the equality α + σ = 1 [32]. The experimental results verify that the BiLSTM significantly improves the accuracy of the original LSTM using the spatial correlation mechanism [32].

Convolutional Neural Network (CNN)
The CNN model is a DL network that maps local features from the input at higher layers and combines them into more complex features at lower layers. The general architecture of this model consists of convolutional layers, activation function, pooling layers, fully connected layer, and dropout layer. The convolutional layers apply a set of filters to the input to generate a set of feature maps. The convolutional layers are advantageous for preventing the model from overfitting, accelerating the training process, decreasing the input volume. The max-pooling layers present the commonly used type of pooling layers. The max-pooling consists of selecting the maximum value of the field covered by the pooling kernels with its parametric equation given as [33]: where P l(i,j) and a l(i,j) denote the width of the pooling layer and the activation value of t neuron. The activation function permits changing the linearly of an inseparable problem into a separable one to enhance the adaptability of the model. The Rectified Linear Unit (ReLU) function is used as an activation function do to its inherent ability to derive values between 0 and 1. The equation of ReLU function is given by [33]: where a l(i,j) denotes the activation value of the layer output y l(i,j) . The fully connected layers combine the previously extracted information to generate the final prediction as shown in the following function [33]: where w l ij , a l(i) and b l j denote the weight of the ith neuron, the ith neuron of the l layer, and the bias values. n is the length of the input data. Finally, the dropout function is employed as a regularization technique to avoid overfitting at the end of iterations by dropping randomly selected neurons during the training process. Much of the research work has emphasized CNN due to its better ability to approximate complex functions.

Convolutional Bi-Directional Lstm Network (CBiLSTM)
The CBiLSTM model consists of initial convolution layers that receive feature embeddings as input. Its output is pooled to a smaller dimension, then fed into BiLSTM layers. The association of the CNN layers is beneficial for its strong capabilities in local feature extraction. The proposed mechanism consists of extracting the inherent characteristics of the load demand by CNN model and extracting the temporal dependencies by BiLSTM layer. The motivation behind the fusion between CNN and BiLSTM is explained by (1) the excellent feature-extracting ability of CNN model in capturing short-term trends in the time series data. (2) the BiLSTM network is efficient in saving the temporal order between the data in both directions and avoiding the gradient disappearance. Owing to the association of CNN with BiLSTM, the pattern recognition for time series data is tracked spatially and temporally. Figure 2 presents a flowchart of CBiLSTM model.

Savisky Golay Filter (SG)
The SG filter is a popular smoothing technique for removing the noise from signals of abruptly fluctuating and non-stationary data [34]. This technique lies in moving average the least square polynomial approximation with conserving the main characteristics of the signal(width, peak, and high). The core concept of SG consists of generating 2n + 1 equidistant points (centered at n = 0) to represent a polynomial of degree p. Thus, SG filter uses three elements, specifically, the input signal, the order of the polynomial, and the frame size. These elements require a specific tuning to achieve the desired results. The polynomial order value attenuates the width of the signals. SG filter computes the value of the least square polynomial at point i = 0, over the entire sample space. Let x(n) be a time series of n elements. We consider y(n) = x(n) + w(n) as the observed time series where w(n) denote the noise of y(n) signal. The output of SG filter denoted byx is equal [35]: where h(n) and M denote the filter impulse response and the SG filter parameter, respectively. The polynomial p(n) and the squared error between the smoothed and noisy signal E are given as follows [35]: where k denotes the polynomial order that ranges between 0 and 2 M inclusive. C(0) is the best fitting polynomial coefficient satisfying p(0) = c(0) as [36]: Analytically, the optimal coefficients c k could be calculated by interchanging the order of the summations. In order to find the optimal polynomial coefficients, we differentiate E as follows [36]: where i ∈ (0, 1, ..., N). Equation (15) could be reformulated as (15) passes by setting the derivatives equal to zero [36]. Typically, this digital filter uses the technique of linear least squares for data smoothing, which helps to obtain a high signal-to-noise ratio and retains the original shape of the signal. The SG filter has been employed in this paper for two major merits: (1) the SG filter retains the width and height of waveform peaks in noisy load which allows the system to achieve higher performance; (2) the SG method transforms the load demand without destroying its fundamental properties.

Partial Least Square Method (PLS)
The PLS is a widely used dimentionality reduction technique. Unlike PCA dimension reduction method, which maximizes the variance-based objective function, PLS maximizes the covariance-objective function [37]. The covariance-objective function maximization is conducted by the optimization of the square of the L 2 norm. Assuming we have two sets of dependent variables Y = {y 1 , y 2 , ..., y m } and X = {x 1 , x 2 , ..., x n } representing the input variables and the response variables, respectively. For the STLF application, we have s samples. So, X 0 and Y 0 could be written as: The PLS method consists of extracting two components (t 1 , u 1 ) from X and Y, respectively. The vectors t 1 and u 1 acquire the variation information from the initial dataset where the correlation is set to its maximum. Thus, the regression of y k , k ∈ N and t 1 is conducted. In case the regression result provides the best accuracy, the PLS stop improving. Otherwise, the PLS algorithm passes to the next pair of components (t r , u r ). with r presents the pair number until the PLS algorithm achieves a satisfactory result. Mathematically, the first component could be written as [38]: where w 1 is the eigenvector corresponding to the maximum eigenvalue of the matrix X T 0 Y 0 Y T 0 X 0 . The component score vectort 1 and the residual matrix X 1 are calculated as [38]: This process is repeated untill the construction of the r components for (X,Y). The optimization of the objective function is based on the covariance cov(Xw,Y) between the output vector Y and the input matrix Xw of a size (N × P). Therefore, the Kth PLS component is identified by searching the weight vector w k as [37]: where the equation Xw k = w k X Xw j = 0 is valid for all 1 ≤ j < k. The PLS method is advantageous for its ease of implementation and unique biased solutions.

Evolution Strategy for Hyperparameter Optimization
Hyperparameter Optimization (HO) is one of the most difficult problems for DL models due to two major points: large number of parameters involved and expansive computational training. These points dramatically complicate testing all the possible values of the search space. In this paper, ES meta-heuristic method is considered the key solution to tackle the HO problem with higher efficiency. The ES is a meta-heuristic search technique inspired by the natural biological evolution. It consists of mutations and combinations of the best individuals (σ) of a certain population of solutions (λ) [39]. According to fitness values, natural selection empowers the best candidate solutions for mutation and reproduction [40]. These solutions dictate the distribution of future generations. Reciprocally, the weak candidates are removed from the population. The mechanism of the ES is presented in Figure 3. Regarding the flowchart in Figure 3. the ES algorithm starts with a random population of σ parents. Successive recombination and mutation take place to compute the fitness values. An assessment of the fitness values to quantify the solution effectiveness. In case there is an area of improvement, the ES algorithm generates a λ children to improve the objective function. This process is done by the Gaussian distributed random modifications G(x P i , λ S i ) , where x P i and λ S i are the mean and derivation parameters, respectively. i(1 ≤ i ≤ n), P(1 ≤ i ≤ σ),and S(1 ≤ S ≤ µ) are the parameter index predecessor index and successor index, respectively. The derivation parameter is calculated as follows [40]: with τ 1 = 0.1 and τ 2 = 0.2 are the standard derivations [40]. The whole multiplicative σ S mutation operation determine vector of strategy parameters σ S = (σ S i , ..., σ S N ) [41]. Each vector component of σ S is mutated individually. The final σ S is mutatively scaled in order to learn axes-parallel mutation ellipsoids [41]. The evolutionary dynamics could lead to undesirable fluctuations which require the removal to keep the high optimization performance of ES method [41]. The individual parameters are calculated as [40]: From Equation (22), the individual parameters x i are modified in order to create a λ solutions. The mutation operation of ES is conducted through the Gaussian distributed random modifications between x P i and σ S i . The best individuals are selected to be used as parents for the next iterations. The iterative process stops when the new individuals have minimal objective function value from the whole population. In this paper, the Evolution strategy is used as a key solution for the global optimization of the complex CBiLSTM black-box function. After fixing the search space, ES uses the batch-sequential process to find the most suitable hyperparameters of a specific model [42]. The ES presents a robust algorithm with a highly parallelizable architecture. Hence, the ES could face the risk of being trapped in a local minimum in a large search space environment.

Proposed Architecture
The proposed SG-CBilSTM is a cupola of models defined in three major stages: data engineering and preprocessing, object determination, model construction and model evaluation as illustrated in Figure 4. According to Figure 4, the proposed methodology is conducted using four main stages: (1) Data processing and feature engineering, object determination, model construction and evaluation stages, for further explanation of the proposed methodology, a complete flow chart is illustrated in Figure 5.
According to Figure 5, the proposed methodology is presented as follows: The data is acquired from an open-sourced portal and cleaned from misleading values in the data preprocessing and feature engineering stage. Then, a data augmentation strategy is adopted to enhance the significance of the data and add value to the learning ability of the whole system. Further, data transformation is conducted to convert all the data to its numeric form using on-hot encoder function. Next, data smoothing-based SG filter and data normalization are conducted to remove the noise from the load demand and standardize the data under a defined interval between 0 and 1. At this stage, SG filter is able to reject the noise efficiently along with the least distortion from the original load data. All the feature inputs are given probability values to assess their relevance to the load demand determination. Above a specific threshold, the features are kept and the rest are neglected. Afterwards, the object determination stage split the data into training, validation, and testing intervals. The hyperparameter tuning is mandatory for CBiLSTM model to generate more accurate results. The final stage is the evaluation of the model. In this stage, the point forecast assessment is conducted using score metrics. Moreover, K-fold cross validation technique is used in this paper as a reliable means to evaluate the proposed model accuracy to an unknown testing data [43]. K-fold cross validation schemes is widely used in regression problems [44,45]. The methodology of K-fold cross validation is illustrated in Figure 6.  Regarding Figure 6, the procedure of K-fold cross validation is conducted as follows. Firstly, k equal sized sub-samples are partitioned from the original dataset. Ten-fold cross-validation (10-CV) is preferred to verify the model robustness with cross-validated data. The ten parts evaluate the model generalization to an independent set. For each 10-CV round, a random data separation into k folds {D 1 , D 2 , . . . , D k } is constructed. The proposed model is repeatedly trained by nine training set folds and tested against the remaining testing k − 1 set fold. The holdout method is repeated 10 times. Consequently, the bias and variance are remarkably reduced due to the use of the original data set in the fitting and validation procedure in the same time. For checking the model competitiveness, the proposed SG-CBilSTM is compared to a bench of DL models to validate its performance superiority. The comprehensive model design and the implementation details of each part are given in the following section.

Simulation Results
In this section, the validation of the proposed SG-CBiLSTM for STLF is demonstrated through extensive experimental studies with the an open source dataset. The evaluation procedure includes dataset description, model construction stages, score metrics, and benchmarks comparison analysis.

Data Pre-Processing and Feature Engineering
The experimental settings mainly uses the ISO New England incorporation (ISO-NE) dataset [46]. The ISO-NE is non-profit incorporation to manage day-to-day operations with more than 200 market participants in New England [46]. The publicly available data covers the hourly temperature and load data for the public and private electricity market of New England from 2003-2014 as plotted in Figure 7.  The total data samples collected consists of 103,753 rows. The used data is cleaned from erroneous and Not a Number (NAN) samples. The cleaning stage is mandatory to feed the model with complete data samples and avoid any outliers or misleading values. Nevertheless, the hourly temperature for the ISO-NE data is the only indicator of the load demand. This data is insufficient for accurate prediction and may lead to underfitting. Therefore, data augmentation is conducted to identify and generate artificial features using Microsoft Excel software as shown in Table 1.
According to Table 1, the total features are the DateTime components (hour, day, month, year, workday, week number, weekend), season of the year, and hourly lagged load for the last 23 h. The Boolean features are converted to numerical values using hot encoder function. Then, the load demand is smoothed via SG filter to improve the pattern recognition of the forecasting system. The cleaned data is forwarded to the feature engineering phase. Four case scenarios were investigated according to the value of the polynomial function specifically 3, 5, 7, and 10. The load demand and its smoothed derivations are illustrated in Figure 8.  According to Figure 8, the increase of the probability values (P-values) results in a more smoothed signal. The generated signals are assessed together in the same conditions with the aim select the optimum SG polynomial value based on the less error produced. Further, the data normalization is conducted using MinMaxScaler function form Scikit-Learn library [47]. The normalized data from 0 to 1 to keep the values within limits of the activation function and to reduce the influence of different units on the forecasting model.
The selection of the most informative features in the dataset is an essential step of the forecasting system performance. This technique is exploited the P-values for each feature to map the domain representation, and assess the weights for each feature input. Therefore, the irrelevant and redundant inputs are discarded from the feature vectors. These artificial features are associated in tandem with the temperature to make a prediction. The data augmentation technique is commonly used for low dimensional systems to track the seasonality and trends of the target variation and acquire more information about domain knowledge. The ranking of the feature is essential at this stage to make the system more efficient with less computational time due to the high number of features. The PLS technique is used to reduce the dimension of the data. PLS is a straightforward dimensionality reduction technique that maps the variables in a new feature space with lower dimensions. The Variable Importance of load Patterns (VIP) for 32 features is shown in Figure 9.   Regarding Figure 9, the most important features are hour, workday, temperature and lagged load (t − x) with x ∈ [1, 2, 3, 4, 5, 6,7,11,12,13,17,18,19,20,21,22,23]. Thus, the selected threshold is VIP = 0.5. The rest of the simulation results adopted a feature representation including a 20 feature inputs. The data is split into three sets: training, validation, and testing sets with a proportion of 75%, 15%, and 10%, respectively. More precisely, the training, validation, and testing instances comprise 77,814, 15,562, and 10,376 samples, respectively.

Score Metrics
The assessment of the model performance is conducted using Percentage Error Measure (PEM). The PEM method is a percentage of the error regardless of the units of the measurements (scale-independent). In order to standardize the SDEM measures, particularly, the Rooted Mean Squared Error (RMSE) and Mean Absolute Error (MAE) measures, the Min-Max normalization method was kept for the evaluation process. In other words, the data is normalized with a magnitude range of [0, 1]. The data normalization is calculated as follows: where x t denote the normalized value, x r is the real value. Here, x min and x max are the minimum and maximum values. The error measures in this work include the coefficient of determination (R 2 ), normalized RMSE (nRMSE), normalized MAE (nMAE), Mean Squared Logarithmic Error (MSLE) and MAPE. The selection of the aforementioned scores is based on their popularity for regression-based techniques evaluation. The parametric equations of the score metrics are given as follows [48]: whereŷ i and y i present the ith forecasted values and the actual values. Here N denotes the total number of samples.

Model Construction and Parameter Sittings
In this part, the assessment of the proposed SG-CBiLSTM model is reported. The simulation results were conducted using PYTHON programming and Keras library [49]. Keras is a high-level framework specialized in DL. All simulations were run on a Lenovo Intel ® i7 ® Nvidia Geforce GTX 1650@ 2.30 GHz. The characteristics of the experiments in this study are resumed in Table 2. The simulation analysis-based on trial and error method is firstly adopted to limit the search space as much as possible for the optimization process. This method is conducted by selecting arbitrary extreme values of BiLSTM and CNN layers and measuring the accuracy of the forecasting engine in terms of R 2 score. Next, a new random values are defined to further decrease the possibilities of hyperparmaters values. The final results of trial and error method shows that three convolutional layers with a kernel size equal to 1 × 1 associated with three successive BiLSTM layers achieve better results according the highest the R 2 criteria. For the rest of optimization work, ES is employed to tune the critical hyperparameters of the proposed computing framework, specifically, the batch size, the number of units, the number of training epochs [49,50]. To tune these hyperparameters, Hyperactive library is employed on open source [39]. It worth noting that the search space of the ES method only considers the key hyperparameters of the proposed C-BiLSTM to reduce the number of combinations and thus accelerate the computational time. In the model construction stage, the search space of the proposed CBiLSTM model is resumed in Table 3. The best hyperparameters are selected as follows. Firstly, the population of ES is randomly initialized. Each candidate solution represents a possible configuration of the optimized CBiLSTM model. From the data training, the CBiLSTM model is trained with a specific structure and parameter sittings included in Table 3. Finally, the testing set is used to evaluate each population performance based on MAPE. Regarding the ES performance in Table 3, the ES algorithm is characterised by a fast convergence speed with 8.77 min for every iteration. From Table 3, the input layer consists of a 3 successive one-dimensional Convolutional layer (Conv1D), with 32, 64 and 64 filters of the convolution, respectively. The Conv1D layers are associated with a ReLU activation function. The ReLu function is employed to solve the vanishing gradient with fast convergence speed. The convolution layers are followed by a maximum one-dimensional sampling (MaxPooling1D) layer, three BiLSTM hidden layers, a dropout layer, and a dense layer. The one-dimensional CNN layer is mainly adopted for the sequential data processing of the load demand. The sampling size of MaxPooling1D is 1. The number of features of the BiLSTM hidden layers is 150, 50, and 32, respectively. The proposed CBiLSTM takes advantage of the nonlinear fitting ability form BiLSTM and the feature extraction ability from CNN. The final structure of the SG-CBiLSTM model is resumed in Table 4. The general architecture of reference models is fixed through repeated training. While, the units' number and the batch size and the activation function is selected using the ES method. The settings of reference models are given in Table 5. Table 5. Hyperparameter settings of nine methods.

SG-CBiLSTM
The convolution layers are 3 with a number of filters of 32, 64, and 64, respectively; the activation function is ReLU; the BiLSTM layers are 3 with a number of neurons 150, 50, and 32, respectively; the batch size is set to 512; the dropout function between layers is fixed at 0.5.

BiGRU
The layers number is 4 with a neurons number of 128, 32, and 4 ended by a fully connected layer, respectively; the dropout function between layers is fixed at 0.4, The optimizer function is Adam; the activation function of the BiGRU layers is ReLU; the batch size is set to 512

BiLSTM
The layers number is 7 with a neurons number of 256, 128, 64, 32, 16, 4 finished by a fully connected layer, respectively; the dropout function between layers is fixed at 0.2, The optimizer is Adam; the activation function of the BiLSTM layers is ReLU; the batch size is set to 512.

CBiLSTM
The convolution layers are 3 with a number of filters of 32, 64, and 64; the activation function is ReLU; the BiLSTM layers are 3 with a number of neurons 150, 50, and 32, respectively; the batch size is set to 512; the dropout function between layers is fixed at 0.5.

BiLSTM-AE
The biLSTM layers are three with a number of neurons of 500, 400, and 300, respectively; the repeat vector layers are two between the located between the BiLSTM layers; the optimizer function is Adam.

ConvLSTM-AE
The convLSTM2D is the initial layer with a filters number of 64; the kernel size is with a dimention of 1×1; the activation function is ReLU; the two next layers are a flatten layers and repeat vector layers; the LSTM layers has a number of neurons equal to 200; the number of time distributed layers is equal to 2; the optimization function is Adam optimizer.

CNN
The Conv1D layer has a filter units of 64; The kernel size is 2×2; The activation function is ReLU; The max pooling layer has a pooling size of 2; The third layer is a flatten layer; The fully connected layers are two with a 50 units and 1 units; the optimizer function is Adam.

RNN
The RNN layers are two with a number of neurons of 500 and 100, respectively; the dropout function between layers is fixed at 0.2; the Adam optimizer function is used; the fully connected layer uses a tanh activation function.

ELM
The neuron number of the hidden layer is 560; the activation function is tanh; the alpha is set to 0.5.

Experimental Results
The SG-CBiLSTM evaluation is conducted in this section. The empirical results at this level are firstly conducted by presenting the performance of the proposed framework individually and with several case studies of different SG filter characteristics. Second, a comparative study is computed and presented where the simulation settings are unified for the sake of fair assessment. The conclusive remarks and interpretations are based on the actual results simulated in this work. During the validation experiments of this paper, the selection of the most appropriate Polynomial (P) degree is based on the hit-and-trial method of 4 p values. For the consistency and coherence of this section, the validation of our proposed SG-CBiLSTM model is firstly conducted compared to the original CBiLSTM using the proposed framework explained in the previous section. Secondly, the supremacy of the proposed method was verified compared to different benchmarks. In order to yield credible results, each case scenario is performed 10 times and the results are averaged. The simulation results as shown in Figure 10. According to Figure 10, the proposed SG-CBiLSTM is generating a good result and follows the actual load demand with high precision. Further, the SG filter demonstrated its ability to improve the performance of the forecasting system. As can be seen, the proposed model performed best for p = 10 where the gap between the prediction curve and the original curve is the smallest compared to the original CBiLSTM and the other models for different p values. However, the polynomial degree variates the errors from a point to another. It could be concluded that the higher the signal is smoothed the better the model will be performing. A quantitative comparison of the proposed SG-CBiLSTM model performance under different p values is given in Table 6. According to Table 6, the best performance has been attributed to 10 SG-CBiLSTM which achieves the highest results. The registered MSLE and MAPE for 10 SG-BiLSTM is equal to 0.93% and 56.66%.
Compared to 3SG-CbiLSTM which reaches to RMSE = 1.36% and MAPE = 59.85%. This leads to conclude that the SG filters are a key element for achieving higher performance. Furthermore, the selection of the most suitable parameters is mandatory to achieve the desired results. For the rest of the work, it will be considered the 10 SG-CBiLSTM as a reference for the proposed forecasting system. For notation simplicity and flow of reading, the 10 polynomial SG-CBiLSTM is referred to as the SG-CBiLSTM hereinafter. The loss function is the training and validation of the model simulation is an essential criterion to follow the learning accumulation in the model construction. The parametric equation of the loss function is calculated as follows [51]: where B, O are the output matrix size. Here, the proposed model is trained with a training set stating from 2 March 2003 with the first 75% portion of the whole set. In order to track the loss values when the proposed SG-BiLSTM fetches the next batch of data. it has been used the MAPE as a loss function with an Adam optimizer [52]. The training process was fixed at 100 epochs with an early stopping function. This function is adopted to automatically track the best-trained model without searching for the most suitable epoch number. Figure 11 displays the variation of the training loss of the first 40 epochs for the training and testing sets. As seen in Figure 11, the training loss value starts at 6.93% and continuously decreases until the 10th epoch starts to stabilize. At this epoch, the loss value reaches to 2.08%. At this level, the training loss value decreases slowly to achieve 1.59% at the 40th epoch. On the other side, the validation loss started at 4.415% to reach 2.407% in the 10th epoch where it decreases slowly aside with the training loss. This indicates the training and validation loss reaches their equilibrium at the same iteration. Moreover, it is verified that the convergence speed is quite fast. The training and validation loss are following the same behavior. The simulation results of the proposed method in tandem with its single components (CNN, BiLSTM, and CBiLSTM) with 10-CV have been computed and shown in Table 7. According to Table 7, the 10-CV results clearly demonstrate that our proposed framework significantly outperform single models in terms of the error measures. The SG-CBiLSTM achieves a high mean testing result of R 2 = 99.18% compared to 86.96%, 89.43%, and 95.41% for CBiLSTM, CNN, BiLSTM, respectively. The overall increase of the performance for the DL models reported in Table 7 is explained by adopting the data augmentation strategy for the original dataset. Further, employing the S-G filter with the 10th polynomial fitting empowers the CBiLSTM the accuracy with a 12.22% in terms of R 2 . An illustration of the 10-CV results in terms of R 2 is shown in Figure 12 to highlight the performance improvement of the SG-BiLSTM model. According to Figure 12, the SG-BiLSTM model clearly outperforms the single models in terms of R 2 . The SG smoothing technique helps to better see patterns and detects trends. According to the CV variation results, the proposed model is characterized by a high location independency. The achieved results indicate that the SG-BiLSTM could be applied for STLF at different weather and load conditions. The experimental results of the SG-BiLSTM compared to the original load values are shown in Figure 13. According to Figure 13, the proposed system achieves high accuracy and follows the real curve with great precision. The proposed model is compared to nine other DL models trained at the same conditions for the integrity of the evaluation. The reference models assessed here include Extreme Learning Machines (ELM), BiLSTM, BiGRU, ConvLSTM, CNN, BiLSTM-Autoencoders (BiLSTM-AE), RNN, CNN, CBiLSTM to verify its competitiveness with the existing benchmarks. It should be noted that the used dataset and extracted features from the proposed model construction process are conserved for the comparison of all the models and the rest of the validation analysis. The comparative simulation results is shown in Figure 14. Regarding Figure 14, the experimental results clearly show that the proposed SG-CBiLSTM outperforms the rest of the models for hourly STLF due to the low gap between the actual and generated forecasts. In order to quantify the quality of the forecasts, Table 8 presents the score errors of the proposed SG-CBiLSTM and reference models. According to Table 8, the SG-CBiLSTM hybrid model is performing best comparing to the rest of his counterparts on the scale of the nRMSE, nMAE, MAPE, and R 2 performance measures. The R 2 reaches 99.22% due to SG filtering. Furthermore, the model performance was slightly close to BiLSTM-AE model which achieves R 2 = 98.56%. Due to the complex nonlinear relationship between the load demand and its drivers, the ELM model achieves the lowest forecasting performance accuracy with an R 2 = 83.98%. Moreover, it is evident that the proposed model is preferable compared to numerous DL baseline and top-of-the-line models. The proposed SG-CBiLSTM is found more efficient to enhance the profitability form the electrical load operations.

Multi-Step Validation
In this study, the model performance is assessed for multi-step forecasting using multiple case scenarios. The multi-step validation is crucial to evaluate the model robustness under an extended time horizon. The investigated forecasting horizons include one hour, twelve hours, twenty four hours, and thirty-six hours ahead. In the simulations results, we conserve the same repartition of the data set with 75%, 10%, and 24% for the training, validation, and testing sets of the whole data. For multi-step-ahead forecasting, the structure of the proposed SG-CBiLSTM model conserves the shape of CNN layers compared to one-step forecasting. For the on-step-ahead, the n forecasted outputsx i+1 = f (x i , , x i−1 , . . . , x i−n ) takes into consideration the real values of the load data. However, in multi-step-ahead forecasting, the previously generated forecasts are fed as inputs asx i+2 = f (x i+1 , x i , x i−1 , . . . , x i−n ). The experimental results of multi-step forecasting for ISO-NE data were repeated ten times and the results are averaged in Table 9. Furthermore, the error between the predicted and actual load demand according to the forecasting horizon are shown in Figure 15.  According to Table 9, it can be deduced that the most suitable scale is the hourly LF. Considering the scores indicators, the achieved results for one hour ahead reports an R 2 = 99.11% and nRMSE = 0.54% compared to R 2 = 87.22% and nRMSE = 3.59% for 36 h ahead. Thus, the lowest error measure is produced by single-step forecasting and the highest is generated by thirty-six step forecasting ahead as shown in Figure 15. In order to increase the model effectiveness, a state-of-the-art optimization approaches such as Coyote Optimization Algorithm (COA) and Covariance Matrix Adaptation Evolution Strategy (CMA-ES) could be employed as alternatives to ES for hyper parameter optimization procedure [53,54]. CMA-ES is an enhanced version of ES where the search individuals are generated according to multivariate normal random distribution. Consequently, the corresponding covariance matrix is updated and assessed in each iterations [53]. Similarly to ES, COA is a population-based algorithm inspired by the coyotes adaptation the environmental conditions [54]. In our work, the lowest performance is attributed to a sequence of thirty-six-steps ahead with a score R 2 = 87.22%. This demonstrates the relative weak stability of the SG-CBiLSTM for longer forecasting horizon which needs further investigations.

Conclusions
The major contribution of this paper is to propose an end-to-end forecasting framework called SG-CBiLSTM model for hourly STLF. The proposed system implements a smoothing technique called Savistky Golay (SG) filter to enhance the extrapolation capability for convolutional neural network-bidirectional long short-term memory. The experimental results strongly reveal that the SG-CBiLSTM model is advantageous for coping with the sudden variation of the load demand and improving the overall accuracy to reach a mean R = 99.18%. The simulation results have demonstrated that the proposed methodology has the following characteristics: (1) The ten polynomial levels provide the most suitable form after the variation of the levels of the SG filter; (2) the proposed SG-CBiLSTM model acquires a higher extrapolation performance to achieve better results than the single CBiLSTM-AE; (3) the proposed model outperforms nine deep learning models used as benchmarks. The SG-CBiLSTM-AE is perfectly tailored to accurately predict the hourly load data. However, its advantages actually hide some limitations related to the poor performance under an extended forecasting horizon. Future work will broaden the scope to include an additional evaluation of the proposed SG-CBiLSTM model for other power systems and a variety of forecasting applications such as the electricity price and renewable energy prediction.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: