Advanced ML-Based Ensemble and Deep Learning Models for Short-Term Load Forecasting: Comparative Analysis Using Feature Engineering

: Short-term load forecasting (STLF) plays a pivotal role in the electricity industry because it helps reduce, generate, and operate costs by balancing supply and demand. Recently, the challenge in STLF has been the load variation that occurs in each period, day, and seasonality. This work proposes the bagging ensemble combining two machine learning (ML) models—linear regression (LR) and support vector regression (SVR). For comparative analysis, the performance of the proposed model is evaluated and compared with three advanced deep learning (DL) models, namely, the deep neural network (DNN), long short-term memory (LSTM), and hybrid convolutional neural network (CNN)+LSTM models. These models are trained and tested on the data collected from the Electricity Generating Authority of Thailand (EGAT) with four different input features. The forecasting performance is measured considering mean absolute percentage error (MAPE) , mean absolute error ( MAE ), and mean squared error ( MSE ) parameters. Using several input features, experimental results show that the integrated model provides better accuracy than others. Therefore, it can be revealed that our approach could improve accuracy using different data in different forecasting ﬁelds.


Introduction
The prime role of manufacturers is to maintain an equilibrium between energy supply and load consumption, allowing load forecasting as a crucial factor for the electric power industry. The load forecasting duration is considered with three different domains, shortterm, medium-term, and long-term, which are used for forecasting a day, a month, or a year. Short-term load forecasting (STLF) focuses on a load of each hour or 30 min daily for years. The electricity availability is improved by making use of appropriate forecasting techniques, resulting a reduction in both generating and operating costs of the electricity industry. Additionally, it decreases costs associated with the performance of short-term scheduling functions and security access of the power system [1].
Traditional statistical models such as regression analysis [2], moving average [3], exponential smoothing [4], and stochastic time series models [5] are applied for time series forecasting. Moreover, artificial intelligence models including support vector machines [6], artificial neural network (ANN) [7], and fuzzy time series [8] are widely used in many forecasting applications. Even though the neural network is highly significant in artificial models for nonlinear time series problems [9], the efficiency of the neural network is questionable due to the back propagation method during the training process with multiple hidden layers [10]. It also affects the possibility of fast convergence due to longer processing durations. Furthermore, not all inputs and outputs are related to each other in a simple neural network which memorizes continuous data. Therefore, long short-term memory (LSTM), convolutional neural network (CNN), and deep neural network (DNN) are proposed to

Prior Works
In recent times, ML models have been widely used in the field of forecasting to solve nonlinear complex problems which could not be solved by traditional time series models. Among them, the LR model is one of the simple and popular algorithms based on supervised learning for regression tasks. This model depends on independent variables for the target predictions and carries out the relationship between variables and forecasting [2]. The SVR algorithm is another useful algorithm for supervised regression tasks. The reasonable forecasting results are achieved at the cost of much higher computation time, which is caused by training gradient descent for updating the parameters and reducing loss function [11]. Moreover, most ML models cannot handle large data in the training network.
To handle the weakness of ML models, DL models were introduced at the beginning of the 20th century and have successfully been applied since then. They have demonstrated strengths in handling complex non-linear relationships, model complexity, and computational efficiency [10]. One of the popular DL models is the deep neural network (DNN) model consisting of a large number of processing layers. It is a class of neural network models that comprises an input layer, an output layer, and a large number of hidden layers. Unlike the simple ANN model, the DNN model can handle more than two hidden layers and have a better backpropagation algorithm for the back forward training process. This algorithm uses stochastic gradient descent over simple gradient descent to overcome speed convergence obstacles and avoid local minima [12]. Nevertheless, the weakness of DNN cannot memorize the sequential data and there is no pre-training process in it.
Typically, a deep belief network (DBN) model is based on ANN with multiple hidden layers, which can reduce the approximation error by adding more hidden layers in between the input and output layers. This architecture is based on its performance in related studies [12]. Deep architectures are required to detect higher-level representation and capture prominent abstractions in the network. DBN has become an efficient model in the fields of regression, image classification, and automatic speech, face recognition, natural language processing, and bioinformatics, etc. Ref. [13]. In the article of [14], the DBN application was implemented and developed for modeling the generator bearing temperature and performed more accurate predictions than SVR, ANN, and extreme learning machine (ELM) in generator bearing failures for wind turbines. Moreover, the deep architecture includes LSTM [15], recursive neural network (RNN), CNN [16], and so on. Each model uses computational methods which are trained with multiple hidden layers, to learn representations of data with numerous abstraction levels. These DL models can detect complex structures in large data sets by using the backpropagation process, therefore, solving the drawbacks from the ML technique.
Various researchers have investigated different approches, such as combining two or more forecasting models regarding the performance efficiency. Recently, the ensemble methods have become popular to convert weak ML learners to strong learners to overcome the weakness in ML algorithms. In the cited work [17], five ML models were combined using the ensemble method to lessen forecast errors. In addition to the selected models, some studies based on the feedforward multilayer perceptron using supervised learning algorithms have been conducted [18]. DBN was successfully applied to forecast load demand with hourly electricity consumption data in Macedonia [19]. Moreover, the combination of classification and regression tree (CART) and DBN models was proposed to improve forecasting accurancy by classifying load data [20]. Quan et al. applied the DBN model with one artificial dataset and three regression datasets to execute time series and regression predictions [21], while El-sharkh presented multilayer perceptron, radial basis, and RNN with a parallel structure ANN where the results outperform the general time series models [22]. Rashid et al. proposed an RNN with an internal feedback structure for electricity load prediction with reliable and robust results [23]. A nonlinear auto-regressive RNN model produced smooth forecasted results for hourly predictions of high-resolution wave power [24]. In [25], Kelo and Dudul used a novel hybrid technique that concatenates a wavelet and Elman network to increase one-day ahead prediction accuracy in all seasons. Cheng et al. in [26] used LSTM for power demand forecasting and, performance-wise, their proposed model is proven to be better than gradient boosting tree (GBT) and SVR. Further, Bouktif et al. used a deep learning LSTM model for electric load forecasting using feature selection and genetic algorithm in which he concluded the characteristics of complex time series [27]. Additionally, Syed et al. [28] proposed a hybrid model which is based on the stacking of RNN fully connected layers and unidirectional LSTM on bi-directional LSTM for energy consumption forecasting accuracy. Experimental results were revealed and compared with other hybrid models such as convolutional (Conv) neural network-LSTM, ConvLSTM, LSTM encoder-decoder model, etc.
In the article by Ullah et al. [29], they presented an ensemble stacked generalization (ESG) method for better prediction of the energy consumption of electric vehicles (EVs). The ESG meta-regression model was a weighted combination of decision tree (DT), random forest (RF), and K-nearest neighbor (KNN) to improve model prediction and reduce model variability compared to the single regression model. Their forecast results showed that the ESG EV outperformed other models in providing a more stable and acceptable standard for the proposed diagnostic parameters in the energy consumption forecast. The study by Khan et al. [30] provided a statistical model that predicts the short-term energy costs of a multifamily residential building. Their proposed approach developed a common forecast model for predicting the short-term energy demand of residential buildings in South Korea using long-term memory (LSTM) and Kalman filters (KF). Experimental results were compared and analyzed to review and compare their proposed model with the traditional ML model energy for efficient planning and energy management.
Considering the operating conditions of the buildings at different times, the study by Dong et al. [31] demonstrated a predictive approach to generating energy consumption based on a combined analysis and classification of energy consumption patterns. In their system, the DT classified the energy consumption patterns of the mines and the energy consumption statistics of their respective sectors. Then, the group analysis method is used to determine the model of the energy cost forecast for each pattern. The proposed method was finally evaluated against the classification of energy consumption patterns and analysis without SVR and ANN, and showed reliability and effectiveness. In the study by Ngo et al. [32], the ensemble ML model was proposed for an integrated approach to predicting energy consumption in non-residential buildings. Their proposed ML model used artificial neural networks, support vector replication, and the M5Rules model as basic models. The analytical results confirmed the effectiveness of the ensemble model against the basic models in predicting the next 24 h of energy consumption in buildings. The article by Xuan et al. [33] designed an innovative multi-power charging forecast model for power systems based on in-depth analysis and group methods of multitasking. The following four aspects were included: a hybrid network based on the Combined CNN and Gated Recurrent Unit (GRU) to absorb dimensional abstract features, three GRU networks with different structures designed for meeting prediction requirements, enhanced multi-task learning with homoscedastic uncertainty (HUMTL) for better predictions with different load variations, and the ensemble approach based on gradient boosting regressor tree (GBRT) for final prediction results of various energy features learning in different degrees.
In addition, ensemble learning has been commonly used in wind energy forecasting. Due to the uncertainty and fluctuations in wind speed, it is difficult to estimate wind power with high accuracy. The research by [34] investigated an innovative method that combines Complete Ensemble Empirical Mode Decomposition (CEEMD) and Stackingensemble learning (STACK) based on five ML algorithms to predict the wind power of wind farm turbines. In fact, their model was an efficient and accurate model for wind energy prediction. Li et al. [35] also proposed the hybrid system, including bilinear transformation, effective data decomposition techniques, LSTM-RNN, and error decomposition correction methods, for wind direction prediction. The stability and performance of the proposed system were verified using different data collected from different wind farms. The results of the calculations indicated that the proposed hybrid method worked better than other individual techniques on the short-term horizon. Consequently, considering prior works mentioned above, this research also aims to improve the accuracy of STLF by using the ensemble method.

1.
Bagging ensemble consisting of LR and SVR is firstly proposed to improve the forecasting accuracy by converting ML models from weak learners to strong learners.

2.
Advanced DL models including DNN, LSTM, and CNN-LSTM are implemented along with tuning hyperparameters for STLF to handle back-propagation learning and time series problems.

3.
A detailed comparative analysis of the proposed model and other DL models is provided and compared each other. The comparison is done by considering the mean absolute percentage error (MAPE), the mean absolute error (MAE), and the mean squared error (MSE) as the main performance metrics. These performance metrics are computed for the provided dataset for every month. 4.
The used data in this work are obtained from EGAT and are first smoothed using the filtering technique. The filtering process is done to avoid missing values and outliers.

5.
Different input features are applied for all models and are compared to check the correlation between load and external influential factors, because external factors like temperature, holidays, and months of the year commonly affect load demand.

Paper Organization
The rest of the paper is structured using the following sections. Section 2 highlights a brief explanation of the proposed model. This section also includes the framework of the integrated forecasting system. Section 3 discusses and presents the results for the proposed model and compares it with three DL baseline models. Finally, the last section concludes the work.

Methodology
This section mainly highlights the framework of the integrated system with key parameters. The readers are encouraged to review the DNN, LSTM, CNN-LSTM and LR models from [11,12,36], as we adopted the theory of these models from the cited works. Figure 1 demonstrates the framework of the proposed models used for forecasting. The integrated system consists of three main parts, named (1) data pre-processing module, (2) training module, and (3) forecasting module. A detailed explanation of each part is revealed in the subsequent subsections. Figure 1. Overview of the integrated system for STLF, including 1. data pre-processing module, 2. training module, and 3. forecasting module.

Data Pre-Processing Module
The Electricity Generating Authority of Thailand (EGAT) collects the load data from five different regions, which includes the Central area, Bangkok, South, North, and North-East regions in Thailand. The collected load data have been recorded every 30 min from 2009 to 2021. In this paper, the net peak load for the whole country from 2019 to 2021 is applied to test our prediction models. As indicated in Figures 2 and 3, the variations in peak loads react time to time and day to day so that our load data are time series seasonal variation repeating regularly over time. A better insight into the seasonal component in the time series load data could also improve the performance of ML modeling. Consequently, the historical data from the previous week, previous day and previous time are chosen to train the forecasting models. Moreover, temperature data are used as an input because it is one of the external factors affecting loads associated with the meteorological situation in Thailand. The data pre-processing module is further divided into three processes: data cleaning, data segmentation, and data arrangement.

Data Cleaning
The historical data need to be smoothed because there are many missing values and outliers in the original raw data. If these outliers are included in the training data, the accuracy performance of the load predictions would be lower. One additional issue is the normalizing of data. So, it cannot predict the load correctly if the load is higher than previous data. To filter and smooth the raw data, a local regression filtering technique is used. It can be classified into four types: (1) the locally weighted scatterplot smoothing (lowess) regression which uses the method of linear regression analysis, (2) the locally estimated scatterplot smoothing (loess) regression which uses the method of polynomial square regression analysis, (3) robust lowess local regression, and (4) robust loess local regression, which are the robust functions that can be used to get rid of outlier values. If outliers are present in the dataset, load demand fluctuations can cause a reduction in forecasting accuracy. Therefore, a robust lowess/loess (rlowess, rloess) procedure is used to overcome the problem of distorted values of electricity data [37,38]. The loess local regression filtering technique is applied for cleaning data in this experimentation.
The filtering technique fits a local regression function to the data within a chosen neighborhood of data points. A chosen neighborhood is specified by the percentage of data points which is known as a smoothing parameter (0 < smooth ≤ 1). The larger the smoothing parameter, the smoother the graphed function. For calculating smoothed values, this filtering technique specifies a weight for every data point in the selected window by using the regression weight function, as shown in the following equation. Once the regression function values are calculated with flexible weights and polynomial degrees, the rloess fit is complete.
Equation (1) indicates that x is the predictor value associated with the response value to be smoothed, w i is the regression weight of points i, x i is the nearest neighbors of x as defined by the selected window, and d(x) is the distance along the abscissa from x to the most distant predictor value within the selected window. The data at each period are separated into seven groups based on the day of the week. Therefore, data are cleaned 48 × 7 times for each period of each day using the rloess function. For example, Monday at 11 am in 2019 and 2021 is smoothed using the filtering technique which is shown in Figure 4. Once the data are filtered, the MiniMax Scaler function, as proposed in the models, is used for normalizing.

Data Segmentation
In all proposed forecasting models, the dataset is divided into training and testing datasets. Each dataset is arranged into seven segments based on the day of the week. For instance, the training dataset consisting of Monday's load can only forecast the load for Monday. There are 104 data points for the training dataset and 53 pairs for testing in 2021. The proposed models use 104 whole training datasets to test the first Monday of 2021. Then, the data slides to the next 104 training datasets to test another day and perform the same procedure until the end of the testing dataset. This is called "walk-forward", as shown in Figure 5. The training dataset from 2019-2020 is applied to train both the proposed ensemble and DL models. The empirical results are compared between the bagging ensemble and three DL models.

Selection of Input Features
For the input selection and better data understanding, the Spearman's correlation coefficient (γ) was determined to catch nonlinear monotonic correlation between two variables. Figure 6 represents the correlation among all input variables, by showing a value between −1 and +1. A negative γ refers to negative correlation, while a positive one is positively correlated. There is no correlation between the two variables if the correlation coefficient is zero. Concerning the correlation illustration, target peak load (L(t, d)) positively correlates with , SI and vice versa. It is slightly negatively correlated with MoY, DoW, and H, whereas there is a close to zero correlation with BH. According to the correlation coefficient, we consider historical peak load and temperature data as main inputs, while other seasonal variables are used as dummy variables. The formula of Spearman's rank correlation coefficient (γ) is expressed as below: where D i is the difference between the two ranks of each observation and N is the number of observations. Both the bagging ensemble model and all DL models consist of four different input features, including five, six, nine, and ten input features. Temperature is not included for five and nine forecasting input features. However, a temperature associated with meteorological situations is included in forecasting features of six and ten input features. Other features are dependent on calendar effects, such as day of the week (DoW), month of the year (MoY), holiday (H), and bridging holiday (BH). Forecasting input features are represented by the following equations (Equations (3)-(6)), For five input features, For six input features, For nine input features, For ten input features, where,  Table 1. According to Table 1, there are five input features to train and test the forecasting models. For training, the dataset from 2019 to 2020 is used and for testing, the dataset from January 2021 to December 2021 is applied.

Training Dataset
No.

Training Module
After arranging the train set and test set based on different input features, the train set is provided into the respective trained model.

Bagging Ensemble Training Process
The bagging ensemble combines LR and SVR models. For the LR training process, M5 prime is used as an expert parameter which indicates the feature selection method to be used during the regression. Additionally, the value is set to 0.05 for min tolerance to eliminate co-linear features during training of the algorithm. For the SVR training process, the model is updated by using stochastic gradient descent (SGD) before updating parameters at every step, converging to a global minimum much faster than ordinary gradient descent. Moreover, the squared error for loss function, 0.0001 alpha learning rate, and one thousand iterations are also tuned in the SVR model.

LSTM Training Process
During training of the LSTM model, the train set and test set are normalized to rescale the original value of the feature between 0 and 1, using the MinMax Scaler function and then fed into the model. The method for normalizing the data is indicated in Equation (7), where x i , min(x), max(x), and new x i represent the original value of the input feature, the minimum value in input feature, the maximum value in input feature, and the new rescaled value of x i .
The following parameters are selected for fitting the LSTM training process: 100 epochs used for the number of iterations, and mean squared error used as a loss function during the training process. The Adam optimizer is employed over the classical stochastic gradient descent procedure to update network weights iteratively based on training data. In addition, 256 batch sizes are used to work through before updating the internal model parameters. The rectified linear unit (ReLU) is also chosen for the activation function to solve vanishing problems. This activation function can learn much faster than the sigmoid function in networks with many layers, by allowing the training of deeply supervised networks without unsupervised pretraining.
The sequence length of the LSTM network is one of the parameters which have to be considered to train sequential data. EGAT data are recorded by time series every 30 min in one day, so that the data have 48 periods in total. As a consequence, 48 lag features of the load data are used to forecast time steps for the next day. Figure 7

CNN-LSTM Training Process
Similar to LSTM, the hybrid CNN-LSTM also uses 100 epochs, 256 batch sizes, an Adam optimizer and ReLU activation function for fitting the model. However, this hybrid model combines two different Dl models: CNN and LSTM so that the data are required to reshape as a subsequence format. Firstly, one dimentional CNN is build with 64 filters, one kernel size and ReLU function, follwed by a maximum pooling layer with two pool sizes and another flattening layer. The next layer is LSTM with 50 networks and ReLU function to execute the final predictions.

DNN Training Process
The DNN model is also well-known as multilayer perceptron (MLP). During the DNN training process, 100 hidden layers and 100 hidden nodes are trained in the network. Other parameters, including 100 epochs, mean squared error for the loss function, Adam optimizer, ReLU for an activation function, are also selected to train the DNN model.

Forecasting Module
Test data in 2021 are predicted according to the associated trained model. Forecasting performance on the test data is measured using an accuracy measurement. In this study, the mean absolute percentage error (MAPE), the mean absolute error (MAE), and the mean squared error (MSE) in Equations (8)- (10) are regarded as the accuracy measurements which represent how many units of the forecasting value deviate from the actual demand value to calculate an error.
where L t (d) denotes the actual load on day t and F t (d) is the forecasted load on day t.

Results and Discussions
The proposed models are evaluated on a MacBook having specifications of an intel Core i5 1.8 GHz processor, 8 GB 1600 MHz DDR3 RAM, and Intel HD graphics 6000 1536 MB, fully loaded with Anaconda navigator and Spyder V3.28 Python Language programming. The Spyder programming has been installed with Tensorflow and Keras libraries. For each data segment, the computation time is set to 10 min. Table 2 presents the error measurements for the discussed parameters (MAPE, MAE, and MSE) for the bagging ensemble and DL baseline models. These parameters are computed and compared monthly while keeping in view five distinct input features. In general, the proposed model provides lower errors than other training models. According to the results provided in Table 2, the proposed bagging model attains a forecasting accuracy of 6.05% in the case of the MAPE parameter. In contrast, the accuracy for LSTM, CNN+LSTM, and DNN are 6.74%, 6.85%, and 6.79%, respectively. This shows that the proposed model outperforms all other training models for the MAPE measurement. A similar trend could also be observed for other parameters. Considering the MAE parameter, the bagging ensemble achieves 1258.98 MW, followed by LSTM with 1413.74 MW, and DNN with 1424.50 MW. The least performance is shown by the CNN+LSTM model, contributing to 1439.48 MW. Similar behavior is observed for the MSE parameter, where bagging realizes 3099.11 GW 2 . After the proposedmodel, LSTM performs better with 3712.15 GW 2 . Finally, the MSEs of the DNN and CNN+LSTM models are 3783.01 GW 2 and 3772.21 GW 2 , respectively. Although the proposed ensemble gives better performance, January, April, May, and December still get higher errors in all models because these months have long holidays and tourism seasons. Another factor is the highest temperature in April and May. Hence, electricity consumption is higher than usual. In contrast, the proposed model observed the lowest MAPE, showing 2.36% MAPE, 512.57 MW MAE, and 421.70 GW 2 MSE in September. Next, the DNN model performs well, followed by LSTM and CNN+LSTM models. Table 3 also indicates error measurements for all models, training with six input features. Similar to Table 2, the best forecasting accuracy is realized by the proposed model,       As illustrated in Table 5, ten inputs, including additional temperature and dummy variables, are trained for all models to measure the accuracy. In this case, our proposed model still provides the minimum error with 6.00% MAPE, 1264.31 MW MAE, and 2840.87 GW 2 MSE. The LSTM and DNN came next with almost similar MAPE, representing around 6.75% MAPE, followed by the CNN+LSTM model having 6.87% MAPE. According to all four result tables (Tables 2-5), our proposed model has no sensitivity for the selection of input features because of similar resulting errors. However, additional temperature input only causes a higher error in the proposed model. Each forecasting model obtains the minimum error on different input features in the comparative analysis among input features. The LSTM, CNN-LSTM, DNN models correspondingly provide the lowest error at five inputs, six inputs, and ten inputs. All in all, our proposed model executes the best performance in all experiments.

LSTM
Five categories, such as holidays, bridging holidays, Mondays, weekdays, and weekends are grouped for all four forecasting models to check the MAPE results in each category. Table 6 refers to the average MAPE of 2021 test predictions for all models according to different input structures. Overall, holidays have influenced MAPE values immensely, providing a higher range of values irrespective of the input features. The proposed model's holiday category offers around 2% to 4% lower than other models in all forecasting structures. Similarly, other models provide inferior performance to the proposed model in the group of bridging holidays. The category of Mondays shows the minimum influence in all DL models. Nevertheless, the proposed model has a worse performance than others in all features, except ten inputs having minimum MAPE. Regarding the group of weekdays and weekends, the bagging ensemble performs similar error percentages in all input features, varying approximately 5% to 6%, while other models range from 6% to 7%. Over and above, the proposed model outperforms other baseline models (LSTM, CNN+LSTM, and DNN) in all categories, except the group of Mondays.

Conclusions
In this paper, the ML-based bagging ensemble model combining LR and SVR was proposed. Moreover, three advanced DL models, i.e., LSTM, CNN+LSTM, and DNN models were implemented as benchmark models for forecasting comparative analysis. The collected data were gathered from the Electricity Generating Authority of Thailand (EGAT).
All models were trained and tested by using cleaned data from 2019 to 2020 to forecast daily load demand in 2021. The proposed and benchmark models were trained on both without temperature using five and nine input features, and with temperature using six and ten input features. The nine and ten input features include dummy features based on the calendar. Feature selection did not have much effect on any of the forecasting models as it executes almost stable errors for all features. However, the temperature input did not help to improve the accuracy. In addition, results were compared using five different categories (holidays, bridging holidays, Mondays, weekdays, and weekends). The proposed model provides a better performance than other models in all categories, except in the category of Mondays. To sum up, apart from our proposed model, each forecasting model is good at different features by giving reasonable accuracies. This result in our proposed model could improve performance and each model could handle the sensitivity of feature variance in forecasting.