Diving Deep into Short-Term Electricity Load Forecasting: Comparative Analysis and a Novel Framework

: In this article, we present an in-depth comparative analysis of the conventional and sequential learning algorithms for electricity load forecasting and optimally select the most appropriate algorithm for energy consumption prediction (ECP). ECP reduces the misusage and wastage of energy using mathematical modeling and supervised learning algorithms. However, the existing ECP research lacks comparative analysis of various algorithms to reach the optimal model with real-world implementation potentials and convincingly reduced error rates. Furthermore, these methods are less friendly towards the energy management chain between the smart grids and residential buildings, with limited contributions in saving energy resources and maintaining an appropriate equilibrium between energy producers and consumers. Considering these limitations, we dive deep into load forecasting methods, analyze their performance, and ﬁnally, present a novel three-tier framework for ECP. The ﬁrst tier applies data preprocessing for its reﬁnement and organization, prior to the actual training, facilitating its effective output generation. The second tier is the learning process, employing ensemble learning algorithms (ELAs) and sequential learning techniques to train over energy consumption data. In the third tier, we obtain the ﬁnal ECP model and evaluate our method; we visualize the data for energy data analysts. We experimentally prove that deep sequential learning models are dominant over mathematical modeling techniques and its several invariants by utilizing available residential electricity consumption data to reach an optimal proposed model with smallest mean square error (MSE) of value 0.1661 and root mean square error (RMSE) of value 0.4075 against the recent rivals.


Introduction
In recent decades, the consumption of energy in different sectors such as industries, factories, transportation, and residential buildings, has tremendously increased due to over population and economy growth. About 39% of total global energy is consumed by buildings and 38% is dissipated in CO 2 emissions [1]. Examining this, we need to reduce the extra energy consumption in buildings to protect and preserve energy for more efficient usage [2]. Therefore, predictions of future energy usage have encouraged several researchers to boost smart grids' performance, that directly affect the energy production and consumption. Predictions of energy usage ensure that proper plans are available to meet the energy demands of certain buildings and control energy distribution in ways that add extra benefits to the setup of smart grids. Similarly, intelligent buildings' profiling plays a vital role in making decisions for energy conservation and its management [3,4]. It assists users by providing insights about energy consumption behavior, that residential owners can use for certain building operations relating energy usage and helps design proper infrastructure [5]. Similarly, intelligent buildings profiling helps to detect outliers in consumption and to perceive any risks in advance [6], which is helpful because energy Renewable energy is obtained by the smart grid, where it is organized and made suitable for usage. After passing the data from the initial stages, it is supplied to the consumers according to their demands. These consumers include industries, households, offices, and public transport systems. (b) The statistical overview of the energy used in different sectors of South Korea, where the industries consume huge amount of energy due to heavy machinery. The existing energy forecasting literature has a lot of contributions from researchers to effectively analyze the time-series data produced by smart meters. But studies reveal (a) Renewable energy is obtained by the smart grid, where it is organized and made suitable for usage. After passing the data from the initial stages, it is supplied to the consumers according to their demands. These consumers include industries, households, offices, and public transport systems. (b) The statistical overview of the energy used in different sectors of South Korea, where the industries consume huge amount of energy due to heavy machinery.
Similar to a number of production resources, energy is also consumed at varied points and locations, depending upon the application under consideration. For instance, as given in Figure 1b, energy consumption in South Korea is highlighted, that is distributed among industries, residential buildings, transportation, etc. [7]. Statistical analysis [8] suggests that the energy consumed by residential and commercial buildings, public, industries, and transportation in South Korea is 38%, 6%, 55%, and 1% of total energy consumption, respectively, as is shown in Figure 1b.
Analyzing the historical data from these varied consumption resources i.e., residential buildings and industries assists in planning future energy production and its efficient consumption. Maintaining a sufficient amount of energy supply has a vital role in human welfare, while the energy production resources tends to maintain and continuously improve their distribution services. The recorded amount of energy distributed in a certain time and varied weather conditions assists to predict the future energy consumption. From time-series analysis perspective, the energy patterns recorded in varied scenarios are either fed to a mathematical model or a machine learning model for energy forecasting. Several energy forecasting methods are used to estimate the future energy demand.
The existing energy forecasting literature has a lot of contributions from researchers to effectively analyze the time-series data produced by smart meters. But studies reveal that the majority of the methods model various features such as weather prediction, product price, household expenses, and their relationship through statistical methods. These features allow the changes to be explained and the cost to be estimated relatively easy. However, as shown in the energy domain and its related literature, the research attention widely increased with the practice of deep learning and sequential time-series methods. These methods have shown a tremendous performance in many data science applications, including energy forecasting domain. Although, the usage of deep learning in energy forecasting boosted the preciseness of current models, but existing literature has rarely focused on household energy management. Regarding energy management and forecasting problems, ensemble learning and its invariants has remained unexplored for residential energy consumption data. Moreover, the literature on deep sequential learning, such as recurrent neural networks (RNNs), reveal its reasonable outputs to deal with different tasks and achieve superiority. However, its several variants are the missing pieces of residential energy management. With the aforementioned assumptions in mind, this article presents a novel three-tier framework for energy consumption prediction (ECP). The notable contributions of the proposed method are given below:

•
The data obtained via smart sensors and meters have several abnormalities, uncertainties, and outliers that occur due to weather variations, government influences, etc.
To handle this issue, we employ a preprocessing step that includes data cleansing, its organization, noise removal, normalization, and arrange it in rolling windows to obtain the refined data, so that it is the best fit for the next training step.

•
To deeply evaluate and consider mathematical modeling in the ECP domain, we apply several ELAs in this research and compared their performances with several deep sequential learning methods to learn about their effectiveness and real-world implementation potentials.

•
The literature on sequential learning, such as recurrent neural networks (RNNs), shows promising outcomes for several tasks, improving on traditional learning methods. Inspired from this, we employ an RNN and its several variants such as long short-term memory (LSTM), bidirectional LSTM (Bi-LSTM), and multilayer LSTM (M-LSTM). These variants are more suitable for ECP of residential buildings.

•
We experimentally prove that the sequential learning models are the most appropriate algorithms to handle the energy forecasting problem, verified by the lowest error rates using publicly available data.
The reminder of the paper is organized as follows: Section 2 covers the literature review and Section 3 provides the three-tier method for ECP, while Section 4 discuss the experimental results. Section 5 concludes the proposed method with some future directions.

Literature Review
Due to the widespread usage of ECP applications across the world, it has gained a considerable research attention. Several techniques have produced realistic and promising results to manage energy consumption. To thoroughly overview the existing methods, we divide them into conventional and deep learning-based methods.

Conventional Energy Management
In the initial stages, statistical methods were broadly used by the researchers. For instance, Zhong et al. [9] proposed a support vector machine (SVR)-based method for ECPs where the multidistortion generated the optimal features space in the data. They approximated a high nonlinearity between the input and output through linearity. Next, Guo et al. [10] introduced a machine learning-based model to predict the response time of a building's thermal energy. They considered multiple linear regression (MLR), support vector machine (SVM), and extreme learning machine model for energy prediction. They analyzed the performance of each model for heating analysis. Similarly, Liu et al. [11] analyzed SVM for ECPs in the buildings. Zhang et al. [12] explored the SVM model to predict the energy consumption in the iron making process. Further, they considered a particle swarm algorithm to improve the consumption prediction. Cauwer et al. [13] detected and quantified the correlation between the energy consumption and the kinematics parameters. They considered the vehicle dynamics as the underlined physical model and used MLR for the construction of three other models. The authors formed a distant level of aggregations used by the models to allow predictions via different types of input parameters. Cai et al. [14] classified the consumption ratings of 16,000 residential houses based on the data collected from the whole region. They shortened the electric patterns using data mining approaches and used the K-mean algorithm to accomplish clustering, where the electricity usage was distributed through the obtained centers of each cluster and then SVM was used for classification. Moreover, Fumo et al. [15] established a simple, multiple, and quadratic regression for hourly and daily energy consumption in residential buildings.
Data quality is an important issue to consider in the forecasting problem and has a high impact on the forecasting algorithm. For instance, Luo et al. [16] investigated the integrity of data over the load forecasting problem. For this purpose, they considered several regression models and simulated a few data integrity attacks to identify their effects on the model's performance. They demonstrated that the existing regression methods failed to give reasonable forecasting results. Similarly, Zhang et al. [17] studied the impact of data attacks over the accuracy of forecasting models. In their method, they verified that the most robust and representative power forecasting models are SVM and K-nearest neighbors (KNNs) combined with kernel density.

Deep Learning-Based Energy Management
Nowadays, research communities use deep learning methods due to their reasonable results in solving different energy and computer vision related applications such as energy systems [18]. Recently, He et al. [19] proposed a deep learning algorithm-based data driven approach in an unsupervised learning manner to extract the sensitive consumption features from the machinery data and developed a prediction model in a supervised manner. Similarly, Hu et al. [20] developed the stacked hierarchy of reservoirs (DeepESN) to predict the energy consumption and wind power generation through a deep learning framework. DeepESN combined the time-series ability of the state network and learning ability of the framework. Further, Ullah et al. [21] presented a clustering-based analysis of energy consumption and categorized the usage of electricity. Similarly, Gao et al. [22] proposed deep learning models such as a sequence to sequence model and two-dimensional convolutional neural network (CNN). They used a transfer learning approach to empower the prediction accuracy obtained for residential buildings. In addition, the sequential learning techniques have also been considered for the forecasting problem. For instance, Somu et al. [23] presented eDemand, which is an energy consumption model that employs LSTM. They improved the sine and cosine optimization algorithms for building energy forecasting. In this regard, Hussain et al. [24] enveloped the energy forecasting methods into one platform that covers both the deep learning and conventional methods. They also provided a statistical analysis of the energy forecasting methods. Furthermore, Li et al. [25] proposed an evolutionary algorithm known as teaching-learning-based optimization (TLBO) for short-term energy consumption in residential buildings. They further improved the prediction process using an artificial neural network where the CNN layers are capable of extracting spatial and temporal features from the data sequence. Recently, Ullah et al. [26] proposed a method where the CNN is combined with multilayer bidirectional LSTM for energy consumption of a household. Inspired by LSTM, Wen et al. [27] used a deep RNN along with LSTM for power load and photovoltaic power forecasting in the microgrid. They proved that the deep RNN with LSTM performed very well compared to multilayer perceptron. They optimized the load dispatch by particle swarm optimization PCO.

Material and Method for ECP
In this section, we discuss each step of our method in a detailed fashion. We discuss the ESA and sequential learning methods for ECP with the coverage of the statistical methods. The load data considered are from the household and are used to evaluate the effectiveness of the proposed method.

Data Setting and Preprocessing
In this section, we discuss the data setting and how it is preprocessed. Usually, during a data collection process, the smart meters are connected to the main board that measures the power, current, voltage, etc., of all the appliances installed inside the house. However, there is somehow uncertainty in the data, which is one of the big problems and drastically affects the ECP. These uncertainties in the data are emerged due to different environmental conditions such as the occupant's behavior, building's infrastructure, noisy values generated by the system software during its settings, etc. The uncertainties in the case of ECPs include outliers, and missing values, verifying its negative affect in the final prediction results. However, energy forecasting is one of the critical steps in efficient energy management to smartly utilize the energy that balances the smart grids and the residential buildings. Prediction of energy consumption in the short term has been studied extensively, while the forecasting is less explored when going deeper at the aggregate level. The uncertainty increases as the samples size becomes smaller. Several methods [28,29] have widely focused on the difficulties and uncertainties that arise when dealing with the consumption prediction problem. Consequently, the proposed method is evaluated by the publicly available benchmarks such as power consumption dataset. Hence, this dataset contains missing values and noisy values. The initial data contain outliers and uncertainties that lead the system to an incorrect prediction of energy consumption. To deal with these uncertainties, the data are first passed through preprocessing layer that applies smoothing filters to make the data perfect for the actual modeling. The missing values are substituted by the previous values. Furthermore, there were some data outliers that were replaced through the normalization technique which brought all the values into same range to help in the smooth processing of the data. Once the data are refined and ready for processing, different horizons are formed from the data, such as minutes, hours, days, and weeks, for detailed investigation. This step is visually shown in Figure 2 as step 1 and step 2.

Learning Mechanism
We applied several ELAs to evaluate their performances for ECPs. Similarly, we used the most popular deep sequential learning techniques that are abundantly used due to their promising results, as given below.

ELA
In this section, we discuss several ELAs to assess their effectiveness and performance for ECP. The ELAs blend multiple predictor forecasting techniques to increase the generalization and robustness. These algorithms can be categorized into (1) the average algorithm, that considers several independent predictors to average forecasts such as those from bagging methods or the random forest (RF) method, while (2) boosting algorithms combine several low-level techniques to make a powerful performance ensemble such as Adaboost (AB) and gradient boosting. In order to perform the short-term consumption prediction, the actual power is considered from the energy data. Furthermore, a detailed explanation is given below.

AB Algorithm
AB is the most popular machine learning algorithm that is introduced by Freund et al. [30] and was originally based on the task of classification. The core concept of this algorithm is to repeatedly fit the sequence from the weak learners by modifying the data. The modification in the data is brought about through change in the weights for each classifier. Firstly, all the weights are distributed equally and for each iteration, the algorithm updates its weights. The weights are updated for those classifiers which wrongly predict the data sequence. This algorithm is adaptive in the case when its subsequent weak learners are tweaked in the favor of instances that are wrongly predicted by the previous classifier. AB obtains the input as a training set (X 1 , Y 1 ), . . . , (x m , y m ), where x i belongs to a certain domain space while each y i is a label in set Y. AB continuously calls the base algorithms a series of N = 1, . . . , N. The core idea of this algorithm is maintenance of distribution or weights in the training set. Consequently, there is another flavor of ensemble algorithm, ABR2 [31], which is the modified version of regression for the AB ensemble [30]. This algorithm sequentially fits the estimators, where each estimator focuses on the samples that were predicted by the high loss. The core features of AB R2 are the dataset and the sampling distribution. Each training data element contains a value in sampling distribution which shows the probability of the included element in the training set. The detailed pseudocode of AB is given in Table 1 while the pseudocode for ABR2 is given in Table 2.
R PEER REVIEW 6 of 23 Figure 2. Three-tier ECP framework. At first, the preprocessing step is applied for noise and outlier removal from the data, normalization, and formation of horizons. In the second tier, the model's learning is carried, which comprises of conventional learning and sequential learningbased methods, while the third tier gives the final prediction of energy, visualization, and performance evaluation using basic metrics.

Learning Mechanism
We applied several ELAs to evaluate their performances for ECPs. Similarly, we used the most popular deep sequential learning techniques that are abundantly used due to their promising results, as given below.

ELA
In this section, we discuss several ELAs to assess their effectiveness and performance for ECP. The ELAs blend multiple predictor forecasting techniques to increase the generalization and robustness. These algorithms can be categorized into (1) the average algorithm, that considers several independent predictors to average forecasts such as Figure 2. Three-tier ECP framework. At first, the preprocessing step is applied for noise and outlier removal from the data, normalization, and formation of horizons. In the second tier, the model's learning is carried, which comprises of conventional learning and sequential learning-based methods, while the third tier gives the final prediction of energy, visualization, and performance evaluation using basic metrics.

GBR Algorithm
The gradient boosting regression (GBR) algorithm was originally developed for regression and classification problems. This algorithm produces a prediction model in an ensemble of weak prediction models, such as the decision tree (DT) model. Next, it builds the model stage-wise as other conventional boosting techniques. It generalizes the method through the optimization of differentiable loss. This algorithm considers the low performance method of DTs to develop the prediction model on the basis of the ensemble algorithm. GBR sequentially keeps the model updated and uses the optimization of loss function to reach its generalization. Table 1. Mathematical step-by-step explanation of the working flow of the ensemble learning algorithm (ELA) representing Adaboost (AB) [32].
For n = 1, . . . .,N: Train weak learner using D n . Obtain weak hypothesis h n : X→ {−1, +1} with error Here, Z t represents factor of normalization Output: final hypothesis: Construct Training-set from D using S 1 .

3.
Develop network h n and train using Training-set 4.
Find max loss, L max , over Dataset where: Calculate loss for each sample in Dataset by Calculate weighted loss: Set: Where Z n shows normalization factor selected so that ∑ n D n+1 sums to 1. n = n + 1; repeat 2-8 till L < 0.5.
GBR sequentially keeps the model updated and uses the optimization of loss function to achieve generalization. This algorithm considers the additive model from the formula [34] given in Equation (1), where hm(x) represents the main functions that are known as weak learners in the boosting context, while γm is the length that is chosen while executing the function given in Equation (2).
Consequently, as like other boosting algorithms, GBR has the capability to create the additive model within forward stage, which is illustrated in Equation (3). At every stage, hm(x) is used to minimize the loss function L in the model F m − 1 and fit F m − 1(xi), which is shown in Equation (4). The GBR model is implemented in python with the scikit-learn library. The total number of estimators used is 50. The depth of the independent regression predictors is tuned. The pseudocode of GBR is given in Table 3. Table 3. Mathematical step-by-step explanation of the working flow of ELA representing gradient boosting regression (GBR) [35].
GBR Inputs: Steps: (1) initialize f 0 using a constant (2) for n = 1to I n do (3) find negative gradient g n (x) (4) fit base learner function L(x, θ n ) (5) compute best gradient descent step-size ρ n : RF is also an ensemble learning method that is developed for regression and classification problems and other tasks operated through constructing multitudes in DTs during the training process. This algorithm was first introduced by Tin Kam Ho et al. [36] as a random decision forest that was defined as ensemble learning for regression and classification. This algorithm uses the technique of bagging to build an ensemble of DTs. Based on random selection of data and variable selection, it develops many trees. This algorithm consists of randomized DTs [37]. Each tree in the forest is trained from the random subsets of the training samples and random features. To predict a certain example, the outputs from each tree are averaged to find the overall output where each tree is traversed until it reaches a leaf node. According to the training example ratio that belongs to the node of the leaf, the probability scores are assigned. These scores are averaged for each tree present in the forest which gives us the overall probability score of that sample. In this algorithm, the number of estimators is 100. Furthermore, the pseudocode of RF is given in Table 4. Table 4. Mathematical step-by-step explanation of the working flow of ELA representing random forest (RF) [32].

KNN, SVR, DT
The KNN is a popular machine learning method adopted as supervised learning. For a test sample, the K-neighbors are created for training, which are near to the test sample. The search is carried out through the distance metric. The prediction is performed on the basis of K-neighbors. This algorithm is also known as a nonparametric method. In KNN, the data are labeled where the prediction is performed [38]. First of all, the Euclidean distance is computed from the query sample to the labeled sample. Next, the labeled samples are made in order through increasing the distance. By increasing the K-distance, the order of the sample changes. Thirdly, K-number of nearest neighbors are heuristically found based on the RMSE. This process is continued in cross-validation. Finally, a weighted average of inverse distance with KNN is computed where the unlabeled data are labeled accordingly.
Furthermore, the SVR algorithm is commonly used to solve several machine learning related tasks. We also use SVR to evaluate its performance for the prediction of energy consumption. In regression, a function based on training samples is found that is used to develop an appropriate mapping from the input domain into a real number. There is a hyperplane in the middle of the samples with two outer lines that represent the decision boundary. A plane with a large number of points is considered as a hyperplane. The goal is to construct a decision boundary that will be the distance from the original plane such that the data samples are near to the hyperplane or support vector. Only those points are considered that come within the decision boundary and have the lowest error rate.
Similarly, we also used DTs, developed by J. R. Quinlan et al. [39], which build regression models by forming a tree structure. It breaks the dataset into smaller pieces to create subsets, while an associated tree is incrementally constructed at the same time. The final result is a tree with leaf nodes and decision nodes. The decision nodes contain two or more branches where each branch represents the attribute value that is tested. However, the leaf node indicates the decision made on the number target. The topmost node inside the tree corresponds to the best predictor, known as the root node. Further details are out of the scope of the paper.

Sequential Learning Algorithm
The data information of time-series data can be referred as sequential data. The significant property of these data is the order of information. For this purpose, different methods have been developed to handle these data. RNNs have gained overwhelming growth as tools to deal with sequential data. The important property of RNNs is the usage of feedback connections inside them. This network starts reading the initial piece of information and then proceeds to read the rest of the data. Inspired by this, we deliberately practiced the sequential learning techniques for the prediction of energy consumption in households, the detail of which is given below.

Long Short-Term Memory
A detailed discussion about the internal architecture of LSTM and its functionality for the information processing is given in this section. LSTM is comprised of special memory blocks in their recurrent layer that overcome the problem of a vanishing gradient. The memory blocks have internal memory cells that are self-connected. These blocks have three multiplicative units known as gates that store the temporal information of the sequence. The three gates are input, output, and forget gate. The input gate controls the flow of current information inside the memory cell, while the output controls the information in the rest of the network. The forget gate sets the state of previous cell information and retains part of the information in the current network. The internal details of the LSTM architecture are given in Figure 3a. The three gates multiply the previous information with a value ranging from 0 to 1. The information is discarded if the value is 0, while retained when the number is 1. The gates use the sigmoid function to turn the data into 0 and 1. This function is given in Equation (5) [40].
i t , f t , and o t represent input, forget, and output gate, respectively, while the intermediate value is represented by C t , which can be calculated as follows: of feedback connections inside them. This network starts reading the initial piece of information and then proceeds to read the rest of the data. Inspired by this, we deliberately practiced the sequential learning techniques for the prediction of energy consumption in households, the detail of which is given below.

Long Short-Term Memory
A detailed discussion about the internal architecture of LSTM and its functionality for the information processing is given in this section. LSTM is comprised of special memory blocks in their recurrent layer that overcome the problem of a vanishing gradient. The memory blocks have internal memory cells that are self-connected. These blocks have three multiplicative units known as gates that store the temporal information of the sequence. The three gates are input, output, and forget gate. The input gate controls the flow of current information inside the memory cell, while the output controls the information in the rest of the network. The forget gate sets the state of previous cell information and retains part of the information in the current network. The internal details of the LSTM architecture are given in Figure 3a. The three gates multiply the previous information with a value ranging from 0 to 1. The information is discarded if the value is 0, while retained when the number is 1. The gates use the sigmoid function to turn the data into 0 and 1. This function is given in Equation (5) [40].
, , and represent input, forget, and output gate, respectively, while the intermediate value is represented by , which can be calculated as follows:  Here, W xi , W hi , W ci , W f , W h f , W c f , W xo , W ho , and W co show the weights, while b i , b f , b C , and b o represent bias vectors. x t indicates the current input while t1 and h t−1 show the output of information at current time t and previous time t − 1, respectively.

Bidirectional LSTM
We also studied and evaluated the performance of bidirectional LSTM for energy consumption, which is another variant of RNNs proposed by Schuster et al. [41]. The core concept of Bi-LSTM can be retrieved from RNNs [41], where the input data sequence is processed in both forward and backward directions inside the hidden layers. Each layer operates using the reversed time-step direction. Basically, two layers operate where one layer processes the sequence in the forward direction while the other layer operates in the backward direction. This representation is given in Figure 3b. This network has widely shown promising results in several fields of ECP and computer vision tasks such as activity analysis [42,43] and forecasting problems. However, M-LSTM is also used for the prediction of energy consumption. In M-LSTM, the state of the first layer obtains the input from the previous layer and the previous state of the same layer. In conventional deep neural networks, the neurons have a huge dimensions in their activation values. These activations have the capability of learning the sequence in big data. Therefore, stacking multiple layers of LSTM guarantees the extraction of long-term sequence information from the data.

Results
In this section, we discuss and deeply investigate the experimental results obtained by the proposed method. Similarly, we add the software and implementation details and the data are also structured and analyzed for its usage.

System Software Settings and Implementation Details
Several kinds of experiments were performed to verify and confirm the effectiveness of the proposed method. The proposed three-tier framework was implemented in Python (Version 3.5) with the famous deep learning framework Keras having TensorFlow as the backend, and Adam was used as an optimizer. We used 10-fold cross-validation during experiments, where the data are divided into N = 10 parts and the N − 1 part is used for testing and the remaining part is used for training. This validation process is repeated until the whole data points are passed from the training and testing phases. Furthermore, we also used the holdout method, where the training and testing sets were formed for validation. Similarly, as we are dealing with a regression problem, we used different error metrics including MSE, RMSE, mean absolute percentage error (MAPE), RMSE, and mean absolute error (MAE). Their detailed formulations are given as follows: y ∼ i represents the variable values for n prediction numbers of energy consumption, while y i shows the observed/predicted values, so Equations (11)- (14) show MSE, MAPE, RMSE, and MAE, respectively.

Dataset
To evaluate the proposed method, we used the household power consumption dataset.

House Hold Power Consumption Dataset
We used the public power consumption dataset that is available on machine learning UCI repository [44] and consists of data measured through a smart meter in the period of 2006 to 2010, adding up to 4 years of total data. This dataset consists of 2,075,259 instances where 25,979 are missing values, equating to 1.25% of missing data. However, these missing values were handled in the preprocessing step. The data recorded in this dataset are organized in one-minute resolutions over four years. In these data, the total active power is represented using submetering 1, submetering 2, and submetering 3, which are consumed every minute and given in watt-hours. To predict the future energy, we used different resolutions such as minutes, hours, days, and weeks where the input data were given to training network as a series in window sequence. In Table 5, the detailed descriptions of the variables are given and its quantitative details are given in Table 6. This variable provides the timing information in minutes, hours, and seconds. The hour resolutions range from 0 to 23 while minute range is from 0 to 59.

Global Active Power (GAP)/Kilowatts
The total active power consumed over each appliance in the household in each minute, represented by GAP.

Global Reactive Power (GRP)/Kilowatts
The total reactive power consumed over each appliance in the household in each minute, represented by GRP.

Voltage(V)/Volts
The total voltage measured in each minute. 6 Global Intensity(GI)/Ampere The total intensity of current measured in each minute is the GI. Sub-metering(S1)/watt-hours This energy is related to the power consumed inside the kitchen, which includes the oven, dishwasher, and microwave.  Sub-metering(S3)/watt-hours The energy consumed to electric water-cooler, heater, and air-conditioner.

Data Interpretation
We visualized the variables and analyzed the patterns of each variable as can be seen in Figure 4, which is provided in minute resolutions. This presentation provides a clear interpretation of each variable and the consumption patterns with time. In the proposed method, we totally focused on Global Active Power (GAP), which covers the power consumed over all the appliances; therefore, we further analyzed GAP for its different horizons and its consumption prediction.

Data Interpretation
We visualized the variables and analyzed the patterns of each variable as can be seen in Figure 4, which is provided in minute resolutions. This presentation provides a clear interpretation of each variable and the consumption patterns with time. In the proposed method, we totally focused on Global Active Power (GAP), which covers the power consumed over all the appliances; therefore, we further analyzed GAP for its different horizons and its consumption prediction.
Before feeding the data into the network, it is a significant step to verify and assess the performance of a prediction model to clarify the objectives. Therefore, we created different time horizons of the dataset including hours, days, weeks, and months. For each horizon, an energy prediction was performed. The representation of each horizon is illustrated in Figure 4. In Figure 5, it can be seen that the distribution and observations made in GAP steadily decreased (in kilowatts). Next, Figure 6 represents the visual representation of the data in minutes' horizons. The distribution is bimodal with a peak, where a long tail is seen in the distribution towards higher values of kilowatts. The drop in energy is observable from Figure 7 when there is no individual at home or all the occupants are asleep, while the peak usage is witnessed when all the appliances in the household are functional. The distribution and the observations made for all variables is given using a histogram in Figure 7.  Before feeding the data into the network, it is a significant step to verify and assess the performance of a prediction model to clarify the objectives. Therefore, we created different time horizons of the dataset including hours, days, weeks, and months. For each horizon, an energy prediction was performed. The representation of each horizon is illustrated in Figure 4. In Figure 5, it can be seen that the distribution and observations made in GAP steadily decreased (in kilowatts). Next, Figure 6 represents the visual representation of the data in minutes' horizons. The distribution is bimodal with a peak, where a long tail is seen in the distribution towards higher values of kilowatts. The drop in energy is observable from Figure 7 when there is no individual at home or all the occupants are asleep, while the peak usage is witnessed when all the appliances in the household are functional. The distribution and the observations made for all variables is given using a histogram in Figure 7. in energy is observable from Figure 7 when there is no individual at home or all the occupants are asleep, while the peak usage is witnessed when all the appliances in the household are functional. The distribution and the observations made for all variables is given using a histogram in Figure 7.

Results On ELA
This section discusses the detail experiments carried out using ELAs. We used the Kfold cross-validation method to evaluate the effectiveness of the methods, where the value of K was set 10. In this way, the data were divided into 10 equal parts. K-1 was used for testing and the other K-folds were used for training. At the first, we used the AB algorithm in a practice run to assess its performance against other ELAs. The AB stands for adaptive boosting, which is widely used as an ELA to deal with different regression problems in machine learning. At first, the data were separated into an X-and Y-label to split them. The model was defined as AB regression class and the number of estimators was set as default (n = 50) and the other parameters were set defaults without any change. The metrics used to evaluate its performance are MSE, RMSE, MAE, and MAPE, and the results for the 10-fold method are given in Table 7, while the results of the hold-out method are given in Table 8.

Results On ELA
This section discusses the detail experiments carried out using ELAs. We used the K-fold cross-validation method to evaluate the effectiveness of the methods, where the value of K was set 10. In this way, the data were divided into 10 equal parts. K-1 was used for testing and the other K-folds were used for training. At the first, we used the AB algorithm in a practice run to assess its performance against other ELAs. The AB stands for adaptive boosting, which is widely used as an ELA to deal with different regression problems in machine learning. At first, the data were separated into an Xand Y-label to split them. The model was defined as AB regression class and the number of estimators was set as default (n = 50) and the other parameters were set defaults without any change. The metrics used to evaluate its performance are MSE, RMSE, MAE, and MAPE, and the results for the 10-fold method are given in Table 7, while the results of the hold-out method are given in Table 8. Table 7. Results obtained using ELA on different horizons using 10-fold method.  The MSE obtained for AB is 9.6452. After AB, we applied GBR to improve the weak learners and create the final prediction model. The DTs were used as base learners in this algorithm. These learners were identified through a gradient in loss function. The prediction made by the weak learner was compared with the actual results to calculate the error. Based on this error, the model defined the gradient and it changed the parameters that decrease the error rate in the next training. Similar to AD, we also used 50 estimators and the same strategy to make X and Y. The MSE value obtained for GBR is 9.2423. Furthermore, we used RF, which is an ensemble algorithm that is also based on learning of DTs. Here, the estimator fits multiple trees on the extracted subsets and averages their prediction. The number of estimators and the other variables are set to their default values. The RF gives a value of 7.1582 for MSE. Moreover, we evaluated DT, which is a well-known machine learning algorithm most commonly used in regression problems. This model is based on the decision rules that are extracted from the training data. Instead of the class, the model uses and MSE for decision accuracy. DT does not exhibit a good performance in generation and is very sensitive to variation in the training data. A minute change in training data widely affects the prediction accuracy. The minimum number of sample leaves used was 4 and the max depth was set to 2 in the model. The MSE value obtained for DT was 8.7545 in minute resolutions.

Resolution
KNN is a supervised learning strategy where K is a constant; we used its default value (K = 5). The distance vector of nearest neighbors is computed by its value. The MSE obtained using KNN was 8.1876 in minute resolutions. Similarly, we used SVR which applies a similar procedure as SVM does for regression analysis. As regression data are a continuous number, to fit the model on such data, the SVR approximates the best values with a margin known as epsilon-tube by considering the model complexity and the error rate. The value of MSE in minute resolution in SVR is 6.0581. All these algorithms have different time durations taken during testing and training. These algorithms widely depend on the configuration settings and the data horizons, such as the number of given instances, their resolutions, etc., which affects their timings etc. We calculated the training and testing times of three algorithms such as AB, GBR, and RF. However, the time calculated for AB was used for ABR2 as it the latest version of AB and is used in the proposed method. The time complexity of these algorithms is given in Table 9, which is calculated for days' resolutions as we are widely focus on short-term analysis. The analysis carried out for short-term shows that the RF is the most expensive in terms of its training and testing.

Results On Sequential Learning
In this section, we technically discuss the results obtained for ECP using deep sequential learning and debate on their predictions for future. The same strategy of the K-fold validation method and hold-out method was applied over the used dataset. At first, we performed the experiments on the LSTM network, which is a type of RNN that process the energy sequential data. The LSTM learns the input data sequence by iterating it and acquires the information regarding the observed sequence. Based on learned information, the prediction is performed in the next sequence. We created X and Y sequences in the used dataset and apply the window method with the size of the step value i.e., minutely, hourly, etc. The Y-value was generated after the sequence of X-values. After this, the window was shifted into the next element of X, and then Y was predicted and this process continued.
After applying LSTM, we used Bi-LSTM, which processes the data sequence in forward and backward directions to remember the past information and predict the future data information. We also used M-LSTM where multiple layers of LSTM were used to reduce error and enhance the accuracy and the MSEs obtained for LSTM, Bi-LSTM, and M-LSTM were 0.2821, 0.1855, and 0.1661 in minute resolutions, respectively. These results were obtained using the 10-fold cross-method and are presented in Table 10, while the results of the hold-out method are presented in Table 11.

Comparative Analysis
In this section, we compare the techniques used for energy predictions. The employed techniques are generalized as ELA and sequential learning. Sequential learning is known to be a subset of machine learning which functions in a similar fashion. If an algorithm based on artificial intelligence (AI) makes an incorrect prediction, we have to make adjustments. In deep/sequential learning, the algorithms determine on their own whether a prediction is accurate or not using their neural networks. Here, in the prediction problem, the machine learning algorithm parses the data and learns from it to make an informed prediction based on what it has learned. However, the deep sequential learning creates an artificial neural network that learns and makes intelligent predictions. In the proposed study, the sequential learning algorithms perform better than ensemble/conventional machine learning algorithms, which that can be verified by the error values given in Tables 7 and 10. Additionally, their architectural details are explained in each section, which widely shows the good performance of the deep sequential learning algorithms. The best performers in ensemble learning and sequential learning are SVR and M-LSTM, respectively. Overviewing this, the most appropriate model is M-LSTM from sequential learning. Energy prediction for minutes and hours are given in Figure 8, while for days and week are given in Figure 9. Furthermore, Figure 10 shows a detailed visual representation of the comparative analysis of the methods.

Comparison with State-of-the-Art Techniques
In this section, we discuss and compare the proposed method with the state of the art to confirm and verify the effectiveness of the proposed algorithm using the household power consumption dataset. To fairly compare the results, we used the same one-minute resolution and the 10-fold results were considered. At first, we investigated the method presented by Kim et al. [45], where a deep learning-based autoencoder was used to forecast future energy by applying backpropagation through a time algorithm and train the model. error values given in Tables 7 and 10. Additionally, their archi explained in each section, which widely shows the good perfor sequential learning algorithms. The best performers in ensemble lear learning are SVR and M-LSTM, respectively. Overviewing this, th model is M-LSTM from sequential learning. Energy prediction for mi given in Figure 8, while for days and week are given in Figure 9. Fur shows a detailed visual representation of the comparative analysis of      Figure 10. Visual representation of each method using MSE and RMSE for the minutes' horizon, which shows the better performance of the deep sequential learning algorithms over ELA.

Conclusions
The technological growth and industrial electrical machineries advancements have resulted in a large amount of energy consumption in terms of power, fuel, oil, and gas without proper infrastructure. The practice of smart energy management over the decades has received considerably lower research attention when compared to computer vision and many other data science problems. A large amount of energy is wasted due to improper management and no proper adjustment being made between the residential buildings/areas and smart grids. To handle this problem, researchers apply several techniques to forecast and efficiently manage energy consumption through machine learning techniques. Though, the existing techniques have widely focused on the study of a single strategy and are selective for conventional approaches, their performance is still far from real-world implementation. Thus, in this paper, we developed a three-tier novel ECP framework and we dive deep into a detailed comparative analysis of conventional and deep sequential forecasting learning methods by investigating them for predictions and error rates. Conventional forecasting learning includes several ELAs while deep sequential learning contains popular techniques such as LSTM, Bi-LSTM, and M-LSTM. In the first tier, the input data sequence is given to preprocessing layer for noise and outlier removal. The second tier feds the refined data into the training phase for learning while the third tier gives the final ECP through actual and prediction graphs. We also evaluated the effectiveness of the proposed method using basic error metrics. Furthermore, we visualized the data and the results to analyze the data patterns that show better interpretations. The proposed method used individual a household power consumption dataset that is publicly available on UCI machine learning repository.
In future, we aim to consider different forms of energy acquired from the real-world environment such as temperature, humidity, heating, cooling, energy from solar, water, and wind related data. Based on these data, we aim to explore distinct forecasting models using mathematical and theoretical modeling, AI, complex neural networks, and reinforcement learning. Similarly, energy storage systems are important aspects to be considered in energy management to aid efficient energy consumption. Next, the energy generation is not properly aligned with the residential areas for proper energy usage; therefore, a system to match the smart grid and residential areas for energy production and consumption need to be included in the energy related literature. Moreover, we intend to focus on edge computing and include resource-constrained devices with lower computational costs. Using the edge concept will ease ECPs in terms of lower compositionality and ensure timely responses.