Towards Modified Entropy Mutual Information Feature Selection to Forecast Medium-Term Load Using a Deep Learning Model in Smart Homes

Over the last decades, load forecasting is used by power companies to balance energy demand and supply. Among the several load forecasting methods, medium-term load forecasting is necessary for grid’s maintenance planning, settings of electricity prices, and harmonizing energy sharing arrangement. The forecasting of the month ahead electrical loads provides the information required for the interchange of energy among power companies. For accurate load forecasting, this paper proposes a model for medium-term load forecasting that uses hourly electrical load and temperature data to predict month ahead hourly electrical loads. For data preprocessing, modified entropy mutual information-based feature selection is used. It eliminates the redundancy and irrelevancy of features from the data. We employ the conditional restricted Boltzmann machine (CRBM) for the load forecasting. A meta-heuristic optimization algorithm Jaya is used to improve the CRBM’s accuracy rate and convergence. In addition, the consumers’ dynamic consumption behaviors are also investigated using a discrete-time Markov chain and an adaptive k-means is used to group their behaviors into clusters. We evaluated the proposed model using GEFCom2012 US utility dataset. Simulation results confirm that the proposed model achieves better accuracy, fast convergence, and low execution time as compared to other existing models in the literature.


Introduction
In the energy sector, load forecasting assists the utility to estimate the energy needed to balance energy supply and demand. In addition, load forecasting provides information that is used for easy energy interchange with other utilities. Long-term load forecasting (LTLF) of more than a year ahead is required to determine the grid's regulatory policies and prices as well as the planning and construction of new electricity generation capacity. On the contrary, short-term load forecasting (STLF) of a few hours to couple of weeks ahead is required for economic planning of electricity generation capacity, security analysis, fuel purchases and short-term maintenance of grid. Very short-term load forecasting (VSTLF) of a few minutes to couple of hours ahead is required for real-time evaluation of security and 1.
We propose an entropy MI-based FS method that can handle both linear and nonlinear electricity load data; however, we improve the work of [21] to remove irrelevancy and redundancy of features. We also demonstrate how a modified entropy MI is applied to work systematically on load time series.

2.
We propose auxiliary variables for the FS method based on four joint discrete random variables. Furthermore, the efficiency of the selected candidate features is examined based on ranking.

3.
An accurate and robust MTLF (AR-MTLF) based on condition restricted Boltzmann machine (CRBM) is proposed to forecast month ahead hourly electrical loads. We refer to "robust" in our work to imply the efficiency of our proposed model in terms of execution time and the dynamic analysis of consumers' behaviors. In addition, Jaya-based meta-heuristic optimization algorithm is used to improve the forecasting accuracy. 4.
We analyze the consumers' energy consumption behaviors by adopting a discrete time Markov chain (DTMC) that determines the state-dependent features. Furthermore, adaptive k-means [22] is used to classify electricity load into five groups, e.g., low, average, high, and extremely high consumption. In addition, it also derives the number of transitions and the obtained values serve as the quantization level of the load consumption. 5.
The proposed model was implemented using the dataset of GEFCom2012 US utility [23].
In addition, we compared AR-MTLF model with accurate fast converging STLF (AFC-STLF) [21], variables, which are used in getting the operational forecasts. In addition, different temperature scenarios were considered for load forecasting. Furthermore, they performed electricity price forecasting by fitting a sparse linear regression to a large set of covariates. The authors of [29] used CRBM and factor CRBM (FCRBM) for energy forecasting, which is classified by load profile based on measured data. Load forecasting is performed for 15-min, 1-h, and one-week time resolutions. The simulation results show that aggregated active power consumption gains the best forecasting results as compared to the load demand of intermittent appliances. However, the proposed network does not consider the weather temperature and other factors that may affect load forecasting. Quilumba et al. [30] resolved the effort involved in enhancing intra-day load forecasting using clustering method to detect the groups of consumers with similar load consumption from smart meters. K-means is used for clustering and NN is used for load forecasting. The authors also focused on the sub-hourly forecast with various time horizons up to one day ahead. Thus, the value of k in k-means must be known. However, it can create inaccurate clusters, if the number of k is chosen incorrectly. Similarly, the NN is prone to return solutions which are locally but not globally optimal. Singh et al. [31] proposed an intelligent model for data mining to forecast and analyze time series loads, which visually reveals several temporal consumption patterns. The operation of an appliance on hourly, weekly, monthly, and seasonally bases defines patterns that are used to examine consumers' behavior. Unsupervised data clustering, frequent pattern analysis, and Bayesian network are trained to forecast the time series loads. In addition, the proposed model performs better than the support vector machine and multi-layer perceptron, respectively.
Smart meters are an important part of the power grid. The uses of smart meters' data help to study and examine consumers' behaviors. Yuancheng et al. [19] analyzed electricity behavior of consumers to achieve the accuracy of load forecasting. At first, individual daily loads of consumers are examined using various forecasting horizons, i.e., workdays, the day before the holidays, and holidays. Subsequently, electrical loads are grouped to classify consumers with the same behavior in a cluster. Furthermore, the electricity load forecasting of different groups is carried out using the online sequential extreme learning machine (ELM) (OS-ELM). This approach provides a summary of the entire system's load as well as examines consumers' electricity behaviors extensively. Therefore, it assures the accuracy of the load forecasting via clustering of consumers that uncovers the relationship between electricity behavior and cluster number. Although numerous works focus on STLF, in smart grid, VSTLF is imperatively used to solve, facilitate, and improve the quality of real-time electricity. Yu-Hsiang Hsiao [32] proposed a novel model for VSTLF that examines households' data, based on daily scheduling patterns and context information. Distinctive behavioral patterns are used to examine the everyday electricity consumption and context features from different sources. In addition, it is used to control the anticipated behavioral patterns on a particular day. Thus, the volume of electricity consumption is modeled to predict individual's behavioral patterns within a specific period of a day.
With the advent of the smart grid, many renewable energy resources such as solar and the wind are introduced into the power system. It creates intricate power system loads, which makes STLF analysis difficult. Pei et al. [33] proposed a STLF framework to address these limitations. The clustering analysis classifies the daily load patterns of different loads collected by smart meters. Afterwards, critical influential factors are determined by association analysis. Besides, the established classification criteria are applied via a decision tree. Finally, the best load forecasting models for an individual's load patterns and the associated critical factors are selected. The selected models are used to examine the load forecasting by aggregating the different load forecasting results and line losses. Table 1 provides the summary of literature review in terms of the size of data, time resolutions, FS, techniques, and objectives.

Problem Statement
In the literature, several forecasting techniques mainly focus on conventional models such as fuzzy polynomial regression [8,9,11]. The conventional models are used to capture hidden information within the data. However, these models cannot solve complex nonlinear relationship between time series factors, i.e., daily time rhythm. Thus, it may cause substantial errors regarding load forecasting. In addition, these models do not pay sufficient attention to the time lags effect of external economic factors. Presently, machine learning techniques, such as ANNs, are used to forecast continuous time series and also provide adaptability [12,14,16,17]. However, the entire computational process is mostly a black box, which is not understandable compared to conventional methods.
Nevertheless, accuracy and convergence of machine learning techniques are not fully improved. For example, Liu et al. [34] proposed a hybrid ANN-based strategy to improve the forecasting accuracy. Despite this improvement, the entire strategy resulted in low convergence rate and high complexity. Similarly, the authors of [35] enhanced the convergence rate through ANN-based strategy, but achieved low forecasting accuracy. The authors of [36] further enhanced the work of [35] by incorporating an optimizer, which resulted in high execution time. In addition, Reference [21] improved the work in [35] by integrating a modified enhanced differential evolution algorithm (mEDE), modified MI-based FS and ANN. In MI-based FS process of [21], the downsized inputs do not further reduce the training time; here, information loss is observed. This is due to the unstable convergence of the mEDE and inefficiency of the model to train on massive amount of data.
In this paper, we improve the forecasting model of [21] by integrating CRBM. The preference of selecting CRBM over ANN is due to its ability to perform deep learning and it being a multi-layered neural network. In addition, CRBM uses the conditional hidden layer values as input to the next layer, whereas ANN is fine-tuned on whatever input it receives (i.e., label) and can be used as the traditional back propagation training method. CRBM is trained with a predetermined energy function, while ANN is trained by back propagation for achieving the least square objectives. In this paper, Jaya-based meta-heuristic optimization algorithm is used to minimize the forecasting error via iterative process. The choice of Jaya over the mEDE algorithm used in [21] is that mEDE requires parameters tuning, which may not guarantee the global optimum solution, whereas Jaya does not require algorithmic specific control parameters such as mutation and crossover rate. In this paper, we examine the dynamic behaviors of customers using adaptive k-means, which solves the problem of selecting k in k-means. In addition, a Markov chain (MC) is used to formulate the dynamic behavior of consumers, which indicates that the future energy consumption state correlates with the present states.

System Model
Figure 1 depicts our proposed system model, which consists of FS, forecaster, optimizer, and customer dynamics modules. At first, data of electrical loads are normalized. In the FS module, a modified entropy MI-based FS method is proposed to eliminate redundancy and irrelevancy from the data. It generates the candidate sets. In addition, candidates are sorted based on their ranking. Note that the candidates are designed based on target, average observed data, and moving average of data, which are partitioned into training, validation, and testing and used by the forecaster module for load forecasting. The forecasting error is minimized by the optimizer module through the iterative search process. DTMC is used to examine the consumers' behaviors and the states indicate the patterns of load consumption as lowest, low, average, high, and extremely high. In Figure 1, P denotes the probability matrix of the input data; u denotes the input to the forecasting model; W denotes the random weight for the hidden, visible, and history layers, respectively; and forecasting model biases are denoted by "a" and "b" for visible and hidden layers. The number of iterations is denoted by r; α is the learning rate; . data signifies the settings of forecasting model after it is fed with the training data; . recon represents the settings of forecasting model after MC is performed; and N and sign denote the Gaussian and sigmoid functions, respectively.  Figure 2 depicts the flowchart of our system model. Details of each step in the flowchart are presented in subsequent subsections of this paper. The processing step starts by combining the electrical load and the temperature data, which have an impact on the consumers' electricity consumption behaviors. Based on this fact, the moving average of the load data T h,d of the dth day is calculated by Equation (1).

Data Preparation and Preprocessing
The total time horizon is denoted by z. The preprocessing of data ensures zeroes and outliers are removed. In addition, the processed dataset is normalized to the range of [0, 1], while maintaining the temporal order. Three datasets are created from the normalized data for training, validation, and testing.

Modified MI Based FS
A survey by [37] provides the detailed discussion on the different types of FS techniques such as filter and wrapper methods. In the filter methods, ordering-based variable selection is used, which is constrained by variable ranking, e.g., correlation criteria and MI methods. In the wrapper methods, variable selections are performed by predictor, e.g., heuristic search algorithms and sequential selection algorithm. Other FS methods are semi-supervised learning, unsupervised learning, and ensemble FS. In this paper, a modified entropy-based MI FS method is proposed to eliminate redundancy and irrelevancy of features by choosing the best subsets for accurate load forecasting. In this way, the curse of dimensionality is prevented.
The proposed MI-based FS method consists of four joint discrete random variables defined in Equation (2).
where the joint probability of the four discrete random variables are represented as pr(p, p t , p m , p q ), and pr(.) is a probability. Input discrete random variable is denoted as p i , p t j denotes the target value, p m k represents the mean value, and p q l denotes the moving average data. We formulate the proposed MI-based FS method as: ).
If MI(p, p t , p m , p q ) = 0, it is independent. In addition, if MI(p, p t , p m , p q ) > 0, it is slightly related. Otherwise, if MI(p, p t , p m , p q ) < 0, it is not related. Ahmad et al. [21] added the target data as values of the previous day and average behavior to improve the forecasting results of their model. However, adding an average behavior is not sufficient. In this paper, temperature and moving average of the target data are included in the proposed forecasting model to achieve high forecasting accuracy. In Equation (3), the proposed MI is coded to binary values using Equation (4). ).
We introduce an auxiliary variable τ m for the individual elements and the joint probability is given in Equation (6).
where τ m ∈ [0, 1, . . . , 15]. τ 0m is the number of zeros, τ 1m is the number of ones, τ 2m is the number of twos, etc. Figure 3 reports the simulation results for the fifteen auxiliary variables. Note that the empty spaces of auxiliary variables 8,9,19, and 11 depict that there are no corresponding matching elements. The joint probability of the individual value of τ m is calculated using Equation (7).
In the proposed MI-based FS method in Equation (7), L denotes the length of input data. The candidates are sorted based on the values of MI. With the the sorted values, redundancy and irrelevancy of features are eliminated. Based on the proposed forecasting model, the selected candidates, S 1,1 , . . . , S 1,n , are coded to binary values using Equation (4).

Forecaster Module
The forecaster module depicted in Figure 1 shows the configuration of CRBM model, which is adopted from [29]. In our proposed model, CRBM performs three steps: cost function minimization, gradient update, and probability inference. Interested readers can find the details of these steps in the work by [29]. The aim of our proposed forecasting model is to forecast the energy consumption for a given time slot or series of time slots in the future. By obtaining the historical energy consumption data, the time slots between the two measurements should be the same (i.e., input and output). We consider the time slot k for the historical energy consumption as the current data. Here, the vector that denotes the historical energy consumption data is expressed as: where P k denotes the kth measurement. The forecasting model should be able to forecast the energy consumption for the next h time slot, which is expressed as: where P k denotes the kth measurement for the forecasted energy consumption data. The input vector of our proposed AR-MTLF model is expressed as: where I k is the kth input to the hidden layer of the forecaster module, O k is the kth output from the forecaster module, p q k is the kth moving average value, p m k is the kth mean value, p t k is the kth target values, p k is the kth historical value, and F k is the kth flag that defines if the first forecasting time slot is on the weekend. Note that, if the historical data for previous four time slots are used as input to make the forecasting for the next time slot, the fifth time slot is used as the input to the hidden layer of the forecasting model.
To evaluate the accuracy of the proposed forecasting model, RMSE is used.
where A k represents the kth actual load and A k denotes the kth predicted load. The value of N is used to denote the time trends such as hourly, daily, weekly, and monthly trends. Note that after, sequences of iterations, the final value of RMSE becomes the validation error.

Optimizer Module
In this paper, the forecasting model's accuracy is improved using Jaya optimization algorithm, which is adopted from [38]. It has been used to solve the non-constrained and constrained optimization problems [39]. We define the objective function as: where N is the total time horizon. Table 2 provides the simulation parameters of the optimizer. Note that this paper aims at achieving, firstly, fast convergence, i.e., the time spent by the system during simulation to execute the forecasting strategy and, secondly, acceptable minimum forecasting error (the system is able to forecast in the fastest manner).

Customers' Dynamic Behavior
In this section, we discuss the consumers' dynamic energy consumption behaviors using the approach shown in Figure 4. The related work on load profiling mainly focuses on a single residential customer, which shows a weak regularity. It is important to note that the dynamic characteristics are best in combined consumers and can be illustrated using different consumers' load profiles. However, due to the randomness of these different consumers' load profiles, the real consumption behavior of consumers cannot be efficiently examined. To handle the problem, a DTMC is used to formulate the dynamic behavior of consumers by considering the state-dependent features. Thus, it indicates that the future load consumption behaviors of consumers will correlate with their present states.
Determine the number of states, we have used the adaptive kmeans. Determine the states transition, we have used the maximum likelihood estimation for transition probability.

Relationship States
Approaches and objectives  The relations and transitions between consumption behaviors in adjacent periods are known as dynamics [40]. In this paper, recording the dynamics as a factor of grouping is required, which is useful in deriving vital information about the consumers' consumption patterns within the shortest period of time. It also helps to establish the potential demand response target and reduces the dimensionality of the dataset.
Based on the load profile, this paper uses adaptive k-means from [41] to deduce the number of k-Markovian states. With respect to this, we classify consumers' consumption into five states, namely lowest, low, average, high, and extremely high, which are denoted by 1-5 in Figure 5, respectively. To group the consumption into a well-defined classification, a k-centroid obtained from the adaptive k-means is used to ascertain the required k-states. The value of k-means serves as the quantization level of load profile and the quantize values are used to derive the number of transitions. Algorithm 1 illustrates the proposed adaptive k-means. To determine the state transition probability, we use the maximum likelihood estimation, which is given in Equation (14) [42].
where n x,i is the number of transitions from state x to state i and ∑ t i=1 n x is the total number of transitions from state x. In Table 3, a five-step transition probability is obtained by transitioning from state x to state i in a single step and, in addition, it is a transition matrix table with a square matrix. When pr 5,2 = 0.0000, it means that, if you start from the state x with row = 5 to state i on column = 2, the transition probability is zero. On the other hand, it means that, if you start from state i with column = 2 to state x on row = 5, then the transition probability is zero. Hence, whichever state the consumers start with, their transition probabilities are zero, which means that the consumers are absorbed in that state. MC will either stay at the current state or move to the adjacent state. The transition probability is used to examine the current behavior of the consumer for decision making.  i ← 0; , j ← 0; initialize iteration counter 3: data ← double(actual dataset); 4: index ← data(:); copy value as array 5: while true do 6: initialize the mean point 7: i ← i + 1; increment counter for each iteration 8: while true do 9: j ← j + 1; increment counter for each iteration 10: ds ← (index − M 1 ) 2 ; find the distance between index and data 11: N is the number of dataset 12: best ← ds − ds 1 ; check whether it is selected accurately 13: M new 1 ← mean(M 1 (best)); update mean point 14: if M new 1 then 15: break; 16: end if 17: if M 1 == M new 1 | j > β then β is distortion threshold 18: j ← 0; 19: index(best) ← [ ]; remove value that is already assigned to a cluster 20: Center(i) ← M new 1 store center of cluster 21: break; 22: end if 23: update mean point 24: end while 25: if index == 0 | i > β then check maximum number of cluster 26: i ← 0; 27: break; 28: end if 29: end while 30: Center ← sort(Center); sort center 31: Center new ← di f f (Center); find the differences between two centers 32: Center(Center new <= intercluster) = [ ]; ignore cluster center less than distance 33: distance ← data − Center; find distance between cluster and data 34: choose cluster index of minimum distance 35: return indx, Center The choice of selecting a better distance function is challenging and the attempt to develop a cluster analysis on the dataset may result in different distance function values. Choosing a wrong distance function may not capture the variability of the data correctly. For example, based on the experiment and survey conducted by the authors of [43][44][45], we conclude that distance values differ across different distance functions. Distance function calculation plays a vital role in the clustering algorithm, as the distances between two points depend on the properties of the data as well as the dimension of the dataset. In addition, when random initialization of centroid is used, different simulations of k-means will introduce a different sum of square error (SSE). It is shown from the experimental studies of [44,45] that the number of iterations in Euclidean distance function is more than the Manhattan distance function, which makes the k-means less computationally time complex. Besides, the city block distance function shows better performance on both datasets (iris and wine) in terms of computational time as compared to Euclidean and Manhattan distance functions. A well-known distance function is Euclidean distance that is used to analyze all continuous numerical variables that reflect absolute distances. However, it does not remove redundancy. Mahalanobis distance is also a popular distance function that is used when there are continuous numerical variables that reflect absolute distances as well as to remove redundancy. Besides, if we are concerned about making a distinction between variables, the family of Hellinger distance, species profile, and chord distance are appropriate distance functions. These distances are weighted by the overall quantity of each sample. Hence, smaller distances are obtained when the variables of each sample are similar while the absolute magnitude varies. In this paper, we use the squared Euclidean distance, where each centroid is the mean point in the cluster, as defined in Equation (15).
where x is the observation, c is the centroid, and the total number of observations is denoted by M.
The importance of using the squared Euclidean distance is to avoid computing the square root that derives the squared distance between two data points. In addition, it saves computation costs. In the DTMC, we consider the random variables Y 1 , Y 2 , Y 3 , . . . , with Markov property, which states that the probability of moving to the next state depends only on the present state and not the previous state [42].
We verify this fact in our simulation by assuming a static state of the MC, and then we record the transition of the state using Equation (16).
If both conditions are well defined, i.e., pr(Y n+1 = y|Y n = y n ) > 0, then the values of Y are the state space. MC is mostly explained by a directed graph, where the edge n is the probability of going from one state at time n to another state at time n + 1. This can also be represented using a transition matrix from time n to n + 1. MC can be assumed to be time-homogeneous, by which the matrices are independent of n. Time-homogeneous can be described as a state machine that assigns a probability of moving from a state to an adjacent one [42]. Thus, the probability can be studied as the statistical behavior of the machine's state with the first element of the state space as input given below.
The probability of transition does not depend on n. MC can also be described as MC with memory, in which the future state depends on the past state. Certain properties of MC [42] are relevant to discuss in this paper as follows.

1.
Irreducible: MC is irreducible if all states interact with one another.

2.
Periodicity: Any visit to state x with period d occurs k number of times.
where gcd is the greatest common divisor. Assuming it is easy to visit a state at k time step, the state is aperiodic if k = 1. Contrarily, the state is periodic if k > 1.

3.
Transient: A state is transient if we start in state x and there is a nonzero probability that we return to x.

4.
Recurrence: If the number of visit to a state is infinite, then recurrence state has occurred. 5.
Absorbing: If one stays in a state, then it is impossible to leave the state; hence, an absorbing state has occurred. Thus, a state x is absorbing if pr x,x = 1 and pr x,y = 0, x = y.

Simulations and Discussion
The performance of our proposed AR-MTLF was tested using a dataset comprising 5000 residential customers taken from GEFCom2012 [23]. The dataset is the hourly four-year electrical loads and temperature data across 21 zones of the US utility. It is split into training set (2004)(2005), validation set (2006), and testing set (2007). For example, Figure 6 shows the electricity load dataset for Zone 1 (Z 1 ). Note that the historical datasets used for the simulations were chosen because they are widely used in the forecasting community [46,47]. All simulations were performed in MATLAB 2018 using a personal computer with 64-bit processor and 8 GB RAM.
To show the performance of AR-MTLF model, we compared its forecasting results with AFC-STLF-, ANN-, NB-, KNN-, SVR-, and ensemble-based forecasting models. For interested readers, the details of these algorithms are discussed in the work by [48]. We discuss the simulation results in the following subsections. Section 5.1 presents the discussions of hourly load forecasting. Section 5.2 provides discussions of seasonal load forecasting. Section 5.3 presents the simulation results based on performance evaluations in terms of forecasting error and convergence rate. Section 5.5 provides the results of consumers' load consumption dynamics. We considered Zones 1-6 (Z 1 -Z 6 ) for further evaluations. As shown in Figure 3, the total number of elements in the MI-based FS is 2500 using Equation (6). Each value of discrete random variables is derived using Equations (2) and (4). We obtain the time lag data from the temperature data and moving average using Equation (1). We define a threshold of 0.05 for the proposed MI-based FS method. Note that the threshold value of 0.05 is derived from the median of the normalized dataset. Candidate sets greater than the defined threshold are selected as the irrelevant features; otherwise, they are selected as the redundant features. The selected features are used by the forecasting model and the entire process is described in Figure 2. Our proposed MI-based FS method is used to solve the probability distribution of the joint entropy variables and discretized continuous features into 15 partitions. Table 4 shows the probability results for MI of the 15 binary partitions. In the table, it is observed that binary partitions 1-4 and 9-12 have probabilities of zeros, which means that the feature variables are independent. On the other hand, partitions 5-8 and 13-15 have probabilities of 0.91, which means that the feature variables are slightly related. In Table 5, we denote F 1 as the historical data, while F 2 , F 3 , and F 4 are denoted as the target value, mean, and moving average value, respectively. For example, when features F 1 and F 2 are selected, we evaluate F 3 and F 4 by making an equivalent combination of C = {F 1 , F 2 , F 3 , F 4 }. Because the combinations of F 1 , F 2 , F 3 , and F 4 are (0, 0, 0, 0), (0, 0, 0, 1), (0, 0, 1, 0), (0, 0, 1, 1), (0, 1, 0, 0), (0, 1, 0, 1), (0, 1, 1, 0), (0, 1, 1, 1), (1, 0, 0, 1), (1, 0, 0, 0), (1, 0, 1, 0), (1, 0, 1, 1), (1, 1, 0, 0), (1, 1, 0, 1), (1, 1, 1, 0), (1, 1, 1, 1), we assign 1=(0, 0, 0, 0), 2=(0, 0, 0, 1), 3=(0, 0, 1, 0), 4=(0, 0, 1, 1), 5=(0, 1, 0, 0) and so on until 15= (1, 1, 1, 1). If F 3 > F 4 , then F 3 will be selected when binary is 1; however, if MI(C, F 3 ) = 0.0 and MI(C, F 4 ) = 0.91, it implies that F 4 is more relevant to C than F 3 . Nevertheless, F 4 is redundant to the combination of C, while F 3 and C are complement to each other (i.e, MI(C, F 3 ) = F 3 = 0.0). Note that this approach is not restricted to the selection of a single feature; multiple features can be selected as well. In addition, the performance of the proposed MI-based FS method depends on the forecasting model. Table 4. MI for the four joint discrete random variables.

Hourly Load Forecasting
The hourly demand for electricity has periodic trends that can be hourly, daily, weekly, monthly, and yearly, which immediately creates the calendar variables. Figure 7 shows the 24-h load forecasting of Z 1 . From the results, it is observed that SVR and ensemble models over forecast the actual load; the reasons are explained in Section 5.2.   KNN and NB models do not learn the actual load, but only memorize it. AR-MTLF and AFC-STLF forecasts are not far from the actual load. It is observed that there are slight rises and falls in forecasted values with the actual load, which is due to consumers' behaviors. The results show that, at the start of the day, consumer's electrical consumption begins to rise and fall, alternatively, until the peak hours and suddenly drops after peak hours. Thus, it illustrates the behavior of electrical energy consumption for a consumer.  (16-18 June 2007) is higher than the electrical load consumption during the working days, which occur when cooling loads are mostly active. The sudden decrease of consumers' load means that the utility restricts the use of power by means of switching off the energy supply in these periods (20)(21)(22)(23). Another plausible explanation could be that a higher temperature can spur load demand as the air conditioning continues to operate.  On the other hand, results in Figure 9 show that heat pumps or other heating devices continue to operate during 15-17 November of the severe winter periods. The continuous use of heating equipment increases the electrical energy consumption, which is observed during the periods of 15-20 November. From the figure, low electrical energy consumption is observed during the start of the week, where the temperature is high, thus, people did not use the heating equipment.            Figure 11 shows the month ahead hourly load forecasting for a year. From the results, it is observed that high electrical energy consumption during the extreme winter and summer periods are influenced by the rise and fall of temperatures. The results of the forecasting clearly show that our proposed model AR-MTLF outperforms the other models. Note that the NB model could not perform well due to the following reasons. NB model depends on labels that assume the shape of the distribution, which means that two features are independent of the output class. In addition, the NB model has continuous features; trying to make them discrete, a lot of information is lost. Furthermore, in the NB model, the classes may be unbalanced, since the method of deriving class labels is based on assumption.

Load Forecasting Based on Seasons
In the ANN model, curse of dimensionality problem is observed, where the approximation result is independent of the dimensions of the input space. Other problems of ANN model may be low convergence, less generalization, and over-fitting. The KNN model, by our assumption, suffers learning issue. In addition, it uses training data for its classification and the wrong choice of selecting the value of k in KNN can affect the results. The ensemble model, on the other hand, has a diversity problem in training the dataset of the six zones. In the SVR model, we select the kernel for the model based on trial and error. We noticed that, as the sample sizes increases, SVR becomes inefficient. The simulation results confirm that our proposed AR-MTLF achieves higher accuracy than the AFC-STLF using the proposed modified MI FS technique as well as deep learning method. This further indicates the importance of our proposed model to resolve the MTLF problem with respect to forecasting accuracy and convergence rate. In addition, the benefits of our scheme over the existing schemes is that it does not memorize the training dataset and can also be implemented on devices with low-memory. Furthermore, our proposed scheme is scalable for large datasets unlike ANN and SVR methods. The multi-layer state of our proposed scheme does not suffer from limited representational power of ANN and ensemble methods. Besides, once data are trained, our proposed scheme allows additional layers to be added.

Proposed Model's Performance Evaluations
RMSE is used to measure the accuracy of the forecasting in a model; a smaller RMSE means higher accuracy of forecasting model. Figure 12 shows the forecasting accuracy of existing models: RMSE = 8.22 for SVR, RMSE = 9.01 for NB, RMSE = 8.92 for Ensemble, RMSE = 0.98 for KNN, and RMSE = 5.81 for ANN. It is noticed that the AFC-STLF model has RMSE = 0.42, higher than RMSE = 0.32 for our proposed AR-MTLF model. The results show that our proposed model outperforms the existing models with the smallest RMSE value.    Figure 13 depicts the comparison of the proposed model with the existing models based on execution time. The execution time observed for our proposed AR-MTLF and AFC-STLF models are 100.00 and 125.50 s, respectively. AR-MTLF model minimizes execution time because the optimizer finds the global optimal solution with less execution time as compared to AFC-STLF model. However, there is no general way to detect that a solution has achieved a global solution. Moreover, in some cases, the fitness of global optimum or some bounds on it value may be known. In our case, we consider the optimality of the solution by inspecting the fitness value. In addition, we consider the number of iterations and the stopping criteria of the Jaya-based optimization algorithm to determine if the optimum solution is reached. Furthermore, if the current solutions are better than the formal solutions, the current solutions are considered as the optimum solutions after the stopping criteria are reached. In the figure, other existing models have less execution time as compared to the AR-MTLF and AFC-STLF models that incorporate optimization techniques such as Jaya and mEDE. Table 7 shows the execution time in seconds for all models. From the results, the models with highest execution time are observed from AR-MTLF and AFC-STLF. The high execution time occurs because of the extra execution time in implementing FS and optimization algorithms. Note that there is trade off between forecasting accuracy and execution time, as shown in Table 6 and Table 7, respectively.   Figure 14 shows that the convergence of our proposed AR-MTLF model. The convergence starts at 20th iteration. The results clearly indicate that the model achieves a global optimum solution within a reasonable number of iterations. In addition, our proposed AR-MTLF FS process reflects good performance for the lagged temperature data and average observed data. We show the RMSE values for the different time periods using a heat map in Figure 15. The heat map shows the first one week to avoid verbosity. No. of days in a week

Operational Forecast
We consider the MTLF distribution of consumers' electricity load for seven consecutive days using the temperature values for a year. However, the forecasting results do not reflect the uncertainty in the temperature; hence, the confident interval (CI) is smaller. CI enables us to know the range of values for the given distribution. Since we are concerned with capturing the true distribution of the consumers' electrical load consumption, a wider interval would be much better. Thus, the accuracy of forecasting will increase with the increase in CI; however, precision may decrease. Figure 16 shows the MTLF distribution of consumers' electrical load for seven consecutive days. It is observed that our proposed AR-MTLF model ensures accuracy and precision. We consider also the true distribution of the consumers' electrical load, while the certainty of the future temperature is unknown. To resolve this issue, we make an average between the consumers' electrical load and the temperature to derive a new MTLF distribution [28]. Note that we perform forecasting on new MTLF distribution and examine the distribution of 30 consecutive days with CI of 95%, as shown Figure 17. It is observed that CI is much larger than CI of Figure 16, which means that the true values of temperature are known.

Consumers Consumption Dynamic
A way of determining the unique behavior of consumers' consumption is to partition them into groups. An adaptive k-means is applied to the entire dataset and, afterwards, the number of k-centroid is obtained. Five groups are formed from k-centroid, which are distinguished in the bar chart shown in Figure 5. Each bar in the figure represents electricity consumption. Consumers with the same behavior are placed in the same bar. It can be observed that the energy consumption of different bars is not equally distributed. By our assumption, 90% of the consumption belongs to the last large bar, whereas the rest is distributed in the other four bars. In addition, consumers' consumption in the same bar relates to their electricity consumption behaviors over a specific period of time.
We are rarely concerned with the consumers' load consumption dynamic characteristics in all periods; we rather concentrate on a specific period of time. A DTMC process is employed on the sequence of demand response of dynamic consumers' load consumption behavior from one state to another. This is possible only if we consider the grouping of consumers' consumption in different adjacent periods. Figure 18 shows the transition states of the consumers' load consumption behaviors in different adjacent periods. We observe that the consumers' load consumption behaviors are viewed in five adjacent periods: let Period 1 be the lowest consumption; Period 2 be the low consumption; Period 3 be the average consumption; Period 4 be the high consumption; and Period 5 be the extremely high consumption. It is worth-mentioning that all periods belong to the aperiodic class. All periods are transient and the dynamic behaviors of consumers show much diversity since consumers' load consumption depend on temperature as well as the activities of daily living (ADL), i.e., classifying activities and tasks of consumers in the house. The consumer can move from one state to another without being absorbed in that state.
This makes it possible to study consumers' load consumption behavioral patterns. Figure 19 shows the probability distributions of each state after 20 simulation iterations. In the figure, it is observed that the consumers' average energy consumption probability of 0.27 is the highest as compared to the other classes of energy consumption. It means that the consumers have applied the demand response strategy to reduce their cost of electricity. Because of the different events of a day, consumers may change their states from average consumption to extremely high consumption to satisfy their load demands. To avoid being charged at a significantly higher rate than normal for all energy consumption, consumers immediately change their states from extremely high consumption to high consumption and subsequently to very low consumption. The ability to measure the electricity consumption of consumers throughout the period can provide an insight of the consumers' ADL. This helps both the consumer and utility in proper electricity management and planning.

Conclusions
MTLF is an emerging paradigm for electricity load forecasting. Many methods of MTLF exist in the literature that focus on the forecasting of daily peak load, daily energy consumption, and annual peak load consumption. However, this work focuses on month ahead hourly electricity load forecasting, which is important for grid's maintenance planning and harmonizing energy sharing arrangement. In addition, a modified MI-based FS model is proposed to eliminate redundancy and irrelevancy of features from the dataset. The proposed AR-MTLF model is used for electricity load forecasting, which resolves the limitations of AFC-STLF model through its ability to learn from massive amounts of data with less computational overhead. From the forecasting results, the relationship between temperature and electricity loads is examined. The existing model AFC-STLF focuses on day-to-day industrial smart grid operations using the lagged input samples. However, month ahead hourly load forecasting is not considered in the AFC-STLF model. In FS process of AFC-STLF, the downsized inputs do not reduce the training time; here, information loss is observed. This is due to the unstable convergence of the mEDE and inefficiency of the model in learning from massive amount of data. AR-MTLF model is proposed to overcome the stated trade offs. The newly proposed AR-MTLF achieves approximately 99.68% accuracy as compared to AFC-STLF with 99.58% accuracy. In addition, AR-MTLF model achieves execution time reduction up to 54.64% as compared to 46.12% of AFC-STLF. Furthermore, we compared our proposed model with KNN, ANN, NB, SVR, and Ensemble forecasting models, and the results clearly report that our model outperforms its counterparts.
This paper also proposes a novel approach to group consumers based on their energy consumption behaviors. Since the behaviors of consumers change from time to time, the need to examine their behaviors has become imperative to ensure proper management and planning. A DTMC is performed to uncover the typical electricity consumption dynamics and divide consumers into several distinct groups using the adaptive k-means.
As consumers do not follow well defined energy consumption patterns, there is a tendency that consumers' behavior will be repeated. Thus, if we can learn consumers' behavior, we may be able to deduce their next behavior. Based on this fact, our future work aims to employ reinforcement learning for real-time feedback. With this approach, consumers can be seen as the set of actions established over time.
Author Contributions: All authors agreed on the main idea. O.S., R.J.U.H.K., and H.F. implemented the proposed schemes and also wrote the proposed system models and results. F.A.A., M.S. and M.K.A. wrote rest of the paper, organized and refined the refined the paper as well. All authors together responded the reviewers' comments. N.J. supervised the overall work. All authors have read and agreed to the published version of the manuscript.
Acknowledgments: This work was funded by the National Research Foundation of Korea (NRF) through the Brain Korea 21 Plus Program under Grant 22A20130012814.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this paper:

ANN
Artificial neural network AFC-STLF Accurate fast converging short-term load forecasting AG Antigen AIS Artificial