Short Term Electrical Load Forecasting Using Mutual Information Based Feature Selection with Generalized Minimum-Redundancy and Maximum-Relevance Criteria

Abstract: A feature selection method based on the generalized minimum redundancy and maximum relevance (G-mRMR) is proposed to improve the accuracy of short-term load forecasting (STLF). First, mutual information is calculated to analyze the relations between the original features and the load sequence, as well as the redundancy among the original features. Second, a weighting factor selected by statistical experiments is used to balance the relevance and redundancy of features when using the G-mRMR. Third, each feature is ranked in a descending order according to its relevance and redundancy as computed by G-mRMR. A sequential forward selection method is utilized for choosing the optimal subset. Finally, a STLF predictor is constructed based on random forest with the obtained optimal subset. The effectiveness and improvement of the proposed method was tested with actual load data.


Introduction
A short-term load forecasting (STLF) predicts future electric loads with a particular prediction limit from one hour extending up to several days.The primary target of smart grids, such as reducing the difference between peak and valley electric loads, large-scale renewable energy absorption, demand side response, and optimal economic operation of the power grid, needs accurate STLF results [1].In addition, with the development of competitive electricity markets, an accurate STLF is an important basis for drafting a reasonable electricity price and improving the stability of electricity market operation [2].
The existing STLF methods can be divided into traditional methods and artificial intelligence methods.In the traditional methods, such as autoregressive integrated moving average (ARIMA) [3] and regression analysis [4], Kalman filter [5] and exponential smoothing [6] are commonly used.The combination of autoregressive and moving average in ARIMA is a better time series model for STLF [7].According to the historical time-varying load data, the ARIMA is established and applied for predicting the forthcoming electrical load.The regression analysis uses historical data to establish simple but highly efficient regression models [8].The Kalman filter improves the accuracy of STLF by estimating each component of load which is apportioned into random and fixed components.The exponential smoothing eliminates the noise in the load time series, and the degree of future load influenced by recent load data can be reflected by adjusting the weight of both data, which is helpful for improving the accuracy of STLF [9].Overall, the traditional STLF methods can analyze the linear relationships between input and output, but not the nonlinear relationships [10].If the load presents large fluctuations caused by environmental factors, the traditional methods may provide inaccurate forecasts.
In recent years, predictors based on artificial intelligence algorithms were widely used in the STLF of power systems [10][11][12][13][14][15][16][17].Such processes like fuzzy logic [14], expert systems [16,17], artificial neural networks (ANNs) [18,19], and support vector machines (SVMs) [20,21] are currently used in STLF.Fuzzy logic methods divide the input and the output into different kinds of membership functions, and then the relationship between input and output is established by a set of fuzzy rules for fuzzy systems for STLF [22].However, the fuzzy systems with single if-then rules lack self-learning and adaptive ability to be able to learn the input information effectively.An ANN acquires the complicated non-linear relationship between input and output variables by learning the training samples.However, there is no scientific way of acquiring the optimal network architecture when establishing an ANN model.In addition, it also encounters the problems of falling into local optima and over-fitting [15,23].SVMs overcome the deficiencies of ANNs by dealing with quadratic programming problems in acquiring the global optimal solution.As compared to an ANN, the SVM has many advantages.However, the SVM parameters, such as the type and variance of the kernel function, and penalty factor, are selected empirically.To achieve the optimal parameters, a SVM combined with genetic and particle swarm optimization algorithm is utilized [24,25].The random forest (RF) is a combination of classification and regression trees (CARTs) and a bagging learning method.Randomly, by sampling from the training samples and selecting features for splitting node, the RF provides the ability to resist noise and is free from over-fitting problems [26].Furthermore, in actual practice, there are only two parameters (the tree number and the number of the features for node splitting) that need to be set when RF is applied for STLF [15], making RF highly suitable for STLF.
Considering the effect of various factors, artificial intelligence methods analyze the complicated nonlinear relationships between power load and related factors to achieve higher precision of prediction.However, the features that the predictor employs will influence the accuracy and efficiency of STLF.Therefore, a feature selection schedule should be generated for choosing the optimal feature subset for a predictor.The common features, including historical load, time, and meteorology, are used for STLF modeling [11,27,28].Historical load can reflect the variation of load accurately, which contains plenty of information.The features of time, such as hour point, day of week, and on/off work day, can also indirectly show the load pattern.In addition, a short-term power load is mainly affected by the changing weather conditions which have a strong correlation with load demand.The accurate meteorological information of the numerical weather prediction (NWP) can improve the accuracy of STLF effectively.Consequently, NWP errors will reduce the accuracy of STLF [29].
A feature selection is a process of choosing the most effective features from an original feature set.The optimal feature subset extracted from a given feature set can improve the efficiency and accuracy of predictor in STLF [30].Nowadays, the manner of selecting the features has become a hot topic in short-term load forecasting research.Reference [31] adopted conditional mutual information for feature selection.The mutual information values between features and load was measured and subsequently ranked through their values.The first 50 features were used as a threshold parameter for filtering out the irrelevant and weakly relevant features.Reference [10] constructed an original feature set by using the phase space reconstruction theory.The correlation between features and load was analyzed, discovering the optimal feature subset.In reference [29], the mutual information was applied for extracting the effective features from the weather features, as well as, the historical load data features were also extracted for improving the accuracy of holiday load forecasting.Reference [32] used a memetic algorithm to extract a proper feature subset from an original feature set for medium-term load forecasting.Reference [33] analyzed the daily and weekly pattern by autocorrelation function, and chose 50 features as the best features for very short-term load forecasting.The mutual information based on feature selection was used in reference [23].By calculating the mutual information values between feature vectors and target variable, we can temporarily define a lower boundary criterion to filter the features.The optimal feature subset with best features was achieved for STLF.All of the researches [10,23,29,[31][32][33] made important contributions to the feature selection in STLF.However, these feature selection methods were just carried out by analyzing the correlation between features and load and the redundancy among these features was not considered.
To improve the accuracy of STLF, the mutual information based on generalized minimum redundancy and maximum relevance feature selection and RF for STLF is proposed.First, an original feature set is formed by extracting historical load features and time features from the original load data.Second, G-mRMR is used for generating the candidate feature, which is ranked in a descending order.Third, the sequential forward selection (SFS) method and a decision criteria based on mean absolute percentage error (MAPE) are utilized for obtaining optimal feature subset by adding one feature at a time to the input feature set of RF.Finally, the RF-based predictor is constructed with the optimal feature subset to achieve the optimal predictor.The proposed method is validated through STLF experiments using the actual load data from a city in Northeast China.The experimental results are compared with different feature selection methods and predictors.

Mutual Information-Based Generalized Minimal-Redundancy and Maximal-Relevance
The minimum-redundancy and maximum-relevance (mRMR) is the method which uses mutual information (MI) to measure the dependence between two variables.The MI-based mRMR not only considers the effective information between feature and target variable, but also acquires the repetitive information among features [34].It has the advantage of obtaining helpful features accurately when dealing with high dimensional data.
Given two random variables X and Y, the MI between them can be estimated as: X,Y P(x, y)log P(x, y) P(x)P(y) where P(x) and P(y) are the marginal density functions, and P(x, y) is the joint probability density function.
The target of feature selection methods based on MI is finding a feature subset J with n features which reflect the largest dependency on the target variable l from a feature set F m with m features (n m).
The maximum-relevance criterion uses the mean value of MI between feature x i and target l is described as follows: maxD The redundancy indicated by MI value describes the overlapping information among features, wherein a larger MI signifies more overlapping information and vice versa.In the process of feature selection, the features selected by maximum-relevance criterion can have more redundancy, and the redundant features have similar information as the prior selected feature cannot improve the accuracy of predictor.Therefore, the redundancy among features should also be evaluated in the process of feature selection.
The minimum-redundancy requires a minimum dependency among each feature: The mRMR criterion combined with Equations ( 2) and ( 3) is computed as follows: Generally, an incremental search method is used to search for the optimal features [34].Supposing there is a feature set J n−1 with n−1 features that has been selected.The aim is to select the nth feature from the rest of set {F m -J n−1 } according to Equation (4).The incremental search method with respect to the condition is as follows: mRMR : max where |J n−1 | refers to the number of features in J n−1 .
Restructuring Equation ( 5) by using a weighting factor to balance the redundancy and relevance of feature subset develops into the generalized mRMR (G-mRMR) presented as follows [35]:

Random Forest
The random forest (RF) is a kind of machine learning algorithm presented by Leo Breiman, who integrates classification and regression tree (CART) and bagging algorithm [26].A RF generates many different CARTs by sampling with replacement, wherein each CART achieves one result.The final forecasting result is achieved by computing the average value of all CARTs' results.

CART
The CART employs binary recursive partitioning technology for solving classification and regression issues [36].A CART, which consists of a root node, non-leaf nodes, branches, and leaf nodes, is shown in Figure 1.Each non-leaf node must be divided according to the Gini index when CART grows.Generally, an incremental search method is used to search for the optimal features [34].Supposing there is a feature set Jn−1 with n−1 features that has been selected.The aim is to select the nth feature from the rest of set {Fm-Jn−1} according to Equation (4).The incremental search method with respect to the condition is as follows: where |Jn−1| refers to the number of features in Jn−1.
Restructuring Equation ( 5) by using a weighting factor to balance the redundancy and relevance of feature subset develops into the generalized mRMR (G-mRMR) presented as follows [35]: : max ( , ) ( , )

Random Forest
The random forest (RF) is a kind of machine learning algorithm presented by Leo Breiman, who integrates classification and regression tree (CART) and bagging algorithm [26].A RF generates many different CARTs by sampling with replacement, wherein each CART achieves one result.The final forecasting result is achieved by computing the average value of all CARTs' results.

CART
The CART employs binary recursive partitioning technology for solving classification and regression issues [36].A CART, which consists of a root node, non-leaf nodes, branches, and leaf nodes, is shown in Figure 1.Each non-leaf node must be divided according to the Gini index when CART grows.Supposing there is a dataset D with d samples which includes C classes, the Gini index of D can be defined as: where di is the number of ith class.Afterward, the feature f is used to divide D into D1 and D2 subset, wherein the Gini index after the split is: Supposing there is a dataset D with d samples which includes C classes, the Gini index of D can be defined as: where d i is the number of ith class.
Afterward, the feature f is used to divide D into D 1 and D 2 subset, wherein the Gini index after the split is: Entropy 2016, 18, 330 5 of 19

Bagging
The bagging is an integrated learning algorithm proposed by Leo Breiman [37].Given dataset B with M features and learning rule H, a bootstrapping is carried out to generate training sets B 1 , B 2 , . . ., B q .The samples in dataset B may be appraised many times or not at all.A forecasting system consists of a group of learning rule H 1 , H 2 , . . ., H q which have learned the training set is achieved.Breiman pointed out that bagging can improve the accuracy of predicting the instability of learning algorithms such as CART and ANN [37].

RF
The RF is a group of predictors {p(x, Θ k ), k = 1, 2, . ..}, which is composed of numbers of CARTs, where x is the input vector and {Θ k } represents the independent identically distributed random vectors.The modeling process of RF is: (1) k training sets are sampled with replacement from the dataset B by bootstrap.
(2) Each training set grows up to a tree according to CART algorithm.Supposing dataset B has M features and mtry features are randomly selected from B for each non-leaf node.Afterward, the node is split by a feature selected from these mtry features.(3) Each tree grows completely without pruning.(4) The forecasting result is solved by calculating the mean value of the consequences of each tree predicted.
The flow chart of RF model is illustrated in Figure 2.

Bagging
The bagging is an integrated learning algorithm proposed by Leo Breiman [37].Given dataset B with M features and learning rule H, a bootstrapping is carried out to generate training sets . The samples in dataset B may be appraised many times or not at all.A forecasting system consists of a group of learning rule which have learned the training set is achieved.Breiman pointed out that bagging can improve the accuracy of predicting the instability of learning algorithms such as CART and ANN [37].

RF
The RF is a group of predictors { ( , ), 1,2, } , which is composed of numbers of CARTs, where x is the input vector and { } k Θ represents the independent identically distributed random vectors.The modeling process of RF is: (1) k training sets are sampled with replacement from the dataset B by bootstrap.
(2) Each training set grows up to a tree according to CART algorithm.Supposing dataset B has M features and mtry features are randomly selected from B for each non-leaf node.Afterward, the node is split by a feature selected from these mtry features.(3) Each tree grows completely without pruning.(4) The forecasting result is solved by calculating the mean value of the consequences of each tree predicted.
The flow chart of RF model is illustrated in Figure 2. The bagging and the random selection of feature for splitting ensure the good performance of RF, wherein: The bagging and the random selection of feature for splitting ensure the good performance of RF, wherein: Entropy 2016, 18, 330 6 of 19 (1) The same capacity of the training set sampled by bootstrap guarantees each sample in dataset B to be appraised equally.A situation that one sample may appear many times in the same training set and some may not causes low correlation among the trees.(2) The manner of selecting feature for node split applies randomness, and ensures the generalized performance of RF.
The number of feature mtry and the number of tree of RF nTree should be set when applying RF.Generally, mtry suggested setting is either mtry = [log2 (M) + 1] or mtry = √ M or mtry = M/3.The scale of RF generally selected empirically the largest size in order to improve the diversity of trees and guarantee the performance of RF.

Data Analysis
The historical load data used in this paper is archived data from a city in Northeast China from 2005 to 2012.As shown in Figure 3a,b, the load demand from 2005 to 2012 increased rapidly with the increase in population and development of the local society.It is difficult to generate a highly accurate STLF in this kind of load pattern.Figure 3c shows the correlation analysis results of the historical load by autocorrelation function [38].Evidently, the autocorrelation coefficient is reduced gradually along with the increasing of hour lag.According to Figure 3c, the load far from current has low correlation.Only the correlation of the load data from 2011 to 2012 is above the confidence interval which is positive correlation (above of the blue line).With the increasing of the load, the historic load with large lag has very low correlation with the forecasting point.Therefore, we prefer the data from 2011 to 2012 to be used for further research.
Entropy 2016, 18, 330 6 of 19 (1) The same capacity of the training set sampled by bootstrap guarantees each sample in dataset B to be appraised equally.A situation that one sample may appear many times in the same training set and some may not causes low correlation among the trees.(2) The manner of selecting feature for node split applies randomness, and ensures the generalized performance of RF.
The number of feature mtry and the number of tree of RF nTree should be set when applying RF.Generally, mtry suggested setting is either mtry = [log2 (M) + 1] or mtry = √ or mtry = M/3.The scale of RF generally selected empirically the largest size in order to improve the diversity of trees and guarantee the performance of RF.

Data Analysis
The historical load data used in this paper is archived data from a city in Northeast China from 2005 to 2012.As shown in Figure 3a,b, the load demand from 2005 to 2012 increased rapidly with the increase in population and development of the local society.It is difficult to generate a highly accurate STLF in this kind of load pattern.Figure 3c shows the correlation analysis results of the historical load by autocorrelation function [38].Evidently, the autocorrelation coefficient is reduced gradually along with the increasing of hour lag.According to Figure 3c, the load far from current has low correlation.Only the correlation of the load data from 2011 to 2012 is above the confidence interval which is positive correlation (above of the blue line).With the increasing of the load, the historic load with large lag has very low correlation with the forecasting point.Therefore，we prefer the data from 2011 to 2012 to be used for further research.By observing Figure 5, it is possible to know that the load demand presents a kind of cycling mode with a period of 7 days.The load demand from Monday to Friday is similar, whereas on Saturday and Sunday they are dissimilar from each other.This pattern is due to the concurrent changing of load level with the varying electricity consumption behavior of people within a week.The load point predicts the highly correlated load points similar from the day before as well as relevant with previous week.As shown in Figure 6, the load points throughout the week at lag 1, lag 24, lag 48, lag 72, lag 96, lag 120, lag 144, and lag 168 have strong relevance assuming each lag is 1 hour difference.Furthermore, other moment load values also have different dependence.By observing Figure 5, it is possible to know that the load demand presents a kind of cycling mode with a period of 7 days.The load demand from Monday to Friday is similar, whereas on Saturday and Sunday they are dissimilar from each other.This pattern is due to the concurrent changing of load level with the varying electricity consumption behavior of people within a week.By observing Figure 5, it is possible to know that the load demand presents a kind of cycling mode with a period of 7 days.The load demand from Monday to Friday is similar, whereas on Saturday and Sunday they are dissimilar from each other.This pattern is due to the concurrent changing of load level with the varying electricity consumption behavior of people within a week.The load point predicts the highly correlated load points similar from the day before as well as relevant with previous week.As shown in Figure 6, the load points throughout the week at lag 1, lag 24, lag 48, lag 72, lag 96, lag 120, lag 144, and lag 168 have strong relevance assuming each lag is 1 hour difference.Furthermore, other moment load values also have different dependence.The load point predicts the highly correlated load points similar from the day before as well as relevant with previous week.As shown in Figure 6, the load points throughout the week at lag 1, lag 24, lag 48, lag 72, lag 96, lag 120, lag 144, and lag 168 have strong relevance assuming each lag is 1 h difference.Furthermore, other moment load values also have different dependence.By observing Figure 5, it is possible to know that the load demand presents a kind of cycling mode with a period of 7 days.The load demand from Monday to Friday is similar, whereas on Saturday and Sunday they are dissimilar from each other.This pattern is due to the concurrent changing of load level with the varying electricity consumption behavior of people within a week.The load point predicts the highly correlated load points similar from the day before as well as relevant with previous week.As shown in Figure 6, the load points throughout the week at lag 1, lag 24, lag 48, lag 72, lag 96, lag 120, lag 144, and lag 168 have strong relevance assuming each lag is 1 hour difference.Furthermore, other moment load values also have different dependence.The original feature set for STLF can be achieved based on the above analysis.The 168 load variables {L t-168 , L t-167 , . . ., L t-2 , L t-1 } are extracted as part of original feature set.When doing a day ahead load forecasting, assuming the current moment is t, the load values from the moment t-1 to t-24 are unknown.Therefore, the variables {L t-24 , L t-23 , . . ., L t-1 } are eliminated from the original feature set.In addition, the features, such as hour of day, the day is within weekday or weekend, day of week and season, are considered for constructing the original feature set.
Though meteorological factor affects the load demand, it is not considered in this paper because the error of NWP influences the accuracy of STLF [29].If needed, the meteorological can be added into the original feature set for feature selection in the same manner.There are 168 features in the original feature set F, as shown in Table 1.The meaning of features in Table 1 is: Exogenous features: F Hour means the moment of hour, which is tagged by the numbers from 1 to 24.
F WW is either weekday or weekend marked by binary numbers, wherein 0 means weekend and 1 means weekday.F DW refers to the day of week, which is labeled by the numbers from 1 to 7. F S uses the numbers from 1 to 4.

Endogenous features:
F L(t-25) is the load 25 h before, F L(t-26) means the load 26 h before, and so on.

The Proposed Feature Selection Method and STLF Model
A feature selection method combined with G-mRMR and RF is proposed.First, the redundancy of features in F and the relevance between features and load are measured by G-mRMR.Each feature with mRMR value is ranked in a descending order.Afterward, a SFS-based RF is used to search for the optimal feature subset.The MAPE used as a performance index in the feature subset selection process is defined as: where Z i is the actual value of load, Ẑi is the forecasting value, N is the number of sample.

G-mRMR for Feature Selection
Supposed an original feature set F m including m features and a selected feature set J. The detail of feature selection process is enumerated below: (1) Initialization Ø→J.
(2) Compute the relevance between each feature and target variable l.Pick out the feature from F m which satisfies Equation ( 2) and add it into J. (3) Find the feature in the rest of m−1 features in F m that satisfies Equation ( 4) and add it in to J. (4) Repeat step (3) until F m becomes Ø.
(5) Rank the features in feature set J in descending order in accordance with the measured mRMR value.

Wrapper for Feature Selection
The common wrapper is a sequential forward and backward selection, both of which do not consider the feature weighting [34,39].Therefore, the effects of different dimensional features are must be measured, making wrapper a complex and computational feature selection method.According to the result of feature selection of G-mRMR, a wrapper for finding a feature subset can be applied in simpler manner.Considering the features selected by mRMR are ranked in a descending order, the features in the front of the ranking list contain more effective information, thus SFS is used for finding a small feature subset.
A SFS, in which features are sequentially added to an empty candidate set until the addition of another features, does not decrease the criterion.By defining an empty set S and an original feature set F m , in the first step, the wrapper searched for the feature subset with only one feature, marked as S 1 , wherein the feature x 1 selected in S 1 leads to the largest prediction error reduction.In the second step, the wrapper selects the feature x 2 from {F m -S 1 } and combines with S 1 lead to the largest prediction error reduction.The search schedule is repeated until the prediction stops decreasing.

The Proposed STLF Model
Based on the methods in Sections 4.1 and 4.2, the method of feature selection with RF for STLF is proposed.The feature selection and short-term load forecasting process are shown in Figure 7, where p is the number of feature and α is the weighting factor from 0.1 to 0.9, with an increment of 0.1.

Wrapper for Feature Selection
The common wrapper is a sequential forward and backward selection, both of which do not consider the feature weighting [34,39].Therefore, the effects of different dimensional features are must be measured, making wrapper a complex and computational feature selection method.According to the result of feature selection of G-mRMR, a wrapper for finding a feature subset can be applied in simpler manner.Considering the features selected by mRMR are ranked in a descending order, the features in the front of the ranking list contain more effective information, thus SFS is used for finding a small feature subset.
A SFS, in which features are sequentially added to an empty candidate set until the addition of another features, does not decrease the criterion.By defining an empty set S and an original feature set Fm, in the first step, the wrapper searched for the feature subset with only one feature, marked as S1, wherein the feature x1 selected in S1 leads to the largest prediction error reduction.In the second step, the wrapper selects the feature x2 from {Fm-S1} and combines with S1 lead to the largest prediction error reduction.The search schedule is repeated until the prediction stops decreasing.

The Proposed STLF Model
Based on the methods in Sections 4.1 and 4.2, the method of feature selection with RF for STLF is proposed.The feature selection and short-term load forecasting process are shown in Figure 7, where p is the number of feature and α is the weighting factor from 0.1 to 0.9, with an increment of 0.1.

Case Study and Results Analysis
The data for the experiment consists of the actual data from 2011 to 2012 from a city in Northeast China.For the purpose of feature selection and STLF, the data is divided into three parts: (1) training set (extract eight months randomly from 2011); (2) validation set (the remaining four months of 2011); and (3) test set (extract one week from each season from the data of 2012).More information about the data set is shown in Table 2.The number of variable mtry, which RF is not overly sensitive to, is recommended as mtry = p/3 [40].The complexity of RF is affected by the number of tree.Under the premise of non-reduction of prediction accuracy, the initial number of trees nTree is set as 500 [15].
Let Equation ( 9) to be one of the criteria of RF.In addition, the root mean square error (RMSE) is also used.The RMSE is defined in the follow equation:

Feature Selection Results Based on G-mRMR and RF
In this subsection, the optimal subset is achieved according to the minimum MAPE by setting different weighting factor values of G-mRMR.Figure 8 shows the MAPE curves of the results from RF predictions under different weighting factor α. As shown in Figure 8a, the MAPE is reduced and reaches a minimum value with the increase in the number of feature.Subsequently, it ceases to decrease and gradually increases, indicating that the later addition of features does not improve the performance of RF, but only brings adverse effect.As shown in Figure 8b, the error is reduced rapidly when adopting a small value of α, for instance α = 0.1, which indicates that features have useful information for improving the performance of RF.By excessively considering the redundancy among features when using a large value of α, the selected feature subset does not provide enough relevant information for the prediction of RF-based predictor.

Case Study and Results Analysis
The data for the experiment consists of the actual data from 2011 to 2012 from a city in Northeast China.For the purpose of feature selection and STLF, the data is divided into three parts: (1) training set (extract eight months randomly from 2011); (2) validation set (the remaining four months of 2011); and (3) test set (extract one week from each season from the data of 2012).More information about the data set is shown in Table 2.The number of variable mtry, which RF is not overly sensitive to, is recommended as mtry = p/3 [40].The complexity of RF is affected by the number of tree.Under the premise of non-reduction of prediction accuracy, the initial number of trees nTree is set as 500 [15].
Let Equation ( 9) to be one of the criteria of RF.In addition, the root mean square error (RMSE) is also used.The RMSE is defined in the follow equation:

Feature Selection Results Based on G-mRMR and RF
In this subsection, the optimal subset is achieved according to the minimum MAPE by setting different weighting factor values of G-mRMR.Figure 8 shows the MAPE curves of the results from RF predictions under different weighting factor α. As shown in Figure 8a, the MAPE is reduced and reaches a minimum value with the increase in the number of feature.Subsequently, it ceases to decrease and gradually increases, indicating that the later addition of features does not improve the performance of RF, but only brings adverse effect.As shown in Figure 8b, the error is reduced rapidly when adopting a small value of α, for instance α = 0.1, which indicates that features have useful information for improving the performance of RF.By excessively considering the redundancy among features when using a large value of α, the selected feature subset does not provide enough relevant information for the prediction of RF-based predictor.Table 3 presents the results of feature selection.When α = 0.4, the feature subset has the least number of feature and the RF generates the minimum MAPE.The optimal feature subset is selected.
The RF will do poor forecasting with less trees, while excessive trees will make it a complicated predictor.In order to obtain a reasonable number of trees of RF, an experiment is designed as follows: (1) The training set and test set with optimal features are used for the experiment.(2) The initial number of tree nTree = 1.
Training RF and testing with different nTree value with increment of 1 until nTree = 500.
The experimental result is shown in Figure 9.
The experimental result is shown in Figure 9.The prediction error decreases with the increasing number of tree.When nTree > 100, the error tends to be steady.By analyzing the result, nTree = 184 with minimum MAPE = 2.5389% is obtained, using this number of trees as the parameter of RF in the future experiment.

Comparision Experiments for STLF
The data shown in Table 2 are used in the comparision of experiments.

Comparison of Different Feature Selection Methods
By using RF as the predictor, the feature selection methods such as Pearson Correlation Coefficient (PCC), MI, and SFS, are compared with the proposed method for estimating the effect of feature selection of G-mRMR.The results of these feature selection methods are presented in Figure 10.
In Figure 10, with the same predictor, the SFS provides the best performance, followed by G-mRMR (α = 0.4) and MI, and finally the PCC.The SFS, which convolves with RF, selects 22 features and achieves the minimum MAPE = 2.4925%.Considering the relevance between feature and load and the redundancy among features, G-mRMR (α = 0.4) selects 15 features with the minimum MAPE = 2.5597%.The feature subset selected by MI, which does not consider the redundancy among features, is higher than G-mRMR (α = 0.4).Only the PCC analyzes the linear relation between features and load, however the feature subset selected through this method is not as good as G-mRMR (α = 0.4).The prediction error decreases with the increasing number of tree.When nTree > 100, the error tends to be steady.By analyzing the result, nTree = 184 with minimum MAPE = 2.5389% is obtained, using this number of trees as the parameter of RF in the future experiment.

Comparison Experiments for STLF
The data shown in Table 2 are used in the comparison of experiments.

Comparison of Different Feature Selection Methods
By using RF as the predictor, the feature selection methods such as Pearson Correlation Coefficient (PCC), MI, and SFS, are compared with the proposed method for estimating the effect of feature selection of G-mRMR.The results of these feature selection methods are presented in Figure 10.
In Figure 10, with the same predictor, the SFS provides the best performance, followed by G-mRMR (α = 0.4) and MI, and finally the PCC.The SFS, which convolves with RF, selects 22 features and achieves the minimum MAPE = 2.4925%.Considering the relevance between feature and load and the redundancy among features, G-mRMR (α = 0.4) selects 15 features with the minimum MAPE = 2.5597%.The feature subset selected by MI, which does not consider the redundancy among features, is higher than G-mRMR (α = 0.4).Only the PCC analyzes the linear relation between features and load, however the feature subset selected through this method is not as good as G-mRMR (α = 0.4).The experimental result is shown in Figure 9.The prediction error decreases with the increasing number of tree.When nTree > 100, the error tends to be steady.By analyzing the result, nTree = 184 with minimum MAPE = 2.5389% is obtained, using this number of trees as the parameter of RF in the future experiment.

Comparision Experiments for STLF
The data shown in Table 2 are used in the comparision of experiments.

Comparison of Different Feature Selection Methods
By using RF as the predictor, the feature selection methods such as Pearson Correlation Coefficient (PCC), MI, and SFS, are compared with the proposed method for estimating the effect of feature selection of G-mRMR.The results of these feature selection methods are presented in Figure 10.
In Figure 10, with the same predictor, the SFS provides the best performance, followed by G-mRMR (α = 0.4) and MI, and finally the PCC.The SFS, which convolves with RF, selects 22 features and achieves the minimum MAPE = 2.4925%.Considering the relevance between feature and load and the redundancy among features, G-mRMR (α = 0.4) selects 15 features with the minimum MAPE = 2.5597%.The feature subset selected by MI, which does not consider the redundancy among features, is higher than G-mRMR (α = 0.4).Only the PCC analyzes the linear relation between features and load, however the feature subset selected through this method is not as good as G-mRMR (α = 0.4).In order to verify the validity of the feature subset applied for STLF, there were four weeks distributed among four seasons in 2012 are used to test each feature subset with RF.For comparison, the full set of features with RF is also tested.The experimental results is shown in Figure 11.By examining the results in Figure 11a-d, generalized minimum redundancy and maximum relevance-random forest (G-mRMR-RF) (α = 0.4), mutual information-random forest (MI-RF), sequential forward selection-random forest (SFS-RF), and RF (full features) can fit with true load value accurately, whereas the accuracy of pearson correlation coefficient-random forest (PCC-RF) is low.The results of fifth day prediction in Figure 11a and the seventh day in Figure 11c show G-mRMR-RF has a better fit than MI-RF, indicating the necessity of considering the redundancy among features.The results of fifth day prediction in Figure 11a show that SFS-RF has better prediction performance than G-mRMR, while the seventh day prediction results in Figure 11c indicates G-mRMR-RF predicts better.In order to verify the validity of the feature subset applied for STLF, there were four weeks distributed among four seasons in 2012 are used to test each feature subset with RF.For comparison, the full set of features with RF is also tested.The experimental results is shown in Figure 11.By examining the results in Figure 11a-d, generalized minimum redundancy and maximum relevance-random forest (G-mRMR-RF) (α = 0.4), mutual information-random forest (MI-RF), sequential forward selection-random forest (SFS-RF), and RF (full features) can fit with true load value accurately, whereas the accuracy of pearson correlation coefficient-random forest (PCC-RF) is low.The results of fifth day prediction in Figure 11a and the seventh day in Figure 11c show G-mRMR-RF has a better fit than MI-RF, indicating the necessity of considering the redundancy among features.The results of fifth day prediction in Figure 11a show that SFS-RF has better prediction performance than G-mRMR, while the seventh day prediction results in Figure 11c indicates G-mRMR-RF predicts better.By analyzing Figures 10 and 11 and Tables 3-6 comprehensively, although SFS achieved the best forecasting results in the feature selection process, the proposed method achieved the better result in the testing schedule.When predicting the 28 days in the test set, the proposed method yields the best forecasting in 20 days and the MAPE in the remaining eight days is higher than other methods, ranging from 0.04% to 0.37%.The average MAPE and the average RMSE indicate G-mRMR-RF performs the best among the methods which demonstrates the validity and advancement of G-mRMR.
The new method also has the minimum value of the maximum error of STLF in the testing set.As shown in Table 5, the maximum MAPE and maximum RMSE of the proposed method are 6.12% and 208.00 MW.Although the maximum error of the new method is high, but compared with other By analyzing Figures 10 and 11 and Tables 4-7 comprehensively, although SFS achieved the best forecasting results in the feature selection process, the proposed method achieved the better result in the testing schedule.When predicting the 28 days in the test set, the proposed method yields the best forecasting in 20 days and the MAPE in the remaining eight days is higher than other methods, ranging from 0.04% to 0.37%.The average MAPE and the average RMSE indicate G-mRMR-RF performs the best among the methods which demonstrates the validity and advancement of G-mRMR.
The new method also has the minimum value of the maximum error of STLF in the testing set.As shown in Table 6, the maximum MAPE and maximum RMSE of the proposed method are 6.12% and 208.00 MW.Although the maximum error of the new method is high, but compared with other Entropy 2016, 18, 330 14 of 19 methods, the proposed method still performed better.The high prediction error can be caused by two factors.On the one hand, the load of forecasting day is much larger than the historical load data in the training set.In this paper, most features in the original feature set are extracted from the historical load data.Without the consideration of other features, the prediction results cannot advance just by improving the feature selection and forecasting method.On the other hand, with the significant economic rise of China from 2005 to 2012, the growth rate of gross domestic product of the city is more than 10%.Under this premise, the electric load of the city increases rapidly which makes STLF a challenging work.For comparing the influence of different predictors to STLF, support vector regression (SVR) and back propagation neural network (BPNN) are examined with G-mRMR for feature selecting in this subsection.The parameters of SVR are set as follows: the penalty factor is C = 100, the insensitive loss function is ε = 0.1, and the kernel width is δ 2 = 2 [41].
The parameters of BPNN are set as follows: the number of neurons in hidden layer is N neu = 2p+1 [42], and the iteration is T = 2000 [43].
Data consist of training set, validation set, and test set are similar with Section 4.2.The SVR and BPNN are used to generate the optimal feature subsets.
Table 8 presents feature subsets that different intelligent STLF methods had selected.With different predictors, the weighting factors are diverse, thus features are varying.Although the final number of feature selected by SVR and BPNN are less than RF, the RF-based predictor has higher precision of prediction which is the main target of STLF.The test sets, with four weeks being distributed over the four seasons, are used for estimating each predictor with the features chosen above.Figure 12 shows the MAPE for comparison and Table 9 gives the predictive accuracy of each model through maximum, minimum, and average MAPE.In addition, a direct comparison between G-mRMR-RF, generalized minimum redundancy and maximum relevance-back propagation neural network (G-mRMR-BPNN), and generalized minimum redundancy and maximum relevance-support vector regression (G-mRMR-SVR), in terms of MAPE, are also presented in this figure.Except for the MAPE prediction in the seventh day, as shown in Figure 12c, the accuracy of G-mRMR-RF is between 1% and 2%; one point is above 2%.In the whole experiment, only four days show that G-mRMR-RF forecasted worse than other models.Clearly, the G-mRMR-RF is the best prediction model for its low MAPE and small fluctuation of error.The G-mRMR-BPNN shows a little better performance than G-mRMR-SVR.We can observe the maximum MAPE of these four weeks of G-mRMR-RF is 2.26%, 2.04%, 6.12%, 1.98%, respectively, which is smaller than other models.Same conclusion can be drawn by analyzing the minimum and average MAPE.fluctuation of error.The G-mRMR-BPNN shows a little better performance than G-mRMR-SVR.We can observe the maximum MAPE of these four weeks of G-mRMR-RF is 2.26%, 2.04%, 6.12%, 1.98%, respectively, which is smaller than other models.Same conclusion can be drawn by analyzing the minimum and average MAPE.Based on the comprehensive analysis above, as compared to BPNN and SVR, the RF combines with G-mRMR is more suitable for STLF.

Conclusions
For the issues regarding the selection of reasonable features for STLF, a feature selection method based on G-mRMR and RF is proposed in this paper.The experimental results show that the proposed feature selection approach can select fewer features than other feature selection methods, and the features identified by the proposed approach are useful for STLF.In addition, the experimental results show that the forecasting consequences by RF are better than other predictors.
The advantages of the proposed method are as follows: (1) MI is adopted as the criterion to measure the relevance between features and time series of load and the dependency among features, which is the basis of quantitative analysis of feature selection by mRMR.(2) The correlation between features and load as well as the redundancy of these features are considered.As compared to the maximum relevance method, the G-mRMR method for feature selection reduces the number of optimal feature subset and avoids the association of STLF accuracy with the redundancy of features.For the time being, the relevance and redundancy are balanced by using a variable weighting factor.The features selected by G-mRMR make the accuracy of RF more precise than mRMR.(3) The optimal structure of RF is designed for reducing the complexity of the model and for improving the accuracy of STLF.

Figure 2 .
Figure 2. Random Forest modeling and predicting process.

Figure 2 .
Figure 2. Random Forest modeling and predicting process.

Figure 3 .
Figure 3. Yearly load curve analysis: (a) Average daily load from 8 January 2005 to 31 December.2012; (b) The population and GDP from 2005 to 2012; (c) Hourly load autocorrelation of historical load data.

Figure 4
Figure 4 shows the average daily load pattern occurring in different seasons.These loads have visibly different patterns which are caused by the varying climate.

Figure 3 .
Figure 3. Yearly load curve analysis: (a) Average daily load from 8 January 2005 to 31 December 2012; (b) The population and GDP from 2005 to 2012; (c) Hourly load autocorrelation of historical load data.

Figure 4
Figure 4 shows the average daily load pattern occurring in different seasons.These loads have visibly different patterns which are caused by the varying climate.

Figure 4 .
Figure 4. Four seasons average daily load profile from December 2010 to November 2011.

Figure 4 .
Figure 4. Four seasons average daily load profile from December 2010 to November 2011.

Figure 4 .
Figure 4. Four seasons average daily load profile from December 2010 to November 2011.

Figure 4 .
Figure 4. Four seasons average daily load profile from December 2010 to November 2011.

FormFinding the feature satisfies equation ( 4 )Figure 7 .
Figure 7.The feature selection process based on G-mRMR and RF for STLF.

Figure 7 .
Figure 7.The feature selection process based on G-mRMR and RF for STLF.

Figure 8 .
Figure 8. Prediction error curves: (a) Prediction error curves corresponding to different weighting factor α; (b) The enlarged figure of red box in (a).Figure 8. Prediction error curves: (a) Prediction error curves corresponding to different weighting factor α; (b) The enlarged figure of red box in (a).

Figure 8 .
Prediction error curves: (a) Prediction error curves corresponding to different weighting factor α; (b) The enlarged figure of red box in (a).

Figure 9 .
Figure 9. Correlation between tree number and prediction of RF.

Figure 10 .
Figure 10.Prediction error curves: (a) Prediction error curves corresponding to different feature selection methods; (b) The enlarge figure of red box in (a).

Figure 9 .
Figure 9. Correlation between tree number and prediction of RF.

Figure 9 .
Figure 9. Correlation between tree number and prediction of RF.

Figure 10 .
Figure 10.Prediction error curves: (a) Prediction error curves corresponding to different feature selection methods; (b) The enlarge figure of red box in (a).

Figure 10 .
Figure 10.Prediction error curves: (a) Prediction error curves corresponding to different feature selection methods; (b) The enlarge figure of red box in (a).

Figure 11 .
Figure 11.Load curves of forecasting results of four weeks in four seasons and the true values: (a) Forecasting from 23 to 29 February 2012; (b) Forecasting from 13 to 19 May 2012; (c) Forecasting from 21 to 27 August 2012; (d) Forecasting from 24 to 30 November 2012.

Figure 11 .
Figure 11.Load curves of forecasting results of four weeks in four seasons and the true values: (a) Forecasting from 23 to 29 February 2012; (b) Forecasting from 13 to 19 May 2012; (c) Forecasting from 21 to 27 August 2012; (d) Forecasting from 24 to 30 November 2012.

Table 1 .
The original feature set.

Table 2 .
Detail information about the data set.

Table 2 .
Detail information about the data set.

Table 8 .
Max, Min and Average daily MAPEs of test set corresponding to different predictors.

Table 9 .
Max, Min and Average daily MAPEs of test set corresponding to different predictors.