A Permutation Importance-Based Feature Selection Method for Short-Term Electricity Load Forecasting Using Random Forest

: The prediction accuracy of short-term load forecast (STLF) depends on prediction model choice and feature selection result. In this paper, a novel random forest (RF)-based feature selection method for STLF is proposed. First, 243 related features were extracted from historical load data and the time information of prediction points to form the original feature set. Subsequently, the original feature set was used to train an RF as the original model. After the training process, the prediction error of the original model on the test set was recorded and the permutation importance (PI) value of each feature was obtained. Then, an improved sequential backward search method was used to select the optimal forecasting feature subset based on the PI value of each feature. Finally, the optimal forecasting feature subset was used to train a new RF model as the ﬁnal prediction model. Experiments showed that the prediction accuracy of RF trained by the optimal forecasting feature subset was higher than that of the original model and comparative models based on support vector regression and artiﬁcial neural network.


Introduction
Load forecast (LF) is the basis for the planning, operating, and scheduling of traditional power networks.LF is also the basis for creating an efficient power system by reducing operational costs and using the renewable energy source of the smart grid [1].LF can be divided into three categories according to forecast horizon: long-term LF, mid-term LF, and short-term LF (STLF) [2].STLF generally refers to the prediction of load one hour, one day, or one week ahead [3].The prediction accuracy of STLF is directly related to the safety, stability, and economy of power system operation.Considering an electrical utility in the United Kingdom as an example, a decrease of 1% in prediction error can result in a decrease of approximately 10 million pounds in operational costs [4].Therefore, in the smart grid and deregulated power market environment, power systems require high prediction accuracy of STLF.
STLF methods are divided into traditional and artificial intelligence methods.The traditional methods mainly include Kalman filtering [5], exponential smoothing [6], regression analysis [7], and autoregressive integrated moving average (ARIMA) methods [8,9].These methods have simple principles and mature technologies, but lack self-learning capability and have difficulty describing complex non-linear models accurately [10].The artificial intelligence methods mainly include the fuzzy logic (FL) method [11,12], the artificial neural network (ANN) [13][14][15], support vector regression Energies 2016, 9, 767 2 of 24 (SVR) [16][17][18], and random forest (RF) [19][20][21].The FL method has strong adaptability and can deal with fuzzy phenomena in power systems.However, FL can only roughly map the output and has weak learning capability.ANN has been extensively used in the field of STLF because of its excellent self-learning and fault-tolerant capabilities.Nevertheless, ANN easily falls to overfitting and local optimum.A unified approach for choosing the network structure and connection weights is also lacking [16], thereby resulting in several errors and instabilities for the application of ANN to STLF.SVR follows the structural risk minimization principle to improve the generalization capability, a principle different from that of empirical risk minimization of ANN.Therefore, SVR overcomes the numerous disadvantages of ANN [22].However, when constructing the forecast model, the structure and parameters of SVR must be adjusted according to different inputs.The optimization process is relatively complex.Unlike other methods, RF is an artificial intelligence method that combines decision tree (DT) and integrated algorithm together.RF has the advantages of good anti-noise capability and strong resistant to overfitting.Only a few parameters of RF need to be optimized as compared with other methods [23].
The feature set used as the input of these artificial intelligence forecast methods can directly affect the prediction accuracy and efficiency of the forecasting model [24][25][26].Electrical LF refers to the accurate prediction of future electric power load based on a large quantity of historical load data and other related factors.Therefore, the present study aims to obtain the optimal feature subset for the forecasting model through feature selection methods.However, most artificial intelligence forecast methods cannot conduct the feature selection process and need to be combined with another feature selection algorithm to do so.Che et al. combined SVR with an approximation convexity optimization framework with three different initial values of the optimal feature subset dimension m converged to the same value at the stop condition to obtain the optimal feature subset [24].Ghofrani et al. and Kouhi et al. combined ANN and correlation analysis [25,26].By calculating the correlation coefficient of model input and output, the relevant features are retained and the redundant features are eliminated.Although the above studies have made progress in the feature selection of STLF, the constantly changing input feature subsets used in the process of feature selection require the adjustment and optimization of forecasting model parameters.Furthermore, forecasting models have difficulty achieving optimal forecasting results, and the optimal feature subset needs to be evaluated by the prediction error of the forecasting model.SVR and ANN have difficulty obtaining the minimum prediction error of different feature subsets, thus affecting the feature selection result.
When the original feature set is used to train an RF model, the permutation importance (PI) value of each feature for prediction can be obtained in the training process.On this basis, the optimal features can be selected through the sequential backward search (SBS) method.Thus, when used for load forecasting, RF need not be combined with complex feature selection algorithms.Furthermore, if the number of DTs in RF is sufficiently large, then only one parameter needs to be adjusted when the feature subset dimension changes.This parameter can be conveniently calculated using the empirical formula.Unlike SVR and ANN, RF is more suitable for feature selection of STLF.Determining the threshold for feature selection methods and modifying the numerous parameters of the predictor according to the different feature subsets are difficult.However, the optimal feature subset can be easily selected by combining PI with the improved SBS strategy.Consequently, the abovementioned shortcomings can be overcome and the accuracy of STLF can be improved.
In this paper, a novel RF-based feature selection method for STLF is proposed.First, 243 related features are extracted from historical load data and time information of predicted point to form the original feature set.The load data for a full year are divided into a training set and a test set by random sampling.Subsequently, the original feature set is used to train an RF as the original model.After the training process, the prediction error of the original model on the test set is recorded as the threshold to evaluate the forecasting capability of different feature subsets and obtain the PI value of each feature.Then, the improved SBS method is used to select the optimal forecasting feature subset according to the PI value of each feature.Finally, the optimal forecasting feature subset is used to train a new RF as the final prediction model.The historical load data for 2012 of a city in Northeast China are used for comparative experiments and to demonstrate the superiority of the proposed method in feature selection and load forecasting.
The rest of the paper is organized as follows.Section 2 introduces RF, and Section 3 describes the RF learning process and the proposed feature selection method.Section 4 presents the real load data for experiments, and analyzes and discusses the feature selection and STLF results.Finally, Section 5 elaborates the conclusions and future work.

Mathematical Preliminaries
RF is an intelligent algorithm that combines DT with integrated algorithm.RF not only possesses the numerous advantages of DT, but also overcomes the poor generalization capability of DT.As compared with DT, RF enhances the precision of classification and regression without significantly increasing its computational complexity.

Decision Tree
DT is a type of classical machine learning algorithm.As an example of DT models, classification and regression tree (CART) can be used for classification and regression analysis [27,28].DT is an inverted tree.The top of DT is the root node, which contains all the training samples.The optimal feature is selected from the original feature space to split each of the non-leaf nodes in DT until the stop condition is reached.If all the samples contained in a node belong to one class, then this node is defined as a leaf node.Splitting the leaf node is unnecessary, and each path connecting the root node and the leaf node represents a partition rule.
DT has a simple principle and structure and can thus be constructed easily with high efficiency.Although DT can explain the training set perfectly, its dependency on the training samples increases when grown freely.Accordingly, the generalization capability of DT weakens, thereby lowering its resistance to overfitting.Therefore, a pruning operation must be conducted to restrict the free growth of DT.DT may also fall into the local optimal state, thereby weakening the explanation capability of the single DT.

Random Forest
RF was proposed by Breiman in 2001 [23] to overcome the shortcomings of DT.RF combines CART and the bagging algorithm and builds a new DT set based on ensemble learning methods.
where t(x, s Θ k ) is the base classifier, which represents a CART (k = 1, 2, . . ., m); x is the input vector of CART; and s Θ k is a random vector, which determines the random extraction process of training samples for the kth tree.The growth process of the kth tree is also determined by s Θ k .Meanwhile, all s Θ k values are independent of one another but share the same distribution.For integrated algorithms, the difference of base classifiers can significantly affect their performance [29].The two methods of RF randomness described below ensure significant difference among the base classifiers.

1.
Assume S is the original sample set with n samples.When bagging is used to generate the training set for each CART, each of the samples in the original sample space has 1/n probability to be selected.Based on the characteristics of bagging, several samples may never be selected, whereas other samples may be selected more than once.All samples that have never been selected are the out-of-bag (OOB) dataset of this tree.Therefore, the training set of each CART is different, thus reducing the correlation between the trees in RF.The diversity of CART increases the capability of RF to resist noise and reduces its sensitivity to outliers.These are the main advantages that bagging brings to RF.

2.
RF differs from DT in terms of selecting a feature to split a non-leaf node.Specifically, instead of searching the entire original feature space M to select the best feature, RF randomly generates a candidate segmentation feature set m for each non-leaf node.The set m is a subset of the original feature set with m try features (m try is no longer changed once determined).Thereafter, the optimal feature is selected from m to split this node.
The two ways of RF randomness make its CARTs different from one another.Thus, RF has a wide range of applications and requires few parameters to be adjusted and optimized.Only two parameters can affect the forecasting performance of RF, that is, the tree number n tree and the dimension m try of the candidate segmentation feature set.When the number of trees in RF is low, the performance of RF is very poor and its precision fails to meet the requirements of regression prediction.According to the Strong Law of Large Numbers and the tree structure, the generalization error of RF will tend to be a stable upper bound with the rise in tree number [23].As a result, RF becomes resistant to overfitting.Compared with that of n tree , m try has a larger effect on the performance of RF.After considerable experimental research, the default experience value for m try when RF is used for regression has been obtained [21]: In this equation, t represents the dimension of the original feature set.RF can provide two kinds of useful indices, namely OOB error and the importance value of each feature, after completing the training process.According to the characteristics of bagging, approximately one-third of the samples in the original sample space will never be selected when the training set is generated for each tree.Therefore, for each sample j, approximately one-third of trees exist in RF that are not contained in this sample.These trees are then used to predict sample j.The OOB error is calculated by: where n represents the number of all samples, y r is the true value of sample i, and y p is the predictive value of sample i.The OOB error can be used to estimate the generalization error of RF.Meanwhile, the PI value of each feature can be calculated based on the OOB dataset, which can significantly benefit the following feature selection stage.After the training process, the final predictive result can be obtained by averaging the output of all trees: In this equation, c = n tree represents the number of trees in RF and y i is the predictive value of the ith tree.

Random Forest (RF) Learning Process and Feature Selection for Short-Term Load Forecast (STLF)
In the STLF field, a large number of features must be considered, such as historical load data and hourly and daily information.If all the features are used for load forecasting, then the prediction accuracy and efficiency of the forecasting model can decrease because of the existence of redundant features.Thus, the optimal feature subset must be constructed by removing the redundant features.In the early stage, the feature subset is artificially specified by expert experience.However, this process is unreasonable and less credible.In the current load forecasting field, the feature selection methods are necessary to optimize numerous parameters for every feature subset.As a result, the feature selection becomes time-consuming and the error caused by the unreasonable design of the parameters cannot be avoided easily.RF can guarantee the optimization of the model by adjusting only a small number of parameters.After completing the training process, RF can provide the importance value of each Energies 2016, 9, 767 5 of 24 feature for prediction.The redundant features can then be removed step by step according to the importance value.Accordingly, the feature selection process of STLF can be significantly simplified.

RF Learning Strategy
RF is a collection of numerous DTs.Therefore, RF has a simple structure with strong anti-noise capability and can overcome the interpretation capability disadvantage of a single DT.The Strong Law of Large Numbers guarantees that RF will nearly never fall to overfitting.Its numerous advantages make RF suitable for application in power system load forecasting [19].
By obtaining the historical load data and hourly and daily information, the original sample space with dimension n is constructed.On this basis, n samples are randomly selected with replacement from the original sample space according to the bagging principle.The selected samples form a new training set for the CART.The rest of the samples form the OOB dataset of this tree.The process is repeated for n tree times.These n tree training sets are then used to construct n tree CARTs.All the trees grow freely without pruning operation.
When a non-leaf node is split, the segmentation effect of a feature is determined by Gini index.Assuming that a non-leaf node A contains the dataset D, a total of d samples exist in D. The Gini index of set D before being split by a feature is as follows: where l represents the number of categories contained in D; and d j (j = According to Equation (6), the Gini index is inversely proportional to the segmentation effect.Therefore, the feature with smaller Gini a f ter can achieve better performance.RF can determine the importance of a feature to STLF by calculating the PI value of each feature.When calculating the importance value of feature F j based on the ith tree, OOBError i is first calculated based on Equation (3).Then, the values of feature F j in the OOB dataset are randomly rearranged and those of the other features are unchanged, thereby forming a new OOB dataset OOB i .With the new OOB i set, OOBError i can also be calculated using Equation (3).The PI value of feature F j based on the ith tree can be obtained by subtracting OOBError i from OOBError i .
The calculation process is repeated for each tree.The final PI value of feature F j can be obtained by averaging the PI values of each tree: Energies 2016, 9, 767 6 of 24 where c = n tree represents the tree number.If a feature is important, then its values of different samples will be dissimilar.After the values of this feature are randomly rearranged on the OOB dataset, the discrimination of different samples will be reduced.The feature with high PI value is more important than the other features.On the basis of the PI value of all the features combined with the SBS method, the optimal feature subset can be determined.However, considering that the dimension of the original load feature set is relatively high, the process of feature selection will be time-consuming when using the SBS method.Thus, an improved SBS method is proposed.
A preselection stage is added before using the traditional SBS method.The steps of the preselection stage are described as follows.

1.
The original load feature set is used as the input to train an RF.After the training process is completed, the test set is used to evaluate the performance of this RF.Thereafter, the prediction error P all can be obtained and set as a threshold value.The PI value of each feature can also be obtained.

2.
According to the PI value, all features are rearranged in a descending order and are resaved to the original feature set M.

3.
The first 10 features with the highest PI value are added to the preselection feature set Q pre , which is an empty set at first.Subsequently, these features are removed from set M.

4.
Let set Q i pre (superscript i represents the number of features in the set) be equal to set Q pre .Set Q i pre is then used to retrain a new RF and the prediction error is recorded as P i pre .

5.
If P i pre ≤ P all , then the first 10 features of set M are added to set Q i pre to form the set Q i+10 pre .The training and testing processes are repeated using set Q i+10 pre .

6.
If P i+10 pre ≥ P i pre , then adding another feature to set Q pre is unnecessary.Otherwise, the first 10 features of set M must be added to set Q pre and must be removed from set M. This condition indicates that the stop conditions are P i pre ≤ P all and P i+10 pre ≥ P i pre .

7.
The preceding steps are repeated until the stop condition is met or set M is empty.8.
The preselection feature set Q pre is obtained and is equal to set Q i pre , and the preselection stage ends.
After determining the preselection feature set Q pre , the traditional SBS method is applied to set Q pre .The features in set Q pre are removed one by one, from smallest to largest according to their PI values, until set Q pre is empty.Whenever a feature is removed, set Q pre is used to retrain a new RF and the prediction error on the test set is recorded.The optimal forecasting feature subset Q best is obtained by considering prediction error and feature set dimension.The flowchart of the algorithm is shown in Figure 1.
Unlike the traditional SBS method, a preselection stage is added in the method proposed in this paper.By spending a small amount of time, a preselection feature set Q pre that is much smaller than the original feature set M can be obtained.On this basis, the traditional SBS method is performed within a relatively small amount of time.Therefore, the proposed algorithm is very suitable for load forecasting with high dimensional feature sets.Unlike the traditional SBS method, a preselection stage is added in the method proposed in this paper.By spending a small amount of time, a preselection feature set pre Q that is much smaller than the original feature set M can be obtained.On this basis, the traditional SBS method is performed within a relatively small amount of time.Therefore, the proposed algorithm is very suitable for load forecasting with high dimensional feature sets.

Experimental Results and Analysis
The data used in the experiment are all real historical load data.Two kinds of error evaluation criteria, namely, mean absolute percentage error (MAPE) and root mean square error (RMSE), are used to evaluate the load forecasting results.The calculation formula of MAPE and RMSE are described in Equations ( 9) and ( 10):

Experimental Results and Analysis
The data used in the experiment are all real historical load data.Two kinds of error evaluation criteria, namely, mean absolute percentage error (MAPE) and root mean square error (RMSE), are used to evaluate the load forecasting results.The calculation formula of MAPE and RMSE are described in Equations ( 9) and ( 10): where n represents the number of sample points, y r (t) is the real load value of hour t, and y p (t) is the forecasting load value of hour t.

Dataset
The historical load data for 2012 of a city in the northeast of China are used for the experiment.This historical load dataset contains a total of 366 days.According to Jurado et al., a total of 9% of these days (33 days) are randomly selected as the test set, and the remaining 91% (333 days) are used as the training set [20].When the load data are sampled in this city, the sampling is 1 h.The load curve of the entire year is shown in Figure 2.

Dataset
The historical load data for 2012 of a city in the northeast of China are used for the experiment.This historical load dataset contains a total of 366 days.According to Jurado et al., a total of 9% of these days (33 days) are randomly selected as the test set, and the remaining 91% (333 days) are used as the training set [20].When the load data are sampled in this city, the sampling is 1 h.The load curve of the entire year is shown in Figure 2.  The entire year is divided into a training set and a test set by random sampling.To increase the reliability of the experiment, 33 days of the test set are randomly and equally distributed in four quarters (approximately eight days per quarter).The relevant information for each day of the test set is listed in Table 1.

Load Feature Selection Based on Permutation Importance (PI) Value
Based on the above dataset, two kinds of predictions, namely, 1-h-ahead prediction and day-ahead prediction, are conducted.
Assume the load is predicted starting from the time t.When the original load feature set of 1-hahead prediction is composed, the historical load values of 240 time points (10 days) before time t are considered, starting at time 1 t  (1 h ahead of time t).Meanwhile, whether the forecast date is a working day, the date type of forecast date, and the moment of t are also considered.A total of 243 features constitute the original feature set of 1-h-ahead prediction.
Contrary to 1-h-ahead prediction, the historical load values of the original feature set of dayahead prediction start from time 24 t  . The number of considered historical load value is 240.Whether the forecast date is a working day, the date type of forecast date, and the moment of t are considered as well.A total of 243 features comprise the original feature set of day-ahead prediction.The entire year is divided into a training set and a test set by random sampling.To increase the reliability of the experiment, 33 days of the test set are randomly and equally distributed in four quarters (approximately eight days per quarter).The relevant information for each day of the test set is listed in Table 1.

Load Feature Selection Based on Permutation Importance (PI) Value
Based on the above dataset, two kinds of predictions, namely, 1-h-ahead prediction and day-ahead prediction, are conducted.
Assume the load is predicted starting from the time t.When the original load feature set of 1-h-ahead prediction is composed, the historical load values of 240 time points (10 days) before time t are considered, starting at time t − 1 (1 h ahead of time t).Meanwhile, whether the forecast date is a working day, the date type of forecast date, and the moment of t are also considered.A total of 243 features constitute the original feature set of 1-h-ahead prediction.
Energies 2016, 9, 767 9 of 24 Contrary to 1-h-ahead prediction, the historical load values of the original feature set of day-ahead prediction start from time t − 24.The number of considered historical load value is 240.Whether the forecast date is a working day, the date type of forecast date, and the moment of t are considered as well.A total of 243 features comprise the original feature set of day-ahead prediction.
In this study, several covariates, such as temperature, are ignored.If temperature is added to the original feature set, then the temperature of moment t must be obtained first based on the numerical weather forecast, or simply replaced by a historical temperature.Nevertheless, the numerical weather forecast itself has a certain prediction error that can affect the accuracy of STLF [30].Considering the actual demand in the feature, covariates, such as temperature, can be easily added to the original feature set.Notably, changes in the original feature set can insignificantly influence the process of the proposed feature selection method.
Tables 2 and 3 list the composition of original feature sets of 1-h-ahead prediction and day-ahead prediction, respectively, in detail.
Table 2.The composition of original feature set used for 1-h-ahead prediction.

Names of the Features
Meanings of the Features The historical load value of (t − i) time Whether today is a working day?(1-Yes, 2-No) What day is today?(1-Mon., 2-Tues., 3-Wed., 4-Thur., 5-Fri., 6-Sat., 7-Sun.) The moment of t (from 0 to 23, corresponding to the 24 hours a day) Table 3.The composition of original feature set used for day-ahead prediction.

Names of the Features
Meanings of the Features The historical load value of (t Whether today is a working day?(1-Yes, 2-No) What day is today?(1-Mon., 2-Tues., 3-Wed., 4-Thur., 5-Fri., 6-Sat., 7-Sun.) The moment of t (from 0 to 23, corresponding to the 24 hours a day) The original load feature set is used as the input to train an RF.The PI value of each feature of the original load feature set can be obtained after completing the training process.In the experiment, n tree is set to the default value of 500, and m try also takes the default experience value of t/3.The same training process is conducted 10 times to ensure the reliability of the experimental results.Accordingly, several accidents that can increase the PI value of several irrelevant features and decrease that of several relevant features can be avoided.
The PI value boxplots of each feature of 1-h-ahead prediction and day-ahead prediction are shown in Figures 3 and 4, respectively.
Energies 2016, 9, 767 9 of 24 In this study, several covariates, such as temperature, are ignored.If temperature is added to the original feature set, then the temperature of moment t must be obtained first based on the numerical weather forecast, or simply replaced by a historical temperature.Nevertheless, the numerical weather forecast itself has a certain prediction error that can affect the accuracy of STLF [30].Considering the actual demand in the feature, covariates, such as temperature, can be easily added to the original feature set.Notably, changes in the original feature set can insignificantly influence the process of the proposed feature selection method.
Tables 2 and 3 list the composition of original feature sets of 1-h-ahead prediction and day-ahead prediction, respectively, in detail.
The historical load value of   Whether today is a working day?(1-Yes, 2-No) What day is today?(1-Mon., 2-Tues., 3-Wed., 4-Thur., 5-Fri., 6-Sat., 7-Sun.) The moment of t (from 0 to 23, corresponding to the 24 hours a day) The original load feature set is used as the input to train an RF.The PI value of each feature of the original load feature set can be obtained after completing the training process.In the experiment, tree n is set to the default value of 500, and try m also takes the default experience value of 3 t .The same training process is conducted 10 times to ensure the reliability of the experimental results.Accordingly, several accidents that can increase the PI value of several irrelevant features and decrease that of several relevant features can be avoided.
The historical load value of   The moment of t (from 0 to 23, corresponding to the 24 hours a day) The PI value boxplots of each feature of 1-h-ahead prediction and day-ahead prediction are shown in Figures 3 and 4, respectively.
(a) The dimension of the original load feature set is relatively high.The 243 features are divided into two groups to demonstrate the PI value distribution of each feature.The first 122 features of 1h-ahead prediction and day-ahead prediction are shown in Figures 2a and 3a, respectively, and the remaining 121 features are shown in Figures 2b and 3b, respectively.Considering that the abscissa is relatively dense, only the features with high PI values are marked in the figure.For ease of interpretation, Fi is used to replace the original 1 h i F  and 24 h i F  to represent the feature i in Figures 3 and 4, respectively.
Considerable information can be obtained from a boxplot.A small circle in a boxplot represents an outlier.If the possible outliers are ignored, from top to bottom, then the boxplot comprises the upper edge line, upper quartile ( 3 Q ) line, median line, lower quartile ( 1 Q ) line, and lower edge line.
Among them, the upper quartile line, median line, and lower quartile line constitute a small rectangle.The length of this rectangle represents the concentration degree of data distribution.The values of The dimension of the original load feature set is relatively high.The 243 features are divided into two groups to demonstrate the PI value distribution of each feature.The first 122 features of 1h-ahead prediction and day-ahead prediction are shown in Figures 2a and 3a, respectively, and the remaining 121 features are shown in Figures 2b and 3b, respectively.Considering that the abscissa is relatively dense, only the features with high PI values are marked in the figure.For ease of interpretation, Fi is used to replace the original 1 h i F  and 24 h i F  to represent the feature i in Figures 3 and 4, respectively.
Considerable information can be obtained from a boxplot.A small circle in a boxplot represents an outlier.If the possible outliers are ignored, from top to bottom, then the boxplot comprises the upper edge line, upper quartile ( 3 Q ) line, median line, lower quartile ( 1 Q ) line, and lower edge line.
Among them, the upper quartile line, median line, and lower quartile line constitute a small rectangle.The length of this rectangle represents the concentration degree of data distribution.The values of The dimension of the original load feature set is relatively high.The 243 features are divided into two groups to demonstrate the PI value distribution of each feature.The first 122 features of 1-h-ahead prediction and day-ahead prediction are shown in Figures 2a and 3a, respectively, and the remaining 121 features are shown in Figures 2b and 3b, respectively.Considering that the abscissa is relatively dense, only the features with high PI values are marked in the figure.For ease of interpretation, Fi is used to replace the original F 1−h i and F 24−h i to represent the feature i in Figures 3 and 4, respectively.
Energies 2016, 9, 767 11 of 24 Considerable information can be obtained from a boxplot.A small circle in a boxplot represents an outlier.If the possible outliers are ignored, from top to bottom, then the boxplot comprises the upper edge line, upper quartile (Q 3 ) line, median line, lower quartile (Q 1 ) line, and lower edge line.Among them, the upper quartile line, median line, and lower quartile line constitute a small rectangle.The length of this rectangle represents the concentration degree of data distribution.The values of Q u and Q d , representing the upper edge line and lower edge line, can be calculated using Equations ( 11) and ( 12): where IQR = Q 3 − Q 1 , which represents the interquartile range.The data sitting outside of the upper edge line and lower edge line are defined as outliers and are represented by small circles.Figure 3 shows that features , F 1−h 24 , and F 1−h 168 ) is very short and no abnormal values are found.All these results indicate that the three features are important for the forecasting outcome.This deduction is also consistent with the common sense of LF, which states that the historical load data for 1, 24, and 168 h before time t have significant importance for the forecasting result.As shown in Figure 4, the PI values of features F 24−h 1 , F 24−h 25 , F 24−h 145 , and F 24−h 242 are relatively higher than those of other features.Features F 24−h 1 and F 24−h 145 represent the historical load data for 24 and 168 h before time t.
As shown in Figures 3 and 4, numerous features have an outlier while some have two or more outliers.The existence of outliers has a significant effect on the importance of the feature.An important feature may be regarded as an irrelevant feature because of a very small outlier, or an irrelevant feature may be regarded as an important feature owing to a very large outlier.Therefore, all outliers are deleted.The final PI value of a feature can then be obtained by averaging all normal values.The PI values of features of 1-h-ahead prediction and day-ahead prediction are shown in Figure 5a,b, respectively.Considering the limited space, only the first 40 features with the highest PI values are presented.For ease of interpretation, Fi is used to replace the original F 1−h i and F 24−h i to represent the ith feature in Figure 5a,b.
Energies 2016, 9, 767 11 of 24 u Q and d Q , representing the upper edge line and lower edge line, can be calculated using Equations ( 11) and ( 12): where , which represents the interquartile range.The data sitting outside of the upper edge line and lower edge line are defined as outliers and are represented by small circles.
Figure 3 shows that features As shown in Figures 3 and 4, numerous features have an outlier while some have two or more outliers.The existence of outliers has a significant effect on the importance of the feature.An important feature may be regarded as an irrelevant feature because of a very small outlier, or an irrelevant feature may be regarded as an important feature owing to a very large outlier.Therefore, all outliers are deleted.The final PI value of a feature can then be obtained by averaging all normal values.The PI values of features of 1-h-ahead prediction and day-ahead prediction are shown in Figure 5a,b, respectively.Considering the limited space, only the first 40 features with the highest PI values are presented.For ease of interpretation, Fi is used to replace the original  After obtaining the PI values of all features, the improved SBS method is used for feature selection.First, the original feature set M is used to train an RF, and the prediction error P all of this model on the test set is recorded.Subsequently, features are rearranged in descending order according to their PI values.Then, 10 features are added into the preselection feature set Q pre in order every time and are removed from set M. Finally, set Q i pre , which is equal to set Q pre , is used to train a new RF and the prediction error P i pre is recorded.The process is repeated until the stop condition is reached or set M is empty and the preselection stage is completed.The prediction errors of different RFs with different training sets Q i pre are presented in Tables 4 and 5, respectively.In this paper, MAPE is used as the criteria of feature selection in the preselection stage.The MAPE and RMSE of different feature subsets are shown in Tables 4 and 5.It can be seen that the change trend of the RMSE is the same as that of the MAPE.The preselection feature sets of 1-h-ahead prediction and day-ahead prediction contain 30 and 40 features, respectively.Thus, the traditional SBS method is used for the two preselection feature sets with 30 and 40 iteration times.The traditional SBS method is directly used for the original feature set if no preselection stage is applied.The number of iteration times is 243, which is substantially higher than 30 and 40.Given the substantially larger iteration time, the improved SBS method is suitable for the load forecasting of high dimensional original feature sets.
According to their PI value, the features in the preselection feature set are deleted one by one, from smallest to largest.Whenever a feature is deleted, a new preselection feature set is used to train a new RF and the prediction error is recorded.The process is repeated until the preselection feature set is empty.The prediction errors of different feature subsets of 1-h-ahead prediction and day-ahead prediction using the traditional SBS method are shown in Figure 6a,b.
Figure 6a shows that, when the dimension of the feature subset is smaller than 6, the prediction error quickly increases with the decrease in the number of features.When the number of features is reduced from 18 to 6, the reduction of the prediction error is stable; however, 0.243% of the increase still remains (from 0.988% to 1.231%).When the feature subset dimension is larger than 18, the prediction error does not produce much volatility if a feature is deleted.The MAPE obtains the minimum value of 0.971% when the dimension of the feature subset is 24.When the feature subset dimension is smaller than 18, the prediction error begins to gradually increase if a feature is deleted.The same trend is observed in Figure 6b.The prediction error maintains a relatively stable value when the feature subset dimension is larger than 6, and the minimum value of 1.745% is obtained when the dimension of the feature subset is 33.Therefore, when conducting 1-h-ahead load prediction, the first 24 features with the highest PI values must be selected to form the optimal forecasting feature subset.Meanwhile, the first 33 features with the highest PI values must be selected to form the optimal forecasting feature subset when conducting day-ahead prediction.Figure 6a shows that, when the dimension of the feature subset is smaller than 6, the prediction error quickly increases with the decrease in the number of features.When the number of features is reduced from 18 to 6, the reduction of the prediction error is stable; however, 0.243% of the increase still remains (from 0.988% to 1.231%).When the feature subset dimension is larger than 18, the prediction error does not produce much volatility if a feature is deleted.The MAPE obtains the minimum value of 0.971% when the dimension of the feature subset is 24.When the feature subset dimension is smaller than 18, the prediction error begins to gradually increase if a feature is deleted.The same trend is observed in Figure 6b.The prediction error maintains a relatively stable value when the feature subset dimension is larger than 6, and the minimum value of 1.745% is obtained when the dimension of the feature subset is 33.Therefore, when conducting 1-h-ahead load prediction, the first 24 features with the highest PI values must be selected to form the optimal forecasting feature subset.Meanwhile, the first 33 features with the highest PI values must be selected to form the optimal forecasting feature subset when conducting day-ahead prediction.
Prediction efficiency and prediction accuracy must be considered when STLF is conducted.In 1h-ahead prediction, the increase in prediction error is insignificant when the feature number is reduced from 24 to 18 or 6, and the computation of the model training process is increased.However, the prediction accuracy is directly related to the economic and stability of power system in the STLF field.Therefore, the prediction accuracy must be guaranteed first before prediction efficiency.Kulkarni et al. emphasized that a decrease of 1% in prediction error can result in a decrease of approximately 10 million pounds in operational costs [4].Consequently, a 0.26% (from 1.231% to 0.971%) or a 0.017% (from 0.988% to 0.971%) decrease of prediction error is important for improving the safety and stability of a power system and reducing system operational costs.In the same way, the prediction accuracy must be used as the feature selection index when the day-ahead prediction Prediction efficiency and prediction accuracy must be considered when STLF is conducted.In 1-h-ahead prediction, the increase in prediction error is insignificant when the feature number is reduced from 24 to 18 or 6, and the computation of the model training process is increased.However, the prediction accuracy is directly related to the economic and stability of power system in the STLF field.Therefore, the prediction accuracy must be guaranteed first before prediction efficiency.Kulkarni et al. emphasized that a decrease of 1% in prediction error can result in a decrease of approximately 10 million pounds in operational costs [4].Consequently, a 0.26% (from 1.231% to 0.971%) or a 0.017% (from 0.988% to 0.971%) decrease of prediction error is important for improving the safety and stability of a power system and reducing system operational costs.In the same way, the prediction accuracy must be used as the feature selection index when the day-ahead prediction is conducted.
Two other feature selection algorithms, namely, Pearson correlation coefficient (PCC) and ReliefF, are used for comparative experiments to further verify the effectiveness of the proposed feature selection method.
The optimal feature subsets selected by the three kinds of feature selection algorithms discussed are used to train RF.The dimension of the optimal feature subsets and the prediction error of the three RFs on the test set in 1-h-ahead prediction and day-ahead prediction are listed in Tables 6 and 7, respectively.As shown in Tables 6 and 7, regardless of whether the dimension of the optimal feature subset selected by PCC and ReliefF is higher or lower than PI, the prediction errors of their corresponding RFs are higher than the RF corresponding to PI.Therefore, the proposed feature selection method is valid.
Moreover, to further validate the effectiveness of the proposed feature selection method, some other persistent methods are used for comparison.These methods select the previous load feature, the load feature from the previous day, and the load feature from the previous week by experience.Meanwhile, the selection of the daily and hourly feature refers to the result of the proposed new feature selection method.When 1-h-ahead prediction and day-ahead prediction are conducted, features F241 and F242 are included in the feature subsets selected by the proposed feature selection algorithm.When 1-h-ahead prediction is conducted, the composition of the feature subsets obtained from four persistent methods is as follows:

•
Persistent feature set 1: F1 (L t-1 , the load of the previous one hour), F241, and F242; • Persistent feature set 2: F24 (L t-24 , the load of the same time of the previous day), F241, and F242; • Persistent feature set 3: F168 (L t-168 , the load of the same time of the previous week), F241, and F242; • Persistent feature set 4: F1, F24, F168, F241, and F242; When day-ahead prediction is conducted, the composition of the feature subsets obtained from the four persistent methods is as follows:

•
Persistent feature set 5: from F1 to F24 (from L t-24 to L t-47 , the load of the previous 24 h), F241, and F242;  10, the same conclusion can be drawn from Table 11.Therefore, RF is proven to be more suitable than SVR, ANN, and ARIMA for STLF.The real load curve and forecasting load curves of seven days of the test set obtained from 1-h-aheah prediction and day-ahead prediction are shown in Figures 7 and 8, respectively.These seven days contain every day of the week and are randomly and equally extracted from four quarters (approximately two days per quarter).These seven days include June 18 (Mon.),September 25 (Tues.),June 27 (Wed.),February 9 (Thur.),September 21 (Fri.),March 3 (Sat.),and December 9 (Sun.).
Figure 7 shows that the load curve predicted by RF nearly overlaps with the real load curve.Compared with the load curve predicted by RF, the load curves predicted by SVR and ANN have a certain error with the real load curve.Although the fitting degree among all the forecasting load curves and the real load curve in Figure 8 are worse than those in Figure 7, the load curve predicted by RF is still better than the load curves predicted by SVR, ANN, and ARIMA.This result validates the high accuracy of RF when used as the load predictor.The MAPE and RMSE of different predictors of the seven days in Figures 7 and 8 are shown in Tables 12 and 13, respectively, when 1-h-ahead prediction and day-ahead prediction are conducted.Tables 12 and 13 indicate that the MAPE and RMSE of RF are always lower than those of the other methods.This result is attributed to that RF requires few parameters to be optimized and is resistant to overfitting.As a result, RF can obtain higher prediction accuracy than SVR, ANN, and ARIMA when used for STLF.
Each day of the test set is arranged according to the order of the date.The MAPE and RMSE of each test day of the three predictors for the 1-h-ahead prediction are shown in Figure 9a,b.The MAPE and RMSE of each test day of the four predictors for the day-ahead prediction are shown in Figure 10a,b.As shown in Figure 9, when the 1-h-ahead prediction is conducted, the MAPE and RMSE produced by RF are smaller than those produced by SVR and ANN. Figure 9a shows that the MAPE of RF is higher than 1.5% only when used to forecast the load of the 26th test day, whereas the MAPE of SVR and ANN are mostly higher than 1.5%.A similar conclusion is obtained from Figure 9b.When RF is used to forecast the load, in addition to the RMSE of the 26th test day being higher than 6 MW, the RMSE of the 25th test day is slightly higher than 6 MW.As shown in Figure 10a,b, the MAPE and RMSE of RF are basically smaller than those of the three other predictors.Therefore, the accuracy of RF applied to the load forecasting was verified.As shown in Figure 9, when the 1-h-ahead prediction is conducted, the MAPE and RMSE produced by RF are smaller than those produced by SVR and ANN. Figure 9a shows that the MAPE of RF is higher than 1.5% only when used to forecast the load of the 26th test day, whereas the MAPE of SVR and ANN are mostly higher than 1.5%.A similar conclusion is obtained from Figure 9b.When RF is used to forecast the load, in addition to the RMSE of the 26th test day being higher than 6 MW, the RMSE of the 25th test day is slightly higher than 6 MW.As shown in Figure 10a,b, the MAPE and RMSE of RF are basically smaller than those of the three other predictors.Therefore, the accuracy of RF applied to the load forecasting was verified.As shown in Figure 9, when the 1-h-ahead prediction is conducted, the MAPE and RMSE produced by RF are smaller than those produced by SVR and ANN. Figure 9a shows that the MAPE of RF is higher than 1.5% only when used to forecast the load of the 26th test day, whereas the MAPE of SVR and ANN are mostly higher than 1.5%.A similar conclusion is obtained from Figure 9b.When RF is used to forecast the load, in addition to the RMSE of the 26th test day being higher than 6 MW, the RMSE of the 25th test day is slightly higher than 6 MW.As shown in Figure 10a,b, the MAPE and RMSE of RF are basically smaller than those of the three other predictors.Therefore, the accuracy of RF applied to the load forecasting was verified.10 and 11.The prediction error of RF is lower than the errors of other methods.When 1-h-ahead prediction is conducted, the MAPE of the proposed method ranges from 0.893% to 1.218%, and the RMSE ranges from 4.012 MW to 5.023 MW.It can be seen that the MAPE and RMSE of the proposed method in 10-fold cross-validation are around 0.971% and 4.372 MW, respectively, for minor fluctuation.When day-ahead prediction is conducted, the MAPE of the proposed method ranges from 1.659% to 1.912%, and the RMSE ranges from 6.795 MW to 8.268 MW.Similarly, the MAPE and RMSE of the proposed method in 10-fold cross-validation are around 1.745% and 7.324 MW, respectively, for minor fluctuation.According to the experimental results of 10-fold cross-validation, it can be concluded that the proposed method can achieve satisfactory forecast results for different training and test sets.The effectiveness and robustness of the proposed method were verified.
Therefore, RF can obtain satisfying prediction results for different test sets, and the validity and accuracy of RF applied to STLF were once again verified.

Conclusions
A novel feature selection method for STLF is proposed in this paper.Compared with current STLF methods and feature selection methods, the following innovations are made in this study: 1.
Compared with other STLF methods that use another feature selection method with high time complexity, the proposed approach designs a novel feature selection method based on PI value obtained in the training process of RF.The optimal forecasting feature subset is selected only by the improved SBS method with simple principle and high efficiency.2.
In the process of feature selection, the prediction error of RF is used to determine the performance of each feature subset.Only two parameters of RF need to be adjusted, and the parameter selection method is clear.Considering this advantage, the proposed approach avoids the influence of unreasonable model parameters on the feature selection results.

3.
The traditional SBS method is optimized to reduce the number of iterations.Therefore, the efficiency of the search strategy is dramatically improved.
The experimental results based on real load data verify the effectiveness of the proposed RF-based feature selection method for STLF.In addition, the optimized RF has better generalization capability than SVR and ANN.Therefore, RF is suitable for STLF of power systems.Future work will focus on load forecasting in the distribution system, especially for residential load forecasting.

3. 2 .
Feature Selection for Load Forecasting Based on Permutation Importance (PI) and Optimal SBS Method 3.2.1.Feature Importance Analysis Based on Permutation Importance (PI)

Figure 2 .
Figure 2. Load curve of the entire year.

Figure 2
Figure 2 clearly shows three huge falls in the curve.These three falls correspond to the three main holidays of this year, that is, the Spring Festival, the Labor Day holiday, and National Day.The load consumed in the winter gradually increases because of the cold weather.The entire year is divided into a training set and a test set by random sampling.To increase the reliability of the experiment, 33 days of the test set are randomly and equally distributed in four quarters (approximately eight days per quarter).The relevant information for each day of the test set is listed in Table1.

Figure 2 .
Figure 2. Load curve of the entire year.

Figure 2
Figure 2 clearly shows three huge falls in the curve.These three falls correspond to the three main holidays of this year, that is, the Spring Festival, the Labor Day holiday, and National Day.The load consumed in the winter gradually increases because of the cold weather.The entire year is divided into a training set and a test set by random sampling.To increase the reliability of the experiment, 33 days of the test set are randomly and equally distributed in four quarters (approximately eight days per quarter).The relevant information for each day of the test set is listed in Table1.

Figure 3 .Figure 3 .Figure 4 .
Figure 3. (a) Boxplot of the PI value of the first 122 features of 1-h-ahead prediction; (b) boxplot of the PI value of the remaining 121 features of 1-h-ahead prediction.

Figure 4 .
Figure 4. (a) Boxplot of the PI value of the first 122 features of day-ahead prediction; (b) boxplot of the PI value of the remaining 121 features of day-ahead prediction.

1 ,
and F 1−h 242 have higher PI values compared with those of other features.The PI values of features F 1−h F 1−h 24 , and F 1−h 168 are obviously higher than those of features F 1−h 2 and F 1−h 242 .The length of the rectangle of the three features (F 1−h 1

Figure 5 .
Figure 5. (a) The permutation importance (PI) values of the first 40 features of 1-h-ahead prediction; (b) The permutation importance (PI) values of the first 40 features of day-ahead prediction.

Figure 5 .
Figure 5. (a) The permutation importance (PI) values of the first 40 features of 1-h-ahead prediction; (b) The permutation importance (PI) values of the first 40 features of day-ahead prediction.

Figure 6 .
Figure 6.(a) Prediction error (MAPE) of different feature subsets obtained from SBS method of 1-hahead prediction; (b) prediction error (MAPE) of different feature subsets obtained from SBS method of day-ahead prediction.

Figure 6 .
Figure 6.(a) Prediction error (MAPE) of different feature subsets obtained from SBS method of 1-h-ahead prediction; (b) prediction error (MAPE) of different feature subsets obtained from SBS method of day-ahead prediction.

Figure 7 .
Figure 7. (a) The comparison of prediction results obtained from 1-h-ahead prediction of June 18th; (b) the comparison of prediction results obtained from 1-h-ahead prediction of September 25th; (c) the comparison of prediction results obtained from 1-h-ahead prediction of June 27th; (d) the comparison of prediction results obtained from 1-h-ahead prediction of February 9th; (e) the comparison of prediction results obtained from 1-h-ahead prediction of September 21st; (f) the comparison of prediction results obtained from 1-h-ahead prediction of March 3rd; (g) the comparison of prediction results obtained from 1-h-ahead prediction of December 9th.

Figure 7 .
Figure 7. (a) The comparison of prediction results obtained from 1-h-ahead prediction of June 18th; (b) the comparison of prediction results obtained from 1-h-ahead prediction of September 25th; (c) the comparison of prediction results obtained from 1-h-ahead prediction of June 27th; (d) the comparison of prediction results obtained from 1-h-ahead prediction of February 9th; (e) the comparison of prediction results obtained from 1-h-ahead prediction of September 21st; (f) the comparison of prediction results obtained from 1-h-ahead prediction of March 3rd; (g) the comparison of prediction results obtained from 1-h-ahead prediction of December 9th.

Figure 8 .
Figure 8.(a) The comparison of prediction results obtained from day-ahead prediction of June 18th; (b) the comparison of prediction results obtained from day-ahead prediction of September 25th; (c) the comparison of prediction results obtained from day-ahead prediction of June 27th; (d) the comparison of prediction results obtained from day-ahead prediction of February 9th; (e) the comparison of prediction results obtained from day-ahead prediction of September 21st; (f) the comparison of prediction results obtained from day-ahead prediction of March 3rd; (g) the comparison of prediction results obtained from day-ahead prediction of December 9th.

Figure 8 .
Figure 8.(a) The comparison of prediction results obtained from day-ahead prediction of June 18th; (b) the comparison of prediction results obtained from day-ahead prediction of September 25th; (c) the comparison of prediction results obtained from day-ahead prediction of June 27th; (d) the comparison of prediction results obtained from day-ahead prediction of February 9th; (e) the comparison of prediction results obtained from day-ahead prediction of September 21st; (f) the comparison of prediction results obtained from day-ahead prediction of March 3rd; (g) the comparison of prediction results obtained from day-ahead prediction of December 9th.

Figure 9 .Figure 10 .
Figure 9. (a) MAPE of three predictors for each test day when 1-h-ahead prediction is conducted; (b) RMSE of three predictors for each test day when 1-h-ahead prediction is conducted.

Figure 9 .Figure 9 .Figure 10 .
Figure 9. (a) MAPE of three predictors for each test day when 1-h-ahead prediction is conducted; (b) RMSE of three predictors for each test day when 1-h-ahead prediction is conducted.

Figure 10 .
Figure 10.(a) MAPE of four predictors for each test day when day-ahead prediction is conducted; (b) RMSE of four predictors for each test day when day-ahead prediction is conducted.

Figure 12 .
Figure 12.(a) MAPE of four predictors on the 10 test sets when day-ahead prediction is conducted; (b) RMSE of four predictors on the 10 test sets when day-ahead prediction is conducted.

Figure 11 .
Figure 11.(a) MAPE of three predictors on the 10 test sets when 1-h-ahead prediction is conducted; (b) RMSE of three predictors on the 10 test sets when 1-h-ahead prediction is conducted.

Figures 11 and 12
Figures 11 and 12 show that the MAPE and RMSE in 10-fold cross-validation experiments are consistent with the results in Tables10 and 11.The prediction error of RF is lower than the errors of other methods.When 1-h-ahead prediction is conducted, the MAPE of the proposed method ranges from 0.893% to 1.218%, and the RMSE ranges from 4.012 MW to 5.023 MW.It can be seen that the MAPE and RMSE of the proposed method in 10-fold cross-validation are around 0.971% and 4.372 MW, respectively, for minor fluctuation.When day-ahead prediction is conducted, the MAPE of the proposed method ranges from 1.659% to 1.912%, and the RMSE ranges from 6.795 MW to 8.268 MW.Similarly, the MAPE and RMSE of the proposed method in 10-fold cross-validation are around 1.745% and 7.324 MW, respectively, for minor fluctuation.According to the experimental results of 10-fold cross-validation, it can be concluded that the proposed method can achieve satisfactory forecast results for different training and test sets.The effectiveness and robustness of the proposed method were verified.Therefore, RF can obtain satisfying prediction results for different test sets, and the validity and accuracy of RF applied to STLF were once again verified.

Figure 11 .Figure 12 .
Figure 11.(a) MAPE of three predictors on the 10 test sets when 1-h-ahead prediction is conducted; (b) RMSE of three predictors on the 10 test sets when 1-h-ahead prediction is conducted.

Figure 12 .
Figure 12.(a) MAPE of four predictors on the 10 test sets when day-ahead prediction is conducted; (b) RMSE of four predictors on the 10 test sets when day-ahead prediction is conducted.
1, 2, . . ., l) is the set composed of the samples belonging to the jth class.If feature F is used to split node A, then D is divided into o subsets (D 1 , D 2 , . . ., D o ).A total of d i samples exist in set D i (i = 1, 2, . . ., o).Accordingly, the Gini index of set D after segmentation is:

Table 1 .
The relevant information for each day of the test set.

Table 1 .
The relevant information for each day of the test set.

Table 2 .
The composition of original feature set used for 1-h-ahead prediction.

Table 3 .
The composition of original feature set used for day-ahead prediction.
) is very short and no abnormal values are found.All these results indicate that the three features are important for the forecasting outcome.This deduction is also consistent with the common sense of LF, which states that the historical load data for 1, 24, and 168 h before time t have significant importance for the forecasting result.As shown in Figure4, the PI h F  represent the historical load data for 24 and 168 h before time t.

Table 4 .
Prediction error of different feature subsets selected for 1-h-ahead prediction.

Table 5 .
Prediction error of different feature subsets selected for day-ahead prediction.

Table 6 .
Comparison of three feature selection algorithms when 1-h-ahead prediction is conducted.

Table 7 .
Comparison of three feature selection algorithms when day-ahead prediction is conducted.

Table 10 .
Mean absolute percentage error (MAPE) and root mean square error (RMSE) of random forest (RF), support vector regression (SVR), and artificial neural network (ANN) on the different testing sets of 1-h-ahead prediction.

Table 11 .
MAPE and RMSE of RF, SVR, ANN, and ARIMA on the different testing sets of day-ahead prediction.

Table 10
indicates that, no matter which test set is used, the MAPE and RMSE generated by RF are basically only half or less of those by SVR and ANN.When the entire test set is used to test the performance of RF, the MAPE is only 0.971%.Although the MAPE and RMSE in Table11are larger than those in Table

Table 12 .
Prediction error of different predictors of the selected seven test days when 1-h-ahead prediction is conducted.

Table 13 .
Prediction error of different predictors of the selected seven test days when day-ahead prediction is conducted.